- Copy over the SLES installation media to the server or front end node so we can install packages to build other packages
- Install openssl-devel, readline-devel, pam-devel and munge-devel (and libs) to build munge and slurm
- Configure an LDAP server on the service node then point the frontend node to the LDAP server
Why did we pick slurm?
Since we run slurm on all our other compute clusters, it made sense to have it on this system as well. Probably the best thing is all the accounting logs are sent to the same logging daemon.
Most of the configuration of slurm was pretty straight forward. Since we only have a 1 rack system in work I had only configured it to do DYNAMIC layouts.
The relevant configuration options/lines are shown here. The first time Bluegene/P adminitrator will probably appreciate these snippets of configuration lines. The basic setup is
- service-node runs the slurmctld, this is also where all the Bluegene/P management software runs.
- frontend-node runs the slurmd.
For the bluegene.conf file, I generated it by doing smap -Dc -v, then I typed 'save' in the command window.
# # bluegene.conf file generated by smap # See the bluegene.conf man page for more information # CnloadImage=/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/cnk MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader IoloadImage=/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers/ppcfloor/boot/ramdisk BridgeAPILogFile=/var/log/slurm/bridgeapi.log #Numpsets=4 # io poor Numpsets=16 # io rich BridgeAPIVerbose=2 BasePartitionNodeCnt=512 NodeCardNodeCnt=32 LayoutMode=DYNAMIC
It's not recommended to run with a DYNAMIC layout mode if you have a big system, but for small systems its probably fine.
The slurm.conf file also requires some tweaks, notably
SelectType=select/bluegene # COMPUTE NODES NodeName=bgp[000x001] Procs=2048 NodeHostname=bg-fe State=UNKNOWN PartitionName=compute Nodes=bgp[000x001] Default=YES MaxTime=12:00:00 State=UP
More documentation can be found at https://computing.llnl.gov/linux/slurm/bluegene.html, luckily the rest of the slurm configuration is quite straight forward and if you are experienced with slurm already there should not be any major problems.
Some of the gotchas that I've come across so far has been that I needed to configure both the frontend and serivce nodes to look up a directory service of some sort. That is both nodes should have the same posix users, this is so that all the accounting information is logged correctly. In my case I chose to use LDAP on the two machines, I did not point our Bluegene/P system to our existing LDAP servers due to some stock configurations that IBM had shipped.
mpirun and its configuration file mpirun.cfg also caused me some minor problems with the suid bits, we tweaked it abit. It's less secure than what it was before but we trust our users to a certain extent.