Monday, 24 November 2014

Troubleshooting LDAP

Healthy:

slapd should be running

If not:

Check /var/log/ldap.log

If database corrupted, use db_recover -h <path-to-database>

<path-to-database> can be /var/lib/ldap, can be found on /etc/openldap/slapd.conf

A sample extract from /etc/openldap/slapd.conf
# The database directory MUST exist prior to running slapd AND
# should only be accessible by the slapd and slap tools.
# Mode 700 recommended.
directory       /var/lib/ldap


Make sure to:
$ chown -R ldap /var/lib/ldap && chgrp -R ldap /var/lib/ldap

If it does not work, check the backup (for instance /var/lib/ldap/backup) against the current database directory

Monday, 3 November 2014

Installing TCL

8.6.3

http://sourceforge.net/projects/tcl/files/Tcl/8.6.3/



8.6.1


Binary download: http://downloads.activestate.com/ActiveTcl/releases/8.6.1.0/

$ mkdir -p tcl/tcl-8.6.1

Untar and run install.sh

Installation Options

     Installation Directory:  /scratch1/dsi/dsinibal/tcl/tcl-8.6.1
     Demos Directory:         /scratch1/dsi/dsinibal/tcl/tcl-8.6.1/demos
     Runtime Directory:       See Installation Directory

Post-Install Messages

Please do not forget to extend your PATH and MANPATH variables to
get access to the applications and manpages distributed with ActiveTcl.

For a csh or compatible perform
    setenv PATH "/scratch1/dsi/dsinibal/tcl/tcl-8.6.1/bin:$PATH"

For a sh or similar perform
    PATH="/scratch1/dsi/dsinibal/tcl/tcl-8.6.1/bin:$PATH"
    export PATH

Some shells (bash for example) allow
    export PATH="/scratch1/dsi/dsinibal/tcl/tcl-8.6.1/bin:$PATH"

Similar changes are required for MANPATH


  Note that ActiveTcl 8.6.1.0 is a trimmed down distribution
  providing only the most important packages. All packages
  not found in the distribution can be installed by using
  the teacup client to the TEApot Package Management however.

  Further note that the documentation was not trimmed, and
  contains the documentation of all packages, even those not
  installed by the distribution.

Thursday, 30 October 2014

Installing GCC 4.9.1 from source on Fuji

To be installed: GMP 6.0.0, MPFR 3.1.2, MPC 1.0.2, GCC 4.9.1

The source codes can be obtained from one of the mirrors.
Example:
ftp://gcc.gnu.org/pub/gcc/infrastructure/

GMP is needed by MPFR, which are needed by MPC, which are needed by GCC.

GMP

$ tar -xzvf gmp-6.0.0a.tar.bz2

$ rsync -avr /apps/GNU/GMP/6.0.0/ gmp-6.0.0/*

$ cd /apps/GNU/GMP/6.0.0

$ ./configure --disable-shared --enable-static --prefix=/apps/GNU/GMP/6.0.0

$ make && make check && make install

MPFR

$ tar -xzvf mpfr-3.1.2.tar.gz

$ rsync -avr /apps/GNU/MPFR/3.1.2-new/ mpfr-3.1.2/*

$ cd /apps/GNU/MPFR/3.1.2-new

$ ./configure --disable-shared --enable-static --prefix=/apps/GNU/MPFR/3.1.2-new --with-gmp=/apps/GNU/GMP/6.0.0 

$ make && make check && make install

MPC

$ tar -xzvf mpc-1.0.2.tar.gz

$ rsync -avr /apps/GNU/MPC/1.0.2/ mpc-1.0.2/*

$ cd /apps/GNU/MPC/1.0.2

$ ./configure --disable-shared --enable-static --prefix=/apps/GNU/MPC/1.0.2 --with-gmp=/apps/GNU/GMP/6.0.0 --with-mpfr=/apps/GNU/MPFR/3.1.2-new

$ make && make check && make install

GCC

$ tar -xzvf gcc-4.9.1.tar.gz

$ cd /apps/GNU/GCC/4.9.1

$ <path-to-gcc-source>/gcc-4.9.1/configure --with-gmp=/apps/GNU/GMP/6.0.0 --with-mpfr=/apps/GNU/MPFR/3.1.2-new --with-mpc=/apps/GNU/MPC/1.0.2 --disable-multilib

$ make #This will take a long time

$ make install

After successful build, there is one important message:

Libraries have been installed in:
   /apps/GNU/GCC/4.9.1/lib/../lib64

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.

Wednesday, 29 October 2014

OpenFOAM 2.3.0 in Fuji

System: Fuji (Upgraded - CentOS 6.5)
Note: I used OpenMPI here for this test installation. Default MPI in Fuji is Intel MPI.

Download:
OpenFOAM-2.3.0.tgz
ThirdParty-2.3.0.tgz

Setting up GCC 4.9.1

GCC 4.9.1 is available from /apps/GNU/GCC/4.9.1

$ export PATH=/apps/GNU/GCC/4.9.1/bin:$PATH

$ export LD_LIBRARY_PATH=/apps/GNU/GCC/4.9.1/lib64:/apps/GNU/GCC/4.9.1/lib:/apps/GNU/MPC/1.0.2/lib:/apps/GNU/GMP/6.0.0/lib:/apps/GNU/MPFR/3.1.2/lib:$LD_LIBRARY_PATH

Setting up OpenFOAM Installation

$ mkdir ~/scratch/OpenFOAM

Download the OpenFOAM and ThirdParty into this directory.

$ tar -xzvf OpenFOAM-2.3.0.tgz
$ tar -xzvf ThirdParty-2.3.0.tgz

$ export FOAM_INST_DIR=~/scratch/OpenFOAM
$ foamDotFile=$FOAM_INST_DIR/OpenFOAM-2.3.0/etc/bashrc
$ [ -f $foamDotFile ] && . $foamDotFile


$ cd $FOAM_INST_DIR
$ mkdir obj
$ cd obj
$ ../OpenFOAM-2.3.0/Allwmake

Testing

Notice that all the executables, e.g. icoFoam, are installed on $FOAM_INST_DIR/bin.


In the remote machine, set the .bashrc to include:

export FOAM_INST_DIR=$HOME/scratch/OpenFOAM
source $FOAM_INST_DIR/OpenFOAM-2.3.0/etc/bashrc



To run a test parallel OpenFOAM on 2 nodes:

$ mkdir -p $FOAM_RUN

$ cp -r $FOAM_TUTORIALS $FOAM_RUN

$ cd $FOAM_RUN/tutorials/incompressible/icoFoam

$ cp -r cavity cavityParallel

Copy $WM_PROJECT_DIR/applications/utilities/parallelProcessing/decomposePar/decomposeParDict to cavityParallel/system

Edit the decomposeParDict:
numberOfSubdomains  2;
method simple;

$ cd cavityParallel

$ blockMesh

$ cd ..

$ decomposePar -case cavityParallel

$ cd cavityParallel

$ echo -e 'fuji381\nfuji382' > hosts

$ mpirun -f hosts -np 2 icoFoam -parallel

Tuesday, 28 October 2014

Installing OFED on Linux (CentOS 6.5)

As the title suggests, I will show how to install OFED stack on CentOS 6.5.

Prerequisites

kernel-devel
rpm-build
libtool
gcc-c++
bison
flex
glib2-devel
glib2
tcl-devel
zlib-devel

Tips:
  • To prevent build error, make sure your gcc version is your kernel's latest.
  • It's recommended to use the latest kernel from the repo.

Download OFED software from https://www.openfabrics.org/index.php

Extract and run install.pl (--help to see options)

After installation, do a reboot

Some points:

Typically, locked memory limit has to be set to unlimited to be able to run HPC MPI jobs across nodes. 

Add the following to /etc/security/limits.conf:

* soft memlock unlimited
* hard memlock unlimited

Exit the shell and you should have:
$ ulimit -l
unlimited

Monday, 27 October 2014

Intel MPI How To Use and Debug

Running /bin/hostname

MPIRUN directory:
/opt/intel/impi/<version>/intel64/bin

Source mpivars.sh

Create a machinefile:
$ cat mach.txt
node1
node2

Test run:
$ mpirun -r ssh -f mach.txt -ppn 1 -np 2 ./bin/hostname

mpirun is a utility which runs mpdboot after that mpiexec. So, options for mpdboot comes first and after that options for mpiexec. '-machinefile' is an option for mpiexec.
With mpirun, there is actually no need to run mpdboot (needed only for mpiexec) nor creating mpd.hosts.

If instead you would like to use mpiexec, you would have to do the following.

Create mpd.hosts on your working directory.

Example is
$ cat mpd.hosts
node1
node2

Start mpdring:
$ mpdboot

Try another one: cpi.c
$ mpiicpc mpi.c
$ mpirun -f mach.txt -ppn 4 -np 8 ./a.out

Debugging

Note: Sometimes iptables might prevent mpi across nodes. You might want to flush or edit iptables.

Debugging:

#Pass DEBUG environment variables
export I_MPI_DEBUG=5

#Check mpd is up
$ mpdtrace

#To specifically use IB HCA port 2 instead of default port 1
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-2

Note: DAPL versions of the nodes must match. Older versions of Intel MPI do not support DAPL v2.0. When installing the OS, make sure the necessary Infiniband drivers (e.g. DAPL 1.2 if using old Intel MPI) are installed.

Monday, 13 October 2014

Configuring Internet Connection For Compute Nodes Of A Cluster

Idea: In a cluster, usually only the head node has outgoing internet connection. This post details how to set up the compute nodes to have outgoing internet connection using the head node as the router.

Assuming enp129s0f0 is for internal network and enp129s0f1 for external network.

1. Tell kernel to allow ip forwarding:

On the head node 
$ echo 1 > /proc/sys/net/ipv4/ip_forward

2. Configure IPTABLES to forward packets from internal network.

On the head node
$ sudo iptables -t nat -A POSTROUTING -o enp129s0f1 -j MASQUERADE
$ sudo iptables -A FORWARD -i enp129s0f0 -o enp129s0f1 -j ACCEPT
$ sudo iptables -A FORWARD -i enp129s0f1 -o enp129s0f0 -m state --state RELATED,ESTABLISHED -j ACCEPT
$ sudo iptables -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT


Sunday, 12 October 2014

Debugging SSH without-password

Sometimes, we want to set up passwordless login across nodes. In this post, I will use the root user as an example.

Common Practice

Easiest way:
$ ssh-copy-id remotehostname

Sometimes, we may encounter
/usr/bin/ssh-copy-id: ERROR: No identities found

In that case, do:
$ ssh-copy-id -i ~/.ssh/id_rsa.pub remotehostname

If known_hosts has offending key
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
   .
   .
   .
Offending key in /root/.ssh/known_hosts: 6
   .
   .

Thursday, 9 October 2014

Configuring a Compute Node (of a Supercomputer)

This post details how a compute node is set up after an upgrade (e.g. OS update).
This instruction is based on the real setup of one HPC system that I administered.

Migrate User Accounts

Write a script to transfer user accounts. This step is for creating user home directories cleanly (optional).
Require: /etc/shadow and /etc/passwd from login node.

Next, copy over the /etc/passwd, /etc/shadow, /etc/group, and /etc/hosts from master node. Remember to make copies of the existing files on the compute node.

Transfer user home directory to /home01
$ cd /
$ ln -s home home01

Set up internet

$ echo 'GATEWAY=10.10.8.243' >> /etc/sysconfig/network

$ echo -e 'nameserver 10.10.8.236\nnameserver 202.83.248.3\nnameserver 123.136.66.68' >> /etc/resolv.conf

$ service network restart

Flush iptables and make a copy of the existing one. Comment up the rules in /etc/sysconfig/iptables.

Wednesday, 8 October 2014

Testing Intel OFED Installation.

Intel TrueScale QDR Qlogic with Intel OFED.


First, you have to install OFED from Intel.

Testing OpenMPI (with Intel compilers).

cd /usr/mpi/intel/openmpi-<version>-qlc/tests/osu_benchmarks

source /usr/mpi/intel/openmpi-<version>-qlc/bin/mpivars.sh

mpirun -host node1,node2 -np 2 ./osu_latency
#0-byte message should be less than 2 us.

mpirun -host node1,node2 -np 2 ./osu_bw
#Larger messages should hit above 3000MB/s.

Note: When running MPI apps, do not use InfiniBand Verbs (IBV), OpenIB, DAPL, etc. Use PSM instead.
If latency for 0-byte message is 5 us, then Verbs interface is used instead of PSM.
If this happens, try exit the ssh session and log in again.

Tuesday, 30 September 2014

Installing Intel TrueScale Fabric HCA Host Software

Background: I have a Haswell system with CentOS 7.0 on which I would like to have InfiniBand software installed.

Removing Previous OFED installation

Before installing the Intel OFED, get the latest OFED from OpenFabrics Alliance. Run the install.pl script to uninstall the previously installed OFED software.

After clean uninstall, reboot the machine. Manually removing any remaining ib modules after reboot.
For example, I would do:
$ rmmod ib_qib
$ rmmod ib_mad
$ rmmod ib_core

Next, install the OFED software by running
$ ./install.pl -k <kernel-version> -s /lib/modules/<kernel-version>/build --umad-dev-rw

Reboot.

Installing the Intel OFED Software

Grab the installer from Intel Download Center

Since I am using CentOS 7, I will choose IntelIB-Basic.RHEL7-x86_64.7.3.0.0.26.tgz

Run the INSTALL script provided with the package with --force option (since the script will complain due to my OS being CentOS 7).

OFED components that I installed include:
  • IpoIB and MPI over uDAPL
  • ib_qib #Intel TrueScale cards
  • libibumad
  • OpenSM #Subnet Manager
  • SDP (Socket Driver Protocol)
  • SRP
  • Perftest
  • Intel MPI
  • Debug info
NOTE: Do not install iWARP.

Installing Intel TrueScale Fabric HCA Host Software

Background: I have a Haswell system with CentOS 7.0 on which I would like to have InfiniBand software installed.

Removing Previous OFED installation

Before installing the Intel OFED, get the latest OFED from OpenFabrics Alliance. Run the install.pl script to uninstall the previously installed OFED software.

After clean uninstall, reboot the machine. Manually removing any remaining ib modules after reboot.
For example, I would do:
$ rmmod ib_qib
$ rmmod ib_mad
$ rmmod ib_core

Next, install the OFED software by running
$ ./install.pl -k <kernel-version> -s /lib/modules/<kernel-version>/build --umad-dev-rw

Reboot.

Installing the Intel OFED Software

Grab the installer from Intel Download Center

Since I am using CentOS 7, I will choose IntelIB-Basic.RHEL7-x86_64.7.3.0.0.26.tgz

Run the INSTALL script provided with the package with --force option (since the script will complain due to my OS being CentOS 7).

OFED components that I installed include:
  • IpoIB and MPI over uDAPL
  • ib_qib #Intel TrueScale cards
  • libibumad
  • OpenSM #Subnet Manager
  • SDP (Socket Driver Protocol)
  • SRP
  • Perftest
  • Intel MPI
  • Debug info
NOTE: Do not install iWARP.

Installing Intel TrueScale Fabric HCA Host Software

Background: I have a Haswell system with CentOS 7.0 on which I would like to have InfiniBand software installed.

Removing Previous OFED installation

Before installing the Intel OFED, get the latest OFED from OpenFabrics Alliance. Run the install.pl script to uninstall the previously installed OFED software.

After clean uninstall, reboot the machine. Manually removing any remaining ib modules after reboot.
For example, I would do:
$ rmmod ib_qib
$ rmmod ib_mad
$ rmmod ib_core

Next, install the OFED software by running
$ ./install.pl -k <kernel-version> -s /lib/modules/<kernel-version>/build --umad-dev-rw

Reboot.

Installing the Intel OFED Software

Grab the installer from Intel Download Center

Since I am using CentOS 7, I will choose IntelIB-Basic.RHEL7-x86_64.7.3.0.0.26.tgz

Run the INSTALL script provided with the package with --force option (since the script will complain due to my OS being CentOS 7).

OFED components that I installed include:
  • IpoIB and MPI over uDAPL
  • ib_qib #Intel TrueScale cards
  • libibumad
  • OpenSM #Subnet Manager
  • SDP (Socket Driver Protocol)
  • SRP
  • Perftest
  • Intel MPI
  • Debug info
NOTE: Do not install iWARP.

Monday, 15 September 2014

Setting up a two-node HPC cluster with InfiniBand

In this post, I would share how to build a two-node cluster with InfiniBand (IB) interconnection.

I have installed CentOS 7.0 (server with GUI version) with IB and iWARP support.

To install the necessary software

The following are to be done on every node, unless stated otherwise.

$ yum groupinstall "Infiniband Support"
$ yum install infiniband-diags perftest qperf opensm

OpenSM is the subnet manager.
 
Extra steps (optional):

Edit /etc/default/opensm such that
$ cat /etc/default/opensm
PORTS="0x00117500007005aa 0x0011750000700c2a"

The port GUID can be obtained by doing
$ ibstat -p

Activate the services:
$ chkconfig rdma on
$ chkconfig opensm on #only on master node

$ service rdma start
$ service opensm start #only on master node
$ shutdown -r now

After services are started / reboot, hopefully ibstat will show State "Active" and Physical State "LinkUp".

To check network connectivity

Display all switches
$ ibswitches

Display all hosts visible in the network
$ ibhosts

Reports link info
$ iblinkinfo

Testing with ibping
$ ibping -S #on one server
$ ibping -G 0x0011750000700c2a #on another server

Replace the port GUID above with yours.

Sunday, 14 September 2014

Setting up a two-node ethernet cluster

In this post I would like to share on how to set up a two-node cluster.

First: Configuring the ethernet connection

For every node, we have to configure the network script for the ethernet card, /etc/hosts, /etc/hostname, /etc/resolv.conf, /etc/hosts.allow (and /etc/hosts.deny), and the iptables. The goal is the following:
  • Assign private static IPs to every node.
  • Assign hostnames to every node.
  • Assign a virtual gateway and nameserver
  • Allowing communication between the nodes through iptables (if applicable) and hosts.allow.

Let's say we have two nodes and we would like to name them node1 (head node) and node2.

Note: On node1 (chosen as the head node), enp3s0f0 is connected to the internet. So I used enp3s0f1 for the local network.

Tuesday, 9 September 2014

Compiling Linux Kernel 3.16.1 for CentOS 7.0 on Haswell Server

Background: I would like to upgrade the existing kernel for CentOS 7.0 (3.10.0-123.6.3.el7.x86_64) to 3.16.1 (latest kernel from kernel.org - not available in repo yet).


Download the latest kernel from www.kernel.org into /usr/src/kernels. Untar and change into the kernel directory (in this case is linux-3.16.1).

Install the necessary packages:
$ yum groupinstall "Development Tools"
$ yum install ncurses ncurses-devel

Do a proper cleanup:
$ make mrproper

Copy current kernel config (in /boot) to .config (in working dir) as a base to use.
Note that 'make mrproper' deletes .config
$ cp /boot/config-3.10.0-123.el7.x86_64 .config

Edit the configuration further if needed:
$ make menuconfig

To adjust compiling to the number of your CPU cores (for faster compilation time):
$ export CONCURRENCY_LEVEL=`getconf _NPROCESSORS_ONLN`

Compile and build the rpms
$ make rpm

Install the kernel
$ rpm -ivh /root/rpmbuild/RPMS/x86_64/kernel-2.6.32.27-1.x86_64.rpm


This should set up the initrd and grub settings. If not, manually do these:
$ mkinitrd /boot/initrd-3.16.1.img 3.16.1
$ grubby --add-kernel=/boot/vmlinuz-3.16.1  --initrd=/boot/initramfs-3.16.1.img --title="CentOS Linux 7.0 3.16.1" --make-default --copy-default

Tuesday, 19 August 2014

OpenFOAM 2.3.0 with GCC 4.9.1 on SGI UV running Linux

Hardware specs
Model: SGI Altix UV 1000
OS: SUSE Linux Enterprise Server 11

To use GCC 4.9.1
  • Setup environment variable to include /apps/GNU/GCC/4.9.1 in $PATH and $LD_LIBRARY_PATH. Refer to the Readme file in that directory.
Directory Structure
/apps/OpenFOAM/OpenFOAM-2.3.0

Required:
OpenFOAM-2.3.0.tar.gz
ThirdParty-2.3.0.tar.gz

Environment variables:

#Use SGI MPI
export PATH=/opt/sgi/mpt/mpt-2.08/bin:$PATH
export LD_LIBRARY_PATH=/opt/sgi/mpt/mpt-2.08/lib:$LD_LIBRARY_PATH

#setting the flags and installation directory
export WM_CFLAGS="$WM_CFLAGS -DMPI_NO_CPPBIND -DSGIMPI"
export FOAM_INST_DIR=/apps/OpenFOAM
foamDotFile=$FOAM_INST_DIR/OpenFOAM-2.3.0/etc/bashrc

[ -f $foamDotFile ] && . $foamDotFile

Installation

./Allwmake