All posts by Brian Jo

Setting up and running hail on cluster (also featuring Apache Spark)

According to http://hail.is, “Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from genetic data in VCF, BGEN or PLINK format, Hail can, for example:

load variant and sample annotations from text tables, JSON, VCF, VEP, and locus interval files
generate variant annotations like call rate, Hardy-Weinberg equilibrium p-value, and population-specific allele count
generate sample annotations like mean depth, imputed sex, and TiTv ratio
generate new annotations from existing ones as well as genotypes, and use these to filter samples, variants, and genotypes
find Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples via the GRM and IBD matrix, and compute sample scores and variant loadings using PCA
perform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritability”

I have finished setting up hail to run on the cluster, and this document summarizes what needs to be done in order to run hail on cluster, in both standalone and cluster modes.

These links will also prove to be useful – I recommend reading through them first:

https://hail.is/docs/stable/getting_started.html
https://www.princeton.edu/researchcomputing/faq/spark-via-slurm/

But it must be noted that I had to do a lot of hand-setting configurations to make this work.

First, it is worth noting that the spark distribution we have on cluster is only compatible with python 2 – if you have a default python3 directory and PYTHONPATH set, you may need to disable these.

You can double check the versions of python, ipython and jupyter being used – first, run:
module load anaconda
module load spark/hadoop2.6/2.1.0

Take note that although the spark tutorial from CSES states that you should run module load python, I found that you actually need to run module run anaconda for data structures necessary for hail to run.

These commands will load all the necessary binaries. You’ll see that this command also needs to be included in the .slurm file.

[bj5@della5 ~]$ which spark-submit
/usr/licensed/spark/spark-2.1.0-bin-hadoop2.6/bin/spark-submit
[bj5@della5 ~]$ which python
/usr/licensed/anaconda/5.0.1/bin/python
[bj5@della5 ~]$ which ipython
/usr/licensed/anaconda/5.0.1/bin/ipython
[bj5@della5 ~]$ which jupyter
/usr/licensed/anaconda/5.0.1/bin/jupyter

As for the version of hail, instead of building hail from source, CSES suggested that I download the pre-built distribution (compatible with spark 2.1.0) from https://storage.googleapis.com/hail-common/distributions/0.1/Hail-0.1-1214727c640f-Spark-2.1.0.zip

the hail directory is now located at:
/tigress/BEE/spark_hdfs/hail. Disregard the hadoop and spark directories in spark_hdfs – we’ll stick to the modules installed by CSES at Della.

Now, we need to set a few configurations at ~/.bashrc (or ~/.bash_profile, depending on how your directory is set up)

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export PATH=$PATH:$JAVA_HOME/bin
# These env vars added for Spark – change to fit your appropriate spark version
export SPARK_HOME=/usr/licensed/spark/spark-2.1.0-bin-hadoop2.6/
export PATH=$PATH:$SPARK_HOME/bin
# These env vars added for hail
export HAIL_HOME=/tigress/BEE/spark_hdfs/hail
export PATH=$PATH:$HAIL_HOME/bin
# set this to jupyter if running a jupyter notebook – otherwise set it to python
# export PYSPARK_DRIVER_PYTHON=/usr/licensed/anaconda/5.0.1/bin/jupyter
export PYSPARK_DRIVER_PYTHON=/usr/licensed/anaconda/5.0.1/bin/python
export PYSPARK_PYTHON=`which python`
# Set this forwarding if running a jupyter notebook
# export PYSPARK_DRIVER_PYTHON_OPTS=”notebook –no-browser –port=8889 –ip=127.0.0.1″
export PYTHONPATH=”$HAIL_HOME/python:$SPARK_HOME/python:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip | tr ‘\n’ ‘:’)$PYTHONPATH”
export SPARK_CLASSPATH=$HAIL_HOME/jars/hail-all-spark.jar

Take note that the environment variable PYSPARK_DRIVER_PYTHON needs to be set differently depending on whether you’re running standalone mode with jupyter notebook or cluster mode. PYSPARK_DRIVER_PYTHON_OPTS also needs to be set in order to allow for ssh tunneling to run the jupyter notebook (but not in cluster mode). Also take note the JAVA_HOME directory – setting this is not mentioned by the CSES spark tutorial, but I’ve found that setting JAVA_HOME to another directory makes spark not work.

Let’s start by running cluster mode – submitting a hail job on cluster. but before doing so, I created a .zip file of the directory /tigress/BEE/spark_hdfs/hail/python/hail in order to give it to the spark-submit configuration –py-files hail/python/hail.zip. This is my test.slurm script:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 10:00:00
#SBATCH –ntasks-per-node 3
#SBATCH –cpus-per-task 2

module load anaconda
module load spark/hadoop2.6/2.1.0

spark_core_jars=( “${SPARK_HOME}/jars/spark-core*.jar” )
if [ ${#spark_core_jars[@]} -eq 0 ]
then
echo “Could not find a spark-core jar in ${SPARK_HOME}/jars, are you sure SPARK_HOME is set correctly?” >&2
exit -1
fi

spark-start
echo $MASTER
spark-submit –total-executor-cores 6 –executor-memory 5G \
–jars hail/jars/hail-all-spark.jar \
–py-files hail/python/hail.zip test.py

This script submits a job with 6 cores (1 node, 3 tasks per node, and 2 cpus per task). Adding the configurations total-executor-cores, –executor-memory were not detailed in the hail tutorial but they are suggested by the CSES tutorial.

My test.py script imported a test vcf file and saved it in a vds format used by hail:

from hail import *
hc = HailContext()
hc.import_vcf(‘file:///tigress/BEE/spark_hdfs/test.vcf’).write(‘file:///tigress/BEE/spark_hdfs/test.vds’)

This part is pretty simple – you just need to remember that our cluster doesn’t have a dedicated HDFS or other file system used by Spark – so all file addresses need to be prefixed by file://.

The part that gave me the most trouble (and is still not fully resolved) is running a spark-enabled jupyter notebook.

First, we need to start up a spark cluster that is idling:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 10:00:00
#SBATCH –ntasks-per-node 3
#SBATCH –cpus-per-task 2

module load anaconda
module load spark/hadoop2.6/2.1.0

spark_core_jars=( “${SPARK_HOME}/jars/spark-core*.jar” )
if [ ${#spark_core_jars[@]} -eq 0 ]
then
echo “Could not find a spark-core jar in ${SPARK_HOME}/jars, are you sure SPARK_HOME is set correctly?” >&2
exit -1
fi

spark-start
echo $MASTER
sleep infinity

Then, we need to check the slurm output file, which will have a line that looks like this:
Starting master on spark://della-r1c3n12:7077
starting org.apache.spark.deploy.master.Master, logging to /tmp/spark-bj5-org.apache.spark.deploy.master.Master-1-della-r1c3n12.out
Starting slaves

This means that a spark cluster has been instantiated at spark://della-r1c3n12:7077

Now, remembering to reset PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables (. ~/.bashrc after editing the file – unfortunately if you set env variables in .slurm script, jupyter is not able to see it), you can run:

pyspark –master spark://della-r1c3n12:7077 –total-executor-cores 6 –conf spark.sql.files.openCostInBytes=1099511627776 –conf spark.sql.files.maxPartitionBytes=1099511627776 –conf spark.hadoop.parquet.block.size=1099511627776

making sure that the master address is correctly set. The conf variables spark.sql.files.openCostInBytes, maxPartitionBytes, spark.sql.files.maxPartitionBytes and spark.hadoop.parquet.block.size were recommended to be set only for cloudera clusters in the hail tutorial, but hail doesn’t work if we don’t set these variables.

Now, from the local machine, you can run: ssh -N -f -L localhost:8889:localhost:8889 yourusername@clustername.princeton.edu

Now, you can access the spark and hail-enabled jupyter notebook from your local machine, available at http://127.0.0.1:8889 – you may need to enter in the token value that is output by jupyter if this is your first time accessing it as well – in my case, it looked like this:

http://127.0.0.1:8889/?token=fb7f4cee67d15f6f900455cef2f38c2dd299cc60b0856032

Another important distinction is that in order to start hail, you need to run:

from hail import *
hc = HailContext(sc)

with the pre-defined sparkContext variable sc.

However, there still is a problem with doing operations on data tables – with the following error:
vds = hc.read(‘test.vds’)
vds.sample_ids[:5]

—————————————————————————
IllegalArgumentException Traceback (most recent call last)

IllegalArgumentException: u”Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionState’:”

It seems like the backend sql with Hive is not being properly initialized. For purposes of development, I recommend installing hail locally on your machine and work with a small subset of genotypes, expression values, etc, until the issue with jupyter notebooks is resolved, with a lot of custom configurations. But make sure to install compatible versions of JRE (or JDK), hadoop and hail. The github page for hail:

https://github.com/broadinstitute/hail.git

states that it is also compatible with spark-2.2.0, so building from source using the github directory may be a good option as well – this is what I ended up doing on my local machine.

But the good news is that now hail jobs with spark can be set up and run on cluster!

Installing IPython notebook on Della – the conda way (featuring Python 3 and IPython 4.0, among other things)

Thanks to Ian’s previous post, I was able to set up IPython notebook on Della, and I’ve been working extensively with it. However, when I was trying to sync the notebooks between the copies on my local machine and Della, I found out that the version of IPython on Della is the old 2.3 version, and that IPython is not backward compatible. So any IPython notebook that I create and work on in my local directory will simply not work in Della, which is quite annoying.

Also, I think there is a lot of benefit to setting up and using Anaconda in my Della directory. It sets up a lot of packages (including Python 3, instead of the archaic 2.6 that Della runs; you have to module load python as Ian does in his post in order to load 2.7) and manages them seamlessly, without having to worry about what package is in what directory.

According to the Conda website:

Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.

So, let’s get started. First, go to:

http://repo.continuum.io/miniconda/index.html

And download the latest Linux x86-64 version, namely:

Miniconda3-latest-Linux-x86_64.sh

Then, scp the Miniconda installer to your Della local directory (e.g. /home/my.user.name/)

Note: I initially tried using the easy_install way of installing conda, only to run into the following error:

Error: This installation of conda is not initialized. Use 'conda create -n
envname' to create a conda environment and 'source activate envname' to
activate it.

# Note that pip installing conda is not the recommended way for setting up your
# system. The recommended way for setting up a conda system is by installing
# Miniconda, see: http://repo.continuum.io/miniconda/index.html

It indeed is preferable to follow their instructions. Then run:

sh Miniconda3-latest-Linux-x86_64.sh

And follow their instructions. Conda will install a set of basic packages (including python 3.5, conda, openssl, pip, setuptools, only to name a few useful packages) under the package you specify, or the default directory:

/home/my.user.name/miniconda3

It also modifies the PATH for you so that you don’t have to worry about that yourself. How nice of them. (But sometimes you might need to specify the default versions of programs that are on della, especially for distributing jobs to other users, etc. Don’t forget to specify them when needed. But you should be set for most use cases.)

Now, since we are using the conda package version of pip, by simply running,

pip install ipython
pip install jupyter

or

conda install ipython
conda install jupyter

conda will integrate these packages into your environment. Neat.

That’s it! You can double check what packages you have by running:

conda list

After this, the steps for having the notebook serve the notebook to your local browser is identical to the previous post. Namely:

#create mycert.pem using the following openssl cmd:

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

# my mycert.pem wherever you’d like

mv mycert.pem ~/local/lib/

# create an ipython profilename

ipython profile create profilename

# generate a password using the following ipython utility:

python
from IPython.lib import passwd
passwd()
Enter password:
Verify password:
'sha1:…'

#copy this hashed pass

vi /home/my.della.user.name/.ipython/profile_profilename/ipython_config.py

# edit:

c.NotebookApp.port = 1999 # change this port number to something not in use, I used 2999
c.NotebookApp.password = 'sha1:…' #use generated pass here
c.NotebookApp.certfile = u'/home/my.della.user.name/local/lib/mycert.pem'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = '127.0.0.1'

[save]

 

# sign off and sign back on to della

ssh -A -L<Your Port #>:127.0.0.1:<Your Port #> my.della.user.name@della.princeton.edu

# boot up notebook

ipython notebook --ip=127.0.0.1 --profile=profilename --port <Your Port #>
# note that if you are trying to access Della
# from outside the Princeton CS department, you
# may have to forward the same port from your home computer
# to some princeton server, then again to Della

# In your browser go to

http://127.0.0.1:<Your Port #>

ipython_conda

After you’ve set everything up, you can upload the ipython notebook to gist for sharing with others. I’ll repeat the post:upload-a-gist-to-github-directly-from-della for convenience:

# First, install gist gem locally at della
gem install –user-install gist
echo ‘export PATH=$PATH:/PATH/TO/HOME/.gem/ruby/1.8/bin/’ >> ~/.bashrc
source ~/.bashrc

# Boot up connection
gist –login
[Enter Github username and password]

# Upload gist, e.g.
gist my_notebook.ipynb -u [secret_gist_string]

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)

Here is an example ipython notebook that I shared through gist and is available for viewing:

GTEx eQTL detection: cis- pipeline