Monthly Archives: January 2018

Setting up and running hail on cluster (also featuring Apache Spark)

According to http://hail.is, “Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from genetic data in VCF, BGEN or PLINK format, Hail can, for example:

load variant and sample annotations from text tables, JSON, VCF, VEP, and locus interval files
generate variant annotations like call rate, Hardy-Weinberg equilibrium p-value, and population-specific allele count
generate sample annotations like mean depth, imputed sex, and TiTv ratio
generate new annotations from existing ones as well as genotypes, and use these to filter samples, variants, and genotypes
find Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples via the GRM and IBD matrix, and compute sample scores and variant loadings using PCA
perform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritability”

I have finished setting up hail to run on the cluster, and this document summarizes what needs to be done in order to run hail on cluster, in both standalone and cluster modes.

These links will also prove to be useful – I recommend reading through them first:

https://hail.is/docs/stable/getting_started.html
https://www.princeton.edu/researchcomputing/faq/spark-via-slurm/

But it must be noted that I had to do a lot of hand-setting configurations to make this work.

First, it is worth noting that the spark distribution we have on cluster is only compatible with python 2 – if you have a default python3 directory and PYTHONPATH set, you may need to disable these.

You can double check the versions of python, ipython and jupyter being used – first, run:
module load anaconda
module load spark/hadoop2.6/2.1.0

Take note that although the spark tutorial from CSES states that you should run module load python, I found that you actually need to run module run anaconda for data structures necessary for hail to run.

These commands will load all the necessary binaries. You’ll see that this command also needs to be included in the .slurm file.

[bj5@della5 ~]$ which spark-submit
/usr/licensed/spark/spark-2.1.0-bin-hadoop2.6/bin/spark-submit
[bj5@della5 ~]$ which python
/usr/licensed/anaconda/5.0.1/bin/python
[bj5@della5 ~]$ which ipython
/usr/licensed/anaconda/5.0.1/bin/ipython
[bj5@della5 ~]$ which jupyter
/usr/licensed/anaconda/5.0.1/bin/jupyter

As for the version of hail, instead of building hail from source, CSES suggested that I download the pre-built distribution (compatible with spark 2.1.0) from https://storage.googleapis.com/hail-common/distributions/0.1/Hail-0.1-1214727c640f-Spark-2.1.0.zip

the hail directory is now located at:
/tigress/BEE/spark_hdfs/hail. Disregard the hadoop and spark directories in spark_hdfs – we’ll stick to the modules installed by CSES at Della.

Now, we need to set a few configurations at ~/.bashrc (or ~/.bash_profile, depending on how your directory is set up)

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export PATH=$PATH:$JAVA_HOME/bin
# These env vars added for Spark – change to fit your appropriate spark version
export SPARK_HOME=/usr/licensed/spark/spark-2.1.0-bin-hadoop2.6/
export PATH=$PATH:$SPARK_HOME/bin
# These env vars added for hail
export HAIL_HOME=/tigress/BEE/spark_hdfs/hail
export PATH=$PATH:$HAIL_HOME/bin
# set this to jupyter if running a jupyter notebook – otherwise set it to python
# export PYSPARK_DRIVER_PYTHON=/usr/licensed/anaconda/5.0.1/bin/jupyter
export PYSPARK_DRIVER_PYTHON=/usr/licensed/anaconda/5.0.1/bin/python
export PYSPARK_PYTHON=`which python`
# Set this forwarding if running a jupyter notebook
# export PYSPARK_DRIVER_PYTHON_OPTS=”notebook –no-browser –port=8889 –ip=127.0.0.1″
export PYTHONPATH=”$HAIL_HOME/python:$SPARK_HOME/python:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip | tr ‘\n’ ‘:’)$PYTHONPATH”
export SPARK_CLASSPATH=$HAIL_HOME/jars/hail-all-spark.jar

Take note that the environment variable PYSPARK_DRIVER_PYTHON needs to be set differently depending on whether you’re running standalone mode with jupyter notebook or cluster mode. PYSPARK_DRIVER_PYTHON_OPTS also needs to be set in order to allow for ssh tunneling to run the jupyter notebook (but not in cluster mode). Also take note the JAVA_HOME directory – setting this is not mentioned by the CSES spark tutorial, but I’ve found that setting JAVA_HOME to another directory makes spark not work.

Let’s start by running cluster mode – submitting a hail job on cluster. but before doing so, I created a .zip file of the directory /tigress/BEE/spark_hdfs/hail/python/hail in order to give it to the spark-submit configuration –py-files hail/python/hail.zip. This is my test.slurm script:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 10:00:00
#SBATCH –ntasks-per-node 3
#SBATCH –cpus-per-task 2

module load anaconda
module load spark/hadoop2.6/2.1.0

spark_core_jars=( “${SPARK_HOME}/jars/spark-core*.jar” )
if [ ${#spark_core_jars[@]} -eq 0 ]
then
echo “Could not find a spark-core jar in ${SPARK_HOME}/jars, are you sure SPARK_HOME is set correctly?” >&2
exit -1
fi

spark-start
echo $MASTER
spark-submit –total-executor-cores 6 –executor-memory 5G \
–jars hail/jars/hail-all-spark.jar \
–py-files hail/python/hail.zip test.py

This script submits a job with 6 cores (1 node, 3 tasks per node, and 2 cpus per task). Adding the configurations total-executor-cores, –executor-memory were not detailed in the hail tutorial but they are suggested by the CSES tutorial.

My test.py script imported a test vcf file and saved it in a vds format used by hail:

from hail import *
hc = HailContext()
hc.import_vcf(‘file:///tigress/BEE/spark_hdfs/test.vcf’).write(‘file:///tigress/BEE/spark_hdfs/test.vds’)

This part is pretty simple – you just need to remember that our cluster doesn’t have a dedicated HDFS or other file system used by Spark – so all file addresses need to be prefixed by file://.

The part that gave me the most trouble (and is still not fully resolved) is running a spark-enabled jupyter notebook.

First, we need to start up a spark cluster that is idling:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 10:00:00
#SBATCH –ntasks-per-node 3
#SBATCH –cpus-per-task 2

module load anaconda
module load spark/hadoop2.6/2.1.0

spark_core_jars=( “${SPARK_HOME}/jars/spark-core*.jar” )
if [ ${#spark_core_jars[@]} -eq 0 ]
then
echo “Could not find a spark-core jar in ${SPARK_HOME}/jars, are you sure SPARK_HOME is set correctly?” >&2
exit -1
fi

spark-start
echo $MASTER
sleep infinity

Then, we need to check the slurm output file, which will have a line that looks like this:
Starting master on spark://della-r1c3n12:7077
starting org.apache.spark.deploy.master.Master, logging to /tmp/spark-bj5-org.apache.spark.deploy.master.Master-1-della-r1c3n12.out
Starting slaves

This means that a spark cluster has been instantiated at spark://della-r1c3n12:7077

Now, remembering to reset PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables (. ~/.bashrc after editing the file – unfortunately if you set env variables in .slurm script, jupyter is not able to see it), you can run:

pyspark –master spark://della-r1c3n12:7077 –total-executor-cores 6 –conf spark.sql.files.openCostInBytes=1099511627776 –conf spark.sql.files.maxPartitionBytes=1099511627776 –conf spark.hadoop.parquet.block.size=1099511627776

making sure that the master address is correctly set. The conf variables spark.sql.files.openCostInBytes, maxPartitionBytes, spark.sql.files.maxPartitionBytes and spark.hadoop.parquet.block.size were recommended to be set only for cloudera clusters in the hail tutorial, but hail doesn’t work if we don’t set these variables.

Now, from the local machine, you can run: ssh -N -f -L localhost:8889:localhost:8889 yourusername@clustername.princeton.edu

Now, you can access the spark and hail-enabled jupyter notebook from your local machine, available at http://127.0.0.1:8889 – you may need to enter in the token value that is output by jupyter if this is your first time accessing it as well – in my case, it looked like this:

http://127.0.0.1:8889/?token=fb7f4cee67d15f6f900455cef2f38c2dd299cc60b0856032

Another important distinction is that in order to start hail, you need to run:

from hail import *
hc = HailContext(sc)

with the pre-defined sparkContext variable sc.

However, there still is a problem with doing operations on data tables – with the following error:
vds = hc.read(‘test.vds’)
vds.sample_ids[:5]

—————————————————————————
IllegalArgumentException Traceback (most recent call last)

IllegalArgumentException: u”Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionState’:”

It seems like the backend sql with Hive is not being properly initialized. For purposes of development, I recommend installing hail locally on your machine and work with a small subset of genotypes, expression values, etc, until the issue with jupyter notebooks is resolved, with a lot of custom configurations. But make sure to install compatible versions of JRE (or JDK), hadoop and hail. The github page for hail:

https://github.com/broadinstitute/hail.git

states that it is also compatible with spark-2.2.0, so building from source using the github directory may be a good option as well – this is what I ended up doing on my local machine.

But the good news is that now hail jobs with spark can be set up and run on cluster!