Category Archives: Uncategorized

Setting up and running hail on cluster (also featuring Apache Spark)

According to http://hail.is, “Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from genetic data in VCF, BGEN or PLINK format, Hail can, for example:

load variant and sample annotations from text tables, JSON, VCF, VEP, and locus interval files
generate variant annotations like call rate, Hardy-Weinberg equilibrium p-value, and population-specific allele count
generate sample annotations like mean depth, imputed sex, and TiTv ratio
generate new annotations from existing ones as well as genotypes, and use these to filter samples, variants, and genotypes
find Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples via the GRM and IBD matrix, and compute sample scores and variant loadings using PCA
perform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritability”

I have finished setting up hail to run on the cluster, and this document summarizes what needs to be done in order to run hail on cluster, in both standalone and cluster modes.

These links will also prove to be useful – I recommend reading through them first:

https://hail.is/docs/stable/getting_started.html
https://www.princeton.edu/researchcomputing/faq/spark-via-slurm/

But it must be noted that I had to do a lot of hand-setting configurations to make this work.

First, it is worth noting that the spark distribution we have on cluster is only compatible with python 2 – if you have a default python3 directory and PYTHONPATH set, you may need to disable these.

You can double check the versions of python, ipython and jupyter being used – first, run:
module load anaconda
module load spark/hadoop2.6/2.1.0

Take note that although the spark tutorial from CSES states that you should run module load python, I found that you actually need to run module run anaconda for data structures necessary for hail to run.

These commands will load all the necessary binaries. You’ll see that this command also needs to be included in the .slurm file.

[bj5@della5 ~]$ which spark-submit
/usr/licensed/spark/spark-2.1.0-bin-hadoop2.6/bin/spark-submit
[bj5@della5 ~]$ which python
/usr/licensed/anaconda/5.0.1/bin/python
[bj5@della5 ~]$ which ipython
/usr/licensed/anaconda/5.0.1/bin/ipython
[bj5@della5 ~]$ which jupyter
/usr/licensed/anaconda/5.0.1/bin/jupyter

As for the version of hail, instead of building hail from source, CSES suggested that I download the pre-built distribution (compatible with spark 2.1.0) from https://storage.googleapis.com/hail-common/distributions/0.1/Hail-0.1-1214727c640f-Spark-2.1.0.zip

the hail directory is now located at:
/tigress/BEE/spark_hdfs/hail. Disregard the hadoop and spark directories in spark_hdfs – we’ll stick to the modules installed by CSES at Della.

Now, we need to set a few configurations at ~/.bashrc (or ~/.bash_profile, depending on how your directory is set up)

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export PATH=$PATH:$JAVA_HOME/bin
# These env vars added for Spark – change to fit your appropriate spark version
export SPARK_HOME=/usr/licensed/spark/spark-2.1.0-bin-hadoop2.6/
export PATH=$PATH:$SPARK_HOME/bin
# These env vars added for hail
export HAIL_HOME=/tigress/BEE/spark_hdfs/hail
export PATH=$PATH:$HAIL_HOME/bin
# set this to jupyter if running a jupyter notebook – otherwise set it to python
# export PYSPARK_DRIVER_PYTHON=/usr/licensed/anaconda/5.0.1/bin/jupyter
export PYSPARK_DRIVER_PYTHON=/usr/licensed/anaconda/5.0.1/bin/python
export PYSPARK_PYTHON=`which python`
# Set this forwarding if running a jupyter notebook
# export PYSPARK_DRIVER_PYTHON_OPTS=”notebook –no-browser –port=8889 –ip=127.0.0.1″
export PYTHONPATH=”$HAIL_HOME/python:$SPARK_HOME/python:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip | tr ‘\n’ ‘:’)$PYTHONPATH”
export SPARK_CLASSPATH=$HAIL_HOME/jars/hail-all-spark.jar

Take note that the environment variable PYSPARK_DRIVER_PYTHON needs to be set differently depending on whether you’re running standalone mode with jupyter notebook or cluster mode. PYSPARK_DRIVER_PYTHON_OPTS also needs to be set in order to allow for ssh tunneling to run the jupyter notebook (but not in cluster mode). Also take note the JAVA_HOME directory – setting this is not mentioned by the CSES spark tutorial, but I’ve found that setting JAVA_HOME to another directory makes spark not work.

Let’s start by running cluster mode – submitting a hail job on cluster. but before doing so, I created a .zip file of the directory /tigress/BEE/spark_hdfs/hail/python/hail in order to give it to the spark-submit configuration –py-files hail/python/hail.zip. This is my test.slurm script:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 10:00:00
#SBATCH –ntasks-per-node 3
#SBATCH –cpus-per-task 2

module load anaconda
module load spark/hadoop2.6/2.1.0

spark_core_jars=( “${SPARK_HOME}/jars/spark-core*.jar” )
if [ ${#spark_core_jars[@]} -eq 0 ]
then
echo “Could not find a spark-core jar in ${SPARK_HOME}/jars, are you sure SPARK_HOME is set correctly?” >&2
exit -1
fi

spark-start
echo $MASTER
spark-submit –total-executor-cores 6 –executor-memory 5G \
–jars hail/jars/hail-all-spark.jar \
–py-files hail/python/hail.zip test.py

This script submits a job with 6 cores (1 node, 3 tasks per node, and 2 cpus per task). Adding the configurations total-executor-cores, –executor-memory were not detailed in the hail tutorial but they are suggested by the CSES tutorial.

My test.py script imported a test vcf file and saved it in a vds format used by hail:

from hail import *
hc = HailContext()
hc.import_vcf(‘file:///tigress/BEE/spark_hdfs/test.vcf’).write(‘file:///tigress/BEE/spark_hdfs/test.vds’)

This part is pretty simple – you just need to remember that our cluster doesn’t have a dedicated HDFS or other file system used by Spark – so all file addresses need to be prefixed by file://.

The part that gave me the most trouble (and is still not fully resolved) is running a spark-enabled jupyter notebook.

First, we need to start up a spark cluster that is idling:

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 10:00:00
#SBATCH –ntasks-per-node 3
#SBATCH –cpus-per-task 2

module load anaconda
module load spark/hadoop2.6/2.1.0

spark_core_jars=( “${SPARK_HOME}/jars/spark-core*.jar” )
if [ ${#spark_core_jars[@]} -eq 0 ]
then
echo “Could not find a spark-core jar in ${SPARK_HOME}/jars, are you sure SPARK_HOME is set correctly?” >&2
exit -1
fi

spark-start
echo $MASTER
sleep infinity

Then, we need to check the slurm output file, which will have a line that looks like this:
Starting master on spark://della-r1c3n12:7077
starting org.apache.spark.deploy.master.Master, logging to /tmp/spark-bj5-org.apache.spark.deploy.master.Master-1-della-r1c3n12.out
Starting slaves

This means that a spark cluster has been instantiated at spark://della-r1c3n12:7077

Now, remembering to reset PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables (. ~/.bashrc after editing the file – unfortunately if you set env variables in .slurm script, jupyter is not able to see it), you can run:

pyspark –master spark://della-r1c3n12:7077 –total-executor-cores 6 –conf spark.sql.files.openCostInBytes=1099511627776 –conf spark.sql.files.maxPartitionBytes=1099511627776 –conf spark.hadoop.parquet.block.size=1099511627776

making sure that the master address is correctly set. The conf variables spark.sql.files.openCostInBytes, maxPartitionBytes, spark.sql.files.maxPartitionBytes and spark.hadoop.parquet.block.size were recommended to be set only for cloudera clusters in the hail tutorial, but hail doesn’t work if we don’t set these variables.

Now, from the local machine, you can run: ssh -N -f -L localhost:8889:localhost:8889 yourusername@clustername.princeton.edu

Now, you can access the spark and hail-enabled jupyter notebook from your local machine, available at http://127.0.0.1:8889 – you may need to enter in the token value that is output by jupyter if this is your first time accessing it as well – in my case, it looked like this:

http://127.0.0.1:8889/?token=fb7f4cee67d15f6f900455cef2f38c2dd299cc60b0856032

Another important distinction is that in order to start hail, you need to run:

from hail import *
hc = HailContext(sc)

with the pre-defined sparkContext variable sc.

However, there still is a problem with doing operations on data tables – with the following error:
vds = hc.read(‘test.vds’)
vds.sample_ids[:5]

—————————————————————————
IllegalArgumentException Traceback (most recent call last)

IllegalArgumentException: u”Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionState’:”

It seems like the backend sql with Hive is not being properly initialized. For purposes of development, I recommend installing hail locally on your machine and work with a small subset of genotypes, expression values, etc, until the issue with jupyter notebooks is resolved, with a lot of custom configurations. But make sure to install compatible versions of JRE (or JDK), hadoop and hail. The github page for hail:

https://github.com/broadinstitute/hail.git

states that it is also compatible with spark-2.2.0, so building from source using the github directory may be a good option as well – this is what I ended up doing on my local machine.

But the good news is that now hail jobs with spark can be set up and run on cluster!

Fragile Family Scale Construction

The workflow for creating scale variables for the Fragile Family data is broken into four parts.
Here, we describe the generation of the Social skills Self control subscale.
I highly recommend for you to open the scales summary document at /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv with some spreadsheet viewing software (e.g. excel) and one or more of the scales documents for years 1, 3, 5, and 9: http://www.fragilefamilies.princeton.edu/documentation
First, SSH into the della server and cd into the Fragile Family restricted use directory
cd /tigress/BEE/projects/rufragfam/data

  • Step 1: create the scale variables file. Relevant script: sp1_processing_scales.ipynb or sp1_processing_scales.py. This python script first obtains the prefix descriptors for individual categories. That is, in the scales documentation, a group of questions is labeled as being asked of the mother, father, child, teacher, etc… Each one of these has an abbreviation. The raw scale variables file can be accessed with

    less /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv

    It is useful to have this file open with some spreadsheet or tab delimited viewing software to get an idea of how the data is structured. Next, it creates a map between each prefix descriptor and fragile family raw data file. It then, through some automated and manual work, attempts to match all variables defined in the scale documentation with the raw data files. After this automated and manual curation, 1514 of the scale variables defined in the PDFs could be found in the data, and 46 could not.

    This step only needs to be run if there are additional scale documents available, for instance, when the year 15 data is released. And the year 15 scale variables need to be added to the fragile_families_scales_noquote.tsv file prior to running this step.

  • Step 2: creating scales data table from raw data table + step 1 output
  • Relevant script: sp2_creating_scales_data.ipynb or sp2_creating_scales_data.py. This script takes the scale variables computed in part 1 and converts them into data tables for each scale. The output is stored in the tab delimited files

    ls -al /tigress/BEE/projects/rufragfam/data/raw-scales/

    The output of this step still contains the uncleaned survey responses from the raw data. For any scale, there are a large number of inconsistencies and errors in the raw data. These need to be cleaned before we can do any imputation or scale conversion. Similarly to step 1, this step only needs to be done if new scales documentations are released and only after updating fragile_families_scales_noquote.tsv.

  • Step 3: data cleaning and conversion of fragile families format to a format that can actually be run through imputation software.
  • Relevant script: sp3_clean_scales.ipynb or sp3_clean_scales.py.

    All unique responses to questions for a scale, e.g. Mental_Health_Scale_for_Depression, can be computed with

    cd /tigress/BEE/projects/rufragfam/data/raw-scales

    awk -F"\t" '{ print $4 }' Mental_Health_Scale_for_Depression.tsv | sort | uniq

    Unfortunately, there doesn’t seem to be an automated way to do this so I recommend going through the scale documents and the question/answer formats.

    The FF scale variables and the set of all response values they can take can be found in the file:
    /tigress/BEE/projects/rufragfam/data/scale_variables_and_responses.tsv

    The FF variable identifiers and labels (survey questions) can be found in the file:
    /tigress/BEE/projects/rufragfam/data/all_ff_variables.tsv

    To add support for a new scale, the replaceScaleSpecificQuantities function needs to be updated to encode the raw response values with something meaningful. For instance, for the social skills self control subscale, we process Social_skills__Selfcontrol_subscale.tsv and replace values we wish to impute with float(‘nan’), and the result of the values are replaced according to the ff_scales9.pdf documentation. The cleaned scales will be generated in the directory /tigress/BEE/projects/rufragfam/data/clean-scales/

  • Step 4: compute the scale values
  • Relevant script: sp4_computing_scales.ipynb or sp4_computing_scales.py. From the cleaned data and the procedures defined in the FF scales PDFs, we can reconstruct scale scores. To add support for your scale, add in your scale to the if scale_file if statement block. For example, the Social_skills__Selfcontrol_subscale.tsv scale is processed by first imputing the data and then summing up the individual counts across survey questions for each wave. The final output file with all the scale data will be stored in /tigress/BEE/projects/rufragfam/data/ff_scales.tsv.

    We are currently using an implementation of multiple imputation by chained equations but other methods can be tested. See https://pypi.python.org/pypi/fancyimpute
    Also, this is a great resource for imputation in the Fragile Families data.

After adding in your scale in Steps 3 and 4, you can use the ff_scales.tsv file for data modeling. This is where it gets interesting!

Installing IPython notebook on Della – the conda way (featuring Python 3 and IPython 4.0, among other things)

Thanks to Ian’s previous post, I was able to set up IPython notebook on Della, and I’ve been working extensively with it. However, when I was trying to sync the notebooks between the copies on my local machine and Della, I found out that the version of IPython on Della is the old 2.3 version, and that IPython is not backward compatible. So any IPython notebook that I create and work on in my local directory will simply not work in Della, which is quite annoying.

Also, I think there is a lot of benefit to setting up and using Anaconda in my Della directory. It sets up a lot of packages (including Python 3, instead of the archaic 2.6 that Della runs; you have to module load python as Ian does in his post in order to load 2.7) and manages them seamlessly, without having to worry about what package is in what directory.

According to the Conda website:

Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.

So, let’s get started. First, go to:

http://repo.continuum.io/miniconda/index.html

And download the latest Linux x86-64 version, namely:

Miniconda3-latest-Linux-x86_64.sh

Then, scp the Miniconda installer to your Della local directory (e.g. /home/my.user.name/)

Note: I initially tried using the easy_install way of installing conda, only to run into the following error:

Error: This installation of conda is not initialized. Use 'conda create -n
envname' to create a conda environment and 'source activate envname' to
activate it.

# Note that pip installing conda is not the recommended way for setting up your
# system. The recommended way for setting up a conda system is by installing
# Miniconda, see: http://repo.continuum.io/miniconda/index.html

It indeed is preferable to follow their instructions. Then run:

sh Miniconda3-latest-Linux-x86_64.sh

And follow their instructions. Conda will install a set of basic packages (including python 3.5, conda, openssl, pip, setuptools, only to name a few useful packages) under the package you specify, or the default directory:

/home/my.user.name/miniconda3

It also modifies the PATH for you so that you don’t have to worry about that yourself. How nice of them. (But sometimes you might need to specify the default versions of programs that are on della, especially for distributing jobs to other users, etc. Don’t forget to specify them when needed. But you should be set for most use cases.)

Now, since we are using the conda package version of pip, by simply running,

pip install ipython
pip install jupyter

or

conda install ipython
conda install jupyter

conda will integrate these packages into your environment. Neat.

That’s it! You can double check what packages you have by running:

conda list

After this, the steps for having the notebook serve the notebook to your local browser is identical to the previous post. Namely:

#create mycert.pem using the following openssl cmd:

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

# my mycert.pem wherever you’d like

mv mycert.pem ~/local/lib/

# create an ipython profilename

ipython profile create profilename

# generate a password using the following ipython utility:

python
from IPython.lib import passwd
passwd()
Enter password:
Verify password:
'sha1:…'

#copy this hashed pass

vi /home/my.della.user.name/.ipython/profile_profilename/ipython_config.py

# edit:

c.NotebookApp.port = 1999 # change this port number to something not in use, I used 2999
c.NotebookApp.password = 'sha1:…' #use generated pass here
c.NotebookApp.certfile = u'/home/my.della.user.name/local/lib/mycert.pem'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = '127.0.0.1'

[save]

 

# sign off and sign back on to della

ssh -A -L<Your Port #>:127.0.0.1:<Your Port #> my.della.user.name@della.princeton.edu

# boot up notebook

ipython notebook --ip=127.0.0.1 --profile=profilename --port <Your Port #>
# note that if you are trying to access Della
# from outside the Princeton CS department, you
# may have to forward the same port from your home computer
# to some princeton server, then again to Della

# In your browser go to

http://127.0.0.1:<Your Port #>

ipython_conda

After you’ve set everything up, you can upload the ipython notebook to gist for sharing with others. I’ll repeat the post:upload-a-gist-to-github-directly-from-della for convenience:

# First, install gist gem locally at della
gem install –user-install gist
echo ‘export PATH=$PATH:/PATH/TO/HOME/.gem/ruby/1.8/bin/’ >> ~/.bashrc
source ~/.bashrc

# Boot up connection
gist –login
[Enter Github username and password]

# Upload gist, e.g.
gist my_notebook.ipynb -u [secret_gist_string]

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)

Here is an example ipython notebook that I shared through gist and is available for viewing:

GTEx eQTL detection: cis- pipeline

Upload a gist to github directly from della

 

# First, install gist gem locally at della
gem install –user-install gist
echo ‘export PATH=$PATH:/PATH/TO/HOME/.gem/ruby/1.8/bin/’ >> ~/.bashrc
source ~/.bashrc

# Boot up connection
gist –login
[Enter Github username and password]

# Upload gist, e.g.
gist my_notebook.ipynb -u [secret_gist_string]

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)