Category Archives: Uncategorized

Fragile Family Scale Construction

The workflow for creating scale variables for the Fragile Family data is broken into four parts.
Here, we describe the generation of the Social skills Self control subscale.
I highly recommend for you to open the scales summary document at /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv with some spreadsheet viewing software (e.g. excel) and one or more of the scales documents for years 1, 3, 5, and 9: http://www.fragilefamilies.princeton.edu/documentation
First, SSH into the della server and cd into the Fragile Family restricted use directory
cd /tigress/BEE/projects/rufragfam/data

  • Step 1: create the scale variables file. Relevant script: sp1_processing_scales.ipynb or sp1_processing_scales.py. This python script first obtains the prefix descriptors for individual categories. That is, in the scales documentation, a group of questions is labeled as being asked of the mother, father, child, teacher, etc… Each one of these has an abbreviation. The raw scale variables file can be accessed with

    less /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv

    It is useful to have this file open with some spreadsheet or tab delimited viewing software to get an idea of how the data is structured. Next, it creates a map between each prefix descriptor and fragile family raw data file. It then, through some automated and manual work, attempts to match all variables defined in the scale documentation with the raw data files. After this automated and manual curation, 1514 of the scale variables defined in the PDFs could be found in the data, and 46 could not.

    This step only needs to be run if there are additional scale documents available, for instance, when the year 15 data is released. And the year 15 scale variables need to be added to the fragile_families_scales_noquote.tsv file prior to running this step.

  • Step 2: creating scales data table from raw data table + step 1 output
  • Relevant script: sp2_creating_scales_data.ipynb or sp2_creating_scales_data.py. This script takes the scale variables computed in part 1 and converts them into data tables for each scale. The output is stored in the tab delimited files

    ls -al /tigress/BEE/projects/rufragfam/data/raw-scales/

    The output of this step still contains the uncleaned survey responses from the raw data. For any scale, there are a large number of inconsistencies and errors in the raw data. These need to be cleaned before we can do any imputation or scale conversion. Similarly to step 1, this step only needs to be done if new scales documentations are released and only after updating fragile_families_scales_noquote.tsv.

  • Step 3: data cleaning and conversion of fragile families format to a format that can actually be run through imputation software.
  • Relevant script: sp3_clean_scales.ipynb or sp3_clean_scales.py.

    All unique responses to questions for a scale, e.g. Mental_Health_Scale_for_Depression, can be computed with

    cd /tigress/BEE/projects/rufragfam/data/raw-scales

    awk -F"\t" '{ print $4 }' Mental_Health_Scale_for_Depression.tsv | sort | uniq

    Unfortunately, there doesn’t seem to be an automated way to do this so I recommend going through the scale documents and the question/answer formats.

    The FF scale variables and the set of all response values they can take can be found in the file:
    /tigress/BEE/projects/rufragfam/data/scale_variables_and_responses.tsv

    The FF variable identifiers and labels (survey questions) can be found in the file:
    /tigress/BEE/projects/rufragfam/data/all_ff_variables.tsv

    To add support for a new scale, the replaceScaleSpecificQuantities function needs to be updated to encode the raw response values with something meaningful. For instance, for the social skills self control subscale, we process Social_skills__Selfcontrol_subscale.tsv and replace values we wish to impute with float(‘nan’), and the result of the values are replaced according to the ff_scales9.pdf documentation. The cleaned scales will be generated in the directory /tigress/BEE/projects/rufragfam/data/clean-scales/

  • Step 4: compute the scale values
  • Relevant script: sp4_computing_scales.ipynb or sp4_computing_scales.py. From the cleaned data and the procedures defined in the FF scales PDFs, we can reconstruct scale scores. To add support for your scale, add in your scale to the if scale_file if statement block. For example, the Social_skills__Selfcontrol_subscale.tsv scale is processed by first imputing the data and then summing up the individual counts across survey questions for each wave. The final output file with all the scale data will be stored in /tigress/BEE/projects/rufragfam/data/ff_scales.tsv.

    We are currently using an implementation of multiple imputation by chained equations but other methods can be tested. See https://pypi.python.org/pypi/fancyimpute
    Also, this is a great resource for imputation in the Fragile Families data.

After adding in your scale in Steps 3 and 4, you can use the ff_scales.tsv file for data modeling. This is where it gets interesting!

Installing IPython notebook on Della – the conda way (featuring Python 3 and IPython 4.0, among other things)

Thanks to Ian’s previous post, I was able to set up IPython notebook on Della, and I’ve been working extensively with it. However, when I was trying to sync the notebooks between the copies on my local machine and Della, I found out that the version of IPython on Della is the old 2.3 version, and that IPython is not backward compatible. So any IPython notebook that I create and work on in my local directory will simply not work in Della, which is quite annoying.

Also, I think there is a lot of benefit to setting up and using Anaconda in my Della directory. It sets up a lot of packages (including Python 3, instead of the archaic 2.6 that Della runs; you have to module load python as Ian does in his post in order to load 2.7) and manages them seamlessly, without having to worry about what package is in what directory.

According to the Conda website:

Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.

So, let’s get started. First, go to:

http://repo.continuum.io/miniconda/index.html

And download the latest Linux x86-64 version, namely:

Miniconda3-latest-Linux-x86_64.sh

Then, scp the Miniconda installer to your Della local directory (e.g. /home/my.user.name/)

Note: I initially tried using the easy_install way of installing conda, only to run into the following error:

Error: This installation of conda is not initialized. Use 'conda create -n
envname' to create a conda environment and 'source activate envname' to
activate it.

# Note that pip installing conda is not the recommended way for setting up your
# system. The recommended way for setting up a conda system is by installing
# Miniconda, see: http://repo.continuum.io/miniconda/index.html

It indeed is preferable to follow their instructions. Then run:

sh Miniconda3-latest-Linux-x86_64.sh

And follow their instructions. Conda will install a set of basic packages (including python 3.5, conda, openssl, pip, setuptools, only to name a few useful packages) under the package you specify, or the default directory:

/home/my.user.name/miniconda3

It also modifies the PATH for you so that you don’t have to worry about that yourself. How nice of them. (But sometimes you might need to specify the default versions of programs that are on della, especially for distributing jobs to other users, etc. Don’t forget to specify them when needed. But you should be set for most use cases.)

Now, since we are using the conda package version of pip, by simply running,

pip install ipython
pip install jupyter

or

conda install ipython
conda install jupyter

conda will integrate these packages into your environment. Neat.

That’s it! You can double check what packages you have by running:

conda list

After this, the steps for having the notebook serve the notebook to your local browser is identical to the previous post. Namely:

#create mycert.pem using the following openssl cmd:

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

# my mycert.pem wherever you’d like

mv mycert.pem ~/local/lib/

# create an ipython profilename

ipython profile create profilename

# generate a password using the following ipython utility:

python
from IPython.lib import passwd
passwd()
Enter password:
Verify password:
'sha1:…'

#copy this hashed pass

vi /home/my.della.user.name/.ipython/profile_profilename/ipython_config.py

# edit:

c.NotebookApp.port = 1999 # change this port number to something not in use, I used 2999
c.NotebookApp.password = 'sha1:…' #use generated pass here
c.NotebookApp.certfile = u'/home/my.della.user.name/local/lib/mycert.pem'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = '127.0.0.1'

[save]

 

# sign off and sign back on to della

ssh -A -L<Your Port #>:127.0.0.1:<Your Port #> my.della.user.name@della.princeton.edu

# boot up notebook

ipython notebook --ip=127.0.0.1 --profile=profilename --port <Your Port #>
# note that if you are trying to access Della
# from outside the Princeton CS department, you
# may have to forward the same port from your home computer
# to some princeton server, then again to Della

# In your browser go to

http://127.0.0.1:<Your Port #>

ipython_conda

After you’ve set everything up, you can upload the ipython notebook to gist for sharing with others. I’ll repeat the post:upload-a-gist-to-github-directly-from-della for convenience:

# First, install gist gem locally at della
gem install –user-install gist
echo ‘export PATH=$PATH:/PATH/TO/HOME/.gem/ruby/1.8/bin/’ >> ~/.bashrc
source ~/.bashrc

# Boot up connection
gist –login
[Enter Github username and password]

# Upload gist, e.g.
gist my_notebook.ipynb -u [secret_gist_string]

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)

Here is an example ipython notebook that I shared through gist and is available for viewing:

GTEx eQTL detection: cis- pipeline

Upload a gist to github directly from della

 

# First, install gist gem locally at della
gem install –user-install gist
echo ‘export PATH=$PATH:/PATH/TO/HOME/.gem/ruby/1.8/bin/’ >> ~/.bashrc
source ~/.bashrc

# Boot up connection
gist –login
[Enter Github username and password]

# Upload gist, e.g.
gist my_notebook.ipynb -u [secret_gist_string]

# secret_gist_string is the string already associated with a particular file on Github
# To obtain it, the first time you upload a file to Github (e.g. my_notebook.ipynb) go to
# https://github.com/ | Gist | Add file [upload file] | Create secret Gist, which will
# return a secret_gist_string on the panel at right (labeled “HTTPS”)