All posts by daguiar

Fragile Family Scale Construction

The workflow for creating scale variables for the Fragile Family data is broken into four parts.
Here, we describe the generation of the Social skills Self control subscale.
I highly recommend for you to open the scales summary document at /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv with some spreadsheet viewing software (e.g. excel) and one or more of the scales documents for years 1, 3, 5, and 9: http://www.fragilefamilies.princeton.edu/documentation
First, SSH into the della server and cd into the Fragile Family restricted use directory
cd /tigress/BEE/projects/rufragfam/data

  • Step 1: create the scale variables file. Relevant script: sp1_processing_scales.ipynb or sp1_processing_scales.py. This python script first obtains the prefix descriptors for individual categories. That is, in the scales documentation, a group of questions is labeled as being asked of the mother, father, child, teacher, etc… Each one of these has an abbreviation. The raw scale variables file can be accessed with

    less /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv

    It is useful to have this file open with some spreadsheet or tab delimited viewing software to get an idea of how the data is structured. Next, it creates a map between each prefix descriptor and fragile family raw data file. It then, through some automated and manual work, attempts to match all variables defined in the scale documentation with the raw data files. After this automated and manual curation, 1514 of the scale variables defined in the PDFs could be found in the data, and 46 could not.

    This step only needs to be run if there are additional scale documents available, for instance, when the year 15 data is released. And the year 15 scale variables need to be added to the fragile_families_scales_noquote.tsv file prior to running this step.

  • Step 2: creating scales data table from raw data table + step 1 output
  • Relevant script: sp2_creating_scales_data.ipynb or sp2_creating_scales_data.py. This script takes the scale variables computed in part 1 and converts them into data tables for each scale. The output is stored in the tab delimited files

    ls -al /tigress/BEE/projects/rufragfam/data/raw-scales/

    The output of this step still contains the uncleaned survey responses from the raw data. For any scale, there are a large number of inconsistencies and errors in the raw data. These need to be cleaned before we can do any imputation or scale conversion. Similarly to step 1, this step only needs to be done if new scales documentations are released and only after updating fragile_families_scales_noquote.tsv.

  • Step 3: data cleaning and conversion of fragile families format to a format that can actually be run through imputation software.
  • Relevant script: sp3_clean_scales.ipynb or sp3_clean_scales.py.

    All unique responses to questions for a scale, e.g. Mental_Health_Scale_for_Depression, can be computed with

    cd /tigress/BEE/projects/rufragfam/data/raw-scales

    awk -F"\t" '{ print $4 }' Mental_Health_Scale_for_Depression.tsv | sort | uniq

    Unfortunately, there doesn’t seem to be an automated way to do this so I recommend going through the scale documents and the question/answer formats.

    The FF scale variables and the set of all response values they can take can be found in the file:
    /tigress/BEE/projects/rufragfam/data/scale_variables_and_responses.tsv

    The FF variable identifiers and labels (survey questions) can be found in the file:
    /tigress/BEE/projects/rufragfam/data/all_ff_variables.tsv

    To add support for a new scale, the replaceScaleSpecificQuantities function needs to be updated to encode the raw response values with something meaningful. For instance, for the social skills self control subscale, we process Social_skills__Selfcontrol_subscale.tsv and replace values we wish to impute with float(‘nan’), and the result of the values are replaced according to the ff_scales9.pdf documentation. The cleaned scales will be generated in the directory /tigress/BEE/projects/rufragfam/data/clean-scales/

  • Step 4: compute the scale values
  • Relevant script: sp4_computing_scales.ipynb or sp4_computing_scales.py. From the cleaned data and the procedures defined in the FF scales PDFs, we can reconstruct scale scores. To add support for your scale, add in your scale to the if scale_file if statement block. For example, the Social_skills__Selfcontrol_subscale.tsv scale is processed by first imputing the data and then summing up the individual counts across survey questions for each wave. The final output file with all the scale data will be stored in /tigress/BEE/projects/rufragfam/data/ff_scales.tsv.

    We are currently using an implementation of multiple imputation by chained equations but other methods can be tested. See https://pypi.python.org/pypi/fancyimpute
    Also, this is a great resource for imputation in the Fragile Families data.

After adding in your scale in Steps 3 and 4, you can use the ff_scales.tsv file for data modeling. This is where it gets interesting!

[tutorial] bash commands for research

This post contains a quick overview of useful command line tools for research found in standard bash shells. For the sake of brevity, I will only include essential information here, and link to other informative pages where available.

awk

awk (also implemented in newer versions nawk and gawk) is a utility to manipulate text files. It’s syntax is much simpler than perl or python but still enables the fast writing of scripts to process files by lines and columns.

examples

For each line in the file input.txt, print the first and third column separated by a tab. Store the result in output.txt

awk '{ print $1 “\t" $3 }' input.txt > output.txt

Replace each column in file input.txt with its absolute value and print out each line to output.txt.

awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }' input.txt > output.txt

List the number of columns in each row in a file input.txt.

cat input.txt | awk '{ print NF}'

links

sed

sed is a stream editor, similar to awk, but is focused on streams of text instead of delimited files. I find it most useful when working with patterns in files.

examples

Convert a comma separated value file to a tab delimited file


sed -e 's/,/\t/g' input.csv > output.csv

Print the line before the line matching the regular express regexp to standard output.


sed -n '/regexp/{g;1!p;};h' input.txt

links

screen

screen is an invaluable tool for creating virtual consoles that (1) keeps sessions active through network failure, e.g. when using secure shells, (2) connect to your session from different locations, and (3) run a long process without maintaining an active shell. Also see the alias command for a useful screen alias.

examples

Open a new screen console, list all available screen sessions, and reattach to screen with a particular ID


screen; screen –ls; screen –r ID

links

alias

Creates a command which represents a more complex command or group of commands. Commonly used to simplify a complex command without writing a script.

examples

Set up an alias to quickly view all queued cluster jobs for user daguiar


alias myq='showq -u daguiar'

reattach to the first detached screen session


alias cscreen='screen -r `screen -ls | grep Detached | head -n1 | awk "{print $1}"`'

links

grep

Find patterns in text file. Useful options include -c to count occurrences of a pattern, -i case insensitive, -v invert the matches, -P use Perl style regular expressions (sometimes easier to work with)

examples

Print all lines that begin with a tab followed by a 1 in file input.txt.


grep -P "^\t1" input.txt

Print the count of all the lines that end with a 2 followed by a tab in file input.txt.


grep -Pc "2\t$" input.txt

links

find

A tool for finding files. Can be chained with other tools for powerful pipelines. Useful options are –name to find files by filename, -wholename to find files by filename and path, -maxdepth descends down to at most this level. The -exec option is VERY useful.

examples

Find all files ending in .txt and concatenate them all together.


find . --name "*.txt" -exec cat {} \;

links

sort

sorts lines in a file, can sort by column using -k option. Be sure to specific the type of sort, e.g. -n for numeric sort, -g general numeric sort.

examples

Sort text file input.txt by its numeric first column and store it in output.txt.


sort -k1,1n input.txt > output.txt

links

paste/join/cat

Tools for combining files.
paste merges files line by line.
join merges two files on a common field.
cat concatenates a number of files one after the other. Can also be used to print a file to standard input.

examples

paste the lines of input1.txt and input2.txt together separating them with a space


paste –d " " input1.txt input2.txt

join two files input1.txt and input2.txt by the first field of both files


join -1 1 -2 1 input1.txt input2.txt

concatenate two files and store them in output.txt


cat input1.txt input2.txt > output.txt

links

Useful shorthand.

shorthand description examples
. prefix to filename means hidden file;in filename means current directory;synonym for source when used alone .ssh;ls ./;. executable.sh
/ root of the file system; all absolute paths start from here cd /home/daguiar
../ up one directory cd ../
| redirect output of command left of pipe to the command right of pipe ls | sort -r | head -n2
~ represents your home directory; has other uses cd ~; ls ~
[Tab][Tab] lists all available completions cd a[Tab][Tab]
[ArrowUp/ArrowDown] cycle through previously submitted commands
[Ctrl]c kill the current process
[Ctrl]d log out of terminal
[Ctrl]z bg = send process to background (& can also be used to run a process in the background e.g. &) or fg = send process to foreground find . -name "*" [Ctrl]z bg
>out.txt redirect output to file out.txt ls -al > files.txt

More information.

New to linux or need a refresher? Learn these more common commands.

You can view the manual for any linux commands by typing man command

  • mv/cp/rm: move, copy, remove files
  • cd: change directory
  • ls: list files. Common options -l, -a, -S, -t
  • mkdir: make a directory
  • less: view a file in order; does not load the entire file into memory, great for previewing large files.
  • chmod: change permissions of a file. There are 3 permissions (read, write, execute) for the file owner, group, and everyone. Examples: : chmod 777 (give owner/group/everyone read/write/execute permissions); : chmod g+rw: (add to the group permissions read and write access).
  • chgrp/chown: change the group/owner for a file.
  • vim/emacs/nano: popular text editors
  • history: print previously used commands
  • head/tail: view start/end of file
  • wget: network downloader
  • tar/gzip: popular compression and archival software; examples, extract an archive: tar xvf file.tar extract a gzipped tar archive: tar xzvf file.tar.gz gzip a file:gzip file
  • groups: see which groups you belong to. Groups can be used to manage permissions for a set of individuals for a file.
  • pwd: print working directory
  • ssh: login to remote host
  • diff: compare two files, output differences. -w option will ignore whitespace
  • touch: update the timestamp of a file (or create a new one if it doesn’t exist)
  • xargs: constructs argument lists when | cannot be used, example: echo break up a sentence into groups of 2| xargs -n 2
  • ps: view running processes. examples, ps -U daguiar
  • top: display top processes
  • finger: look up information on a user
  • echo: print arguments to standard output
  • rsync: remote and local file synchronization tool
  • du: summarize disk space usage, example: du -h –max-depth=1
  • uname -a: get information about currently logged in machine. Related: echo $0 to print your interpreter.
  • md5sum: checksum files. Verify that some important files you downloaded are genuine.

A worked-through exercise

Now that we’ve learned the basics of these commands, let’s put them all together.
You will have to use what you have learned to work out solutions to the tasks in bold. Note that there are many ways to solve each problem.

The scenario

Due to the recent discovery that the Brontosaurus may indeed be a genus of dinosaur, the NSF has reallocated all of its funding to dinosaur research. A collaboration between leading archaeologist Dr. Li-Fang (aka the iron fist of Taiwan) and crazed molecular biologist Dr. Bianca resulted in the extraction of DNA samples from several Brontosauri fossils. After characterizing the set of variants in the sample, a variant call format (VCF) file was generated and uploaded to the cloud. During upload, a deranged hacker and former world-class sprinter named Greg, who has a personal vendetta against the Brontosaurus, corrupted the text file. We must clean this file using the tools described in this tutorial.

Download the (VCF) file.

(1) In case we get interrupted during this workflow, start a new screen session.
Toggle answer

(2) In a bash prompt, change the directory to the file and print the file to standard output.

Toggle hint

Toggle answer

(3)It appears the third individual (12th column) was maliciously inserted into the VCF file.Remove the 12th column in the file; equivalently, retain all columns except the 12th.

Toggle hint

Toggle answer

(4)There appears to be the string “ihatedinosaur” scattered around the file. Find and replace all occurrences of “ihatedinosaur” with the empty string.

Toggle hint

Toggle answer

(5)An erroneous chromosome was also added to the VCF file. Remove the chromosome FABRICATED from dino.vcf.

Toggle hint

Toggle answer

(6)We’ve recovered a sample that was removed from dino.vcf. Add the file dino4.vcf as a new sample in the dino.vcf file.

Toggle hint

Toggle answer

(7) Finally, sort the file so it is in chromosome – position order.

Toggle hint

Toggle answer

When you are done, you can terminate your screen [Ctrl+D]. Congratulations, you’ve made the world safe

Running NBP-Iso framework on BEERS simulated data

This tutorial will explain how to run the full NBP-iso framework.
The major steps are:
1) Simulate reads sampled from novel splice forms using BEERS simulator.
2) Simulate multiple individuals on using the output of the BEERS simulator.
3) Run the NBP-iso model.
4) Visualization and analysis of the results.


######################## SIMULATE DATA ########################
# constants CHANGE THESE
STORAGE_DIR=/tigress/daguiar/npb-iso
SCRIPT_DIR=/home/daguiar/workspace/npb-iso/src/beers
SOURCE_DIR=/home/daguiar/workspace/npb-iso/src/pythoncode

# get reference data
cd $STORAGE_DIR
wget http://itmat.rum.s3.amazonaws.com/simulator_config_human.tar.gz
tar xzvf simulator_config_human.tar.gz
rm simulator_config_human.tar.gz

# get the BEERS simulator
cd $SCRIPT_DIR
wget http://www.cbil.upenn.edu/BEERS/beers.tar
tar xvf beers.tar
rm beers.tar

# make config files OR use the supplied simulator configuration files (chosen here)
# this process takes a LONG TIME so it's recommended to use the existing reference data
# mkdir $STORAGE_DIR/testgene_dir
# usage: make_config_files_for_subset_of_gene_ids.pl
# perl make_config_files_for_subset_of_gene_ids.pl testgene testgene $STORAGE_DIR $STORAGE_DIR/testgene_dir

mkdir $STORAGE_DIR/sim_out
# usage: perl reads_simulator.pl [options]
# simulate some data
perl reads_simulator.pl 100 sim1 -readlength 100 -numgenes 1 -nointron -nalt 6 -palt 0.4 -sn -mastercfgdir $STORAGE_DIR -outdir $STORAGE_DIR/sim1_out
# or simulate with idealistic parameters (this may fail due to random sampling
# of an non-expressed gene, just run it again)
mkdir $STORAGE_DIR/sim_ideal_out
perl reads_simulator.pl 100 sim_ideal -readlength 100 -numgenes 1 -error 0 -subfreq 0 -indelfreq 0 -nointron -tpercent 0 -tqual 1 -nalt 2 -palt 0.5 -sn -mastercfgdir $STORAGE_DIR -outdir $STORAGE_DIR/sim_ideal_out

######################## RUN NPB-ISO ########################
cd $SOURCE_DIR
# run the gene_11 simulated data
python VB/VarBayes.py --data-file ../../data/gene_11/small/data/GENE_11_LDA_sorted.npy --term2exon-file ../../data/gene_11/small/data/GENE_11_Exon_sorted.txt --output-dir ../../output/GENE_11/small/vb_output/ --N-iter 100

# run the ideal simulated data, we have to prepare
# the data for the format accepted by VarBayes.py
python simulate_reads.py --numsamps 1 --numreads 100 --resultDir $STORAGE_DIR/sim_ideal_out/ --stem sim_ideal --suffix temp
# extract individual, isoform number and position
awk '{print $1,$3,$6}' $STORAGE_DIR/sim_ideal_out/simulated_reads*.cig > $STORAGE_DIR/sim_ideal_out/ind_iso_pos.txt
# format the data to obtain a read dictionary and read term counts for each individual
# tricky bit here: nb-iso includes the actual gene, so in our case, 2 isoforms + 1 gene = 3
python formatSimdata.py --resultsDir $STORAGE_DIR/sim_ideal_out/ --nb-ind 1 --nb-iso 3 --stem sim_ideal --gene-name sim_ideal

python VB/VarBayes.py --data-file $STORAGE_DIR/sim_ideal_out/sim_ideal_LDA_sorted.npy --term2exon-file $STORAGE_DIR/sim_ideal_out/sim_ideal_Exon_sorted.txt --output-dir $STORAGE_DIR/sim_ideal_out/sim_ideal/output/ --N-iter 100

######################## VISUALIZE RESULTS ########################
# GENE_11 results
python VB/analysis.py --data-dir ../../data/gene_11/small/data/ --results-dir ../../output/GENE_11/small/vb_output/ --results-prefix GENE_11_ --sorted-reads simulated_reads_allind.cig