Fragile Family Scale Construction

The workflow for creating scale variables for the Fragile Family data is broken into four parts.
Here, we describe the generation of the Social skills Self control subscale.
I highly recommend for you to open the scales summary document at /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv with some spreadsheet viewing software (e.g. excel) and one or more of the scales documents for years 1, 3, 5, and 9: http://www.fragilefamilies.princeton.edu/documentation
First, SSH into the della server and cd into the Fragile Family restricted use directory
cd /tigress/BEE/projects/rufragfam/data

  • Step 1: create the scale variables file. Relevant script: sp1_processing_scales.ipynb or sp1_processing_scales.py. This python script first obtains the prefix descriptors for individual categories. That is, in the scales documentation, a group of questions is labeled as being asked of the mother, father, child, teacher, etc… Each one of these has an abbreviation. The raw scale variables file can be accessed with

    less /tigress/BEE/projects/rufragfam/data/fragile_families_scales_noquote.tsv

    It is useful to have this file open with some spreadsheet or tab delimited viewing software to get an idea of how the data is structured. Next, it creates a map between each prefix descriptor and fragile family raw data file. It then, through some automated and manual work, attempts to match all variables defined in the scale documentation with the raw data files. After this automated and manual curation, 1514 of the scale variables defined in the PDFs could be found in the data, and 46 could not.

    This step only needs to be run if there are additional scale documents available, for instance, when the year 15 data is released. And the year 15 scale variables need to be added to the fragile_families_scales_noquote.tsv file prior to running this step.

  • Step 2: creating scales data table from raw data table + step 1 output
  • Relevant script: sp2_creating_scales_data.ipynb or sp2_creating_scales_data.py. This script takes the scale variables computed in part 1 and converts them into data tables for each scale. The output is stored in the tab delimited files

    ls -al /tigress/BEE/projects/rufragfam/data/raw-scales/

    The output of this step still contains the uncleaned survey responses from the raw data. For any scale, there are a large number of inconsistencies and errors in the raw data. These need to be cleaned before we can do any imputation or scale conversion. Similarly to step 1, this step only needs to be done if new scales documentations are released and only after updating fragile_families_scales_noquote.tsv.

  • Step 3: data cleaning and conversion of fragile families format to a format that can actually be run through imputation software.
  • Relevant script: sp3_clean_scales.ipynb or sp3_clean_scales.py.

    All unique responses to questions for a scale, e.g. Mental_Health_Scale_for_Depression, can be computed with

    cd /tigress/BEE/projects/rufragfam/data/raw-scales

    awk -F"\t" '{ print $4 }' Mental_Health_Scale_for_Depression.tsv | sort | uniq

    Unfortunately, there doesn’t seem to be an automated way to do this so I recommend going through the scale documents and the question/answer formats.

    The FF scale variables and the set of all response values they can take can be found in the file:
    /tigress/BEE/projects/rufragfam/data/scale_variables_and_responses.tsv

    The FF variable identifiers and labels (survey questions) can be found in the file:
    /tigress/BEE/projects/rufragfam/data/all_ff_variables.tsv

    To add support for a new scale, the replaceScaleSpecificQuantities function needs to be updated to encode the raw response values with something meaningful. For instance, for the social skills self control subscale, we process Social_skills__Selfcontrol_subscale.tsv and replace values we wish to impute with float(‘nan’), and the result of the values are replaced according to the ff_scales9.pdf documentation. The cleaned scales will be generated in the directory /tigress/BEE/projects/rufragfam/data/clean-scales/

  • Step 4: compute the scale values
  • Relevant script: sp4_computing_scales.ipynb or sp4_computing_scales.py. From the cleaned data and the procedures defined in the FF scales PDFs, we can reconstruct scale scores. To add support for your scale, add in your scale to the if scale_file if statement block. For example, the Social_skills__Selfcontrol_subscale.tsv scale is processed by first imputing the data and then summing up the individual counts across survey questions for each wave. The final output file with all the scale data will be stored in /tigress/BEE/projects/rufragfam/data/ff_scales.tsv.

    We are currently using an implementation of multiple imputation by chained equations but other methods can be tested. See https://pypi.python.org/pypi/fancyimpute
    Also, this is a great resource for imputation in the Fragile Families data.

After adding in your scale in Steps 3 and 4, you can use the ff_scales.tsv file for data modeling. This is where it gets interesting!