logo created by Ji-Sung Kim
Contents
1 Introduction
We use discrete fragmentation-coagulation processes to model genetic data in this software package.
1.1 Description
Our genetic sequence model, the haplotype cluster graph model (HCGM), has several desirable properties: (1) it is Bayesian nonparametric and thus allows for a mechanism for estimating model complexity directly from data [???,???]; (2) unlike copying models [???], it is exchangeable across samples and thus robust to permutations of the haplotype sampling order; (3) haplotype blocks may overlap; (4) the distribution of haplotypes into clusters follows Ewens’ sampling formula marginally; and (5) is tractable for thousands of samples and hundreds of thousands of variants.
1.2 Download
phaseME can be downloaded from http://beehive.cs.princeton.edu/wiki/fcp-phase/.
2 phaseME options
2.1 Usage
usage: dfcp [options] [input_file] output_dir
Example usage
java -Xmx1g -jar dfcp.jar -seed 1234 -fBackwards false -fQuiet false -random_restarts 1 -I 1 -B 5 -T 10 -S 1 –mevar -DP 0.0 -APA 1 -APB 1 -DPA 0.1 -DPB 1.1 -u 1 -R 0.1 -fDirichletLearnMutation true -hcg -viz haps.txt path/to/output
2.2 Arguments
The parameters to dfcp.jar are given in -<parameter>
<value>
pairs if there are arguments or -<parameter>
or --<parameter>
for boolean switches.
2.2.1 Required
-A <arg> initial value of Gamma (default 0.01 in [1.0E-5, 10.0]) -allocator <arg> -APA <arg> Gamma distribution alpha prior on alpha (default 1.0) -APB <arg> Gamma distribution beta prior on alpha (default 0.25) -B <arg> number of samples discarded in burnin (default 900) -calls print out calls. -diagnostics <arg> Conduct MCMC diagnostics. -DP <arg> Dirichlet prior for likelihood when computing for each cluster (default 1.1) -DPA <arg> Gamma distribution alpha prior on d (default 1.1) -DPB <arg> Gamma distribution beta prior on d (default 1.1) -fastmosaic save a fast mosaic diagram. -fBackwards <arg> Flag indicates if backwards messages should be computed and used to estimate statistics. (default: false) -fBernoulli <arg> bernoulli emission model. -fDirichlet <arg> dirichlet emission model. -fDirichletLearnMutation <arg> dirichlet emission model while learning mutation as part of gamma. -fGaussian <arg> Flag to indicate if we should use a Gaussian prior on alpha (default: false) -fixalpha do not resample alpha. (default false) -fixGammas do not resample Gammas. (default false) -fixRs do not resample Rs. (default false) -fQuiet <arg> Flag indicates if we should output summary stats. (default: false) -GPA <arg> Gamma distribution alpha prior on gamma (default 1.0) -GPB <arg> Gamma distribution beta prior on gamma (default 0.25) -hcg output haplotype cluster graph. -help print this help. -I <arg> number of iterations with fixed parameters (default 100) -impute do imputation. -ivariance save impute variance. -K <arg> max #sites (default 200) -likelihood -M,--mevar Run maximization expectation algorithm with variational updates (instead of Gibbs Sampling and max-exp with MCMC updates). -marginals print out marginals. -mergemosaic1 <arg> merge DFCP runs mosaic input 1 -mergemosaic2 <arg> merge DFCP runs mosaic input 2 -mergevcf1 <arg> merge DFCP runs VCF input 1 -mergevcf2 <arg> merge DFCP runs VCF input 2 -mosaic save mosaic diagrams. -mutation <arg> set mutation rate -P <arg> number of parameter updates between Gibbs sweeps (default 10) -p <arg> ploidy or the maximum number of variant alleles at any locus (default 2) -phase do haplotype phasing. -R <arg> initial value of R (default 0.5 in [1.0E-10, 1.0]) -random_restarts <arg> number of random restarts (default 1) -reference <arg> file to use for reference panel -resample Resample data. -runtimes get runtimes. -S <arg> number of MCMC iterations after burnin (default 3000) -seed <arg> seed for rng (default: random). -statistics collect statistics. -T <arg> #iterations between each collection (default 1 iteration) -trace record out traces. -trajectories save trajectory paths through cluster graph. -u <arg> initial value of alpha (default 10.0 in [0.001, 1000.0]) -V,--vcf Read VCF file format as input. -VG <arg> number of genotypes to use in VCF input (default ALL) -VH <arg> number of haplotypes to use in VCF input (default ALL) -viz Visualize the haplotype cluster graph. -X,--memcmc Run maximization expectation algorithm with MCMC updates (instead of Gibbs Sampling and max-exp with variational updates).
2.2.2 HapCompass output
Several files will be output to output_dir.
hcg_viz.txt
Output when -viz command is used.
Contains a dot file visualization of the haplotype cluster graph.
You should used the neato layout engine and output to SVG.
If you open the SVG image in a compatible viewer (e.g. Chrome), when you mouse over the clusters, the line numbers of the input haplotypes will be displayed.hcg.txt
A space delimited file representing the haplotype cluster graph.
The first line has three columns: the number of sites, the index (in the input) of the first variant, the index (in the input) of the last variant.
The proceeding lines describe the clusters of the haplotype cluster graph in the following format:
<cluster_ID> <start_of_cluster> <cluster_length> <cluster_ploidy> <allele_counts> <outgoing_edges_count>
Then, what follows is<number_of_outgoing_edges>
space delimited entries describing the transitions from this cluster.
Each transition is encoded as
<to_cluster_ID>:<log_probability_of_transition>
trajectories.txt
Each line encodes the transitions of each haplotype through the haplotype cluster graph by coagulated state.
3 File formats
The only file required to build the haplotype cluster graph is a set of reference haplotypes.
3.1 Reference haplotypes encoded by VCF file
When the -V option is used, a VCF file is expected as input.
VCF specification can be found https://github.com/amarcket/vcf_spec.
3.2 Reference haplotypes in a flat file
A file can contain only the haplotype data.
For example, the file may be
010000101100000001001001100011000001000010000000000100011010 010000101100000001001001000011000100000010010000000100011010 000110000001100001000001010011000000010010000000100101011010 010000101100000001001001100011000001000010000000000100011011 100000000001100001000001000011100000010010000000100101011010 010000101100000001001001101011000010000010000000000100111010 001001000000010010000100000100001000001000101100010010000000 001001010010001110110010000000010000100101100011011000000100 010000101100000001001001101011000000000010000000000100111010
4 Caveats and assumptions
The DFCP can be sensitive to hyperparameter settings and starting position.
5 More information
5.1 Citations
- Original papers on discrete fragmentation coagulation processes for clustering genetic data[???,???].