logo created by Ji-Sung Kim


1  Introduction

We use discrete fragmentation-coagulation processes to model genetic data in this software package.

1.1  Description

Our genetic sequence model, the haplotype cluster graph model (HCGM), has several desirable properties: (1) it is Bayesian nonparametric and thus allows for a mechanism for estimating model complexity directly from data [???,???]; (2) unlike copying models [???], it is exchangeable across samples and thus robust to permutations of the haplotype sampling order; (3) haplotype blocks may overlap; (4) the distribution of haplotypes into clusters follows Ewens’ sampling formula marginally; and (5) is tractable for thousands of samples and hundreds of thousands of variants.

1.2  Download

phaseME can be downloaded from

2  phaseME options

2.1  Usage

usage: dfcp [options] [input_file] output_dir

Example usage

java -Xmx1g -jar dfcp.jar -seed 1234 -fBackwards false -fQuiet false -random_restarts 1 -I 1 -B 5 -T 10 -S 1 –mevar -DP 0.0 -APA 1 -APB 1 -DPA 0.1 -DPB 1.1 -u 1 -R 0.1 -fDirichletLearnMutation true -hcg -viz haps.txt path/to/output

2.2  Arguments

The parameters to dfcp.jar are given in -<parameter> <value> pairs if there are arguments or -<parameter> or --<parameter> for boolean switches.

2.2.1  Required

 -A <arg>                         initial value of Gamma (default 0.01 in
                                  [1.0E-5, 10.0])
 -allocator <arg>
 -APA <arg>                       Gamma distribution alpha prior on alpha
                                  (default 1.0)
 -APB <arg>                       Gamma distribution beta prior on alpha
                                  (default 0.25)
 -B <arg>                         number of samples discarded in burnin
                                  (default 900)
 -calls                           print out calls.
 -diagnostics <arg>               Conduct MCMC diagnostics.
 -DP <arg>                        Dirichlet prior for likelihood when
                                  computing for each cluster (default 1.1)
 -DPA <arg>                       Gamma distribution alpha prior on d
                                  (default 1.1)
 -DPB <arg>                       Gamma distribution beta prior on d
                                  (default 1.1)
 -fastmosaic                      save a fast mosaic diagram.
 -fBackwards <arg>                Flag indicates if backwards messages
                                  should be computed and used to estimate
                                  statistics. (default: false)
 -fBernoulli <arg>                bernoulli emission model.
 -fDirichlet <arg>                dirichlet emission model.
 -fDirichletLearnMutation <arg>   dirichlet emission model while learning
                                  mutation as part of gamma.
 -fGaussian <arg>                 Flag to indicate if we should use a
                                  Gaussian prior on alpha (default: false)
 -fixalpha                        do not resample alpha. (default false)
 -fixGammas                       do not resample Gammas. (default false)
 -fixRs                           do not resample Rs. (default false)
 -fQuiet <arg>                    Flag indicates if we should output
                                  summary stats. (default: false)
 -GPA <arg>                       Gamma distribution alpha prior on gamma
                                  (default 1.0)
 -GPB <arg>                       Gamma distribution beta prior on gamma
                                  (default 0.25)
 -hcg                             output haplotype cluster graph.
 -help                            print this help.
 -I <arg>                         number of iterations with fixed
                                  parameters (default 100)
 -impute                          do imputation.
 -ivariance                       save impute variance.
 -K <arg>                         max #sites (default 200)
 -M,--mevar                       Run maximization expectation algorithm
                                  with variational updates (instead of
                                  Gibbs Sampling and max-exp with MCMC
 -marginals                       print out marginals.
 -mergemosaic1 <arg>              merge DFCP runs mosaic input 1
 -mergemosaic2 <arg>              merge DFCP runs mosaic input 2
 -mergevcf1 <arg>                 merge DFCP runs VCF input 1
 -mergevcf2 <arg>                 merge DFCP runs VCF input 2
 -mosaic                          save mosaic diagrams.
 -mutation <arg>                  set mutation rate
 -P <arg>                         number of parameter updates between
                                  Gibbs sweeps (default 10)
 -p <arg>                         ploidy or the maximum number of variant
                                  alleles at any locus (default 2)
 -phase                           do haplotype phasing.
 -R <arg>                         initial value of R (default 0.5 in
                                  [1.0E-10, 1.0])
 -random_restarts <arg>           number of random restarts (default 1)
 -reference <arg>                 file to use for reference panel
 -resample                        Resample data.
 -runtimes                        get runtimes.
 -S <arg>                         number of MCMC iterations after burnin
                                  (default 3000)
 -seed <arg>                      seed for rng (default: random).
 -statistics                      collect statistics.
 -T <arg>                         #iterations between each collection
                                  (default 1 iteration)
 -trace                           record out traces.
 -trajectories                    save trajectory paths through cluster
 -u <arg>                         initial value of alpha (default 10.0 in
                                  [0.001, 1000.0])
 -V,--vcf                         Read VCF file format as input.
 -VG <arg>                        number of genotypes to use in VCF input
                                  (default ALL)
 -VH <arg>                        number of haplotypes to use in VCF input
                                  (default ALL)
 -viz                             Visualize the haplotype cluster graph.
 -X,--memcmc                      Run maximization expectation algorithm
                                  with MCMC updates (instead of Gibbs
                                  Sampling and max-exp with variational

2.2.2  HapCompass output

Several files will be output to output_dir.

  • hcg_viz.txt
    Output when -viz command is used.
    Contains a dot file visualization of the haplotype cluster graph.
    You should used the neato layout engine and output to SVG.
    If you open the SVG image in a compatible viewer (e.g. Chrome), when you mouse over the clusters, the line numbers of the input haplotypes will be displayed.
  • hcg.txt
    A space delimited file representing the haplotype cluster graph.
    The first line has three columns: the number of sites, the index (in the input) of the first variant, the index (in the input) of the last variant.
    The proceeding lines describe the clusters of the haplotype cluster graph in the following format:
    <cluster_ID> <start_of_cluster> <cluster_length> <cluster_ploidy> <allele_counts> <outgoing_edges_count>
    Then, what follows is
    <number_of_outgoing_edges> space delimited entries describing the transitions from this cluster.
    Each transition is encoded as


  • trajectories.txt
    Each line encodes the transitions of each haplotype through the haplotype cluster graph by coagulated state.


3  File formats

The only file required to build the haplotype cluster graph is a set of reference haplotypes.

3.1  Reference haplotypes encoded by VCF file

When the -V option is used, a VCF file is expected as input.
VCF specification can be found

3.2  Reference haplotypes in a flat file

A file can contain only the haplotype data.
For example, the file may be


4  Caveats and assumptions

The DFCP can be sensitive to hyperparameter settings and starting position.

5  More information

5.1  Citations

  1. Original papers on discrete fragmentation coagulation processes for clustering genetic data[???,???].