Last updated: 25 July, 2017


Overview

This is the code repository for our paper and software for genotyping and parameter estimation in polyploids. The main program, ebg, is written in C++ and can be found in the ebg/ folder. Additional code for other parts of the paper can be found in the following folders:

Below is a list of the models that are implemented in our software, as well as a link to the paper describing the theory behind them.

Models:

Paper:

Obtaining and compiling ebg

The software to infer genotypes and model parameters is called ebg and can be found in the ebg folder in the main polyploid-genotyping repo on GitHub. Inside the ebg folder you can compile the software from source using the Makefile. We have successfully compiled the program using GCC and Clang (Mac OSX). No external libraries are required.

# Clone from GitHub
git clone https://github.com/pblischak/polyploid-genotyping.git

# Change into the ebg directory
cd polyploid-genotyping/ebg

# Compile and install ebg
make && sudo make install

Input data formats

There are three input files that are necessary to run an analysis with ebg (four if you are using the alloSNP model). The read count data files (total and alternative allele read counts) should be in plain text files as tab delimited matrices with individuals as rows and loci as columns. The per locus error rates files should be a single column with the error value listed for each locus on one line.

Total reads

24  12  4  8  46  ...  7
2  14  57  29  -9  ...  78
.
.
.
-9  68  -9  3  8  ...  5

Alternative allele reads

10  0  2  7  24  ...  0
1  3  54  23  -9  ...  75
.
.
.
-9  31  -9  1  2  ...  0

Per locus error rates

0.00254
0.00089
.
.
.
8e-5

Reference allele frequencies (alloSNP model only)

If you are running the alloSNP model, you will need a reference panel of allele frequencies for the genotypes in subgenomes one. This should be formated in the same way as the per locus error rates file: one allele frequency per locus listed on separate lines.

0.578
0.079
.
.
.
0.233

Running analyses

Analyses for each model can be run from the command line by calling the ebg executable. The options for each of the models can be viewed by typing: ebg <model> -h. Below we have given an example of what should be typed at the command line to run each model.

hwe

ebg hwe -t tot-reads.txt \
        -a alt-reads.txt \
        -e error.txt \
        -p 4 \
        --iters 1000 \
        --prefix hwe-test 

diseq

ebg diseq -t tot-reads.txt \
          -a alt-reads.txt \
          -e error.txt \
          -p 4 \
          --iters 1000 \
          --prefix diseq-test

alloSNP

ebg alloSNP -f reference-freqs.txt \
            -t tot-reads.txt \
            -a alt-reads.txt \
            -e error.txt \
            -p1 2 \
            -p2 4 \
            --iters 1000 \
            --prefix alloSNP-test

gatk

ebg gatk -t tot-reads.txt \
         -a alt-reads.txt \
         -e error.txt \
         -p 4 \
         --iters 1000 \
         --prefix gatk-test

Output files

The output files written for each analysis are tab delimited text files with parameter estimates, genotypes, and updated genotype probabilities.

Updated genotype probabilities are specified in a minimal, VCF-like matrix with loci as rows and individuals as columns. Each entry is a comma separated list for the different possible genotype values (the number of copies of the alternative allele). The files are called -PL.txt because the updated probabilities are on the PHRED scale:

\[ PL = -10 \times \log_{10}[P(\text{genotype}|\text{data})] \]

The updated probabilities for the alloSNP model are the joint probabilities for the genotypes in subgenomes one and two. The joint distribution has \((\text{ploidy}_1 + 1) \times (\text{ploidy}_2+1)\) entries that are listed in this order:

\[ (0,0),(0,1),(0,2),\dots,(0,\text{ploidy}_2),\\ (1,0),(1,1),(1,2),\dots,(1,\text{ploidy}_2),\\ \ldots,\\ (\text{ploidy}_1,0),(\text{ploidy}_1,1),(\text{ploidy}_1,2),\ldots,(\text{ploidy}_1,\text{ploidy}_2) \]

These genotype probabilities are included so that downstream analyses can include genotype uncertainty (e.g., estimates of heterozygosity, \(F_{ST}\), etc.).