Easy Manipulation of Genetic Variant Data

For analysis of genetic variants, single nucleotide polymorphism (SNP) information is widely used.

The genotypes of each sample for each variant is commonly coded in:

value genotype
0 homozygous REF
1 heterozyguous REF/ALT
2 homozygous ALT

Sometimes, when the genotype has uncertainties, it is represented in dosage after imputation. Given the posterior probabilities of each genotype, dosage is computed as $$\mathrm{dosage = 0 \cdot Prob(REF/REF) + 1 \cdot Prob(REF/ALT) + 2 \cdot Prob(ALT/ALT)}$$ and have values in $[0, 2]$.

Often, the data formats for genetic variants include

A common type of analysis for this data is genome-wide association studies (GWAS), often testing a statistical hypothesis variant by variant. Significance of each SNP is assessed by some type of regression: $$ \mathrm{trait ∼ SNP + age + sex + principal\;components + other\;covariates } $$

In this workshop, we learn four widely-used file types for genetic variants and how to manipulate them in Julia. We focus on accessing genotype information variant by varint, as it is a common workflow for a GWAS-based application. We learn also try to compute some simple properties of SNPs such as minor allele frequencies (MAF).

Why there are so many formats...

Genetics is a fast-moving subject driven by academics. In the early stage, they devise format that best suit their needs, rather than using what is out there. As the field matures, it converges to a couple of dominant formats. The formats covered in this workshop are considered to be a relatively dominant ones. There are many, many more out there.

Genetic Variant File Formats

Variant Call Format (.vcf)

Text-based format is the most intuitive to represent genetic variants. It is the most flexible format to include diverse information on the samples and variants.

(For raw text format)

With the scale of recent genetic data, with data set like UK Biobank having near a million subjects and millions of variants, the storage needed for storing a raw VCF file is prohibitively huge. A common approach to remedy this is storing the data in a compressed form (such as .gz file) and decompress it at analysis time as a stream. However, there is a trade-off between stored file size and time needed for decompression.

(For compressed format)

Basic usage: VCFTools.jl

As in typical VCF files, it has a bunch of meta-information lines, one header line for column names, and then one line for each marker. In this VCF, genetic data has fields GT (genotype), DS (dosage), and GL (genotype likelihood).

To access number of records and samples (individuals):

Information on samples and variants can be retrieved using the VariantCallFormat package:

Sample names can be retrieved by:

Information of each variant is accessible by:

Each row of a VCF file represents a single variant, so it is natural to parse the genotypes or dosages variant-by-variant. The function copy_gt!() computes genotypes of each variant, and copy_ds!() computes dosages of each variant, represented by values in $[0, 2]$.

The keyword arguments:

In order to rapidly process the variant information while keeping the file size small, various binary file types are devised. This format is basically a compact array of two-bit representation of genotypes. This is the native format of the well-celebrated large-scale GWAS tool, PLINK 1. This is suitable for high-performance applications directly dealing with genotype matrices such as iterative hard thresholding and admixture analysis. Major weakness of this format is that it cannot contain imputed information when there is uncertainty in genotypes.

Genotype Plink/SnpArray (hexadecimal byte) binary numeric
A1,A1 0x00 0b00 0
missing 0x01 0b01 1
A1,A2 0x02 0b10 2
A2,A2 0x03 0b11 3

Basic usage: SnpArrays.jl

Information of i-th sample is accessible by:

Information of j-th variant is accessible by:

Genotype of sample $i$, variant $j$ is accessible by:

Note that this is the encoding defined by the table above. If converted numerically in 0, 1, 2-encoding, it corresponds to 2.

Genotypes of each variant in numeric form can be accessed by:

Note that the count of the second allele is encoded, and it is often the reference allele. If it is desired to run the analyses based on the alternative allele count, the values have to be reversed (subtracted from 2.0). This has led to a lot of confusion, and some later projects began to put the reference allele first (most notably, UK Biobank).

Oxford BGEN format (.bgen + optional .sample)

The BGEN format is native to Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST. This format employs variant-by-variant compression scheme, well-tailored for GWAS applications. The UK Biobank imputed data is distributed in this format. Compression algorithms supported are Gzip (used in .gz files) and Zstandard (used in .zst files)

Basic usage (BGEN.jl)

Sample names are accessible by:

Since how to iterate across the variants in BGEN files depend on if the index file is provided, we support an iterator interface for variants. If you are familiar with Python, it is an interface similar to generator defined using yield statement in Python. This iterator is accessible by the function iterator().

Dosage of each variant is accssible by:

PGEN is a backward-compatible extension of the BED format for PLINK 2 under development. It tries to overcome the limitation of the BED format, and can incorporate phase and dosage information. Cutting-edge GWAS tools now support this format.

Note: Linkage disequilibrium

Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.

Basic usage

g is the genotype vector.

The encoding is as follows:

genotype code genotype category
0x00 homozygous REF
0x01 heterozygous REF-ALT
0x02 homozygous ALT
0x03 missing

To avoid array allocations for iterative genotype extraction, one may preallocate g and reuse it.

Similarly, ALT allele dosages are available through the functions alt_allele_dosage() and alt_allele_dosage!(). As genotype information is required to retrieve dosages, space for genotypes are also required for alt_allele_dosage!(). These functions return dosages, parsed genotypes, and endpoint of the dosage information in the current variant record.

Information of each sample and variant is available by reading in .psam and .pvar file as a DataFrame (not yet supported by PGENFiles.jl). .pvar format admits regular .vcf format.

For .pvar file, all the lines starting with # is header, ending with the column names. For test_pgen.pvar, it is the 26th line.

Exercise

Minor allele frequency (MAF) is widely used for determining if a variant is rare or frequent, and in GWAS, it is used for measuring information content of a variant. One such measure is the ratio of actual numerical variance and expected variance from the binomial model ($2 \hat{p}(1-\hat{p})$, where $\hat{p}$ is the MAF in $[0, 1]$).

By modifying the code cells above...

  1. Compute minor allele frequencies using the packages we used.
  2. Determine the minor/major allele.
  3. Compute the variance ratio measure?
  4. Print out the minor allele, MAF, and the variance ratio for each variant.

File format transformation

The file formats can be transformed between each other using plink2 on the command line. For example, the files used in this workshop is transformed from a VCF file using the following commands.

Concluding Remarks: Applications in OpenMendel

While we focused on variant-by-variant access of the genetic variant data, many of the packages we have seen has various utility functionalities such as filtering out the variants with low MAF or low genotype success rate, filter by chromosome, and merge. In case of the BED format (SnpArrays.jl), we support high-performance linear algebra on the genotype matrix which supports multithreading and GPU (graphics programming unit) computation.

GWAS application

High-performance applications using fixed-width BED format

Others