For analysis of genetic variants, single nucleotide polymorphism (SNP) information is widely used.
rs
("rsID"). The genotypes of each sample for each variant is commonly coded in:
value | genotype |
---|---|
0 | homozygous REF |
1 | heterozyguous REF/ALT |
2 | homozygous ALT |
Sometimes, when the genotype has uncertainties, it is represented in dosage after imputation. Given the posterior probabilities of each genotype, dosage is computed as $$\mathrm{dosage = 0 \cdot Prob(REF/REF) + 1 \cdot Prob(REF/ALT) + 2 \cdot Prob(ALT/ALT)}$$ and have values in $[0, 2]$.
Often, the data formats for genetic variants include
A common type of analysis for this data is genome-wide association studies (GWAS), often testing a statistical hypothesis variant by variant. Significance of each SNP is assessed by some type of regression: $$ \mathrm{trait ∼ SNP + age + sex + principal\;components + other\;covariates } $$
In this workshop, we learn four widely-used file types for genetic variants and how to manipulate them in Julia. We focus on accessing genotype information variant by varint, as it is a common workflow for a GWAS-based application. We learn also try to compute some simple properties of SNPs such as minor allele frequencies (MAF).
Why there are so many formats...
Genetics is a fast-moving subject driven by academics. In the early stage, they devise format that best suit their needs, rather than using what is out there. As the field matures, it converges to a couple of dominant formats. The formats covered in this workshop are considered to be a relatively dominant ones. There are many, many more out there.
.vcf
)¶Text-based format is the most intuitive to represent genetic variants. It is the most flexible format to include diverse information on the samples and variants.
(For raw text format)
With the scale of recent genetic data, with data set like UK Biobank having near a million subjects and millions of variants, the storage needed for storing a raw VCF file is prohibitively huge. A common approach to remedy this is storing the data in a compressed form (such as .gz
file) and decompress it at analysis time as a stream. However, there is a trade-off between stored file size and time needed for decompression.
(For compressed format)
VCFTools.jl
¶using VCFTools
fh = openvcf("test_vcf.vcf.gz", "r")
for l in 1:35
println(readline(fh))
end
close(fh)
##fileformat=VCFv4.1 ##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD"> ##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder"> ##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder"> ##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder"> ##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder"> ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints"> ##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints"> ##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> ##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count"> ##ALT=<ID=DEL,Description="Deletion"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder"> ##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihoods"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README"> ##INFO=<ID=AF,Number=1,Type=Float,Description="Global Allele Frequency based on AC/AN"> ##INFO=<ID=AMR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AMR based on AC/AN"> ##INFO=<ID=ASN_AF,Number=1,Type=Float,Description="Allele Frequency for samples from ASN based on AC/AN"> ##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN"> ##INFO=<ID=EUR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from EUR based on AC/AN"> ##INFO=<ID=VT,Number=1,Type=String,Description="indicates what type of variant the line represents"> ##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data"> ##reference=GRCh37 ##reference=GRCh37 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00104 HG00106 HG00108 HG00109 HG00110 HG00111 HG00112 HG00113 HG00114 HG00116 HG00117 HG00118 HG00119 HG00120 HG00121 HG00122 HG00123 HG00124 HG00125 HG00126 HG00127 HG00128 HG00129 HG00130 HG00131 HG00133 HG00134 HG00135 HG00136 HG00137 HG00138 HG00139 HG00140 HG00141 HG00142 HG00143 HG00146 HG00148 HG00149 HG00150 HG00151 HG00152 HG00154 HG00155 HG00156 HG00158 HG00159 HG00160 HG00171 HG00173 HG00174 HG00176 HG00177 HG00178 HG00179 HG00180 HG00182 HG00183 HG00185 HG00186 HG00187 HG00188 HG00189 HG00190 HG00231 HG00232 HG00233 HG00234 HG00235 HG00236 HG00237 HG00238 HG00239 HG00240 HG00242 HG00243 HG00244 HG00245 HG00246 HG00247 HG00249 HG00250 HG00251 HG00252 HG00253 HG00254 HG00255 HG00256 HG00257 HG00258 HG00259 HG00260 HG00261 HG00262 HG00263 HG00264 HG00265 HG00266 HG00267 HG00268 HG00269 HG00270 HG00271 HG00272 HG00273 HG00274 HG00275 HG00276 HG00277 HG00278 HG00280 HG00281 HG00282 HG00284 HG00285 HG00306 HG00309 HG00310 HG00311 HG00312 HG00313 HG00315 HG00318 HG00319 HG00320 HG00321 HG00323 HG00324 HG00325 HG00326 HG00327 HG00328 HG00329 HG00330 HG00331 HG00332 HG00334 HG00335 HG00336 HG00337 HG00338 HG00339 HG00341 HG00342 HG00343 HG00344 HG00345 HG00346 HG00349 HG00350 HG00351 HG00353 HG00355 HG00356 HG00357 HG00358 HG00359 HG00360 HG00361 HG00362 HG00364 HG00366 HG00367 HG00369 HG00372 HG00373 HG00375 HG00376 HG00377 HG00378 HG00381 HG00382 HG00383 HG00384 HG00403 HG00404 HG00406 HG00407 HG00418 HG00419 HG00421 HG00422 HG00427 HG00428 22 20000086 rs138720731 T C 100 PASS AC=7;RSQ=0.8454;AVGPOST=0.9983;AA=T;AN=2184;LDAF=0.0040;THETA=0.0001;VT=SNP;SNPSOURCE=LOWCOV;ERATE=0.0003;AF=0.0032;AFR_AF=0.01 GT:DS:GL 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.04,-1.05,-5.00 0/0:0.000:-0.07,-0.85,-5.00 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.06,-0.87,-5.00 0/0:0.000:-0.03,-1.14,-5.00 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.23,-0.45,-1.28 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.11,-0.65,-4.40 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.06,-0.91,-5.00 0/0:0.000:-0.18,-0.47,-2.54 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.01,-1.74,-5.00 0/0:0.000:-0.00,-3.66,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.00,-2.53,-5.00 0/0:0.000:-0.09,-0.73,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.00,-3.11,-5.00 0/0:0.000:-0.06,-0.89,-5.00 0/0:0.000:-0.09,-0.71,-4.10 0/0:0.000:-0.11,-0.65,-4.40 0/0:0.000:-0.18,-0.47,-2.34 0/0:0.000:-0.22,-0.45,-1.32 0/0:0.000:-0.02,-1.29,-5.00 0/0:0.000:-0.03,-1.15,-5.00 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.00,-3.34,-5.00 0/0:0.000:-0.12,-0.61,-3.19 0/0:0.000:-0.11,-0.67,-4.40 0/0:0.000:-0.05,-0.99,-5.00 0/0:0.000:-0.18,-0.48,-2.15 0/0:0.000:-0.01,-1.47,-5.00 0/0:0.000:-0.10,-0.67,-3.62 0/0:0.000:-0.03,-1.14,-5.00 0/0:0.000:-0.09,-0.73,-4.40 0/0:0.000:-0.07,-0.84,-4.40 0/0:0.000:-0.18,-0.48,-2.46 0/0:0.000:-0.0292813,-1.18575,-5 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.67,-5.00 0/0:0.000:-0.18,-0.47,-2.40 0/0:0.000:-0.03,-1.25,-5.00 0/0:0.000:-0.11,-0.66,-3.44 0/0:0.000:-0.09,-0.73,-4.70 0/0:0.000:-0.0418663,-1.03687,-4.39794 0/0:0.000:-0.08,-0.79,-3.14 0/0:0.000:-0.00,-2.30,-5.00 0/0:0.000:-0.00,-2.54,-5.00 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.06,-0.86,-5.00 0/0:0.000:-0.09,-0.71,-4.70 0/0:0.000:-0.01,-1.49,-5.00 0/0:0.000:-0.01,-1.88,-5.00 0/0:0.000:-0.09,-0.71,-4.70 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.10,-0.67,-4.40 0/0:0.000:-0.01,-1.51,-5.00 0/0:0.000:-0.02,-1.40,-5.00 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.00,-2.02,-5.00 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.050:-0.18,-0.47,-2.73 0/0:0.000:-0.17,-0.49,-2.97 0/0:0.000:-0.10,-0.68,-4.40 0/0:0.000:-0.05,-0.99,-5.00 0/0:0.000:-0.12,-0.62,-3.38 0/0:0.000:-0.00,-2.06,-5.00 0/0:0.000:-0.16,-0.51,-2.66 0/0:0.000:-0.11,-0.64,-4.22 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.01,-1.64,-5.00 0/0:0.000:-0.00,-2.85,-5.00 0/0:0.000:-0.02,-1.38,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.0311436,-1.15989,-5 0/0:0.000:-0.36,-0.42,-0.73 0/0:0.000:-0.01,-1.88,-5.00 0/0:0.000:-0.05,-0.92,-5.00 0/0:0.000:-0.03,-1.16,-5.00 0/0:0.000:-0.04,-1.04,-5.00 0/0:0.000:-0.13,-0.59,-5.00 0/0:0.000:-0.02,-1.36,-5.00 0/0:0.000:-0.16,-0.51,-2.36 0/0:0.000:-0.02,-1.31,-5.00 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.00,-4.40,-5.00 0/0:0.000:-0.03,-1.16,-5.00 0/0:0.000:-0.09,-0.73,-3.70 0/0:0.000:-0.19,-0.47,-1.77 0/0:0.000:-0.00,-3.32,-5.00 0/0:0.000:-0.17,-0.51,-2.00 0/0:0.000:-0.00,-2.17,-5.00 0/0:0.000:-0.00,-2.91,-5.00 0/0:0.000:-0.10,-0.71,-4.10 0/0:0.000:-0.03,-1.12,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.67,-5.00 0/0:0.000:-0.00,-2.09,-5.00 0/0:0.000:-0.04,-1.09,-5.00 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.02,-1.41,-5.00 0/0:0.000:-0.10,-0.69,-3.80 0/0:0.000:-0.01,-1.54,-5.00 0/0:0.000:-0.03,-1.16,-5.00 0/0:0.000:-0.09,-0.73,-4.70 0/0:0.000:-0.09,-0.74,-4.70 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.08,-0.78,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.10,-0.67,-4.40 0/0:0.000:-0.01,-1.71,-5.00 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.02,-1.26,-5.00 0/0:0.000:-0.04,-1.10,-5.00 0/0:0.000:-0.02,-1.27,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.47,-5.00 0/0:0.000:-0.00,-2.00,-5.00 0/0:0.000:-0.10,-0.67,-4.22 0/0:0.050:-0.18,-0.47,-2.34 0/0:0.000:-0.05,-1.00,-5.00 0/0:0.000:-0.11,-0.65,-3.85 0/0:0.000:-0.10,-0.68,-4.70 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.08,-0.76,-5.00 0/0:0.000:-0.19,-0.47,-2.14 0/0:0.000:-0.00,-1.99,-5.00 0/0:0.000:-0.18,-0.47,-2.46 0/0:0.000:-0.09,-0.74,-4.40 0/0:0.450:-0.05,-0.94,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.10,-0.69,-4.70 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.18,-0.47,-2.34 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.06,-0.88,-5.00 0/0:0.000:-0.02,-1.41,-5.00 0/0:0.000:-0.06,-0.88,-5.00 0/0:0.000:-0.18,-0.47,-1.95 0/0:0.000:-0.19,-0.46,-2.17 0/0:0.000:-0.03,-1.13,-5.00 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.18,-0.48,-2.23 0/0:0.000:-0.23,-0.45,-1.31 0/0:0.000:-0.11,-0.64,-3.92 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.11,-0.66,-4.22 0/0:0.000:-0.12,-0.61,-2.38 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.40,-0.45,-0.60 0/0:0.000:-0.00,-2.98,-5.00 0/0:0.000:-0.13,-0.59,-2.09 0/0:0.000:-0.02,-1.37,-5.00 0/0:0.000:-0.477139,-0.477113,-0.477113 0/0:0.000:-0.04,-1.10,-5.00 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.01,-1.51,-5.00 0/0:0.000:-0.01,-1.67,-5.00 0/0:0.000:-0.08,-0.75,-4.40 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.10,-0.69,-4.40 0/0:0.000:-0.12,-0.63,-3.92 0/0:0.000:-0.01,-1.74,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.19,-0.46,-2.60 0/0:0.000:-0.19,-0.46,-2.62 0/0:0.000:-0.11,-0.65,-4.70 0/0:0.000:-0.11,-0.66,-4.70 0/0:0.050:-0.18,-0.49,-2.04 0/0:0.050:-0.10,-0.67,-4.40 0/0:0.000:-0.01,-1.62,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.23,-0.41,-1.51 0/0:0.000:-0.18,-0.48,-2.18 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.10,-0.68,-4.10 0/0:0.000:-0.03,-1.24,-5.00 0/0:0.000:-0.18,-0.48,-2.14 22 20000146 rs73387790 G A 100 PASS LDAF=0.0169;RSQ=0.9482;THETA=0.0004;AA=G;AN=2184;AVGPOST=0.9972;VT=SNP;SNPSOURCE=LOWCOV;AC=36;ERATE=0.0003;AF=0.02;AFR_AF=0.07;EUR_AF=0.0013 GT:DS:GL 0/0:0.000:-0.00,-2.68,-5.00 0/0:0.000:-0.07,-0.82,-5.00 0/0:0.000:-0.13,-0.60,-3.05 0/0:0.000:-0.03,-1.24,-5.00 0/0:0.000:-0.18,-0.47,-3.08 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.16,-0.51,-2.63 0/0:0.000:-0.01,-1.76,-5.00 0/0:0.000:-0.10,-0.67,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.11,-0.66,-4.40 0/0:0.000:-0.10,-0.69,-4.70 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.18,-0.47,-2.30 0/0:0.000:-0.00,-2.80,-5.00 0/0:0.000:-0.00,-2.02,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.49,-5.00 0/0:0.000:-0.01,-1.76,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.02,-1.40,-5.00 0/0:0.000:-0.00,-2.09,-5.00 0/0:0.000:-0.10,-0.70,-4.10 0/0:0.000:-0.22,-0.46,-1.27 0/0:0.000:-0.18,-0.48,-2.39 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.19,-0.47,-2.27 0/0:0.000:-0.07,-0.85,-5.00 0/0:0.000:-0.00,-2.53,-5.00 0/0:0.000:-0.00,-2.83,-5.00 0/0:0.000:-0.22,-0.46,-1.24 0/0:0.000:-0.19,-0.46,-2.27 0/0:0.000:-0.10,-0.68,-4.40 0/0:0.000:-0.09,-0.73,-4.22 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.15,-0.55,-2.64 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.08,-0.76,-4.70 0/0:0.000:-0.01,-1.49,-5.00 0/0:0.000:-0.06,-0.86,-5.00 0/0:0.000:-0.029681,-1.18006,-5 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.18,-0.48,-2.33 0/0:0.000:-0.10,-0.70,-4.70 0/0:0.000:-0.00,-2.04,-5.00 0/0:0.000:-0.03,-1.24,-5.00 0/0:0.000:-0.10,-0.69,-4.40 0/0:0.000:-0.0843997,-0.755377,-3.00877 0/0:0.000:-0.21,-0.47,-1.33 0/0:0.000:-0.00,-3.22,-5.00 0/0:0.000:-0.00,-3.70,-5.00 0/0:0.000:-0.05,-0.96,-5.00 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.01,-1.47,-5.00 0/0:0.000:-0.01,-1.86,-5.00 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.01,-1.51,-5.00 0/0:0.000:-0.03,-1.25,-5.00 0/0:0.000:-0.01,-1.92,-5.00 0/0:0.000:-0.00,-2.58,-5.00 0/0:0.000:-0.06,-0.89,-5.00 0/0:0.000:-0.05,-1.00,-5.00 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.01,-1.72,-5.00 0/0:0.000:-0.00,-2.25,-5.00 0/0:0.000:-0.02,-1.43,-5.00 0/0:0.000:-0.18,-0.48,-2.03 0/0:0.000:-0.18,-0.47,-2.72 0/0:0.000:-0.09,-0.72,-4.70 0/0:0.000:-0.18,-0.47,-2.42 0/0:0.000:-0.19,-0.46,-2.20 0/0:0.000:-0.24,-0.44,-1.19 0/0:0.000:-0.01,-1.75,-5.00 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.19,-0.46,-2.27 0/0:0.000:-0.05,-0.99,-5.00 0/0:0.000:-0.06,-0.87,-5.00 0/0:0.000:-0.00,-3.40,-5.00 0/0:0.000:-0.10,-0.68,-4.70 0/0:0.000:-0.01,-1.72,-5.00 0/0:0.000:-0.0148341,-1.47392,-5 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.03,-1.16,-5.00 0/0:0.000:-0.04,-1.03,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.09,-0.73,-4.40 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.02,-1.26,-5.00 0/0:0.000:-0.03,-1.25,-5.00 0/0:0.000:-0.19,-0.45,-2.12 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.05,-0.96,-5.00 0/0:0.000:-0.00,-4.70,-5.00 0/0:0.000:-0.13,-0.58,-3.06 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.17,-0.51,-2.09 0/0:0.000:-0.00,-2.29,-5.00 0/0:0.000:-0.38,-0.43,-0.67 0/0:0.000:-0.00,-2.81,-5.00 0/0:0.000:-0.00,-3.25,-5.00 0/0:0.000:-0.22,-0.46,-1.26 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.01,-1.54,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.00,-2.04,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.10,-0.68,-3.70 0/0:0.000:-0.09,-0.72,-5.00 0/0:0.000:-0.01,-1.77,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.01,-1.51,-5.00 0/0:0.000:-0.16,-0.51,-2.74 0/0:0.000:-0.10,-0.69,-4.40 0/0:0.000:-0.18,-0.48,-2.26 0/0:0.000:-0.18,-0.48,-2.37 0/0:0.000:-0.18,-0.48,-2.27 0/0:0.000:-0.00,-2.58,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.00,-2.34,-5.00 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.10,-0.69,-4.40 0/0:0.000:-0.08,-0.78,-4.70 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.18,-0.48,-2.36 0/0:0.000:-0.01,-1.53,-5.00 0/0:0.000:-0.18,-0.48,-2.25 0/0:0.000:-0.10,-0.68,-4.70 0/0:0.000:-0.09,-0.73,-5.00 0/0:0.000:-0.02,-1.41,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.18,-0.47,-2.41 0/0:0.000:-0.09,-0.73,-4.40 0/0:0.000:-0.00,-2.00,-5.00 0/0:0.000:-0.19,-0.46,-2.19 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.05,-0.98,-5.00 0/0:0.000:-0.18,-0.47,-2.31 0/0:0.000:-0.09,-0.73,-4.10 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.10,-0.67,-3.66 0/0:0.000:-0.12,-0.63,-4.70 0/0:0.000:-0.16,-0.51,-2.28 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.01,-1.75,-5.00 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.10,-0.68,-4.22 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.12,-0.62,-3.08 0/0:0.000:-0.19,-0.45,-2.25 0/0:0.000:-0.01,-1.77,-5.00 0/0:0.000:-0.13,-0.60,-2.15 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.04,-1.05,-5.00 0/0:0.000:-0.19,-0.46,-2.30 0/0:0.000:-0.19,-0.46,-2.28 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.190265,-0.457324,-2.2321 0/0:0.000:-0.23,-0.46,-1.21 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.00,-2.44,-5.00 0/0:0.000:-0.01,-1.73,-5.00 0/0:0.000:-0.01,-1.53,-5.00 0/0:0.000:-0.05,-0.96,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.19,-0.46,-2.62 0/0:0.000:-0.00,-2.25,-5.00 0/0:0.000:-0.19,-0.46,-2.23 0/0:0.000:-0.11,-0.65,-4.70 0/0:0.000:-0.18,-0.46,-2.68 0/0:0.000:-0.11,-0.66,-4.40 0/0:0.000:-0.19,-0.46,-2.43 0/0:0.000:-0.18,-0.49,-2.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.20,-0.45,-2.05 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.10,-0.68,-4.00 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.10,-0.69,-4.22 0/0:0.000:-0.11,-0.65,-4.40 0/0:0.000:-0.11,-0.67,-3.85 22 20000199 rs183293480 A C 100 PASS LDAF=0.0009;THETA=0.0004;AN=2184;AVGPOST=0.9990;VT=SNP;AA=A;RSQ=0.6274;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0003;AF=0.0005;EUR_AF=0.0013 GT:DS:GL 0/0:0.000:-0.00,-2.04,-5.00 0/0:0.000:-0.07,-0.82,-3.47 0/0:0.000:-0.07,-0.83,-5.00 0/0:0.000:-0.03,-1.12,-5.00 0/0:0.000:-0.11,-0.64,-4.10 0/0:0.000:-0.12,-0.62,-3.85 0/0:0.000:-0.01,-1.47,-5.00 0/0:0.000:-0.01,-1.54,-5.00 0/0:0.000:-0.10,-0.70,-4.70 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.16,-0.50,-3.30 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.10,-0.70,-5.00 0/0:0.000:-0.19,-0.46,-2.46 0/0:0.000:-0.16,-0.51,-2.67 0/0:0.000:-0.00,-2.57,-5.00 0/0:0.000:-0.00,-2.85,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.00,-2.55,-5.00 0/0:0.000:-0.00,-2.02,-5.00 0/0:0.050:-0.48,-0.48,-0.48 0/0:0.000:-0.23,-0.46,-1.24 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.10,-0.68,-4.40 0/0:0.000:-0.06,-0.88,-5.00 0/0:0.000:-0.13,-0.61,-2.23 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.26,-0.43,-1.11 0/0:0.000:-0.04,-1.01,-5.00 0/0:0.000:-0.12,-0.62,-3.28 0/0:0.000:-0.00,-2.27,-5.00 0/0:0.000:-0.00,-2.42,-5.00 0/0:0.000:-0.04,-1.08,-5.00 0/0:0.000:-0.06,-0.88,-5.00 0/0:0.000:-0.04,-1.03,-5.00 0/0:0.000:-0.10,-0.70,-4.10 0/0:0.000:-0.18,-0.48,-2.22 0/0:0.000:-0.02,-1.41,-5.00 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.14,-0.57,-2.12 0/0:0.000:-0.10,-0.69,-4.70 0/0:0.000:-0.12,-0.62,-2.28 0/0:0.000:-0.0149779,-1.4698,-5 0/0:0.000:-0.10,-0.70,-4.70 0/0:0.000:-0.06,-0.89,-5.00 0/0:0.000:-0.39,-0.23,-2.25 0/0:0.000:-0.00,-3.40,-5.00 0/0:0.000:-0.07,-0.86,-4.40 0/0:0.000:-0.18,-0.48,-2.35 0/0:0.000:-0.00967891,-1.65679,-5 0/0:0.000:-0.22,-0.46,-1.24 0/0:0.000:-0.00,-2.27,-5.00 0/0:0.000:-0.00,-3.12,-5.00 0/0:0.000:-0.01,-1.73,-5.00 0/0:0.000:-0.02,-1.39,-5.00 0/0:0.000:-0.10,-0.70,-4.70 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.09,-0.72,-4.10 0/0:0.000:-0.10,-0.71,-4.70 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.04,-1.11,-5.00 0/0:0.000:-0.03,-1.14,-5.00 0/0:0.000:-0.00,-2.42,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.11,-0.65,-3.85 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.01,-1.72,-5.00 0/0:0.000:-0.00,-2.08,-5.00 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.16,-0.52,-2.11 0/0:0.000:-0.10,-0.69,-4.70 0/0:0.000:-0.44,-0.46,-0.54 0/0:0.000:-0.07,-0.83,-5.00 0/0:0.000:-0.16,-0.51,-2.42 0/0:0.000:-0.18,-0.48,-2.30 0/0:0.000:-0.19,-0.46,-2.27 0/0:0.000:-0.06,-0.89,-5.00 0/0:0.000:-0.06,-0.88,-5.00 0/0:0.000:-0.00,-3.15,-5.00 0/0:0.000:-0.12,-0.61,-4.10 0/0:0.000:-0.06,-0.91,-5.00 0/0:0.000:-0.00935928,-1.67121,-5 0/0:0.000:-0.18,-0.48,-2.13 0/0:0.000:-0.07,-0.85,-5.00 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.03,-1.15,-5.00 0/0:0.000:-0.09,-0.71,-4.70 0/0:0.000:-0.19,-0.46,-2.52 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.09,-0.73,-4.10 0/0:0.000:-0.11,-0.64,-4.40 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.19,-0.47,-1.83 0/0:0.000:-0.00,-2.76,-5.00 0/0:0.000:-0.09,-0.73,-3.92 0/0:0.000:-0.15,-0.55,-2.43 0/0:0.000:-0.18,-0.48,-1.89 0/0:0.000:-0.00,-3.18,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.02,-1.43,-5.00 0/0:0.000:-0.22,-0.40,-5.00 0/0:0.000:-0.07,-0.84,-4.70 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.10,-0.69,-3.85 0/0:0.000:-0.01,-1.84,-5.00 0/0:0.000:-0.00,-3.07,-5.00 0/0:0.000:-0.10,-0.69,-3.85 0/0:0.000:-0.06,-0.91,-5.00 0/0:0.000:-0.10,-0.68,-4.70 0/0:0.000:-0.10,-0.69,-3.80 0/0:0.000:-0.48,-0.18,-4.10 0/0:0.000:-0.02,-1.47,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.18,-0.48,-2.34 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.18,-0.48,-2.37 0/0:0.000:-0.10,-0.70,-4.40 0/0:0.000:-0.00,-2.44,-5.00 0/0:0.000:-0.01,-1.78,-5.00 0/0:0.000:-0.00,-2.06,-5.00 0/0:0.000:-0.00,-2.28,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.03,-1.24,-5.00 0/0:0.000:-0.01,-1.51,-5.00 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.11,-0.64,-4.70 0/0:0.000:-0.19,-0.46,-2.54 0/0:0.000:-0.02,-1.38,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.02,-1.26,-5.00 0/0:0.000:-0.18,-0.47,-2.47 0/0:0.000:-0.16,-0.51,-2.43 0/0:0.000:-0.01,-1.73,-5.00 0/0:0.000:-0.19,-0.46,-2.79 0/0:0.000:-0.06,-0.91,-5.00 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.18,-0.48,-2.35 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.19,-0.46,-2.72 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.16,-0.51,-2.71 0/0:0.000:-0.01,-1.80,-5.00 0/0:0.000:-0.10,-0.69,-3.66 0/0:0.000:-0.10,-0.69,-4.22 0/0:0.000:-0.23,-0.44,-1.26 0/0:0.000:-0.19,-0.46,-2.12 0/0:0.000:-0.04,-1.07,-5.00 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.19,-0.46,-2.20 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.11,-0.65,-3.44 0/0:0.000:-0.05,-1.00,-5.00 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.01,-1.77,-5.00 0/0:0.000:-0.189687,-0.458221,-2.2426 0/0:0.000:-0.01,-1.79,-5.00 0/0:0.050:-0.24,-0.42,-1.39 0/0:0.000:-0.19,-0.45,-4.70 0/0:0.000:-0.00,-2.15,-5.00 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.18,-0.48,-2.23 0/0:0.000:-0.00,-2.11,-5.00 0/0:0.000:-0.11,-0.66,-4.70 0/0:0.000:-0.01,-1.59,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.18,-0.46,-2.63 0/0:0.000:-0.11,-0.66,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.19,-0.46,-2.43 0/0:0.000:-0.19,-0.47,-1.84 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.19,-0.46,-2.21 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.82,-5.00 0/0:0.000:-0.05,-0.99,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.00,-2.26,-5.00 0/0:0.000:-0.10,-0.69,-4.00 22 20000291 rs185807825 G T 100 PASS ERATE=0.0005;AVGPOST=0.9983;AA=G;AN=2184;LDAF=0.0015;VT=SNP;SNPSOURCE=LOWCOV;RSQ=0.5564;AC=2;THETA=0.0003;AF=0.0009;ASN_AF=0.0035 GT:DS:GL 0/0:0.000:-0.00,-2.06,-5.00 0/0:0.000:-0.07,-0.83,-5.00 0/0:0.000:-0.02,-1.27,-5.00 0/0:0.000:-0.01,-1.77,-5.00 0/0:0.000:-0.19,-0.45,-2.14 0/0:0.000:-0.11,-0.66,-5.00 0/0:0.000:-0.03,-1.17,-5.00 0/0:0.000:-0.02,-1.33,-5.00 0/0:0.000:-0.18,-0.48,-2.43 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.18,-0.46,-2.76 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.03,-1.15,-5.00 0/0:0.000:-0.07,-0.83,-4.40 0/0:0.000:-0.02,-1.33,-5.00 0/0:0.000:-0.10,-0.68,-4.40 0/0:0.000:-0.00,-2.28,-5.00 0/0:0.000:-0.01,-1.83,-5.00 0/0:0.000:-0.13,-0.61,-2.24 0/0:0.000:-0.00,-3.70,-5.00 0/0:0.000:-0.00,-2.21,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.22,-0.46,-1.25 0/0:0.000:-0.00,-2.81,-5.00 0/0:0.000:-0.02,-1.44,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.08,-0.79,-3.21 0/0:0.000:-0.01,-1.72,-5.00 0/0:0.000:-0.21,-0.46,-1.42 0/0:0.000:-0.08,-0.79,-4.40 0/0:0.000:-0.04,-1.06,-5.00 0/0:0.000:-0.02,-1.42,-5.00 0/0:0.000:-0.00,-3.85,-5.00 0/0:0.000:-0.07,-0.83,-5.00 0/0:0.000:-0.01,-1.80,-5.00 0/0:0.000:-0.11,-0.66,-4.40 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.01,-1.47,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.19,-0.46,-2.54 0/0:0.000:-0.03,-1.22,-5.00 0/0:0.000:-0.05,-1.00,-5.00 0/0:0.000:-0.13,-0.60,-2.19 0/0:0.000:-0.0021071,-2.31515,-5 0/0:0.000:-0.10,-0.68,-4.40 0/0:0.000:-0.10,-0.68,-4.22 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.00,-3.38,-5.00 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.20,-0.45,-1.99 0/0:0.000:-0.0193877,-1.35992,-5 0/0:0.000:-0.02,-1.28,-5.00 0/0:0.000:-0.00,-2.46,-5.00 0/0:0.000:-0.00,-2.75,-5.00 0/0:0.000:-0.18,-0.47,-2.36 0/0:0.000:-0.10,-0.68,-4.00 0/0:0.000:-0.02,-1.34,-5.00 0/0:0.000:-0.00,-3.27,-5.00 0/0:0.250:-0.03,-1.19,-5.00 0/0:0.000:-0.07,-0.84,-5.00 0/0:0.000:-0.01,-1.52,-5.00 0/0:0.000:-0.11,-0.65,-3.40 0/0:0.000:-0.10,-0.70,-4.70 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.00,-1.99,-5.00 0/0:0.000:-0.01,-1.61,-5.00 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.10,-0.69,-4.22 0/0:0.000:-0.01,-1.49,-5.00 0/0:0.000:-0.01,-1.59,-5.00 0/0:0.000:-0.10,-0.69,-3.36 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.08,-0.77,-4.10 0/0:0.000:-0.05,-1.00,-5.00 0/0:0.000:-0.10,-0.67,-3.80 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.06,-0.86,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.11,-0.66,-4.70 0/0:0.000:-0.00,-2.68,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.01,-1.84,-5.00 0/0:0.000:-0.0293556,-1.18469,-5 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.19,-0.46,-2.37 0/0:0.000:-0.10,-0.68,-4.10 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.06,-0.91,-5.00 0/0:0.000:-0.18,-0.48,-2.06 0/0:0.000:-0.06,-0.87,-5.00 0/0:0.000:-0.03,-1.13,-5.00 0/0:0.000:-0.03,-1.24,-5.00 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.00,-3.85,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.07,-0.83,-4.40 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.00,-2.82,-5.00 0/0:0.000:-0.08,-0.76,-3.80 0/0:0.000:-0.00,-3.47,-5.00 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.11,-0.65,-3.59 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.02,-1.43,-5.00 0/0:0.000:-0.00,-2.73,-5.00 0/0:0.000:-0.06,-0.86,-4.22 0/0:0.000:-0.14,-0.57,-2.84 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.01,-1.76,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.01,-1.79,-5.00 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.01,-1.87,-5.00 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.08,-0.79,-5.00 0/0:0.000:-0.06,-0.91,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.00,-2.89,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.20,-0.44,-2.07 0/0:0.000:-0.00,-2.65,-5.00 0/0:0.000:-0.02,-1.27,-5.00 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.01,-1.79,-5.00 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.10,-0.68,-4.70 0/0:0.000:-0.35,-0.41,-0.78 0/0:0.000:-0.09,-0.75,-3.70 0/0:0.000:-0.05,-0.98,-5.00 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.08,-0.77,-5.00 0/0:0.000:-0.04,-1.03,-5.00 0/0:0.000:-0.02,-1.47,-5.00 0/0:0.000:-0.03,-1.16,-5.00 0/0:0.000:-0.01,-1.57,-5.00 0/0:0.000:-0.06,-0.89,-5.00 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.00,-2.30,-5.00 0/0:0.000:-0.05,-0.98,-5.00 0/0:0.000:-0.10,-0.69,-4.40 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.06,-0.87,-5.00 0/0:0.000:-0.16,-0.51,-2.37 0/0:0.000:-0.19,-0.46,-2.23 0/0:0.050:-0.27,-0.34,-3.22 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.08,-0.77,-4.40 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.05,-0.99,-5.00 0/0:0.000:-0.07,-0.82,-5.00 0/0:0.000:-0.12,-0.61,-2.91 0/0:0.000:-0.00,-1.98,-5.00 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.07,-0.82,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.49,-5.00 0/0:0.000:-0.12,-0.61,-3.15 0/0:0.000:-0.00,-2.08,-5.00 0/0:0.000:-0.02,-1.33,-5.00 0/0:0.000:-0.01,-1.74,-5.00 0/0:0.050:-0.11227,-0.642561,-4.22185 0/0:0.000:-0.22,-0.46,-1.28 0/0:0.000:-0.10,-0.70,-4.10 0/0:0.000:-0.10,-0.67,-3.62 0/0:0.000:-0.03,-1.18,-5.00 0/0:0.000:-0.03,-1.24,-5.00 0/0:0.000:-0.10,-0.70,-4.10 0/0:0.000:-0.01,-1.80,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.02,-1.47,-5.00 0/0:0.000:-0.06,-0.88,-5.00 0/0:0.000:-0.03,-1.13,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.18,-0.46,-2.69 0/0:0.000:-0.03,-1.13,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.18,-0.48,-2.41 0/0:0.000:-0.18,-0.47,-2.85 0/0:0.050:-0.48,-0.48,-0.48 0/0:0.000:-0.08,-0.76,-4.70 0/0:0.000:-0.10,-0.69,-4.00 0/0:0.000:-0.11,-0.65,-3.62 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.70,-5.00 0/0:0.000:-0.00,-2.57,-5.00 22 20000428 rs55902548 G T 100 PASS AC=323;AVGPOST=0.9983;AA=G;AN=2184;VT=SNP;RSQ=0.9949;LDAF=0.1473;SNPSOURCE=LOWCOV;ERATE=0.0003;THETA=0.0003;AF=0.15;ASN_AF=0.0017;AMR_AF=0.15;AFR_AF=0.31;EUR_AF=0.15 GT:DS:GL 1/0:1.000:-5.00,0.00,-5.00 0/0:0.000:-0.35,-0.43,-0.73 0/1:1.000:-1.81,-0.01,-2.95 0/0:0.000:-0.01,-1.79,-5.00 0/0:0.000:-0.06,-0.86,-5.00 1/0:1.000:-0.19,-0.46,-2.18 0/0:0.000:-0.10,-0.68,-5.00 0/1:1.000:-4.40,-0.03,-1.12 0/1:1.000:-5.00,-0.69,-0.10 0/0:0.000:-0.10,-0.69,-4.70 0/0:0.000:-0.48,-0.48,-0.48 0/1:1.000:-5.00,-0.01,-1.77 0/0:0.000:-0.18,-0.48,-2.57 0/0:0.000:-0.02,-1.31,-5.00 0/0:0.000:-0.11,-0.65,-4.70 0/0:0.000:-0.10,-0.68,-4.70 0/0:0.000:-0.01,-1.72,-5.00 1/0:1.000:-5.00,0.00,-5.00 1/0:1.000:-1.38,-0.02,-2.61 0/1:1.000:-5.00,-1.40,-0.02 0/0:0.000:-0.00,-2.97,-5.00 0/0:0.000:-0.19,-0.47,-2.15 0/0:0.000:-0.44,-0.46,-0.54 0/0:0.000:-0.00,-2.52,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.01,-1.77,-5.00 0/1:0.750:-0.22,-0.46,-1.26 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.21,-0.46,-1.42 0/0:0.000:-0.19,-0.45,-2.20 0/0:0.000:-0.12,-0.63,-3.36 0/0:0.000:-0.00,-2.54,-5.00 0/0:0.000:-0.00,-2.88,-5.00 1/0:1.000:-2.10,-0.00,-4.00 1/0:1.250:-5.00,-1.62,-0.01 0/1:1.000:-3.06,-0.47,-0.18 0/1:1.000:-5.00,-0.87,-0.06 1/1:2.000:-5.00,-0.84,-0.07 0/0:0.000:-0.05,-0.95,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.14,-0.58,-2.17 0/0:0.000:-0.01,-1.77,-5.00 0/0:0.000:-0.06,-0.91,-5.00 1/0:1.000:-5,0,-5 0/0:0.000:-0.05,-0.95,-5.00 1/0:1.000:-4.70,-0.70,-0.10 0/0:0.000:-0.09,-0.72,-4.70 0/0:0.000:-0.01,-1.54,-5.00 1/1:2.000:-5.00,-0.68,-0.10 0/0:0.000:-0.18,-0.48,-2.33 0/0:0.000:-0.00465443,-1.97224,-5 0/0:0.000:-0.48,-0.48,-0.48 1/0:1.000:-5.00,-0.93,-0.05 0/0:0.000:-0.00,-2.82,-5.00 1/1:2.000:-5.00,-1.67,-0.01 0/0:0.000:-0.05,-0.95,-5.00 1/0:1.000:-5.00,-0.87,-0.06 0/0:0.000:-0.01,-1.78,-5.00 0/0:0.000:-0.01,-1.83,-5.00 0/1:1.000:-5.00,-0.00,-3.70 0/1:0.950:-0.15,-0.53,-2.39 0/0:0.000:-0.09,-0.71,-4.70 0/0:0.000:-0.18,-0.48,-2.56 0/0:0.000:-0.00,-2.60,-5.00 0/0:0.000:-0.01,-1.49,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.04,-1.09,-5.00 0/0:0.000:-0.00,-2.40,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.01,-1.54,-5.00 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.01,-1.84,-5.00 0/1:1.000:-2.39,-0.30,-0.31 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.11,-0.64,-3.44 0/0:0.000:-0.02,-1.42,-5.00 0/0:0.000:-0.01,-1.76,-5.00 0/0:0.000:-0.03,-1.17,-5.00 0/1:1.000:-5.00,-0.03,-1.18 0/0:0.000:-0.03,-1.21,-5.00 0/0:0.000:-0.10,-0.70,-5.00 0/1:1.000:-5.00,0.00,-5.00 0/0:0.000:-0.01,-1.58,-5.00 0/1:1.000:-5.00,0.00,-5.00 1/0:1.000:-5,-5.21375e-05,-3.92082 0/0:0.000:-0.05,-0.92,-5.00 0/0:0.000:-0.02,-1.45,-5.00 0/1:1.000:-5.00,-0.00,-4.22 0/0:0.000:-0.09,-0.73,-4.22 0/0:0.000:-0.00,-2.03,-5.00 0/0:0.000:-0.03,-1.20,-5.00 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.18,-0.48,-2.03 0/0:0.000:-0.01,-1.59,-5.00 0/0:0.000:-0.37,-0.42,-0.71 0/0:0.000:-0.02,-1.45,-5.00 0/0:0.000:-0.00,-3.36,-5.00 1/0:1.000:-3.85,-0.05,-0.94 1/0:1.000:-5.00,-0.84,-0.07 1/0:1.000:-0.06,-0.90,-5.00 0/0:0.000:-0.00,-2.56,-5.00 0/0:0.000:-0.19,-0.47,-1.73 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.00,-3.74,-5.00 0/0:0.000:-0.22,-0.46,-1.26 1/0:1.000:-5.00,-0.00,-3.92 0/0:0.000:-0.10,-0.69,-3.85 0/0:0.000:-0.00,-2.76,-5.00 0/0:0.000:-0.01,-1.70,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.06,-0.92,-5.00 0/0:0.000:-0.01,-1.73,-5.00 0/0:0.000:-0.05,-1.00,-5.00 1/0:1.000:-2.32,-0.01,-1.58 0/1:0.950:-0.09,-0.73,-4.70 0/1:1.000:-2.25,-0.02,-1.52 0/0:0.000:-0.01,-1.50,-5.00 0/0:0.000:-0.10,-0.70,-4.40 0/0:0.000:-0.05,-0.99,-5.00 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.03,-1.25,-5.00 0/0:0.000:-0.10,-0.67,-4.10 0/1:1.000:-5.00,-0.00,-4.22 0/0:0.000:-0.18,-0.48,-2.02 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.00,-4.70,-5.00 0/0:0.000:-0.03,-1.23,-5.00 0/0:0.000:-0.05,-0.93,-5.00 0/0:0.000:-0.03,-1.20,-5.00 1/0:1.000:-0.37,-0.24,-2.87 0/0:0.000:-0.05,-0.94,-5.00 0/0:0.000:-0.18,-0.48,-2.36 0/0:0.000:-0.18,-0.48,-1.83 0/0:0.000:-0.18,-0.48,-2.23 0/0:0.000:-0.01,-1.78,-5.00 0/0:0.000:-0.06,-0.90,-5.00 0/0:0.000:-0.07,-0.82,-5.00 0/0:0.000:-0.10,-0.70,-4.70 0/0:0.000:-0.01,-1.48,-5.00 0/0:0.000:-0.02,-1.46,-5.00 0/0:0.000:-0.18,-0.48,-2.11 0/0:0.000:-0.10,-0.69,-4.70 0/0:0.000:-0.18,-0.47,-2.84 0/0:0.000:-0.03,-1.21,-5.00 1/0:1.000:-4.40,-0.00,-4.70 0/1:0.800:-0.03,-1.20,-5.00 0/0:0.000:-0.05,-0.98,-5.00 0/0:0.000:-0.10,-0.68,-3.70 0/0:0.000:-0.10,-0.67,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.10,-0.68,-3.59 0/0:0.000:-0.02,-1.26,-5.00 0/0:0.000:-0.48,-0.48,-0.48 0/0:0.000:-0.01,-1.80,-5.00 1/0:1.000:-5.00,-0.66,-0.11 0/0:0.000:-0.12,-0.61,-2.49 1/1:2.000:-5.00,-1.30,-0.02 0/0:0.000:-0.04,-1.07,-5.00 0/0:0.000:-0.18,-0.48,-2.19 0/0:0.000:-0.07,-0.85,-5.00 1/0:1.000:-2.07,-0.04,-1.10 0/1:1.000:-0.09,-0.74,-4.70 0/1:1.000:-5.00,-0.00,-2.47 0/0:0.000:-0.02,-1.26,-5.00 1/0:1.000:-2.25,-0.01,-1.55 0/0:0.000:-0.00,-3.17,-5.00 0/0:0.000:-0.203967,-0.438136,-1.99396 0/0:0.000:-0.12,-0.62,-3.18 0/0:0.000:-0.03,-1.19,-5.00 0/0:0.000:-0.01,-1.66,-5.00 1/0:1.000:-5.00,-0.00,-3.74 1/0:1.000:-1.43,-0.02,-4.70 0/0:0.000:-0.05,-0.96,-5.00 0/0:0.000:-0.09,-0.74,-4.70 0/0:0.000:-0.48,-0.48,-0.48 1/0:1.000:-5.00,-0.18,-0.47 0/0:0.000:-0.17,-0.50,-2.70 0/1:1.000:-5.00,-0.00,-3.80 0/1:1.000:-2.22,-0.01,-1.97 0/1:1.000:-5.00,-0.67,-0.10 0/0:0.000:-0.21,-0.43,-2.01 0/0:0.000:-0.19,-0.47,-1.83 0/0:0.000:-0.10,-0.69,-4.40 0/0:0.000:-0.11,-0.64,-4.10 0/0:0.000:-0.18,-0.47,-2.80 0/0:0.000:-0.01,-1.47,-5.00 0/0:0.000:-0.22,-0.43,-1.61 0/0:0.000:-0.02,-1.47,-5.00 0/0:0.000:-0.01,-1.70,-5.00 0/0:0.000:-0.05,-0.97,-5.00 0/0:0.000:-0.02,-1.28,-5.00
As in typical VCF files, it has a bunch of meta-information lines, one header line for column names, and then one line for each marker. In this VCF, genetic data has fields GT (genotype), DS (dosage), and GL (genotype likelihood).
To access number of records and samples (individuals):
records = nrecords("test_vcf.vcf.gz")
1356
samples = nsamples("test_vcf.vcf.gz")
191
Information on samples and variants can be retrieved using the VariantCallFormat
package:
Sample names can be retrieved by:
using VariantCallFormat
reader = VCF.Reader(openvcf("test_vcf.vcf.gz"))
h = header(reader)
h.sampleID
191-element Vector{String}: "HG00096" "HG00097" "HG00099" "HG00100" "HG00101" "HG00102" "HG00103" "HG00104" "HG00106" "HG00108" "HG00109" "HG00110" "HG00111" ⋮ "HG00383" "HG00384" "HG00403" "HG00404" "HG00406" "HG00407" "HG00418" "HG00419" "HG00421" "HG00422" "HG00427" "HG00428"
Information of each variant is accessible by:
reader = VCF.Reader(openvcf("test_vcf.vcf.gz"))
println("chrom\tposition\tids\t\treference\talternative")
cnt = 0
for record in reader
println("$(VCF.chrom(record))\t$(VCF.pos(record))\t$(try VCF.id(record) catch; ["."] end)\t$(VCF.ref(record))\t\t$(VCF.alt(record))")
cnt += 1
if cnt == 30
break
end
end
chrom position ids reference alternative 22 20000086 ["rs138720731"] T ["C"] 22 20000146 ["rs73387790"] G ["A"] 22 20000199 ["rs183293480"] A ["C"] 22 20000291 ["rs185807825"] G ["T"] 22 20000428 ["rs55902548"] G ["T"] 22 20000683 ["rs142720028"] A ["G"] 22 20000771 ["rs114690707"] A ["C"] 22 20000793 ["rs189842693"] T ["C"] 22 20000810 ["rs147349046"] C ["T"] 22 20000814 ["rs183154520"] T ["C"] 22 20000864 ["rs187930998"] G ["A"] 22 20000882 ["rs148068532"] C ["G"] 22 20000950 ["rs1978233"] T ["G"] 22 20000975 ["rs141800233"] G ["A"] 22 20001001 ["rs192051979"] T ["C"] 22 20001006 ["rs2079702"] G ["A"] 22 20001016 ["rs183256914"] C ["T"] 22 20001157 ["rs150580380"] G ["A"] 22 20001159 ["rs139570132"] C ["T"] 22 20001219 ["rs143369598"] G ["C"] 22 20001333 ["rs5993894"] C ["T"] 22 20001434 ["rs146344141"] C ["T"] 22 20001455 ["rs188666449"] G ["A"] 22 20001521 ["rs139601437"] C ["A"] 22 20001587 ["rs71788814"] CAG ["C"] 22 20001600 ["rs144217522"] T ["A"] 22 20001655 ["rs192606530"] G ["A"] 22 20001822 ["rs111598545"] A ["G"] 22 20002011 ["rs184950746"] C ["T"] 22 20002207 ["rs142461772"] G ["A"]
Each row of a VCF file represents a single variant, so it is natural to parse the genotypes or dosages variant-by-variant. The function copy_gt!()
computes genotypes of each variant, and copy_ds!()
computes dosages of each variant, represented by values in $[0, 2]$.
using VariantCallFormat
using Statistics
using StatsBase
# initialize VCF reader
people, snps = nsamples("test_vcf.vcf.gz"), nrecords("test_vcf.vcf.gz")
reader = VCF.Reader(openvcf("test_vcf.vcf.gz"))
# pre-allocate vector for marker data
g = zeros(Union{Missing, Float64}, people)
sampleID = Vector{String}(undef, people)
rec_chr = Vector{String}(undef, 1)
rec_pos = Vector{Any}(undef, 1)
rec_ids = Vector{Vector{String}}(undef, 1)
rec_ref = Vector{String}(undef, 1)
rec_alt = Vector{Vector{String}}(undef, 1)
for j = 1:30
copy_gt!(g, reader; model = :additive, impute = true,
sampleID=sampleID,
record_chr=rec_chr,
record_pos=rec_pos,
record_ids=rec_pos,
record_ref=rec_ref,
record_alt=rec_alt)
# Do some statistical analysis. Just for simple illustration, we show the mean of the genotype values.
println(mean(g))
end
close(reader)
0.0 0.0 0.0 0.0 0.2931937172774869 0.0 0.0 0.015706806282722512 0.0 0.0 0.005235602094240838 0.0 1.931937172774869 0.0 0.0 1.801047120418848 0.005235602094240838 0.0 0.0 0.015706806282722512 0.0 0.03664921465968586 0.0 0.0 0.0 0.0 0.0 0.04712041884816754 0.0 0.0
The keyword arguments:
model
defines which genotype model to use. The common choice is :additive
for 0/1/2 encoding. impute
sets whether to impute the missing values with the mean of nonmissing values.sampleID
stores the list of sample ID.
-record_chr
, record_pos
, record_ids
, record_ref
, and record_ref
stores chromosome, position, identifiers, reference allele, and alternative alleles, respectively. .bed
+ .bim
and .fam
)¶In order to rapidly process the variant information while keeping the file size small, various binary file types are devised. This format is basically a compact array of two-bit representation of genotypes. This is the native format of the well-celebrated large-scale GWAS tool, PLINK 1. This is suitable for high-performance applications directly dealing with genotype matrices such as iterative hard thresholding and admixture analysis. Major weakness of this format is that it cannot contain imputed information when there is uncertainty in genotypes.
Genotype | Plink/SnpArray (hexadecimal byte) | binary | numeric |
---|---|---|---|
A1,A1 | 0x00 |
0b00 |
0 |
missing | 0x01 |
0b01 |
1 |
A1,A2 | 0x02 |
0b10 |
2 |
A2,A2 | 0x03 |
0b11 |
3 |
.bim
) and sample (.fam
) informationSnpArrays.jl
¶using SnpArrays
d = SnpData("test_bed")
SnpData(people: 191, snps: 1356, snp_info: Row │ chromosome snpid genetic_distance position allele1 allele2 │ String String Float64 Int64 String String ─────┼─────────────────────────────────────────────────────────────────────── 1 │ 22 rs138720731 0.0 20000086 C T 2 │ 22 rs73387790 0.0 20000146 A G 3 │ 22 rs183293480 0.0 20000199 C A 4 │ 22 rs185807825 0.0 20000291 T G 5 │ 22 rs55902548 0.0 20000428 T G 6 │ 22 rs142720028 0.0 20000683 G A …, person_info: Row │ fid iid father mother sex phenotype │ Abstract… Abstract… Abstract… Abstract… Abstract… Abstract… ─────┼────────────────────────────────────────────────────────────────── 1 │ 0 HG00096 0 0 0 -9 2 │ 0 HG00097 0 0 0 -9 3 │ 0 HG00099 0 0 0 -9 4 │ 0 HG00100 0 0 0 -9 5 │ 0 HG00101 0 0 0 -9 6 │ 0 HG00102 0 0 0 -9 …, srcbed: test_bed.bed srcbim: test_bed.bim srcfam: test_bed.fam )
Information of i-th sample is accessible by:
i = 20
d.person_info[i, :]
DataFrameRow (6 columns)
fid | iid | father | mother | sex | phenotype | |
---|---|---|---|---|---|---|
Abstract… | Abstract… | Abstract… | Abstract… | Abstract… | Abstract… | |
20 | 0 | HG00119 | 0 | 0 | 0 | -9 |
Information of j-th variant is accessible by:
j = 7
d.snp_info[j, :]
DataFrameRow (6 columns)
chromosome | snpid | genetic_distance | position | allele1 | allele2 | |
---|---|---|---|---|---|---|
String | String | Float64 | Int64 | String | String | |
7 | 22 | rs114690707 | 0.0 | 20000771 | C | A |
Genotype of sample $i$, variant $j$ is accessible by:
d.snparray[i, j]
0x03
Note that this is the encoding defined by the table above. If converted numerically in 0
, 1
, 2
-encoding, it corresponds to 2
.
Genotypes of each variant in numeric form can be accessed by:
g = Vector{Float64}(undef, d.people)
cnt = 0
for j in 1:30
copyto!(g, @view(d.snparray[:, j]); impute=true,
model=ADDITIVE_MODEL,
center=false, scale=false)
# Do some statistical analyses
println(mean(g))
end
2.0 2.0 2.0 2.0 1.706806282722513 2.0 2.0 1.9842931937172774 2.0 2.0 1.9947643979057592 2.0 0.06806282722513089 2.0 2.0 0.19895287958115182 1.9947643979057592 2.0 2.0 1.9842931937172774 2.0 1.963350785340314 2.0 2.0 2.0 2.0 2.0 1.9528795811518325 2.0 2.0
Note that the count of the second allele is encoded, and it is often the reference allele. If it is desired to run the analyses based on the alternative allele count, the values have to be reversed (subtracted from 2.0
). This has led to a lot of confusion, and some later projects began to put the reference allele first (most notably, UK Biobank).
.bgen
+ optional .sample
)¶The BGEN format is native to Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST. This format employs variant-by-variant compression scheme, well-tailored for GWAS applications. The UK Biobank imputed data is distributed in this format. Compression algorithms supported are Gzip (used in .gz
files) and Zstandard (used in .zst
files)
BGEN.jl
)¶using BGEN
b = Bgen("test_bgen.bgen")
WARNING: using BGEN.samples in module Main conflicts with an existing identifier.
Bgen(IOStream(<file test_bgen.bgen>), 0x0000000000020c71, BGEN.Header(0x000006d3, 0x00000014, 0x0000054c, 0x000000bf, 0x02, 0x02, false), ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10" … "182", "183", "184", "185", "186", "187", "188", "189", "190", "191"], nothing)
Sample names are accessible by:
BGEN.samples(b)
191-element Vector{String}: "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" ⋮ "180" "181" "182" "183" "184" "185" "186" "187" "188" "189" "190" "191"
Since how to iterate across the variants in BGEN files depend on if the index file is provided, we support an iterator interface for variants. If you are familiar with Python, it is an interface similar to generator defined using yield
statement in Python. This iterator is accessible by the function iterator()
.
println("rsid\t\tchrom\tpos\t\t#alleles\tlist of alleles")
for (i, v) in enumerate(BGEN.iterator(b))
println("$(rsid(v))\t$(chrom(v))\t$(pos(v))\t$(n_alleles(v))\t\t$(alleles(v))")
if i == 30
break
end
end
rsid chrom pos #alleles list of alleles rs138720731 22 20000086 2 ["C", "T"] rs73387790 22 20000146 2 ["A", "G"] rs183293480 22 20000199 2 ["C", "A"] rs185807825 22 20000291 2 ["T", "G"] rs55902548 22 20000428 2 ["T", "G"] rs142720028 22 20000683 2 ["G", "A"] rs114690707 22 20000771 2 ["C", "A"] rs189842693 22 20000793 2 ["C", "T"] rs147349046 22 20000810 2 ["T", "C"] rs183154520 22 20000814 2 ["C", "T"] rs187930998 22 20000864 2 ["A", "G"] rs148068532 22 20000882 2 ["G", "C"] rs1978233 22 20000950 2 ["G", "T"] rs141800233 22 20000975 2 ["A", "G"] rs192051979 22 20001001 2 ["C", "T"] rs2079702 22 20001006 2 ["A", "G"] rs183256914 22 20001016 2 ["T", "C"] rs150580380 22 20001157 2 ["A", "G"] rs139570132 22 20001159 2 ["T", "C"] rs143369598 22 20001219 2 ["C", "G"] rs5993894 22 20001333 2 ["T", "C"] rs146344141 22 20001434 2 ["T", "C"] rs188666449 22 20001455 2 ["A", "G"] rs139601437 22 20001521 2 ["A", "C"] rs71788814 22 20001587 2 ["C", "CAG"] rs144217522 22 20001600 2 ["A", "T"] rs192606530 22 20001655 2 ["A", "G"] rs111598545 22 20001822 2 ["G", "A"] rs184950746 22 20002011 2 ["T", "C"] rs142461772 22 20002207 2 ["A", "G"]
Dosage of each variant is accssible by:
for (i, v) in enumerate(BGEN.iterator(b))
g = first_allele_dosage!(b, v) # first allele is the ALT allele in this file.
println(mean(g))
if i == 30
break
end
end
0.0 0.0 0.0 0.0 0.29319373 0.0 0.0 0.015706806 0.0 0.0 0.005235602 0.0 1.9319372 0.0 0.0 1.8010471 0.005235602 0.0 0.0 0.015706806 0.0 0.036649216 0.0 0.0 0.0 0.0 0.0 0.04712042 0.0 0.0
.pgen
+ .pvar
and .psam
)¶PGEN is a backward-compatible extension of the BED format for PLINK 2 under development. It tries to overcome the limitation of the BED format, and can incorporate phase and dosage information. Cutting-edge GWAS tools now support this format.
BGEN
with similar precision.pvar
) and sample (.psam
) files. Note: Linkage disequilibrium
Linkage disequilibrium (LD) is the correlation between nearby variants such that the alleles at neighboring polymorphisms (observed on the same chromosome) are associated within a population more often than if they were unlinked.
using PGENFiles
p = Pgen("test_pgen.pgen");
print(PGENFiles.n_variants(p))
1356
print(PGENFiles.n_samples(p))
191
v_iter = PGENFiles.iterator(p);
v = first(v_iter)
g, _... = get_genotypes(p, v)
(UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 … 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], UInt8[0x00], 0x0000000000000001)
g
is the genotype vector.
g
191-element Vector{UInt8}: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 ⋮ 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
The encoding is as follows:
genotype code | genotype category |
---|---|
0x00 |
homozygous REF |
0x01 |
heterozygous REF-ALT |
0x02 |
homozygous ALT |
0x03 |
missing |
To avoid array allocations for iterative genotype extraction, one may preallocate g
and reuse it.
g = Vector{UInt8}(undef, PGENFiles.n_samples(p))
for (i, v) in enumerate(PGENFiles.iterator(p))
get_genotypes!(g, p, v)
println(mean(g))
# do someting with genotypes in `g`...
if i == 30
break
end
end
0.0 0.0 0.0 0.0 0.2931937172774869 0.0 0.0 0.015706806282722512 0.0 0.0 0.005235602094240838 0.0 1.931937172774869 0.0 0.0 1.801047120418848 0.005235602094240838 0.0 0.0 0.015706806282722512 0.0 0.03664921465968586 0.0 0.0 0.0 0.0 0.0 0.04712041884816754 0.0 0.0
Similarly, ALT allele dosages are available through the functions alt_allele_dosage()
and alt_allele_dosage!()
. As genotype information is required to retrieve dosages, space for genotypes are also required for alt_allele_dosage!()
. These functions return dosages, parsed genotypes, and endpoint of the dosage information in the current variant record.
d = Vector{Float32}(undef, PGENFiles.n_samples(p))
g = Vector{UInt8}(undef, PGENFiles.n_samples(p))
g_ld = similar(g)
for (i, v) in enumerate(v_iter)
alt_allele_dosage!(d, g, p, v)
# do someting with dosage values in `d`...
println(mean(d))
if i == 30
break
end
end
0.0 0.0 0.0 0.0 0.29319373 0.0 0.0 0.015706806 0.0 0.0 0.005235602 0.0 1.9319372 0.0 0.0 1.8010471 0.005235602 0.0 0.0 0.015706806 0.0 0.036649216 0.0 0.0 0.0 0.0 0.0 0.04712042 0.0 0.0
Information of each sample and variant is available by reading in .psam
and .pvar
file as a DataFrame
(not yet supported by PGENFiles.jl
). .pvar
format admits regular .vcf
format.
using DataFrames, CSV
sample_info = CSV.read("test_pgen.psam", DataFrame)
first(sample_info, 5)
5 rows × 2 columns
#IID | SEX | |
---|---|---|
String7 | String3 | |
1 | HG00096 | NA |
2 | HG00097 | NA |
3 | HG00099 | NA |
4 | HG00100 | NA |
5 | HG00101 | NA |
For .pvar
file, all the lines starting with #
is header, ending with the column names. For test_pgen.pvar
, it is the 26th line.
variant_info = CSV.read("test_pgen.pvar", DataFrame; delim="\t", header=26)
first(variant_info, 5)
5 rows × 8 columns
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | |
---|---|---|---|---|---|---|---|---|
Int64 | Int64 | String15 | String15 | String31 | String7 | String7 | String | |
1 | 22 | 20000086 | rs138720731 | T | C | 100 | PASS | AC=7;RSQ=0.8454;AVGPOST=0.9983;AA=T;AN=2184;LDAF=0.0040;THETA=0.0001;VT=SNP;SNPSOURCE=LOWCOV;ERATE=0.0003;AF=0.0032;AFR_AF=0.01 |
2 | 22 | 20000146 | rs73387790 | G | A | 100 | PASS | LDAF=0.0169;RSQ=0.9482;THETA=0.0004;AA=G;AN=2184;AVGPOST=0.9972;VT=SNP;SNPSOURCE=LOWCOV;AC=36;ERATE=0.0003;AF=0.02;AFR_AF=0.07;EUR_AF=0.0013 |
3 | 22 | 20000199 | rs183293480 | A | C | 100 | PASS | LDAF=0.0009;THETA=0.0004;AN=2184;AVGPOST=0.9990;VT=SNP;AA=A;RSQ=0.6274;SNPSOURCE=LOWCOV;AC=1;ERATE=0.0003;AF=0.0005;EUR_AF=0.0013 |
4 | 22 | 20000291 | rs185807825 | G | T | 100 | PASS | ERATE=0.0005;AVGPOST=0.9983;AA=G;AN=2184;LDAF=0.0015;VT=SNP;SNPSOURCE=LOWCOV;RSQ=0.5564;AC=2;THETA=0.0003;AF=0.0009;ASN_AF=0.0035 |
5 | 22 | 20000428 | rs55902548 | G | T | 100 | PASS | AC=323;AVGPOST=0.9983;AA=G;AN=2184;VT=SNP;RSQ=0.9949;LDAF=0.1473;SNPSOURCE=LOWCOV;ERATE=0.0003;THETA=0.0003;AF=0.15;ASN_AF=0.0017;AMR_AF=0.15;AFR_AF=0.31;EUR_AF=0.15 |
Minor allele frequency (MAF) is widely used for determining if a variant is rare or frequent, and in GWAS, it is used for measuring information content of a variant. One such measure is the ratio of actual numerical variance and expected variance from the binomial model ($2 \hat{p}(1-\hat{p})$, where $\hat{p}$ is the MAF in $[0, 1]$).
By modifying the code cells above...
The file formats can be transformed between each other using plink2 on the command line. For example, the files used in this workshop is transformed from a VCF file using the following commands.
run(`plink2 --vcf test_vcf.vcf.gz --export bgen-1.3 --out test_bgen`) # ALT allele comes first
PLINK v2.00a2.3 64-bit (24 Jan 2020) www.cog-genomics.org/plink/2.0/ (C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to test_bgen.log. Options in effect: --export bgen-1.3 --out test_bgen --vcf test_vcf.vcf.gz Start time: Mon Jul 18 07:42:35 2022 16384 MiB RAM detected; reserving 8192 MiB for main workspace. Using up to 8 compute threads. --vcf: 1356 variants scanned. --vcf: test_bgen-temporary.pgen + test_bgen-temporary.pvar + test_bgen-temporary.psam written. 191 samples (0 females, 0 males, 191 ambiguous; 191 founders) loaded from test_bgen-temporary.psam. 1356 variants loaded from test_bgen-temporary.pvar. Note: No phenotype data present. Writing test_bgen.bgen ... 0%done. Writing test_bgen.sample ... done. End time: Mon Jul 18 07:42:35 2022
Process(`plink2 --vcf test_vcf.vcf.gz --export bgen-1.3 --out test_bgen`, ProcessExited(0))
run(`plink2 --vcf test_vcf.vcf.gz --make-bed --out test_bed`) # ALT allele comes first
PLINK v2.00a2.3 64-bit (24 Jan 2020) www.cog-genomics.org/plink/2.0/ (C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to test_bed.log. Options in effect: --make-bed --out test_bed --vcf test_vcf.vcf.gz Start time: Mon Jul 18 07:42:35 2022 16384 MiB RAM detected; reserving 8192 MiB for main workspace. Using up to 8 compute threads. --vcf: 1356 variants scanned. --vcf: test_bed-temporary.pgen + test_bed-temporary.pvar + test_bed-temporary.psam written. 191 samples (0 females, 0 males, 191 ambiguous; 191 founders) loaded from test_bed-temporary.psam. 1356 variants loaded from test_bed-temporary.pvar. Note: No phenotype data present. Writing test_bed.fam ... done. Writing test_bed.bim ... done. Writing test_bed.bed ... 0%done. End time: Mon Jul 18 07:42:35 2022
Process(`plink2 --vcf test_vcf.vcf.gz --make-bed --out test_bed`, ProcessExited(0))
run(`plink2 --vcf test_vcf.vcf.gz --make-pgen --out test_pgen`)
PLINK v2.00a2.3 64-bit (24 Jan 2020) www.cog-genomics.org/plink/2.0/ (C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to test_pgen.log. Options in effect: --make-pgen --out test_pgen --vcf test_vcf.vcf.gz Start time: Mon Jul 18 07:42:35 2022 16384 MiB RAM detected; reserving 8192 MiB for main workspace. Using up to 8 compute threads. --vcf: 1356 variants scanned. --vcf: test_pgen-temporary.pgen + test_pgen-temporary.pvar + test_pgen-temporary.psam written. 191 samples (0 females, 0 males, 191 ambiguous; 191 founders) loaded from test_pgen-temporary.psam. 1356 variants loaded from test_pgen-temporary.pvar. Note: No phenotype data present. Writing test_pgen.psam ... done. Writing test_pgen.pvar ... 0%0%1%1%2%2%3%3%4%4%5%5%6%6%7%7%8%8%9%9%10%10%11%11%12%12%13%13%14%14%15%15%16%16%17%17%18%18%19%19%20%20%21%21%22%22%23%23%24%25%25%26%26%27%27%28%28%29%29%30%30%31%31%32%32%33%33%34%34%35%35%36%36%37%37%38%38%39%39%40%40%41%41%42%42%43%43%44%44%45%45%46%46%47%47%48%48%49%50%50%51%51%52%52%53%53%54%54%55%55%56%56%57%57%58%58%59%59%60%60%61%61%62%62%63%63%64%64%65%65%66%66%67%67%68%68%69%69%70%70%71%71%72%72%73%73%74%75%75%76%76%77%77%78%78%79%79%80%80%81%81%82%82%83%83%84%84%85%85%86%86%87%87%88%88%89%89%90%90%91%91%92%92%93%93%94%94%95%95%96%96%97%97%98%98%99%done. Writing test_pgen.pgen ... 0%done. End time: Mon Jul 18 07:42:35 2022
Process(`plink2 --vcf test_vcf.vcf.gz --make-pgen --out test_pgen`, ProcessExited(0))
While we focused on variant-by-variant access of the genetic variant data, many of the packages we have seen has various utility functionalities such as filtering out the variants with low MAF or low genotype success rate, filter by chromosome, and merge. In case of the BED format (SnpArrays.jl
), we support high-performance linear algebra on the genotype matrix which supports multithreading and GPU (graphics programming unit) computation.
OrdinalGWAS.jl
: GWAS for ordered categorical trait, e.g. disease status (undiagnosed, pre-disease, mild, moderate, severe)TrajGWAS.jl
: GWAS for continuous longitudinal phenotypes using a modified linear mixed effects model. Tests both mean effect and within-subject variability effect.MendelIHT.jl
: Rather than GWAS with variant-by-variant testing, uses penalized regression model. OpenADMIXTURE.jl
: Estimates ancestry of samples. Julia reimplementation of highly celebrated ADMIXTURE software in C++, 8x faster in multi-threaded setting and 35x faster using GPU.VarianceComponentModels.jl
: Fitting and testing variance component modelsMendelImpute.jl
: Fast phase inference and genotype imputationTraitSimulation.jl
: Quickly simulate phenotypes under a variety of genetic architectures.QuasiCopula.jl
: Analysis of correlated data with specified marginals with a flexible quasi-copula distribution