Day-16 - more imputation

Post date: Oct 24, 2014 8:58:52 PM

I heard about dna.land from twitter. Looked interesting so I uploaded my genotypes. As Yaniv Erlich suggested, it did indeed take <1hr or so. Great! I downloaded the imputed VCF file.

What should I do now? Well NCBI have some cool VCF files that I could check against my own. I like to use BCFTOOLS for this.

Download the data and prepare:

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_20140929.vcf.gz

gunzip clinvar_20140929.vcf.gz

bgzip clinvar_20140929.vcf

tabix -f clinvar_20140929.vcf.gz

Prepare my data:

gunzip cod_imputed.vcf.gz

bgzip cod_imputed.vcf

tabix -f cod_imputed.vcf.gz

Annotate using BCFTOOLS:

bcft annotate -a clinvar_20140929.vcf.gz -c CLNSIG cod_imputed.vcf.gz | grep 'CLNSIG'

The results are below. I list only CLNSIG with values 4 (Likely pathogenic), 5 (Pathogenic), 6 (drug response), 7 (histocompatibility):

So it looks like I have 3 pathogenic variants, 2 heterozygotes and 1 homozygote.

rs2837902 lies between DSCAM and BACE2, conceivably in promoter regions of both.

rs12532066 lies upstream of CTTNBP2, conceivably in the promoter region.

rs7868318 lies in the intron of ZER1.

Some of the associated diseases look scary but would need to take with a pinch of salt I think. The main point of this exercise was to demonstrate how you can add useful additional information to your genotypes and look for additional variants that might not be explicitly reported by 23andMe (or accessory software like Promethease).

By the way, having the imputed VCF file is great. You could upload to SeattleSeq for instance and get some more annotations.

You could obviously just make a BED file from the imputed VCF file, and cross-reference that with other BED files of variants/regions of interest (using something like BEDTOOLS).

zcat cod_imputed.vcf.gz | grep -v '^#' | grep -P "0/1|1/1|1/0" | gawk '{print "chr"$1"\t"$2"\t"$2"\t"$3 }' > cod.bed

Download flagged SNPs from dbSNP 141 (clinical significance - snp141Flagged). Got this from the UCSC genome browser.

bedtools intersect -a cod.bed -b flagged_snps.dbsnp141.bed -wo | gawk '$10 != "unknown" && $10 != "" && $10 != "" && $10 != "" && $10 != "" && $10 != "" && $10 != "" && $10 != "" .... didn't get around to this, but worth a look at another time..

However, overall, I think something like VEP followed by LOFTEE would make more sense though. That, my friends, will be in my next installment...