Day-13 - PLINK annotation

Post date: Jul 20, 2011 1:56:05 PM

I use PLINK alot in my day to day work. It's and excellent tool for analysis and manipulation of genomewide association datasets. I had the idea of using its annotation feature to see what I could find out. The website provides a pre-computed SNP attributes file for dbSNP v129 (I re-did this for 132 but there isn't that much difference for gene annotations). The method is described on this page. What I wanted to do is ask the following: for SNPs in my genome that are rare in the HapMap Caucasian population (as a reference), highlight any genes with interesting functional mutations. Even though any findings might not have any disease associations, I think it's a useful exercise. In addition, you can use the method to annotate however you like, e.g. other functional annotations, conserved elements, sites under selection etc.

I'm just going the process as a series of steps:

# Download the data and prepare my own

wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/snp129.attrib.gz

wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/glist-hg18

grep '^rs' ../genome_Colm_O_Dushlaine_Full_20110124134637.txt | gawk '{print $2,$1,$3,"1",$4 }' > cod.snps

# manual edit - add header - CHR SNP BP P

# Annotate genes, dbSNP annotations and subset on rare SNPs (1%) (the snps.1pc.tmp file I had prepared earlier, pulling out SNP genotypes that I had that were <1% in frequency in CEU)

plink --annotate cod.snps attrib=snp129.attrib.gz ranges=glist-hg18 snps=snps.1pc.tmp --out cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu

# The functional annotations cover the following, missense being the most frequent

gawk '{print $2 }' snp129.attrib | sort | uniq -c

348 =frameshift

19 =FRAMESHIFT

203933 =missense

30874 =MISSENSE

7682 =nonsense

849 =NONSENSE

963 =splice

297 =SPLICE

== 244765

# Look at functional categories of interest that have gene annotations. Note: these results are first-pass and the snp129.attrib file includes annotations based on a SNP being in LD with a functional SNP, i.e. it may not necessarily be the true SNP

grep -P "nonsense|NONSENSE|frameshift|FRAMESHIFT|SPLICE|splice" cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu.annot | grep '('

1 rs969788 86689444 1 AG CLCA2(0)|=missense|=nonsense

5 rs1423414 41179513 1 AA C6(0)|=missense|=nonsense

5 rs4957377 41228756 1 CC C6(0)|=missense|=nonsense

5 rs4957152 41270160 1 CC C6(0)|=missense|=nonsense

5 rs2004385 41286190 1 AA C6(0)|=missense|=nonsense

5 rs7724199 156922535 1 AA ADAM19(0)|=missense|=nonsense

5 rs7719224 156931918 1 TT ADAM19(0)|=missense|=nonsense

5 rs11134822 156933880 1 TT ADAM19(0)|=missense|=nonsense

6 rs3778638 31200103 1 AA PSORS1C1(0)|=missense|=nonsense

6 rs2233945 31215340 1 AA PSORS1C1(0)|=missense|=nonsense

6 rs12364 31218765 1 AA CCHCR1(0)|=missense|=nonsense

6 rs9263739 31219335 1 TT CCHCR1(0)|=missense|=nonsense

6 rs9263740 31219379 1 CC CCHCR1(0)|=missense|=nonsense

6 rs130077 31230309 1 AA CCHCR1(0)|=missense|=nonsense

6 rs9263785 31233802 1 GG CCHCR1(0)|=missense|=nonsense

6 rs9263794 31237998 1 GG TCF19(0)|=missense|=nonsense

6 rs1044870 31238800 1 TT TCF19(0)|=missense|=nonsense

6 rs9263796 31240862 1 TT POU5F1(0)|=missense|=nonsense

6 rs9263800 31242578 1 AA POU5F1(0)|=missense|=nonsense

7 rs596572 4174875 1 TT SDK1(0)|=missense|=nonsense

7 rs623667 4177807 1 AA SDK1(0)|=missense|=nonsense

8 rs3213604 17770696 1 AT FGL1(0)|=frameshift|=missense

11 rs4910844 5733060 1 TT OR52N4(0)|=NONSENSE|=missense

11 rs10838637 5766053 1 GG OR52N1(0)|=missense|=nonsense

11 rs10769224 5766249 1 TT OR52N1(0)|=MISSENSE|=nonsense

11 rs10742787 5766322 1 TT OR52N1(0)|=MISSENSE|=nonsense

13 rs11568656 94512943 1 GT ABCC4(0)|=frameshift

13 rs7997839 94514336 1 AG ABCC4(0)|=frameshift

CLCA2: calcium channel: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CLCA2

C6: immune-related: http://www.genecards.org/cgi-bin/carddisp.pl?gene=C6

ADAM19: lots of roles, including neurogenesis (explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=ADAM19

PSORS1C1: unclear: http://www.genecards.org/cgi-bin/carddisp.pl?gene=PSORS1C1

CCHCR1: keratinocyte: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CCHCR1

TCF19: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=TCF19

POU5F1: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=POU5F1

SDK1: cell adhesion protein that guides axonal terminals to specific synapses in developing neurons (definitely explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=SDK1

FGL1: hepatocyte mitogenic activity: http://www.genecards.org/cgi-bin/carddisp.pl?gene=FGL1

OR52N4: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N4

OR52N1: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N1

ABCC4: organic anion pump relevant to cellular detoxification http://www.genecards.org/cgi-bin/carddisp.pl?gene=ABCC4