|
PennCNV CNV QC and annotation The generated raw CNV calls needs to be processed, so that calls from low quality samples be eliminated from the call file, and so that only calls meeting specific length criteria (for example, more than 10 SNPs) are kept in analysis. Furthermore, the calls would not make much biological sense unless we know how to functionally interpret them and correlate them with prior biological knowledge. 1 Filtering CNV calls by user-specified criteria
Filtering CNV calls by user-specified criteriaThe raw CNV calls often need to be filtered to keep a specific subset of calls for further analysis. In the PennCNV package, the filter_cnv.pl program can filter CNV calls based on various criteria, including both sample-level and on call-level criteria. For example, “-numsnp 10 -length 50k” can be used to select CNV calls containing 10 SNPs and larger than 50kb. [kaiwang@cc penncnv]$ filter_cnv.pl -numsnp 10 -length 50k sampleall.rawcnv As we can see, in the output, only CNV calls meeting specified criteria are printed out to the screen. Adding the “-output” argument to the above command will write the CNV calls to a file. For recently developed SNP arrays with high-density markers, PennCNV may tends to split large CNVs (such as those >500kb) into smaller parts (such as two or three 150kb CNV calls). It is possible to examine the CNV calls and merge adjacent calls together into one single call, using user-specified threshold (for example, gap<20% of total length). In other word, suppose there are three genomic segments, A, B and C, whereas A and C are called as deletion by PennCNV. If you divide the length of the gap B (measured by #probes) by the length of A+B+C, and if this fraction if <=20%, then it is recommended to merge A+B+C as a single deletion call. In the latest version of PennCNV, the clean_cnv.pl program can do this automatically for you, with the -fraction argument to control the fraction threshold. The -bp argument can be used so that the fraction is calculated as base pair length, rather than number of SNPs/markers withint he call.
Automatic Quality Control of CNV callsThe filter_cnv.pl program also has functionality to perform sample-level QC, in addition to call-level QC. Therefore, we can use the program to identify low-quality samples from a genotyping experiment, and eliminate them from future analysis. This analysis requires the LOG file used in CNV calling. The three samples (sample1.txt, sample2.txt and sample3.txt) used in the tutorial are actually of very good data quality. However, to illustrate how to apply the QC procedure, next we specify an extraordinarily stringent LRR_SD criteria to show that one sample (sample3.txt) cannot meet this criteria and is dropped during the QC procedure. [kaiwang@cc penncnv]$ filter_cnv.pl sampleall.rawcnv -qclogfile sampleall.log -qclrrsd 0.135 -qcpassout sampleall.qcpass -qcsumout sampleall.qcsum -out sampleall.goodcnv The above command asked to analyze the LOG file (sampleall.log), find all samples with LRR_SD less than 0.135, then write these samples to the sampleall.qcpass file, write the CNV calls for these samples to the sampleall.goodcnv file, and write the QC summary for all samples to the sampleall.qcsum file. Now we can examine the sampleall.qcsum file: [kaiwang@cc penncnv]$ cat sampleall.qcsum This is a tab-delimited text file that can be easily loaded into Excel for plots and histograms. For example, it is often informative to plot the number of CNV calls versus the LRR_SD measure to diagnose what's a good LRR_SD threshold to use in QC for a particular data set. Now examine the sampleall.qcpass file: [kaiwang@cc ~/project/penncnv/tutorial]$ cat sampleall.qcpass This means that sample1.txt does not pass the QC threshold. Checking the qcsum file, we can see that the sample1.txt has LRR_SD of 0.138, which is higher than the 0.135 as specified in the command line. Now check the sampleall.goodcnv file: we will see that only sample2.txt and sample3.txt are included in this file, that is, the CNV calls from “bad” sample is filtered from the output file. Again, the above example is used only for illustrating how the procedure works. In reality the 0.135 threshold is too stringent. Using a 0.24, or even 0.3 or 0.35, would be advised. (it does not hurt to change the threshold and see how many samples pass QC). Finally, it is highly recommended to also set the -qcnumcnv argument in the command line in practical settings; for example, “-qcnumcnv 50” would treat any samples with >50 CNV calls as low quality samples and eliminate them from analysis. (taking the qcsum file and draw a histogram to see the distribution and decide on a good threshold to use; it is not unusual for PennCNV to generate 2,000 CNV calls for a sample with very low quality even if LRR_SD appear to be normal!). Removing spurious CNV calls in specific genomic regions Several genomic regions are known to harbor spurious CNV calls that represent cell-line artifacts, so the calls should be eliminated before analysis. For example, the original PennCNV paper in Genome Research has done a comparison to show that immunoglobulin regions are especially likely to have deletions in cell line samples than whole-blood samples. Additionally, centromeric and telomeric regions are especially likely to harbor spurious CNV calls as well. The scan_region.pl program can easily help remove CNV calls in specific genomic regions. Some examples were given in the FAQ section since many people asked about this issue. One thing to note is that people may use different thresholds (like 100kb, 500kb, 1Mb) to define telomeric/centromeric regions. Another thing to note is that CNVs in some telomeric regions may be indeed functionally important rather than simple artifacts. So always double check when deleting things from the CNV call files to make sure that they make sense.
Finding overlapping/neighboring genes for CNV callsOne of the most common tasks for CNV annotation is to identify overlapping or neighboring genes. If the CNV calls are generated using hg18 (Mar 2006, NCBI build 36) human genome assembly, we can download UCSC known gene annotation (knownGene.txt.gz and kgXref.txt.gz) or refGene annotation (refGene.txt.gz and refLink.txt.gz) for CNV annotation. (After downloading these files, it is recommended to first unzip the files (for example, gunzip refGene.txt.gz), then rename them (for example, mv refGene.txt hg18_refGene.txt) to add the genome build information to the file name to remind yourself about the version of genome assembly). For example, to identify UCSC known genes that overlap with CNV calls (which were generated using the hg18 genome coordinates), we can run: [kai@adenine penncnv]$ scan_region.pl sampleall.cnv hg18_refGene.txt -refgene -reflink hg18_refLink.txt > sampleall.cnv.rg18 The output file contains two additional columns to each line of the sampleall.cnv file. The first column represents the gene symbols and the second column indicates the distance between CNV and gene. In our case, the distance is always zero since we will only find CNVs that overlap with a gene. If the CNV does not overlap with any gene, the “NOT_FOUND” notation will be shown for the corresponding CNVs. For example, several lines of the sampleall.cnv.rg18 file is shown below: chr2:242565979-242656041 numsnp=16 length=90,063 state2,cn=1 sample1.txt startsnp=rs12987376 endsnp=rs6740738 father triostate=233 NOT_FOUND NOT_FOUND For the above five CNV calls, only one overlaps with a gene (CTDSPL), while others are in intergenic regions. The second CNV call (50 SNPs, 97kb) is a de novo CNV. It is usually more useful to find neighboring genes for an intergenic CNV, we can use the --expandmax argument: [kai@adenine penncnv]$ scan_region.pl sampleall.cnv hg18_refGene.txt -refgene -reflink hg18_refLink.txt -expandmax 5m > sampleall.cnv.rg18 This will expand the CNV up to 5 megabases in both direction and then try to find neighboring genes. Only the closest gene to the CNV will be written to output, while this closest gene might be located to the left or right side of the CNV (note that we use “left” and “right” here since CNVs occur in both forward and reverse chains without a predefined direction). To find only left genes, we can use the “--expandleft 5m” argument. To find both left and right genes, we have to run the program twice with --expandleft and --expandright argument respectively. This can be done easily in one single step: [kai@adenine penncnv]$ scan_region.pl sampleall.cnv hg18_refGene.txt -refgene -reflink hg18_refLink.txt -expandleft 5m | scan_region.pl sampleall.cnv hg18_refGene.txt -refgene -reflink hg18_refLink.txt -expandright 5m > sampleall.cnv.rg18 Note that when <inputfile> is “stdin”, the scan_region.pl program will read input from STDIN (standard input, which can be a piped output from a previous command). The output will contain four extra columns, representing closest left gene, left distance, closest right gene and right distance, respectively. For example, the same five CNVs mentioned above were annotated as: chr2:208064035-208066083 numsnp=5 length=2,049 state2,cn=1 sample2.txt startsnp=rs918843 endsnp=rs959668 mother triostate=323 KLF7 325176 CREB1 36848 Tips: the UCSC known gene annotation is more comprehensive than RefSeq annotation, but it contains too many uncharacterized transcripts (with names such as AK091772 or BC015880). In most cases, it is probably a better idea to use RefSeq annotation to annotate CNVs.
Finding overlapping exons for CNV callsWe often want to know which CNVs severely affects genes, so examination of exon deletions may be one way to ensure that the CNV call indeed disrupt gene function. The “-refexon” argument, rather than “-refgene” argument, can be used in the above command line to find out exonic overlaps. Those CNV calls without exonic overlap with have “NOT_FOUND” appended to the end of the line, so a “fgrep -v NOT_FOUND” can be added to get rid of these CNVs affecting non-exonic regions.
Finding overlapping microRNA, evolutionarily conserved elements, transcription factor binding sites, etc for CNV callsThe scan_region.pl program can perform these tasks using the UCSC genome annotation file. Please use the “-m” argument for the scan_region.pl program to read a complete manual on how to achieve these goals.
CNV case-control comparisonThe PennCNV program implements a very preliminary functionality of case-control association analysis, to identify a stretch of SNPs that tend to have copy number changes in cases versus controls using Fisher’s Exact Test. This function is very preliminary and very rough, but can be a first-pass effort to identify potentially interesting regions in a case-control setting. A phenotype file needs to be supplied for this analysis. The file contains two tab-delimited columns, representing file name and the disease label, respectively. The disease label can be 1 (indicating non-affected) or 2 (indicating affected), or can be “control” and “case”. In fact, The disease label can be anything: if using the “-control_label” argument, then any string that match the control label will be treated as controls, otherwise treated as cases. The following is an example phenofile: [kaiwang@cc penncnv]$ cat phenotype sample1.txt chronic_disease sampel2.txt control sample3.txt acute_disease So the program will treat sample2.txt as control, and the other samples as cases. If a file name is not annotated in the phenofile, it will be treated as unknown and not used in the association analysis. For illustration purposes, using the CNV call file generated in the tutorial, now we can perform a simple case-control comparison between the case group (two samples) and the control group (one sample), using the CNV call file sampleall.rawcnv: [kaiwang@cc ~/project/penncnv/tutorial]$ detect_cnv.pl -cctest -pfb ../lib/hh550.hg18.pfb -phenofile phenotype -cnv sampleall.rawcnv The association results will be printed in the screen in tab-delimited text format (using -out argument to write to a file). For example, for the SNP rs11579261, two cases (out of two total cases) have CNV, but zero control (out of one control) has CNV. The P-value is 0.33 by Fisher’s Exact Test. Note that the results are printed in a per-marker basis: obviously if a stretch of neighboring markers all have good P-values it may indicates that this entire CNV region is associated with disease status. Some additional useful arguments include “-type del” or “-type dup” to examine deletions or duplications only. (By default both deletions and duplications are counted as CNV in the above calculation). The “-onesided” argument can be used to performed one-sided association test that only cares about regions where cases have more CNVs than controls. |