|
PennCNV individual-based CNV calling Suppose we already generated the necessary input signal intensity file, including sample1.txt, sample2.txt and sample3.txt, as described in the previous tutorial section in Input Files. Below is a description of the procedure for CNV calling for each of the three individuals using PennCNV.
General CNV calling overviewA general flowchart for CNV calling algorithm by PennCNV is described in the figure below: As we can see from the flowchart, for CNV calling, we need to have LRR and BAF values (contained within the signal intensity file sample*.txt), the population frequency of B alleles and SNP genome coordinates (contained within the file hh550.hg18.pfb), and an appropriate HMM model (contained within the hh550.hmm file). Notes: the hh550.hg17.pfb and hh550.hg18.pfb file contains the genome coordinates and the population frequency for each SNP in the HumanHap550 array. The differences are that they used different genome assembly: hg17 corresponds to the 2004 human genome assembly (or NCBI35 assembly), while the hg18 corresponds to the 2006 human genome assembly (or NCBI 36 assembly). In the latest release of PennCNV, the hg17 files have been removed from the package.
CNV calling on three individualsNext we can start calling CNVs for these three individuals. Suppose we are already inside the penncnv/ directory: [kai@adenine penncnv]$ detect_cnv.pl -test -hmm lib/hh550.hmm -pfb lib/hh550.hg18.pfb sample*.txt -log sampleall.log -out sampleall.rawcnv If you are using Windows, or if the above command does not work and show “bash: detect_cnv.pl: command not found” error message, try to add “perl” in the beginning of the command line before “detect_cnv.pl”. If you are using Windows shell, the “sample*.txt” won’t be recognized as the expansion to 3 files. Try to use “sample1.txt sample2.txt sample3.txt” explicitly in command line, or you can put the three file names into a list file (see tips below) and use the --list argument. In the command line, we have several arguments: the --test argument tells the program to generate CNV calls, the --hmmfile and --pfbfile arguments tell the program to use a HMM model file called lib/hh550.hmm and a Population Frequency of B Allele file called hh550.hg18.pfb. The sample*.txt notation tells the program to generate CNV calls for three files (sample1.txt, sample2.txt and sample3.txt), the –log argument tells the program to write log information to the sampleall.log file, while the –out argument tells the program to output CNV calls to a file called sampleall.rawcnv. This should be done in a few minutes in a modern computer. Tip: Instead of listing all file names in the command line, we can generate a listfile containing file names to be processed (one per line), then use the --list argument in the command line. This is especially useful when you want to process a subset of all files in a directory, or when you have too many (>2000) files in a directory (in this case, the expansion of sample*.txt exceed the maximum length of allowable characters in a command line). Note: By default only autosome CNVs will be detected, but the --chrx argument can be used in the above command to generate CNV calls for chromosome X only. The CNV calling for chrX is slightly different from that of autosomes. It is highly recommended to use the --sexfile argument to supply gender annotation for all genotyped samples. The sexfile is two-column file, with first column as signal file name, and second column as either male or female. If sex for a sample is not provided in sexfile, or if --sexfile is not specified, PennCNV will try to predict the gender of the sample, based on BAF heterozgous rate (fraction of chrX markers with BAF between 0.25 and 0.75) of >0.1 (female) or <0.1 (male). Next, PennCNV will adjust the LRR values by a fixed value such that females have median LRR at 0 and males have median LRR at the same value as that for CN=1 in the HMM file. After this step, the CNV calling is then applied in a similar way as autosomes. We can then take a look at the resulting sampleall.rawcnv file. The first a few lines look like this: chr1:156783977-156788016 numsnp=6 length=4,040 state2,cn=1 sample1.txt startsnp=rs16840314 endsnp=rs10489835 The first field in each line specifies the chromosome region. Be careful here! Although the sample*.txt file also contains chromosome coordinates for each SNP, this information is ignored during CNV calling. Instead, the chromosome coordinates for each SNP is read from the PFB file (hh550.hg18.pfb), which was generated using the chromosome coordinates for the 2006 human genome assembly, so the chromosome regions shown in the sampleall.rawcnv file are for 2006 human genome assembly. If we want to generate CNV calls for the 2004 human genome assembly, we will need to use a different PFB file (hh550.hg17.pfb). This design allows us to easily switch CNV calls for different genome assembly without changing the signal intensity files. This also means that the input signal intensity file does not have to contain the Chr and Position column: as they will be ignored by default. However, if you insist to use the genome coordinates annotation in your own input file (for whatever reasons), you can use the --coordinate_from_input argument, which will apply the Chr and Position annotation in input files, rather than PFB files. The second field and the third field in each line in sampleall.rawcnv tell how many SNPs are contained within the CNV and the length of the CNV (calculated by subtracting the coordinate of the last SNP by the coordinate of the first SNP plus 1). The fourth field tells the HMM state and the actual copy number (CN) of the CNV call. You can ignore the state, as it is jus an internal HMM parameter. The CN refers to the actual integer copy number estimates, and the normal copy number is 2. So for autosome, CN=0 or 1 means there is a deletion and CN>=3 means there is a duplication. For chrX or chrY in males, CN=1 is the normal copy number and CN=0 means a deletion. The fifth, sixth and seventh fields specify the original file name, the starting marker identifier and the ending marker identifier in the CNV, respectively. There are a total of 51 CNVs generated in this step for three individuals. Of the 17 CNVs in offspring, 12 are inherited from parents, and 5 are Not Detected in Parents (so-called CNV-NDPs). These CNV-NDPs may be false positive calls in offspring, or false negative calls in parents, or as de novo CNVs. Of the 12 inherited CNVs, 10 have identical boundaries as the corresponding CNVs in father or mother. Of the 5 CNV-NDPs, four of them contain less than 10 SNPs, but one of them is quite large (50 SNPs) and does not overlap with any immunoglobulin region. We can then take a look at the sampleall.log file, which contains information reported by the program during analysis of samples: [kai@beta penncnv]$ cat sampleall.log Note: As of June 2008, the –medianadjust argument is turned on by default in the program to reduce false positive duplication calls for problematic samples. The effect is that the BAF_median measure for all samples are automatically adjusted to be 0.5. If the user wants to reproduce the results from the previous version of PennCNV, you can turn the argument off by --nomedianadjust argument. As we can see from the log message, first the program reads the PFB file, and it discarded 178 records in chr M, Y and XY. Next the program reads signal information for autosomes from the sample1.txt file, and discarded 178 records in the signal file (The reason is that these records/SNPs is not read/annotated in the PFB file). Note that a total of 561288 markers are read into PennCNV for analysis: this sounds right for Illumina 550K arrays. For Affy GW6 or GW5 array, this number should be around 1.8 million or 800K; for Illumina 1M array, this number should be around 1M. If it's substantially lower than the expected number, then it means something is wrong so that the signal files are not read correctly. Examine the signal files to see how many lines (markers) it has, and whether the last line is complete, to check the possibility of file corruption or file incompletion during generation. Next the program prints out a list of sample quality summary: this information is very useful step in quality control (see http://www.neurogenome.org/cnv/penncnv/qc_tutorial.htm for more detail). If some of the quality measure looks bad, a warning message will be printed out that the sample does not pass quality control criteria. The CNV calls will still be made for all samples regardless of quality, but users are advised to take caution in analyzing CNV calls from low-quality samples.
GCmodel adjustmentThe current version of PennCNV now implements an improved version of wave adjustment procedure for genomic waves via the –gcmodel argument. The algorithm is described in detail in the Diskin et al paper. If a large fraction of your samples have waviness factor (WF value) less than -0.04 or higher than 0.04, it is much better to apply the adjustment procedure to reduce false positive calls. For the 3 samples in the tutorial, the quality is actually quite good so signal adjustment has little effect. Nevertheless, we can try to do the adjustment with the following command: [kai@adenine penncnv]$ detect_cnv.pl -test -hmm lib/hh550.hmm -pfb lib/hh550.hg18.pfb sample1.txt sample2.txt sample3.txt -log sampleall.adjusted.log -out sampleall.adjusted.rawcnv -gcmodel hh550.hg18.gcmodel This will apply the GC-model specified in hh550.hg18.gcmodel for signal adjustment, before generating CNV calls. The results will be written to sampleall.adjusted.rawcnv file. The following LOG information will be printed out: NOTICE: All program notification/warning messages that appear in STDERR will be also written to log file sampleall.adjusted.log
Compared to previous LOG messages, we can see that several extra lines are printed out, describing the changes in WF values and GCWF values. Basically WF measures the total signal fluctuation of a signal intensity file, while GCWF measures the GC-content-caused fluctuation of a signal intensity file. The signal adjustment procedure tries to reduce GC-content-caused fluctuation, and you will generally see a drop in the absolute value of GCWF after adjustment. |