|
Using calculate_association.pl for genome-wide association analysis The calculate_association.pl program in the GenGen package is written to perform simple single-marker association tests on data sets from genome-wide association studies (GWAS).
IntroductionThe calculate_association.pl program was originally developed in 2006 to perform simple TDT association tests. However, it has gradually evolved into a more sophisticated program for case-control analysis and quantitative traits regression analysis as well. There are certainly many other GWAS analysis software (such as PLINK), but there are several distinct features of the calculate_association.pl program:
However, the functionality of the program is restricted to simple single-marker association tests. It is unable to offer more complicated testing options (such as population structure inference, multi-marker tests) available from other software. As this program is constantly under development, another problem with the program is that future versions may not generate identical results as the current version, and the command line options may not be backward compatible. Finally, as the program has not been widely used, it may contain many un-identified bugs, and may potentially generate biased results. Therefore, it is advised to use different software to test the same data set, and if the results are discordant, it may indicate the presence of a bug. Of course, it would be appreciated if users can send bug reports (if you find some) or other comments to me, and I will try my best to improve the program. To read the detailed documentation of the program, one can use the --man argument when running the program. To see a list of arguments and their functionalities, one can use the --help argument when running the program. When the program is issued without arguments, it will print out abbreviated usage information, including supported arguments (Try to do it now to see a list of currently supported arguments and their functions). Note that you can use both double dash or single dash in the command line (so --output has the same meaning as -output), and you can omit trailing letters of the argument as long as there is no ambiguity with another argument (so the --output argument has the same meaning as the -o or -ou or -out or -outp or -outpu argument).
General overview of two-pass procedure
A general overview of the two-pass procedure used in this program is given in the above figure. Unlike other GWA software that read genotype data for all individuals into memory as a large matrix, the calculate_association.pl program scan the genotype file and process each line (each marker) sequentially. Therefore, although the marker inclusion/exclusion criteria can be used during association test, the individual inclusion/exclusion criteria cannot be applied during the initial run; instead, after the initial run is finished, the individuals failing inclusion criteria will be written to a --remove file, so that in the second run of the program, these individuals are excluded from analysis. I am unable to find the documentation on how exactly other software handles the issue of simultaneous inclusion/exclusion criteria for both markers and individuals. However, I think the two-pass procedure outlined above does make practical sense.
Input filesThe program requires two main input files, a ped_header file and a GT file. Alternatively, to facilitate PLINK users to use the program, it also accepts the input files generated by the --transpose argument in PLINK (TFAM file and TPED file): if you have used PLINK to analyze your data, see the section below for more details on how to generate TFAM/TPED files. These data formats are briefly described below: ped_header fileThe ped_header file can be considered a superset of the normal PED file used in most genetic analysis. It is a tab-delimited text file that records information in one line for each individual, and the first six columns of the line are family id, individual id, father id, mother id, sex and affection status (or quantitative trait). The following columns can be other phenotypes or covariates or genotypes or any information (such as race, name, address, etc) one wants to include. An example of a ped_header file is shown below: AU0215 AU021503 AU021502 AU021501 1 2 White AA In addition to the first 6 columns seen in a typical PED file, there are two extra columns: one specifying race, and the other specifying the genotype of a SNP marker. These two extra columns can be covariates in association analysis. Many more additional columns can be added in the file following the first 6 columns; therefore, it is even possible to use a regular PED file containing all SNP genotypes as a ped_header file. However, it is definitely not recommended to do so; instead one can use “cut -f 1-6” command instead to generate a ped_header file from a regular PED file (if it is tab-delimited).
GT fileThe GT file is a genotype file that contains information for one marker per line. In a canonical GT file, the first three columns for the file are markerid, chromosome and position, while the following columns contain the genotypes for each individual. To explain this in more detail, see the first 10 lines of a canonical GT file below: Name Chr Position sample1 sample2 sample3 sample4 sample5 sample6 sample7 Each line contains 10 tab-delimited columns, and the first three columns are general description of the SNP marker, while the rest columns are the genotypes of the markers. Note that it is okay to have one space between the two alleles in a genotype call (so “AB” and “A B” are both valid genotype calls). The format for the GT file can be slightly relaxed by the --illumina_gtfile_format and --affymetrix_gtfile_format arguments. In the case of Illumina GT file, one can simply export selected columns in the FullDataTable from the BeadStudio software as a GT file, without conforming to the strict criteria for a canonical GT file; instead, the definition of columns can be read from the first line of the GTfile. Missing genotypes can be specified as NC, --, 00 or 0. The first 10 lines of an illumina-GT file is shown below: Index Name Chr Position 4028211240_B.GType 4028211240_B.B Allele Freq 4028211240_B.Log R Ratio 4030178116_B.GType 4030178116_B.B Allele Freq 4030178116_B.Log R Ratio As you can see, the column position Name, Chr and Position columns are determined by the first line (so-called header line of GT file), and the genotype for each marker is also determined by the ***.GType column in the header line. In other words, the Illumina-GTfile can contain much more information (as extra columns) than a canonical GT file, as long as the header line contains enough information for identifying the required columns. The Affymetrix GT format will be explained later in this document. Basically it is the default output file generated by the APT (Affymetrix Power Tool) program with genotype-calling modules.
PLINK TFAM/TPED filesTo facilitate PINK users to use the program, users can simply convert the data file to TFAM/TPED format, and use --plink_tpedfile_format argument in this program to load the TFAM and TPED files. For example, suppose you have a GWAS data set as data.bim, data.bed and data.fam files. Then first use PLINK to generate the TFAM/TPED files: plink --bfile data --out newdata --transpose --recode Next analyze the newdata.tfam and newdata.tped files in calculate_association.pl, by adding the --plink_tpedfile_format argument (or simply --plink argument) to command line.
Individual list fileThe individual list file is used in the --keep or --remove argument that specifies a group of individuals to be used in analysis, or to be excluded from analysis. This is a simple file format, with one family identifier and individual identifier per line, separated by tab character. This format is the same as used by PLINK, so PLINK users can easily specify the inclusion/exclusion criteria using existing files. Note that the --keep and --remove arguments can accept multiple file names separated by comma, or they can be issued multiple times in the command line. This makes it very simple to apply different inclusion/exclusion criteria for association analysis based on multiple filtering files. For example, if one wants to test association on a list of Caucasian individuals from Germany, and suppose that all Caucasian individuals are listed in the indlist.cau file and all german individuals are listed in the indlist.ger file (these two files of course contains some overlaps), then the arguments: --keep indlist.cau,indlist.ger can be used in the program to specify two files together. Alternatively, the arguments: --keep indlist.cau --keep indlist.ger have the same effects. Note that this function is not available in PLINK: it is implemented in calculate_association.pl for convenience when having multiple complicated filtering options.
Marker list fileThe marker list file is used in the --extract or --exclude argument that specifies a group of markers to be used in analysis, or to be excluded from analysis. This is a simple file format, with one marker name per line. This format is the same as used by PLINK, so PLINK users can easily specify the inclusion/exclusion criteria using existing files. Note that the --extract and --exclude arguments can accept multiple file names separated by comma, or they can be issued multiple times in the command line. This is similar to the --keep and --remove arguments for individuals.
Association testsCase-control association testCase-control association test compares genotypes between two phenotype groups (case group and control group). There are typically five different models to test the association of binary phenotypes with bi-allelic genotypes, including allelic association test, Cochran-Armitage trend association test, genotypic association test (2df test), dominant model association test and recessive model association test. All five tests will be performed when the --cc argument is issued, and the results (chi2 and P values) for all five tests will be printed in the output, one marker per line. Notes: To help PLINK users get familiar with the program, the --assoc argument is also supported, and it has the same effect as the --cc argument. However, note that calculate_association.pl compute all five association tests simultaneously, so the results are comparable to using both the --assoc argument and --model argument in two runs in PLINK. Note that all the examples below assumed a ped_header file and a GT file called gt.txt. If the user wants to use PLINK-formatted TFAM/TPED files, it is necessary to add the --plink_tpedfile_format argument to the program. Otherwise error messages will be printed out in the screen.
Simple usageFor example, the following command: [kai@adenine example]$ calculate_association.pl -cc -ab ped_header gt.txt Specifies that the case-control association test should be used (through the --cc argument), and that the genotypes are coded as “AB” alphabet, rather than “ACGT” alphabet (through the -ab argument). The following output will be printed out in the screen: NOTICE: the following marker exclusion criteria is in effect: --geno 0.1 --maf 0.01 --mme 0.1 --hwe 0.001 The above messages contain output to both STDERR and STDOUT. In fact, only the association results (the 11 lines starting from “Name Chr Position”) are printed to STDOUT, but all other messages are printed to STDERR. As we can see from the association results, for each marker, the chi2 values and P values for five models are calculated, but the later 3 tests have “NA” values. This is due to the small sample size: by default the "--cellsize 5" argument is in effect, meaning that the minimum cell size in a contingency table is 5 (otherwise “NA” is printed). Note that the “case_AF” and “cont_AF” refers to case A allele frequency and control A allele frequency, respectively, as opposed to minor allele frequency. Note that for unknown reasons, the test statistic and P values for the trend test (TREND_chi2 and T_P) in my program slightly differs from those reported by the PLINK program. Also note that my program has a different definition of “dominant” and “recessive” as PLINK. For any marker, the terms refers to the effect of B allele over A allele, regardless of which is the minor allele; while for some other programs, the term may be defined differently. In addition, when a cell in the contingency is zero, my program still tries to report test statistic (when --cellsize is set as 0); however, PLINK refuse to report test statistics, even if the --cellsize argument is 0.
With --cellsize argumentWe can try run this program again but setting the --cellsize argument to zero: Name Chr Position A1:A2 case_AF cont_AF ALLELIC_chi2 A_P TREND_chi2 T_P GENO_chi2 G_P DOM_chi2 D_P REC_chi2 R_P In the above command, the “2> /dev/null” is used for suppression of warning/notification messages by redirection. As we can see, most markers now have the statistic values for the genotypic association test, the dominant model and the recessive model.
With --output argumentThe --output argument can be used to specify output files. When --output argument is used, multiple output files will be generated, including a $output.log (replace $output with the specified output file name) file that records the program notification/warning messages; a $output.minfo file that records the no-call rate, minor allele frequency, Hardy-Weinberg P-value, and fraction of nuclear families with Mendelian inconsistency for all markers (regardless of whether they pass quality control criteria); a $output.sinfo file that records non-call rate, fraction of markers with Mendelian inconsistency for all individuals (not that Mendelian error is calculated only for offspring, but not parents); a $output.finfo file that records the fraction of markers with Mendelian inconsistency for all nuclear families; most importantly, a $output.remove file that contains individuals failing the inclusion criteria and should be removed in the next round of analysis. For example, running the following command: [kai@adenine example]$ calculate_association.pl ped_header gt.txt -ab -cell 0 -out temp will generate the temp, temp.finfo, temp.minfo, temp.sinfo, temp.remove files in the same directory. Let’s examine them one by one: Name Chr Position A1:A2 case_AF cont_AF ALLELIC_chi2 A_P TREND_chi2 T_P GENO_chi2 G_P DOM_chi2 D_P REC_chi2 R_P This file contains statistics values for 9 markers that pass the default marker inclusion/threshold. This is the same as what we would see in STDOUT, when not using --output argument. NOTICE: program notification/warning messages that appear in STDERR will be also recorded in log file temp.log This file contains program notification/warning messages. This is almost the same as what we would see in STDERR, when not using --output argument (except the #dm_maf# statement, which is only printed when --output argument is used). The statement like “#dm_maf#” can tell what markers were excluded from association results due to failing the --maf threshold. Similarly, if a marker failed the --hwe threshold or other criteria, they will be printed between the “#dm_hwe#” lines in the log file. Other possible marker exclusion reason include invalid genotype calls (containing alleles other than ABCGT-0, printed between #dm_invalidgt# lines), invalid chromosomes (marker is located in a chromosome other than autosome and X chromosome, printed between #dm_invalidchr# lines), greater than 2 allele (marker genotype contains more than two alleles, printed between #dm_gt2allele# lines), marker failing no-call threshold (printed between #dm_geno# lines), marker failing MAF threshold (printed between #dm_maf# lines), marker failing HWE threshold (printed between #dm_hwe# lines), marker failing marker Mendelian error threshold (printed between #dm_mme# lines). This particular text file format facilitates simple text processing using command such as "fgrep dm_maf -A 10000 | fgrep dm_maf -B 10000" to get a specific group of markers. Name Chr Position genotype_missing minor_allele_frequency hwe_p_value marker_mendel_error comments This file contains marker information for all markers, even those that did not pass inclusion criteria and are not in association results. It is helpful to diagnose why a marker is not included in association results, and to further examine markers with good association P-values to see whether there is something likely to be wrong (like relatively small HWE P-values). Note that the last column in the file is the comment column: it sometimes include some additional comments to explain the problem with this marker in more detail (for example, when the marker has single allele in samples, what is the allele; when the marker has invalid genotypes, what are the invalid ones). family_id individual_id num_analyzed_marker nocall_count nocall_rate mendelian_error_count mendelian_error_rate This file contains sample information. The num_analyzed_marker column contains the total counts of good-quality markers (markers used in association results), while the other columns gives the no-call counts and Mendelian error counts for each individual. Again, the Mendelian error is only considered for offspring, but not parents. (It is possible that in a quartet, one child has excessive Mendelian error, but the other child is perfectly fine, so the trio can still be used in family-based association tests.) [kai@adenine ~/usr/kgenome/trunk/example]$ cat temp.finfo This file is currently not useful and can be safely ignored. [kai@adenine ~/usr/kgenome/trunk/example]$ cat temp.remove This file contains individuals that failed the inclusion criteria and should be removed from analysis during the second run of the program. The two columns are family identifier and individual identifier, respectively. We can check the temp.sinfo file to know why these two individuals are excluded (in this case, both of them have high no-call rate).
Changing marker and individual inclusion/exclusion criteriaSeveral marker and individual inclusion/exclusion criteria can be used (they can be found by issuing the program without arguments): Inclusion/Exclusion Criteria for Markers: Inclusion/Exclusion Criteria for Samples: We can try use some of them to see how results change: [kai@adenine example]$ calculate_association.pl ped_header gt.txt -ab -cell 0 -maf 0.2 -mind 1 In the above example, we specify that markers with Minor Allele Frequency less than 0.2 be excluded, and individuals with missing genotype more than 1 (which in turn means all individuals) be excluded. As we can see, four markers pass the threshold, and all individuals pass the threshold (so no --remove statement is printed in the STDERR).
Permutation procedureFor case-control studies, the permutation procedure randomly flip the case-control status for all samples (but ensure that the same number of cases and controls in the permutated samples as the original samples), then perform GWA association tests on all markers. This procedure is particularly useful, when one want to use pathway-based GWA analysis by the calculate_gsea.pl program. The resulting association results file will contain two additional columsn (chi2_perm and chi2_P_perm) that are read by the calculate_gsea.pl program for testing candidate pathways. An example (using 10 permutation cycles) is shown below: Name Chr Position A1:A2 case_AF cont_AF ALLELIC_chi2 A_P TREND_chi2 T_P GENO_chi2 G_P DOM_chi2 D_P REC_chi2 R_P CHI2_PERM CHI2_P_PERM When the "--cycle 10" argument is set, 10 phenotype permutations will be performed and the marker statistics for each permutation cycle is written in the CHI2_PERM and CHI2_P_PERM columns (the 10 numbers are separated by comma). As there are five different models to perform association tests, the default permutated chi2 and P values are calculated using the alleleic association tests. If one wishes to use a different testing strategy in calculating chi2 and P values, one can use --perm_method argument (1=allelic as default, 2=trend, 3=genotypic, 4=dom, 5=rec). The output association results can be fed into the calculate_gsea.pl program to calculate most significant pathways. See the documentation for calculate_gsea.pl program for more details.
Annotation of association resultsI believe that a good GWA program should provide annotation systems for association results to help better understand the P-values reported in a study. For example, by putting prior biological knowledge into the SNPs, we can assign higher weights to some SNPs, but lower weights for others. The prior knowledge can be anything like linkage peak, plausibility of positive finding, linkage disequilibrium patterns surrounding SNPs and so on. We can also use pathway information, network information, interaction information to better understand the association results. For example, the calculate_gsea.pl program (also in this program package) can be used for testing candidate pathways. In general, to keep the calculate_association.pl program easy and simple, I try to limit its ability to single-marker association test only, while putting more sophisticated functions to other programs that use the same or similar input file formats. Therefore, the calculate_association.pl program has only rudimentary functionalities for annotation of association results; for example: [kai@adenine example]$ calculate_association.pl ped_header gt.txt -ab -cell 0 -maf 0.2 -mind 1 -snppropfile gt.snpgenemap 2> /dev/null The --snppropfile is a file whose first tab-delimited column is marker name, and following columns are SNP annotations (any text annotation should be fine). This annotation is printed at the end of each line of the association results. In our case, the annotation is the neighboring gene for the SNP: [kai@adenine example]$ cat gt.snpgenemap Each SNP is assigned to a gene in the annotation file. However, you can make your own annotation file in very creative way, like marking markers as in prior linkage region, marking markers as candidate gene, marking markers with CNV information and so on. As long as the marker name and marker annotation is separated by a tab character in the same line, this will be a valid annotation file. Generally, after running the calculate_association.pl program, one can sort the association results and take the top 1000 SNPs (by something like "sort -k 7,7g output | head -n 1000"), then load them into a spreadsheet software (for example, Microsoft Excel) for detailed examination. Unlike other GWA software that only gives “marker name” in the results, the output here include chromosome, position and neighboring genes, so it somehow helps better interpretation of results (for example, sort by chr and position in Excel will reveal chromosome regions with multiple high hits, together with neighboring genes). When a SNP looks really good (has extraordinarily small P-value), one should also check it in the $output.minfo file to see whether there is something fishy about its missingness, MAF, HWE and Mendelian inconsistency.
TDT testThe transmission/disequilibrium test examines the over-transmission of a particular allele from heterozygous parents to their offsprings. Strictly speaking, for complex diseases, as we always assume that disease phenotype is linked with every gene, we are using TDT to test for association only. Therefore, the TDT test can be only applied to father-mother-offspring trios in GWA studies. To use TDT test, one can use the --tdt argument: [kai@adenine example]$ calculate_association.pl ped_header gt.txt -ab -tdt Note that only independent trios are used in the TDT test (two trios in two families). The “T:U” column calculate the transmission ratio of A:B alleles in all trios for each marker. If one want to force to test on all trios (for example, when testing linkage for sparse markers), the --force_tdt can be used: [kai@adenine example]$ calculate_association.pl ped_header gt.txt -ab -tdt -force_tdt The permutation procedure in TDT test is quite different from case-control associations. Instead of flipping case-control status, in a TDT-based permutation test, we flip the transmission/untransmission status. In other word, if a heterozygous parent transmit A allele to his/her offspring, during the permutation procedure, the B allele might instead be transmitted (with a 50% chance). The resulting chi2 and P values are calculated and written to the CHI2_PERM and CHI2_P_PERM columns. Again the association results with these two extra columns can be fed into the calculate_gsea.pl program for pathway-based association tests. One example of the output is shown below: [kai@adenine example]$ calculate_association.pl ped_header gt.txt -ab -tdt -force -cycle 10 2> /dev/null
Quantitative trait analysisA simple regression model is used in quantitative trait (QT) analysis in the current implementation of calculate_association.pl program. In the regression model, the dependent variable is the QT, while the independent variable is the genotypes (AA=0, AB=1 and BB=2). The association test is testing the null hypothesis that the slope of the regression is zero. Note that in this regression model, we are assuming an additive model for B alleles. If one wants to use a multiplicative model, one has to manually convert the QT values to logarithm scale in the ped_header file, before running this program. The --qt argument can be used to specify that the QT regression analysis should be performed: [kai@adenine ~/usr/kgenome/trunk/example]$ calculate_association.pl ped_header gt.txt -ab -qt In the output, the Intercept and Slope column represent the regression parameters, while the F test is used to calculate the P value. The slopes for two markers (rs3689947, rs13476331) are extremely small; indeed they should be zero, but due to numerical round-up, they were shown as non-zero. Note that in the above analysis, one marker was discarded due to failure to meet MAF threshold. To see association results for all markers: [kai@adenine ~/usr/kgenome/trunk/example]$ calculate_association.pl ped_header gt.txt -ab -qt -allm 2> /dev/null As we can see, when only one allele is present, the F and F_P values are shown as NA. Similarly, when other exceptions exist, like when invalid genotype is present in the marker, when the marker has more than 2 alleles, when the marker is located in Mitochondria, the F and F_P values will be shown as NA as well. The permutation procedure randomly scrambles the QT values for all individuals to be tested. (This means individuals not specified in --remove argument, and do have non-missing QT values). [kai@adenine ~/usr/kgenome/trunk/example]$ calculate_association.pl ped_header gt.txt -ab -qt -cycle 10 2> /dev/null Unlike association tests using chi2 as test statistic, here in our regression model we used F test statistic, so the two extra columns in the association results are F_PERM and F_P_PERM. The F_PERM values can be used by the calculate_gsea.pl program for pathway-based association tests to identify candidate pathways.
Advanced topics and notesChromosome X processingIn the Januaray 2009 version of the calculate_association.pl program, changes have been made to the chromosome X association procedure, for the sole purpose to match what PLINK reports. This is mainly to address some complaints on the discordance between the two programs on processing chrX markers. I have read the documentation and carefully investigated what PLINK does for case-control association test on chrX markers, and the general observations are that: (1) PLINK reports completely different test statistics for allelic association when the --assoc argument used versus when the --model argument is used (2) when the --assoc argument is used, PLINK treats chrX as if it is autosomes, that is, a male with AA genotype (in reality only one count of A allele) for chrX marker is simply treated as if he is a female with AA genotype (two counts of A allele). (3) when --model argument is used, PLINK disregard any individual annotated as male and only considers female. (What's puzzling here is that the allelic association contingency table in PLINK output is identical to the trend test contingency table, which I do not understand at all; it's not relevant here though). Unlike PLINK, since the calculate_association.pl program simultaneously calculates all five test statistic in case-control association test, I have to make some seemingly strange but indeed somewhat intuitive adjustment to the chrX association procedure. For allelic association test, I follow whatever PLINK does when its --assoc argument is used (treat males as females), and all male heterozygous genotypes are discarded from analysis. For the other four tests, only the female samples will be used in analysis, and all males are discarded. This procedure makes the most sense I think. Parallel computationTypically, when one wants to run many permutations, he/she can run the calculate_association.pl program in parallel in a computational cluster. Several tricks may be helpful here:
For example, to run 400 permutation cycles, the following 10 commands can be executed in parallel in a computer cluster: calculate_association.pl ../nb.tfam ../nb.tped -plink -cycle 40 -seed 1 -perm 2 -noflush > combine.perm1
|
|||||||||||||