Survey SNP-Trait Association

Large scale genetic mapping studies seek to associate genetic markers, such as SNPs, of known location, with various quantitative and qualitative phenotypic traits. Both Marker-Trait Association and SNP-Trait Association processes were developed to address specific needs of these investigations. Marker-Trait Association is especially useful for studies involving multi-allelic markers and some of the more complex modeling techniques. However, it is not particularly efficient at handling very large data sets. Survey SNP-Trait Association was specifically designed for very large genetic data sets, but it lacks some of the more complex options available in Marker-Trait Association. Both of these procedures complement each other very well. However, neither Marker-Trait Association nor SNP-Trait Association can accommodate complex survey designs.

Survey SNP-Trait Association addresses this deficiency by testing for association between various types of traits and SNP genotypes or alleles from a single SNP at a time taking into account complex survey designs. Two types of analyses can be performed: an ANOVA based on SNP genotypes or a regression testing for a linear trend of SNP alleles. Adjustments can be made for quantitative covariates. Rao-Scott chi-square and F statistics can also be computed for non-continuous traits. P-values from these tests, with adjustments applied if requested, are plotted along the marker map.

See the SURVEYFREQ, SURVEYLOGISTIC, and SURVEYREG procedures in the SAS/STAT User's Guide for more information.

What do I need?

One Input Data Set, which contains all of the marker data, is needed for this process. The sample data set used in the following example, the survey_genotype.sas7bdat data set, was computer-generated and consists of 7000 rows of individuals with 38 columns corresponding to genetic data on these individuals. In this data set, genotypes are presented in the one-column format with numerical identities assigned based on the prevalence of each allele. Regardless of the actual identifier, individuals homozygous for the most common allele are identified with a 2. Heterozygotes for the most common allele and the less common allele, are identified with a 1. The six columns at the beginning of the data set (PSU, Stratum, SampWeight, Age, Response, and BMI) contain non-genetic information about the individuals. The first three are related to the survey sample design, containing the cluster identifier, stratum identifier, and sampling weights, respectively. Age and BMI (body mass index) are variables that can be used as covariates in the association analysis. The Response column contains the values for the trait that we are interested in mapping in the genome.

The survey_genotype.sas7bdat data set is partially shown. Note that this is a wide data set; markers are listed in columns, whereas individuals are listed in rows.

A second optional data set is the Annotation Data Set. This data set contains information, such as gene identity or chromosomal location, for each of the markers. This data set must be a tall data set; each row corresponds to a different marker.

Note: The top-to-bottom order of the rows in the annotation data set must match the left-to-right order of the columns in the input data set. This correspondence is required for markers to be matched appropriately.

The survey_genotype.sas7bdat data set is included in the Sample Data folder that comes with JMP Genomics.

For detailed information about the files and data sets used or created by JMP Life Sciences software, see Files and Data Sets.

Output/Results

The output generated by this process is summarized in a Tabbed report. Refer to the Survey SNP-Trait Association output documentation for detailed descriptions and guides to interpreting your results.