Introduction to Analysis of Genomic Data
Genomic data can come in many forms, including single nucleotide polymorphism (SNP) markers, methylation proportions, or gene, metabolite, or protein expression. Such data are typically stored in a tabular format composed of columns and rows. The two dimensions consist of measurements of many variables (for example, SNPs, mRNA expression values) made on some number of samples (for example, humans, plants, animals, cells). The numbers of variables in each data set are typically much larger than the number of samples. For genomics data tables, we define “Wide” and “Tall” formats more specifically as follows: a data table in which attributesare columns and samples are rows is denoted as Wide; a Tall data table, on the other hand, has attributes in rows and samples in the columns. For most genomics analyses in JMP, you will want your data in wide form, although occasionally you may need to transpose it to a tall format.
JMP Pro includes many enhancements to efficiently handle large wide tables with hundreds of thousands of columns and thousands of rows. Keep in mind that JMP stores all data in memory, so you will want to run it on a Windows or Mac machine with a generous amount of memory (a minimum 32 GB of RAM is recommended). Certain analyses may require long computational times, and this book provides guidance on how to best navigate through the numerous available options. With adequate computing hardware, JMP can handle tall problems with over one billion rows and wide problems with over one million columns. Problems that are this large are challenging and can still require long computational times for certain tasks, but as you become more proficient, it is possible to be interactively fast in handling at least 100 million rows or 500,000 columns if you are cautious in what you ask for and how you go about it.
JMP includes several routines for quickly converting between wide data and tall data. The Transpose platform (Tables > Transpose) directly converts one to the other form, making the columns the rows and vice-verse. For more advanced pivoting, use Tables > Stack to convert wide to tall and Tables > Split to convert tall to wide.
For many genomics problems, it is common to have additional data about the variables. For example, for SNP markers we typically have the chromosome and location for each marker. You should store such data in a separate table, that we will usually refer to as the annotation table. Annotation data should be stored in tall form with one column containing values that exactly match the column names in the main wide data table. If you convert the main data to tall, you can use Tables > Update or Tables > Join to merge it with the annotation data.
Other data measured on the samples (for example, treatment variables and covariates such as gender) are often called Experimental Design variables. These should be included directly in the main wide data table along with the molecular variables. It is often informative to color rows of the wide table using one or more of the experimental design variables via Rows > Color or Mark by Column.
Although this book focuses on wide genomic data, many of the same analyses can be done on large data sets from other disciplines. For example, high-throughput sensor measurements are now being made across science and engineering in fields such as manufacturing, semiconductors, and health research. Each field tends to have its own terminology specific to its measurement systems and variable names, and once this is learned, transferring analyses objectives and solutions tends to be straightforward.
A zip file containing all of the initial JMP tables used in the various analyses in this book is available here for you to download.