Annotation SAS Data Set

Parameters | Workflows | Annotation SAS Data Set

Annotation SAS Data Set

Use this field to specify the name and complete path to the annotation SAS data set.

Note: The annotation information is used as a key for a web search. Where applicable, each row in this data set should correspond to the column or pair of columns containing marker genotypes in the Input SAS Data Set. The markers must be in the same order in each data set.

Annotation Data Sets

An annotation data set contains biological or chemical information and properties about genes, SNPs, probes, probesets, or peptides. This annotation information comes from various online Bioinformatics resources, including government agencies, academic organizations and commercial entities. It is used to create a custom Annotation Data Set for your analysis.

The structure of an Annotation Data Set and the information that it provides can vary depending on the nature of the experiment, the source of the data and the application that generated it. The table below lists information commonly contained in an Annotation Data Set. Keep in mind that different providers might name annotation information differently.

Item

Description

Probe or Probeset ID

A unique identifier given to a probe or probeset in a probe array or microarray.

GenBank Accession Number

An accession number is a unique identifier given to a biological polymer sequence (such as DNA or a protein) when it is submitted to a sequence database (GenBank, EMBL, DDBJ).

UniGene Cluster ID

A unique identifier given to a cluster of sequences in UniGene.

Gene ID

A unique identifier assigned to a gene record in Entrez Gene. It is an integer and is species specific.

Description

Description of a gene, probe, or probeset.

Chromosomal Location

The physical location of a gene or sequence on a chromosome.

Ensembl ID

A unique identifier assigned to a sequence in Ensembl.

Swiss-Prot ID

A unique identifier assigned to a protein sequence in Swiss-Prot, a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domain structures, post-translational modifications, variants, and so on), a minimal level of redundancy, and significant integration with other databases.

EC Number

A number assigned to an enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). The EC number is a unique identifier in ENZYME, the Enzyme nomenclature database, maintained at the ExPASy molecular biology server.

OMIM ID

A unique identifier assigned to a genetic disorder in the Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.

dbSNP ID

A unique identifier assigned to a single nucleotide polymorphism (SNPs) when it is submitted to the SNP database. Also known as an 'rs' ID.

RefSeq Accession

A unique identifier given to a sequence in the NCBI RefSeq database. The RefSeq database is a curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, and entire chromosomes.

Gene Ontology ID

A unique alphanumerical identifier given to a GO term.

Genomic Location or Coordinate

A location assigned to a gene or a sequence at both the chromosome and sequence-levels.

The structure of the Annotation Data Set for genetics processes differs from that of the microarray and proteomics processes.

For genetics, each row in the Annotation Data Set represents a marker or SNP used in the analysis, with variables typically containing the following information: a name or identifier for each marker, the chromosome or candidate gene on which it is located, its location (in terms of kilobases or centiMorgans, for example), and an accession number that can be used to retrieve more information about the locus from a publicly available online database. This data set can be specified on the Annotation tab found on most of the process dialogs where the columns can be assigned to various roles:

•

Annotation Label Variable - the name or ID variable that is used to label markers in the output

•

Annotation Group Variable - the variable, such as chromosome, that can be used to group the analyses and output

•

Annotation Location Variable - the variable containing marker locations to be used to accurately represent distances between markers in p-value plots

•

Accession Number Variable - the variable containing GenBank accession number or dbSNP reference sequence ID for example, to be used to create buttons on p-value plots that provide direct access to the website for the selected marker from the appropriate online database

This tab also allows conditional inclusion of markers in your analysis based on particular values of variables from the Annotation Data Set. The criteria can be entered in the Filter to Include Variables field in accordance with SAS syntax for WHERE statements.

For the microarray and proteomics processes, the Annotation Data Set must contain a merge key variable whose values exactly match those of some variable in a tall data set.

For detailed information about the files and data sets used or created by JMP Life Sciences software, see Files and Data Sets.

To Specify an Annotation SAS Data Set:

The method used for this specification can vary depending on whether JMP is connected to SAS on your local machine or connected to SAS on a server. You should refer to the Specifying Folders, Files, and Data Sets documentation for detailed information.

Click Open to open the data set in JMP for inspection.

Item	Description
Probe or Probeset ID	A unique identifier given to a probe or probeset in a probe array or microarray.
GenBank Accession Number	An accession number is a unique identifier given to a biological polymer sequence (such as DNA or a protein) when it is submitted to a sequence database (GenBank, EMBL, DDBJ).
UniGene Cluster ID	A unique identifier given to a cluster of sequences in UniGene.
Gene ID	A unique identifier assigned to a gene record in Entrez Gene. It is an integer and is species specific.
Description	Description of a gene, probe, or probeset.
Chromosomal Location	The physical location of a gene or sequence on a chromosome.
Ensembl ID	A unique identifier assigned to a sequence in Ensembl.
Swiss-Prot ID	A unique identifier assigned to a protein sequence in Swiss-Prot, a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domain structures, post-translational modifications, variants, and so on), a minimal level of redundancy, and significant integration with other databases.
EC Number	A number assigned to an enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). The EC number is a unique identifier in ENZYME, the Enzyme nomenclature database, maintained at the ExPASy molecular biology server.
OMIM ID	A unique identifier assigned to a genetic disorder in the Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.
dbSNP ID	A unique identifier assigned to a single nucleotide polymorphism (SNPs) when it is submitted to the SNP database. Also known as an 'rs' ID.
RefSeq Accession	A unique identifier given to a sequence in the NCBI RefSeq database. The RefSeq database is a curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, and entire chromosomes.
Gene Ontology ID	A unique alphanumerical identifier given to a GO term.
Genomic Location or Coordinate	A location assigned to a gene or a sequence at both the chromosome and sequence-levels.