Appendixes | The SAS WHERE Expression

A WHERE Expression is a type of SAS expression that enables you to filter and select observations meeting one or more specific defined criteria. A WHERE expression can be as simple as a single variable name. A WHERE expression can contain a SAS function, or it can be a sequence of operands and operators that define one or more conditions for selecting observations.A SAS WHERE expression contains the WHERE keyword, one or more operands and one or more operators, and takes the following general form:Note: Blank spaces must be entered between each element in a WHERE expression.An operand is an object to be operated on. An operand can be a variable (or column in a SAS data set), a SAS function (the result of a computation or other manipulation), or a constant.An operator is a symbol that requests a comparison, a logical operation, or arithmetic computation.WHERE expressions can either be simple or compound. Simple WHERE expressions contain only one condition that must be satisfied. Compound WHERE expressions contain more than one condition that must be satisfied, with each condition being separated by Boolean terms, such as and or or.For example, to filter only diseased individuals from a data set containing a mixed population of diseased (sick) and healthy (healthy) individuals (as indicated in a column named DiseaseStatus), you could use the following simple WHERE expression:To simultaneously filter diseased individuals who also have the genotype A/A (as indicated in a column named Marker1), you could use the following compound WHERE expression:Recall that JMP Life Sciences software runs SAS in the background for manipulation and analysis of genomic data sets. Any statements or commands, including WHERE expressions, that can be used in SAS can also be used here. However, because JMP dialogs function as the front end for generating and running the underlying SAS code, you do not need to understand all of the syntax for writing WHERE expressions. Instead, you are prompted by specific data entry fields (identified by the title: Filter to Include...) to specify the relevant operands and operators.The Filter to Include Observations field (shown below), found on many JMP Life Sciences dialogs, simplifies the specification of a WHERE expression.

To use the previous example, type DiseaseStatus = ‘sick’, as shown below, to retain only the diseased individuals from the input data set.The variable names used in this statement (for example, GeneId, Position) must exactly match those in the Input Data Set.A variable is a column in a SAS data set. Each variable has a name and a type (numeric or character) and contains values.To filter the data based on a numeric variable, type the name of the variable followed by a space, the relevant operator followed by a space, and the condition to be met. For example, to select only those rows whose numeric value in the Freq variable exceeds 0.5, type Freq > 0.5 in the Filter field.To continue the example listed above, to filter out those frequency observations that are not reported (indicated by the missing value symbol (.)), type Freq in the Filter field.To filter your data using a character variable, you must enclose the character string defining the condition to be met within quotation marks. For example, to filter only the diseased individuals, type DiseaseStatus = ‘sick’ in the Filter field. You can use either single quotation marks (‘x’) or double quotation marks (“x”). However, you must be consistent in the quotation marks that you use. If you open the condition with a single quotation mark, you must close it using a single quotation mark, and vice-verse. You cannot mix single and double quotation marks (‘x”, for example) within the same expression.The SUBSTR function examines the character strings in a variable for a specific substring and retains only those observations containing the specified substring. The argument for this function specifies the variable containing the substring, the position within the string at which the substring starts, and the length of the substring. For example, entering the estimation substr (GeneName,1,3) = “cyt” in the Filter field subsets the input data set so that only observations in which the character value in the GeneName column begins with cyt, such as cytosolic or cytochrome are retained. Others, such as lymphocyte, are not.Many additional, legitimate SAS functions can be used in WHERE estimations; for example TODAY, which returns the current data. However, most of these tend not to be very useful in analysis and so, are not discussed further here.A constant is the fixed value within a variable for which you are searching. The value is either numeric or character. Constants are also called literals. For example, a constant could be a specific genotype.Remember: Surround character constants with either single or double quotation marks. Do not enclose numeric constants in quotation marks.An operator is a symbol that requests a comparison, a logical operation, or arithmetic computation. When writing the WHERE expression, you should place the operator between two operands.Comparison operators (also called binary operators) compare a variable with a value or with another variable. Comparison operators propose a relationship and ask SAS to determine whether that relationship holds. Only observations that meet the condition(s) specified are included in the analysis.

= or EQ

•

• Entering Gene EQ “EGF” subsets the input data set, retaining only those observations where the value in the Gene column is EGF. ^= or NE

• Entering IndividualNum ^= 450 subsets the input data set, retaining only those observations where the value in the IndividualNum column is not equal to 450. < or LT

• Entering Position < 100000 subsets the input data set, retaining only those observations where the position value is less than 100,000. > or GT

• Entering Position GT 100000 subsets the input data set, retaining only those observations where the position value is greater than 100,000. <= or LE

• Entering Position <= 100000 subsets the input data set, retaining only those observations where the position value is less than or equal to 100,000. >= or GE

• Entering Position GE 100000 subsets the input data set, retaining only those observations where the position value is greater than or equal to 100,000.You can combine comparison operators into compound WHERE expressions. For example, entering 50000 <= Position <= 100000 includes only those observations whose values listed in the Position variable are greater than or equal to 50,000 and are less than or equal to 100,000.Arithmetic operators enable you to perform a mathematical operation on values in the specified operand.

• Entering x = y*0.1 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values multiplied by 0.1.

•

• Entering x = y/10 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values divided by 10.

•

• Entering x = y+1 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values plus one.

• Entering x = y-1 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values less one.

• Entering x = y**2 in the Filter field selects and returns observations whose values in the x column are equal to the square of the corresponding y column values.

• Entering Window IN (1,4,9,32) subsets the input data set, retaining only those observations where the value in the Window column is equal to either 1, 4, 9, or 32.

• Entering GeneName CONTAINS “cyt” subsets the input data set, retaining only those observations where the value in the GeneName column contains the string cyt (for example all of the cytochrome genes). IS NULL or IS MISSING

• Entering GeneName IS MISSING subsets the input data set, retaining only those observations where the value in the GeneName column is absent.Prefix operators are added to modify existing operators in a WHERE expression. The most commonly used prefix operator is NOT. Placing this prefix before the normal operator selects those observations not matching the specified condition.

• Entering GeneName NOT “cyt” subsets the input data set, retaining only those observations where the value in the GeneName column is not cyt.The + and - symbols, which typically serve as arithmetic operators, also function as prefix operators when put in front of an open parenthesis. Entering z = -(x + y) in the Filter field returns observations whose value for z equals the negative of the sum of the values in x and y.For example, entering GeneName CONTAINS “cyt” OR “Fed” in the Filter field selects and returns all rows with these character strings in the GeneName columns. Alternatively, entering GeneName CONTAINS (“cyt”, “Fed”) does the same.