Appendixes | Appendixes | The SAS WHERE Expression

The SAS WHERE Expression
A WHERE Expression is a type of SAS expression that enables you to filter and select observations meeting one or more specific defined criteria. A WHERE expression can be as simple as a single variable name. A WHERE expression can contain a SAS function, or it can be a sequence of operands and operators that define one or more conditions for selecting observations.
A SAS WHERE expression contains the WHERE keyword, one or more operands and one or more operators , and takes the following general form:
WHERE operand operator operand
Note : Blank spaces must be entered between each element in a WHERE expression.
The WHERE keyword alerts SAS to subset the data set.
An operand is an object to be operated on. An operand can be a variable (or column in a SAS data set), a SAS function (the result of a computation or other manipulation), or a constant.
An operator is a symbol that requests a comparison, a logical operation, or arithmetic computation.
WHERE expressions can either be simple or compound . Simple WHERE expressions contain only one condition that must be satisfied. Compound WHERE expressions contain more than one condition that must be satisfied, with each condition being separated by Boolean terms, such as and or or .
For example, to filter only diseased individuals from a data set containing a mixed population of diseased (sick) and healthy (healthy) individuals (as indicated in a column named DiseaseStatus ), you could use the following simple WHERE expression:
WHERE DiseaseStatus = ‘sick’
To simultaneously filter diseased individuals who also have the genotype A/A (as indicated in a column named Marker1 ), you could use the following compound WHERE expression:
WHERE DiseaseStatus = ‘sick’ and Marker1 = ‘A/A’
Using a SAS WHERE Expression
Recall that JMP Life Sciences software runs SAS in the background for manipulation and analysis of genomic data sets. Any statements or commands, including WHERE expressions, that can be used in SAS can also be used here. However, because JMP dialogs function as the front end for generating and running the underlying SAS code, you do not need to understand all of the syntax for writing WHERE expressions. Instead, you are prompted by specific data entry fields (identified by the title: Filter to Include... ) to specify the relevant operands and operators.
The Filter to Include Observations field (shown below), found on many JMP Life Sciences dialogs, simplifies the specification of a WHERE expression.
Note that the WHERE keyword has already been specified. There is no need to type the WHERE keyword. All you need to do is type the operands and the operators in the field.
*
To use the previous example, type DiseaseStatus = ‘sick’ , as shown below, to retain only the diseased individuals from the input data set.
The filter is applied when you click Run .
Specifying an Operand
An operand is defined as the object to be operated on or the condition to be met. Operands can be either variables, SAS functions, or constants.
The variable names used in this statement (for example, GeneId , Position ) must exactly match those in the Input Data Set.
A variable is a column in a SAS data set. Each variable has a name and a type (numeric or character) and contains values.
You must type the name exactly as it appears in the data set.
The values listed in numeric variables must be numeric. Character variables can contain numbers, letters, and symbols. The variable type determines how you specify that variable.
To filter the data based on a numeric variable, type the name of the variable followed by a space, the relevant operator followed by a space, and the condition to be met. For example, to select only those rows whose numeric value in the Freq variable exceeds 0.5, type Freq > 0.5 in the Filter field.
SAS treats numeric values of zero (0) or that are missing (.) as false. All other values are true.
To continue the example listed above, to filter out those frequency observations that are not reported (indicated by the missing value symbol (.)), type Freq in the Filter field.
To filter your data using a character variable, you must enclose the character string defining the condition to be met within quotation marks. For example, to filter only the diseased individuals, type DiseaseStatus = ‘sick’ in the Filter field. You can use either single quotation marks ( ‘x’ ) or double quotation marks ( “x” ). However, you must be consistent in the quotation marks that you use. If you open the condition with a single quotation mark, you must close it using a single quotation mark, and vice-verse. You cannot mix single and double quotation marks ( ‘x” , for example) within the same expression.
A SAS function returns a value from a computation or system manipulation. Most functions use arguments that you supply. To use a SAS function in a WHERE expression, type the name of the function followed by the argument. The argument must be contained within parentheses.
The SUBSTR function examines the character strings in a variable for a specific substring and retains only those observations containing the specified substring. The argument for this function specifies the variable containing the substring, the position within the string at which the substring starts, and the length of the substring. For example, entering the estimation substr (GeneName,1,3) = “cyt” in the Filter field subsets the input data set so that only observations in which the character value in the GeneName column begins with cyt , such as cyt osolic or cyt ochrome are retained. Others, such as lympho cyt e , are not.
Many additional, legitimate SAS functions can be used in WHERE estimations; for example TODAY , which returns the current data. However, most of these tend not to be very useful in analysis and so, are not discussed further here.
Please refer to Base SAS documentation for additional details about use of SAS functions in WHERE expressions, if you require more information.
A constant is the fixed value within a variable for which you are searching. The value is either numeric or character. Constants are also called literals. For example, a constant could be a specific genotype .
As with character and numeric variables, if the constant is a character, you must type either single or double quotation marks before and after the value. Remember, you cannot mix single and double quotation marks (‘x”, for example) within the same expression. Do not use quotation marks if the constant is numeric.
Remember : Surround character constants with either single or double quotation marks. Do not enclose numeric constants in quotation marks.
Specifying an Operator
An operator is a symbol that requests a comparison, a logical operation, or arithmetic computation. When writing the WHERE expression, you should place the operator between two operands.
Comparison operators (also called binary operators) compare a variable with a value or with another variable. Comparison operators propose a relationship and ask SAS to determine whether that relationship holds. Only observations that meet the condition(s) specified are included in the analysis.
The comparison operators, available to you, are shown below.
= or EQ
Entering Gene EQ “EGF” subsets the input data set, retaining only those observations where the value in the Gene column is EGF.
^= or NE
Entering IndividualNum ^= 450 subsets the input data set, retaining only those observations where the value in the IndividualNum column is not equal to 450.
< or LT
Entering Position < 100000 subsets the input data set, retaining only those observations where the position value is less than 100,000.
> or GT
Entering Position GT 100000 subsets the input data set, retaining only those observations where the position value is greater than 100,000.
<= or LE
Entering Position <= 100000 subsets the input data set, retaining only those observations where the position value is less than or equal to 100,000.
>= or GE
Entering Position GE 100000 subsets the input data set, retaining only those observations where the position value is greater than or equal to 100,000.
You can combine comparison operators into compound WHERE expressions. For example, entering 50000 <= Position <= 100000 includes only those observations whose values listed in the Position variable are greater than or equal to 50,000 and are less than or equal to 100,000.
Note : The combination of the > and < symbols (as in 7 > x < 9, for example) is not supported.
Remember : You do not need to type WHERE in the Filter field.
Arithmetic operators enable you to perform a mathematical operation on values in the specified operand.
The arithmetic operators, available to you, are shown below.
Entering x = y*0.1 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values multiplied by 0.1.
Entering x = y/10 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values divided by 10.
Entering x = y+1 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values plus one.
Entering x = y-1 in the Filter field selects and returns observations whose values in the x column are equal to the corresponding y column values less one.
Entering x = y**2 in the Filter field selects and returns observations whose values in the x column are equal to the square of the corresponding y column values.
Remember : You do not need to type WHERE in the Filter field.
Additional operators can be used to select observations matching or containing a specified set of characters, missing values, to concatenate variables, return minimum or maximum values, and so on.
The most useful of these operators are the IN, CONTAINS, and IS MISSING operators shown below.
Entering Window IN (1,4,9,32) subsets the input data set, retaining only those observations where the value in the Window column is equal to either 1, 4, 9, or 32.
Entering GeneName CONTAINS “cyt” subsets the input data set, retaining only those observations where the value in the GeneName column contains the string cyt (for example all of the cytochrome genes).
IS NULL or IS MISSING
Entering GeneName IS MISSING subsets the input data set, retaining only those observations where the value in the GeneName column is absent.
Prefix operators are added to modify existing operators in a WHERE expression. The most commonly used prefix operator is NOT. Placing this prefix before the normal operator selects those observations not matching the specified condition.
Entering GeneName NOT “cyt” subsets the input data set, retaining only those observations where the value in the GeneName column is not cyt .
The + and - symbols, which typically serve as arithmetic operators, also function as prefix operators when put in front of an open parenthesis. Entering z = -(x + y) in the Filter field returns observations whose value for z equals the negative of the sum of the values in x and y.
Compound WHERE Expressions
You might find it useful to combine multiple simple WHERE expressions into a single, compound WHERE expression. Expressions can be joined either by using Boolean terms, such as AND and OR, or by grouping arguments within parentheses.
For example, entering GeneName CONTAINS “cyt” OR “Fed” in the Filter field selects and returns all rows with these character strings in the GeneName columns. Alternatively, entering GeneName CONTAINS (“cyt”, “Fed”) does the same.
Additional Information
Please refer to Base SAS documentation for additional operators, as needed.