Effective Data Mining Using JMP Partition Platform
JMP’s partition platform enables users to systematically analyze large data sets to discover unsuspected or unknown relationships. JMP uses visualization to create a successive tree of partitions according to a relationship between the X and Y variables. It finds a set of cuts or groupings of X values that best predict a Y value by exhaustively searching all possible cuts or groupings, recursively forming a tree of decision rules until the desired fit is reached. Through the use of visualization and recursive portioning, JMP makes data mining techniques accessible to a wide range of users.
Data Mining Defined
Data mining is the analysis of large observational data sets with the goal of discovering unsuspected or unknown relationships. The data sets are collected for purposes other than those of the data mining study, and usually consist of a large number of records or a large number of variables measured on each record or both. Often no obvious relationships exist between the Y (variable of interest) and the X’s (other variables in the data set). The starting data tends to be messy in the respect that all rows and records may be incomplete for all variables, and there may be many missing data values. Extensive pre-processing of large data sets may be a required prior to application of data mining. In the interest of brevity, this paper focuses on the application of data mining after data pre-processing has been completed.
Common Uses for Data Mining
Data mining is used in a variety of applications. In customer research it is used to answer questions, like:
- Who buys what?
- Is a customer who buys Product X, likely to buy Product Y?
In retail and banking, data mining is used to detect anomalies or deviations from the general pattern and aid fraud detection. In marketing and sales, data mining is used to identify subsets of prospective customers who are more likely to buy the target product. In semiconductor manufacturing, data mining is used to detect process and equipment routings associated with low yields. In biological research, data mining techniques are used for variable reduction, particularly with microarrays which have large numbers of variables -– sometimes hundreds of thousands.
Data Mining Techniques
Data mining involves a collection of techniques, including:
- Multiple linear and logistic regression
- Classification and regression trees
- Neural nets
- Clustering algorithms
- Association rules
Other techniques are also often classed as data mining methods, including:
- Extensive display and visualization tools
- Variable reduction techniques
- Bayesian methods
Many of these techniques are beyond the capabilities of most users. Fortunately, JMP Statistical Discovery Software solves this problem by providing partitioning and visual discovery techniques that simplify data mining and bring its capabilities into the mainstream.
JMP Recursive Partitioning
The JMP Partition platform recursively partitions data according to a relationship between the X and Y variables, creating a tree of partitions. It finds a set of cuts or groupings of X values that best predict a Y value. It does this by exhaustively searching all possible cuts or groupings. These splits (or partitions) of the data are done recursively forming a tree of decision rules until the desired fit is reached.
Variations of this technique go by many labels and brand names: decision trees, CARTTM, CHAIDTM, C4.5, C5, and others. Using JMP, mainstream users find it easy to develop decision trees that explain the relationships between variables in large observational datasets, providing insight into their objectives as the following example illustrates.
Using Partition - Marketing and Sales Example
The data set illustrated in Figure 1 consists of 9300 customer records from an insurance company database. Marketing and sales are planning a campaign to promote their private health insurance policy; however past campaigns with indiscriminate targeting have demonstrated poor return on investment. Instead they are interested in identifying and targeting subsets of customers who are more likely to be interested in private medical cover.
In Figure 1, the middle panel on the left hand side of the data table lists the 17 variables/columns extracted for each customer from the database, including household size, family income, building insurance premium, and so on. The bottom panel on the left hand side (Rows) indicates that 3100 (one third) of the 9300 records have been randomly excluded for analysis purposes, the remaining 6200 records will be used to develop a tree of decision rules and the 3100 excluded records will be used to evaluate the effectiveness of the decision rules.
After selecting the JMP Partition Platform from a menu, the user then selects the Y and X variables as indicated in Figure 2. The Y is the answer to the question, "Does the customer have private medical insurance? (yes/no)." The set of X’s are the remaining variables excluding those pertaining to a direct measurement of whether a customer has private medical insurance, which in this case means excluding the variables Private Health Insurance Premium and Private Health Insurance Premium (Company) from the list of X's.
After clicking OK the user gets the display indicated in Figure 3. Of the 6200 data records used in creating the tree of decision rules, just under 41% have purchased private medical insurance.
To start splitting the data into subgroups according to the values of the X’s the user clicks the Split button, revealing that the best single predictor of a customer’s tendency to buy private medical insurance is the amount paid for their home buildings insurance.
Figure 4 shows that 66% of customers who pay more than or equal to £134 for their home buildings insurance also buy private medical insurance, whereas only 10% of customers who pay less than £134 for home buildings insurance buy private medical insurance.
Figure 5 shows the decision tree after a total of 5 splits. A total of six partitions or subgroups have been identified, two of which have more than 80% of the customers buying private medical insurance:
- The partition defined by paying between £134 and £430 for home building insurance and paying more than £590 for car insurance contains 1075 customers, 82% of whom have purchased private medical insurance.
- The partition defined by paying between £134 and £430 for home building insurance, paying more than or equal to £231 for home contents insurance and less than £84 for car insurance contains 452 customers, 82% of whom have purchased private medical insurance.
These two groups combined give a sub-group of slightly more than 1527 customers or 25% of the total customer set used in building the decision tree. The mailing list for the marketing campaign has 100,000 customers who currently have no private medical cover. If the sample of 6200 customers used to build the decision tree is representative of the mailing list of 100,000 customers, the marketing team expects approximately 25,000 customers to be in the target group, with approximately 20,000 of them expected to buy private medical cover. This is the group that the team would like to target with their marketing campaign, however before doing so the team validate the decision tree using the 3100 customer records that were excluded from the analysis.
Figure 6 contains the result of the independent validation of the decision tree, and shows the evaluation data set gives close correspondence to the results observed with the training data. The two target subgroups again have 82% of the customers also buying private medical cover and the total number of customers in these groups is slightly more than 804 of the 3100 or 26% of the customers. This independent validation of the decision tree confirms the marketing plan to target the two subgroups identified, and providing the sample of 6200 customers used to build the model is representative of the 100,000 customers in the original mailing list, the marketing campaign will be targeting one quarter of the number of customers, but those targeted will have a strong interest in purchasing private medical cover.
JMP provides a range of options to control the partitions created, the user can:
- Specify an X variable to be used as the splitting variable for any split instead of using the variable identified by JMP as being the next best predictor of Y. This is useful in situations when the best X variable has no meaning or has no practical value.
- Continue splitting to identify additional partitions that give good discrimination for the Y.
- Get displays of the predictive quality of the partitioning model including ROC and lift curves.
Partitioning models can be developed for both continuous and discrete Y’s.
Summary
JMP makes data mining techniques accessible to a wide range of users through its partition platform, which uses visualization to create a successive tree of partitions to partition data according to a relationship between the X and Y variables. It finds a set of cuts or groupings of X values that best predict a Y value by exhaustively searching all possible cuts or groupings, recursively forming a tree of decision rules until the desired fit is reached.







