[1]:
import emat
emat.versions()
emat 0.5.1, plotly 4.14.3
CART¶
Classification and Regression Trees (CART) can be used for scenario discovery. They partition the explored space (i.e., the scope) into a number of sections, with each partition being added in such a way as to maximize the difference between observations on each side of the newly added partition divider, subject to some constraints.
The Mechanics of using CART¶
In order to use CART for scenario discovery, the analyst must first conduct a set of experiments. This includes having both the inputs and outputs of the experiments (i.e., you’ve already run the model or meta-model).
[2]:
import emat.examples
scope, db, model = emat.examples.road_test()
designed = model.design_experiments(n_samples=5000, sampler='mc', random_seed=42)
results = model.run_experiments(designed, db=False)
In order to use CART for scenario discovery, the analyst must also identify what constitutes a case that is “of interest”. This is essentially generating a True/False label for every case, using some combination of values of the output performance measures as well as (possibly) the values of the inputs. Some examples of possible definitions of “of interest” might include:
- Cases where total predicted VMT (a performance measure) is below some threshold.
- Cases where transit farebox revenue (a performance measure) is above some threshold.
- Cases where transit farebox revenue (a performance measure) is above above 50% of budgeted transit operating cost (a policy lever).
- Cases where the average speed of tolled lanes (a performance measure) is less than free-flow speed but greater than 85% of free-flow speed (i.e., bounded both from above and from below).
- Cases that meet all of the above criteria simultaneously.
The salient features of a definition for “of interest” is that (a) it can be calculated for each case if given the set of inputs and outputs, and (b) that the result is a True or False value.
For this example, we will define “of interest” as cases from the Road Test example that have positive net benefits.
[3]:
of_interest = results['net_benefits']>0
Having defined the cases of interest, to use CART we pass the explanatory data (i.e., the inputs) and the ‘of_interest’ variable to the CART
object, and then we can invoke the tree_chooser
method.
[4]:
from emat.analysis import CART
cart = CART(
model.read_experiment_parameters(design_name='mc'),
of_interest,
scope=scope,
)
[5]:
chooser = cart.tree_chooser()
chooser
The CART algorithm develops a tree that seeks to make the “best” split at each decision point, generating two datasets that are subsets of the original data and which provides the best (weighted) improvement in the target criterion, which can either be gini impurity or information gain (i.e., entropy reduction).
The tree_chooser
method returns an interactive widget that allows an analyst to manipulate selected hyperparameters for the decision tree used by CART. The analyst can set the branch splitting criteria (gini impurity or information gain / entropy reduction), the maximum tree depth, and the minimum fraction of observations in any leaf node.
The display shows the decision tree created by CART, including the branching rule at each step, and a short summary of the data in each branch. The coloration of each tree node highlights the progress, with increasing saturation representing improvements in the branching criterion (gini or entropy) and the hue indicating the dominant result in each node. In the example above, the “of interest” cases are most densely collected in the blue nodes.
It is also possible to review the collection leaf nodes in a tabular display, by accessing the boxes_to_dataframe
method, which reports out the total dimensional restrictions for each box. Here, we provide a True
argument to include box statistics as well.
[6]:
cart.boxes_to_dataframe(True)
[6]:
Box Statistics | expand_capacity | input_flow | value_of_time | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
coverage | density | gini | entropy | res dim | mass | min | max | min | max | min | max | |
box 0 | 0.025449 | 0.124088 | 0.217380 | 0.540997 | 2 | 0.0548 | NaN | 13.7076 | NaN | 109.5 | NaN | NaN |
box 1 | 0.013473 | 0.009424 | 0.018671 | 0.076951 | 2 | 0.3820 | 13.7076 | NaN | NaN | 109.5 | NaN | NaN |
box 2 | 0.049401 | 0.098068 | 0.176902 | 0.462842 | 2 | 0.1346 | NaN | NaN | 109.5 | 123.5 | NaN | 0.114676 |
box 3 | 0.113772 | 0.498361 | 0.499995 | 0.999992 | 2 | 0.0610 | NaN | NaN | 109.5 | 123.5 | 0.114676 | NaN |
box 4 | 0.101048 | 0.517241 | 0.499405 | 0.999142 | 3 | 0.0522 | NaN | 39.168739 | 123.5 | NaN | NaN | 0.069707 |
box 5 | 0.032934 | 0.109453 | 0.194946 | 0.498263 | 3 | 0.0804 | 39.168739 | NaN | 123.5 | NaN | NaN | 0.069707 |
box 6 | 0.469311 | 0.889362 | 0.196795 | 0.501838 | 3 | 0.1410 | NaN | 59.772684 | 123.5 | NaN | 0.069707 | NaN |
box 7 | 0.194611 | 0.553191 | 0.494341 | 0.991821 | 3 | 0.0940 | 59.772684 | NaN | 123.5 | NaN | 0.069707 | NaN |
This table shows various leaf node “boxes” as well as the trade-offs between coverage and density in each.
- Coverage is percentage of the cases of interest that are in each box (i.e., number of cases of interest in the box divided by total number of cases of interest).
- Density is the share of cases in each box that are case of interest (i.e., number of cases of interest in the box divided by the total number of cases in the box).
For the statistically minded, this tradeoff can also be interpreted as the tradeoff between Type I (false positive) and Type II (false negative) error. High coverage minimizes the false negatives, while high density minimizes false positives.
As we can for PRIM, we can make a selection of a particular box, and then generate a number of visualizations around that selection.
[7]:
box = cart.select(6)
box
[7]:
<CartBox leaf 6 of 8>
coverage: 0.46931
density: 0.88936
mass: 0.14100
● input_flow >= 123.5
● value_of_time >= 0.06970738619565964
● expand_capacity <= 59.77268409729004
To help visualize these restricted dimensions better, we can generate a plot of the resulting box, overlaid on a ‘pairs’ scatter plot matrix (splom
) of the various restricted dimensions.
In the figure below, each of the three restricted dimensions represents both a row and a column of figures. Each of the off-diagonal charts show bi-dimensional distribution of the data across two of the actively restricted dimensions. These charts are overlaid with a green rectangle denoting the selected box. The on-diagonal charts show the relative distribution of cases that are and are not of interest (unconditional on the selected box).
[8]:
box.splom()
Depending on the number of experiments in the data and the number and distribution of the cases of interest, it may be clearer to view these figures as a heat map matrix (hmm
) instead of a splom.
[9]:
box.hmm()
CART API¶
-
class
emat.analysis.
CART
(x, y, mass_min=0.05, mode=<RuleInductionType.BINARY: 'binary'>, scope=None, explorer=None)[source]¶ Bases:
emat.workbench.analysis.cart.CART
Classification and Regression Tree Algorithm
CART can be used in a manner similar to PRIM. It provides access to the underlying tree, but it can also show the boxes described by the tree in a table or graph form similar to prim.
Parameters: - x (DataFrame) – The independent variables, generally the experimental design inputs.
- y (array-like, 1 dimension) – The dependent variable of interest.
- mass_min (float, default 0.05) – A value between 0 and 1 indicating the minimum fraction of data points in a terminal leaf.
- mode ({BINARY, CLASSIFICATION, REGRESSION}) – Indicates the mode in which CART is used. Binary indicates binary classification, classification is multiclass, and regression is regression.
- scope (Scope) – The EMAT exploratory scope, used primarily to facilitate visualization.
-
property
boxes
¶ Property for getting a list of box limits
-
boxes_to_dataframe
(include_stats=False)¶ convert boxes to pandas dataframe
Parameters: include_stats (bool, default False) – If True, the box statistics will also be retrieved and returned in the same DataFrame as the boxes.
-
build_tree
(criterion='gini', max_depth=None, mass_min=None, min_samples_split=2)[source]¶ train CART on the data
-
select
(i)[source]¶ Select a leaf from the CART tree.
This will update the CART box to this selected box, as well as update the explorer, if one is attached to this CART.
Parameters: i (int) – The index of the box to select.
-
show_tree
(mplfig=True, format='png')[source]¶ return a png of the tree
Parameters: - mplfig (bool, optional) – if true (default) returns a matplotlib figure with the tree, otherwise, it returns the output as bytes
- format ({‘png’, ‘svg’}, default ‘png’) – Gives a format of the output.
-
property
stats
¶ property for getting a list of dicts containing the statistics for each box
-
stats_to_dataframe
()¶ convert stats to pandas dataframe
-
tree_chooser
()[source]¶ An interactive chooser for setting decision tree hyperparameters.
This method returns an interactive widget that allows an analyst to manipulate selected hyperparameters for the decision tree used by CART. The analyst can set the branch splitting criteria (gini impurity or entropy reduction), the maximum tree depth, and the minimum fraction of observations in any leaf node.
Returns: Return type: ipywidgets.widgets.interaction.interactive