[1]:
import emat
emat.versions()
emat 0.5.1, plotly 4.14.3

CART

Classification and Regression Trees (CART) can be used for scenario discovery. They partition the explored space (i.e., the scope) into a number of sections, with each partition being added in such a way as to maximize the difference between observations on each side of the newly added partition divider, subject to some constraints.

The Mechanics of using CART

In order to use CART for scenario discovery, the analyst must first conduct a set of experiments. This includes having both the inputs and outputs of the experiments (i.e., you’ve already run the model or meta-model).

[2]:
import emat.examples
scope, db, model = emat.examples.road_test()
designed = model.design_experiments(n_samples=5000, sampler='mc', random_seed=42)
results = model.run_experiments(designed, db=False)

In order to use CART for scenario discovery, the analyst must also identify what constitutes a case that is “of interest”. This is essentially generating a True/False label for every case, using some combination of values of the output performance measures as well as (possibly) the values of the inputs. Some examples of possible definitions of “of interest” might include:

  • Cases where total predicted VMT (a performance measure) is below some threshold.
  • Cases where transit farebox revenue (a performance measure) is above some threshold.
  • Cases where transit farebox revenue (a performance measure) is above above 50% of budgeted transit operating cost (a policy lever).
  • Cases where the average speed of tolled lanes (a performance measure) is less than free-flow speed but greater than 85% of free-flow speed (i.e., bounded both from above and from below).
  • Cases that meet all of the above criteria simultaneously.

The salient features of a definition for “of interest” is that (a) it can be calculated for each case if given the set of inputs and outputs, and (b) that the result is a True or False value.

For this example, we will define “of interest” as cases from the Road Test example that have positive net benefits.

[3]:
of_interest = results['net_benefits']>0

Having defined the cases of interest, to use CART we pass the explanatory data (i.e., the inputs) and the ‘of_interest’ variable to the CART object, and then we can invoke the tree_chooser method.

[4]:
from emat.analysis import CART

cart = CART(
    model.read_experiment_parameters(design_name='mc'),
    of_interest,
    scope=scope,
)
[5]:
chooser = cart.tree_chooser()
chooser

The CART algorithm develops a tree that seeks to make the “best” split at each decision point, generating two datasets that are subsets of the original data and which provides the best (weighted) improvement in the target criterion, which can either be gini impurity or information gain (i.e., entropy reduction).

The tree_chooser method returns an interactive widget that allows an analyst to manipulate selected hyperparameters for the decision tree used by CART. The analyst can set the branch splitting criteria (gini impurity or information gain / entropy reduction), the maximum tree depth, and the minimum fraction of observations in any leaf node.

The display shows the decision tree created by CART, including the branching rule at each step, and a short summary of the data in each branch. The coloration of each tree node highlights the progress, with increasing saturation representing improvements in the branching criterion (gini or entropy) and the hue indicating the dominant result in each node. In the example above, the “of interest” cases are most densely collected in the blue nodes.

It is also possible to review the collection leaf nodes in a tabular display, by accessing the boxes_to_dataframe method, which reports out the total dimensional restrictions for each box. Here, we provide a True argument to include box statistics as well.

[6]:
cart.boxes_to_dataframe(True)
[6]:
Box Statistics expand_capacity input_flow value_of_time
coverage density gini entropy res dim mass min max min max min max
box 0 0.025449 0.124088 0.217380 0.540997 2 0.0548 NaN 13.7076 NaN 109.5 NaN NaN
box 1 0.013473 0.009424 0.018671 0.076951 2 0.3820 13.7076 NaN NaN 109.5 NaN NaN
box 2 0.049401 0.098068 0.176902 0.462842 2 0.1346 NaN NaN 109.5 123.5 NaN 0.114676
box 3 0.113772 0.498361 0.499995 0.999992 2 0.0610 NaN NaN 109.5 123.5 0.114676 NaN
box 4 0.101048 0.517241 0.499405 0.999142 3 0.0522 NaN 39.168739 123.5 NaN NaN 0.069707
box 5 0.032934 0.109453 0.194946 0.498263 3 0.0804 39.168739 NaN 123.5 NaN NaN 0.069707
box 6 0.469311 0.889362 0.196795 0.501838 3 0.1410 NaN 59.772684 123.5 NaN 0.069707 NaN
box 7 0.194611 0.553191 0.494341 0.991821 3 0.0940 59.772684 NaN 123.5 NaN 0.069707 NaN

This table shows various leaf node “boxes” as well as the trade-offs between coverage and density in each.

  • Coverage is percentage of the cases of interest that are in each box (i.e., number of cases of interest in the box divided by total number of cases of interest).
  • Density is the share of cases in each box that are case of interest (i.e., number of cases of interest in the box divided by the total number of cases in the box).

For the statistically minded, this tradeoff can also be interpreted as the tradeoff between Type I (false positive) and Type II (false negative) error. High coverage minimizes the false negatives, while high density minimizes false positives.

As we can for PRIM, we can make a selection of a particular box, and then generate a number of visualizations around that selection.

[7]:
box = cart.select(6)
box
[7]:
<CartBox leaf 6 of 8>
   coverage: 0.46931
   density:  0.88936
   mass:     0.14100
   ●       input_flow >= 123.5
   ●    value_of_time >= 0.06970738619565964
   ●  expand_capacity <= 59.77268409729004

To help visualize these restricted dimensions better, we can generate a plot of the resulting box, overlaid on a ‘pairs’ scatter plot matrix (splom) of the various restricted dimensions.

In the figure below, each of the three restricted dimensions represents both a row and a column of figures. Each of the off-diagonal charts show bi-dimensional distribution of the data across two of the actively restricted dimensions. These charts are overlaid with a green rectangle denoting the selected box. The on-diagonal charts show the relative distribution of cases that are and are not of interest (unconditional on the selected box).

[8]:
box.splom()

Depending on the number of experiments in the data and the number and distribution of the cases of interest, it may be clearer to view these figures as a heat map matrix (hmm) instead of a splom.

[9]:
box.hmm()

CART API

class emat.analysis.CART(x, y, mass_min=0.05, mode=<RuleInductionType.BINARY: 'binary'>, scope=None, explorer=None)[source]

Bases: emat.workbench.analysis.cart.CART

Classification and Regression Tree Algorithm

CART can be used in a manner similar to PRIM. It provides access to the underlying tree, but it can also show the boxes described by the tree in a table or graph form similar to prim.

Parameters:
  • x (DataFrame) – The independent variables, generally the experimental design inputs.
  • y (array-like, 1 dimension) – The dependent variable of interest.
  • mass_min (float, default 0.05) – A value between 0 and 1 indicating the minimum fraction of data points in a terminal leaf.
  • mode ({BINARY, CLASSIFICATION, REGRESSION}) – Indicates the mode in which CART is used. Binary indicates binary classification, classification is multiclass, and regression is regression.
  • scope (Scope) – The EMAT exploratory scope, used primarily to facilitate visualization.
property boxes

Property for getting a list of box limits

boxes_to_dataframe(include_stats=False)

convert boxes to pandas dataframe

Parameters:include_stats (bool, default False) – If True, the box statistics will also be retrieved and returned in the same DataFrame as the boxes.
build_tree(criterion='gini', max_depth=None, mass_min=None, min_samples_split=2)[source]

train CART on the data

select(i)[source]

Select a leaf from the CART tree.

This will update the CART box to this selected box, as well as update the explorer, if one is attached to this CART.

Parameters:i (int) – The index of the box to select.
show_boxes(together=False)

display boxes

Parameters:together (bool, otional) –
show_tree(mplfig=True, format='png')[source]

return a png of the tree

Parameters:
  • mplfig (bool, optional) – if true (default) returns a matplotlib figure with the tree, otherwise, it returns the output as bytes
  • format ({‘png’, ‘svg’}, default ‘png’) – Gives a format of the output.
property stats

property for getting a list of dicts containing the statistics for each box

stats_to_dataframe()

convert stats to pandas dataframe

tree_chooser()[source]

An interactive chooser for setting decision tree hyperparameters.

This method returns an interactive widget that allows an analyst to manipulate selected hyperparameters for the decision tree used by CART. The analyst can set the branch splitting criteria (gini impurity or entropy reduction), the maximum tree depth, and the minimum fraction of observations in any leaf node.

Returns:
Return type:ipywidgets.widgets.interaction.interactive