{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "768c40fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import emat\n",
    "emat.versions()"
   ]
  },
  {
   "cell_type": "raw",
   "id": "2730ad90",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _methodology-prim:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50cad623",
   "metadata": {},
   "source": [
    "# PRIM\n",
    "\n",
    "Patient rule induction method (PRIM) is a scenario \n",
    "discovery algorithm that operates on an existing set of data \n",
    "with model inputs and outputs (i.e., you have already\n",
    "designed and run a set of experiments using either a\n",
    "core model or a meta-model.)  Generally, a decently \n",
    "size set of experiments (hundreds or thousands) \n",
    "is used to describe the solution space, although \n",
    "no minimum number of experiments is formally required.\n",
    "\n",
    "PRIM is used for locating areas of an outcome space that \n",
    "are of particular interest, which it does by reducing \n",
    "the data size incrementally by small amounts in an \n",
    "iterative process as follows:\n",
    "    \n",
    "- Candidate boxes are generated.  These boxes represent \n",
    "  incrementally smaller sets of the data. Each box removes \n",
    "  a portion of the data based on the levels of a single \n",
    "  input variable.\n",
    "  * For categorical input variables, there is one \n",
    "    box generated for each category with each box \n",
    "    removing one category from the data set.\n",
    "  * For integer and continuous variables, two boxes \n",
    "    are generated – one box that removes a \n",
    "    portion of data representing the smallest set of \n",
    "    values for that input variable and another \n",
    "    box that removes a portion of data representing \n",
    "    the largest set of values for that input.  \n",
    "    The step size for these variables is controlled \n",
    "    by the analyst.\n",
    "- For each candidate box, the relative improvement \n",
    "  in the number of outcomes of interest inside \n",
    "  the box is calculated and the candidate box with \n",
    "  the highest improvement is selected.\n",
    "- The data in the selected candidate box replaces \n",
    "  the starting data and the process is repeated.\n",
    "\n",
    "The process ends based on a stopping criteria.\n",
    "For more details on the algorithm, \n",
    "see [Friedman and Fisher (1999)](http://statweb.stanford.edu/~jhf/ftp/prim.pdf) \n",
    "or [Kwakkel and Jaxa-Rozen (2016)](https://www.sciencedirect.com/science/article/pii/S1364815215301092).\n",
    "\n",
    "The PRIM algorithm is particularly useful for scenario \n",
    "discovery, which broadly is the process of \n",
    "identifying particular scenarios of interest in a \n",
    "large and deeply uncertain dataset.\n",
    "In the context of exploratory modeling, scenario \n",
    "discovery is often used to obtain a better understanding \n",
    "of areas of the uncertainty space where a policy or \n",
    "collection of policies performs poorly because it is \n",
    "often used in tandem with robust search methods for \n",
    "identifying policies that perform well \n",
    "([Kwakkel and Jaxa-Rozen (2016)](https://www.sciencedirect.com/science/article/pii/S1364815215301092))."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6b11029",
   "metadata": {},
   "source": [
    "## The Mechanics of using PRIM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d67e961b",
   "metadata": {},
   "source": [
    "In order to use PRIM for scenario discovery, the analyst must\n",
    "first conduct a set of experiments.  This includes having both\n",
    "the inputs and outputs of the experiments (i.e., you've already\n",
    "run the model or meta-model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "684eecb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import emat.examples\n",
    "scope, db, model = emat.examples.road_test()\n",
    "designed = model.design_experiments(n_samples=5000, sampler='mc', random_seed=42)\n",
    "results = model.run_experiments(designed, db=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dafe831d",
   "metadata": {},
   "source": [
    "In order to use PRIM for scenario discovery, the analyst must\n",
    "also identify what constitutes a case that is \"of interest\".\n",
    "This is essentially generating a True/False label for every \n",
    "case, using some combination of values of the output performance \n",
    "measures as well as (possibly) the values of the inputs.\n",
    "Some examples of possible definitions of \"of interest\" might\n",
    "include:\n",
    "\n",
    "- Cases where total predicted VMT (a performance measure) is below some threshold.\n",
    "- Cases where transit farebox revenue (a performance measure) is above some threshold.\n",
    "- Cases where transit farebox revenue (a performance measure) is above above 50% of\n",
    "  budgeted transit operating cost (a policy lever).\n",
    "- Cases where the average speed of tolled lanes (a performance measure) is less \n",
    "  than free-flow speed but greater than 85% of free-flow speed (i.e., bounded both\n",
    "  from above and from below).\n",
    "- Cases that meet all of the above criteria simultaneously.\n",
    "\n",
    "The salient features of a definition for \"of interest\" is that\n",
    "(a) it can be calculated for each case if given the set \n",
    "of inputs and outputs, and (b) that the result is a True or False value.\n",
    "\n",
    "For this example, we will define \"of interest\" as cases from the \n",
    "Road Test example that have positive net benefits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "701b68e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "of_interest = results['net_benefits']>0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e8aca44",
   "metadata": {},
   "source": [
    "Having defined the cases of interest, to use PRIM we pass the\n",
    "explanatory data (i.e., the inputs) and the 'of_interest' variable\n",
    "to the `Prim` object, and then invoke the `find_box` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4902b8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from emat.analysis.prim import Prim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "947ef687",
   "metadata": {},
   "outputs": [],
   "source": [
    "discovery = Prim(\n",
    "    model.read_experiment_parameters(design_name='mc'), \n",
    "    of_interest, \n",
    "    threshold=0.02,\n",
    "    scope=scope\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3fa6755a",
   "metadata": {},
   "outputs": [],
   "source": [
    "box1 = discovery.find_box()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccf47168",
   "metadata": {},
   "source": [
    "The `find_box` command actually provides a number of different possible boxes\n",
    "along a (heuristically) optimal trajectory, trading off coverage against\n",
    "density.  We can plot a static figure showing the tradeoff curve using the \n",
    "`show_tradeoff` command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56fdac91",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = box1.show_tradeoff()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9ac68fe",
   "metadata": {},
   "source": [
    "This figure shows the trade-off between coverage and density.\n",
    "\n",
    "- **Coverage** is percentage of the cases of interest that are in the box\n",
    "  (i.e., number of cases of interest in the box divided by total number of \n",
    "  cases of interest).\n",
    "  The starting point of the PRIM algorithm is the unrestricted full set of cases, \n",
    "  which includes all outcomes of interest, and therefore, the coverage starts at \n",
    "  1.0 and drops as the algorithm progresses.  \n",
    "- **Density** is the share of cases in the box that are case of interest\n",
    "  (i.e., number of cases of interest in the box divided by the total \n",
    "  number of cases in the box). \n",
    "  As the box is reduced, the density will increase (as that is the objective \n",
    "  of the PRIM algorithm).  \n",
    "\n",
    "For the statistically minded, this tradeoff can also be interpreted as\n",
    "the tradeoff between Type I (false positive) and Type II (false negative)\n",
    "error.  High coverage minimizes the false negatives, while high density\n",
    "minimizes false positives.\n",
    "\n",
    "By default, the PRIM algorithm sets the \"selected\" box position as the \n",
    "particular box at the end of the peeling trajectory (highest density, at \n",
    "the top of the top of the tradeoff curve), which has the highest \n",
    "density, but the smallest coverage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "983de121",
   "metadata": {},
   "outputs": [],
   "source": [
    "box1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ade712d1",
   "metadata": {},
   "source": [
    "In a Jupyter notebook, the analyst can interactively view and select other possible\n",
    "boxes using the `tradeoff_selector` method.  This method opens a dynamic viewer\n",
    "which shows the same basic figure as the static one above, but when hovering over\n",
    "each possible tradeoff point, a pop-up box appears with more details about\n",
    "that point, including which dimensions are restricted, and what the restrictions\n",
    "are.  A selection of the preferred tradeoff can be made by clicking on any point.[<sup>†</sup>](#_blank \"This website does not include a live Python kernel to \n",
    "execute commands, so some interactive functionality is not available.\")\n",
    "The currently selected preferred tradeoff is marked with an \"X\" instead \n",
    "of a dot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c46dbe59",
   "metadata": {},
   "outputs": [],
   "source": [
    "box1.tradeoff_selector()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01526485",
   "metadata": {},
   "source": [
    "In addition to the point-and-click interface, the analyst can also manually set \n",
    "the selection programmatically using Python code, selecting \n",
    "a preferred box (by index) that trades off density and coverage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cba05f56",
   "metadata": {},
   "outputs": [],
   "source": [
    "box1.select(30)\n",
    "box1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f821bbb",
   "metadata": {},
   "source": [
    "To help visualize these restricted dimensions better, we can \n",
    "generate a plot of the resulting box,\n",
    "overlaid on a 'pairs' scatter plot matrix (`splom`) of the various restricted \n",
    "dimensions.\n",
    "\n",
    "In the figure below, each of the three restricted dimensions represents\n",
    "both a row and a column of figures.  Each of the off-diagonal charts show \n",
    "bi-dimensional distribution of the data across two of the actively\n",
    "restricted dimensions.  These charts are overlaid with a green rectangle\n",
    "denoting the selected box.  The on-diagonal charts show the relative\n",
    "distribution of cases that are and are not of interest (unconditional\n",
    "on the selected box)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c6e035d",
   "metadata": {},
   "outputs": [],
   "source": [
    "box1.splom()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3d0eb73",
   "metadata": {},
   "source": [
    "Depending on the number of experiments in the data and the number \n",
    "and distribution of the cases of interest, it may be clearer to\n",
    "view these figures as a heat map matrix (`hmm`) intead of a splom. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f0995ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "box1.hmm()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f2b6e6c",
   "metadata": {},
   "source": [
    "## Non-Rectangular Boxes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8241e9",
   "metadata": {},
   "source": [
    "Regardless of any analyst intervention in box selection\n",
    "using the `select` method, initial box identified using PRIM\n",
    "will always be a rectangle (or, more generally, a hyper-rectangle).\n",
    "Every restricted dimension is specified independently, not conditional\n",
    "on any other dimension.  \n",
    "\n",
    "A second box can be overlaid after the first to expand the solution\n",
    "to include a non-rectangular area.  The second tier box itself is\n",
    "still another rectangle, but it can be defined using different boundaries,\n",
    "possibly on different dimensions, to give in aggregate a non-convex shape.\n",
    "Creating such a second box is done by calling the `find_box` method again,\n",
    "after the particulars of the first box are finalized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b342506e",
   "metadata": {},
   "outputs": [],
   "source": [
    "box2 = discovery.find_box()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e85280d",
   "metadata": {},
   "outputs": [],
   "source": [
    "box2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16f88dc3",
   "metadata": {},
   "source": [
    "The cases considered for the second box are *only* those cases\n",
    "not within the first box.  Because our first box had a relatively\n",
    "high coverage and high density, we have already captured most of\n",
    "the cases of interest and not too many other cases, so the \n",
    "second box has a much diminished trade-off curve.  The maximum\n",
    "value of the coverage for this second box is one minus the \n",
    "selected final coverage of the first box."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7029684",
   "metadata": {},
   "outputs": [],
   "source": [
    "box2.show_tradeoff();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db014f34",
   "metadata": {},
   "source": [
    "We can also make a similar scatter plot for this second box.\n",
    "There are 8 restricted dimensions in the currently selected\n",
    "solution, which can result in an unwieldy large scatter plot \n",
    "matrix.  To help, the `splom` command allows you to override\n",
    "the selection of dimensions to display in the rows and/or\n",
    "columns of the resulting figure, by explicitly giving a set\n",
    "of dimensions to display. These don't necessarily need to be\n",
    "restricted dimensions, they can be any dimensions available in\n",
    "the original PRIM analysis.\n",
    "\n",
    "Because this is the second box, the method will only plot the \n",
    "points that remain (i.e. the ones not included in the first\n",
    "box).  This results in a \"blotchy\" appearance for may of the \n",
    "graphs in this figure, as most of the \"of interest\" cases were\n",
    "captured previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d200e57",
   "metadata": {},
   "outputs": [],
   "source": [
    "box2.splom(\n",
    "    cols=['beta','input_flow','value_of_time','debt_type','yield_curve']\n",
    ")"
   ]
  },
  {
   "cell_type": "raw",
   "id": "a30082da",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. include:: prim-api.irst"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_json": true,
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "ipynb,py:percent"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}