{ "cells": [ { "cell_type": "raw", "id": "ed4d7f49", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. py:currentmodule:: emat.analysis.feature_scoring" ] }, { "cell_type": "code", "execution_count": null, "id": "1dc3aab9", "metadata": {}, "outputs": [], "source": [ "import emat\n", "emat.versions()" ] }, { "cell_type": "raw", "id": "87a050c2", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _methodology-feature-scoring:" ] }, { "cell_type": "markdown", "id": "ab57d071", "metadata": {}, "source": [ "# Feature Scoring\n", "\n", "Feature scoring is a methodology for identifying what model inputs (in machine \n", "learning terminology, “features”) have the greatest relationship to the outputs. \n", "The relationship is not necessarily linear, but rather can be any arbitrary \n", "linear or non-linear relationship. For example, consider the function:" ] }, { "cell_type": "code", "execution_count": null, "id": "02da1850", "metadata": {}, "outputs": [], "source": [ "import numpy\n", "\n", "def demo(A=0,B=0,C=0,**unused):\n", " \"\"\"\n", " Y = A/2 + sin(6πB) + ε\n", " \"\"\"\n", " return {'Y':A/2 + numpy.sin(6 * numpy.pi * B) + 0.1 * numpy.random.random()}" ] }, { "cell_type": "markdown", "id": "e4bf3d0c", "metadata": {}, "source": [ "We can readily tell from the functional form that the *B* term is the\n", "most significant when all parameter vary in the unit interval, as the \n", "amplitude of the sine wave attached to *B* is 1 (although the relationship \n", "is clearly non-linear) while the maximum change\n", "in the linear component attached to *A* is only one half, and the output\n", "is totally unresponsive to *C*.\n", "\n", "To demonstrate the feature scoring, we can define a scope to explore this \n", "demo model:" ] }, { "cell_type": "code", "execution_count": null, "id": "7a066e01", "metadata": {}, "outputs": [], "source": [ "demo_scope = emat.Scope(scope_file='', scope_def=\"\"\"---\n", "scope:\n", " name: demo\n", "inputs:\n", " A:\n", " ptype: exogenous uncertainty\n", " dtype: float\n", " min: 0\n", " max: 1\n", " B:\n", " ptype: exogenous uncertainty\n", " dtype: float\n", " min: 0\n", " max: 1\n", " C:\n", " ptype: exogenous uncertainty\n", " dtype: float\n", " min: 0\n", " max: 1\n", "outputs:\n", " Y: \n", " kind: info\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "cd044400", "metadata": {}, "source": [ "And then we will design and run some experiments to generate data used for\n", "feature scoring." ] }, { "cell_type": "code", "execution_count": null, "id": "7dfcd7e1", "metadata": {}, "outputs": [], "source": [ "from emat import PythonCoreModel\n", "demo_model = PythonCoreModel(demo, scope=demo_scope)\n", "experiments = demo_model.design_experiments(n_samples=5000)\n", "experiment_results = demo_model.run_experiments(experiments)" ] }, { "cell_type": "raw", "id": "a026a4fc", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The :func:`feature_scores` method from the `emat.analysis` subpackage allows for\n", "feature scoring based on the implementation found in the EMA Workbench." ] }, { "cell_type": "code", "execution_count": null, "id": "1cf06964", "metadata": {}, "outputs": [], "source": [ "from emat.analysis import feature_scores\n", "fs = feature_scores(demo_scope, experiment_results)\n", "fs" ] }, { "cell_type": "raw", "id": "300edaa4", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Note that the :func:`feature_scores` depend on the *scope* (to identify what are input features\n", "and what are outputs) and the *experiment_results*, but not on the model itself. \n", "\n", "We can plot each of these input parameters using the `display_experiments` method,\n", "which can help visualize the underlying data and exactly how *B* is the most important\n", "feature for this example." ] }, { "cell_type": "code", "execution_count": null, "id": "61f450e5", "metadata": {}, "outputs": [], "source": [ "from emat.analysis import display_experiments\n", "fig = display_experiments(demo_scope, experiment_results, render=False, return_figures=True)['Y']\n", "fig.update_layout(\n", " xaxis_title_text =f\"A (Feature Score = {fs.data.loc['Y','A']:.3f})\",\n", " xaxis2_title_text=f\"B (Feature Score = {fs.data.loc['Y','B']:.3f})\",\n", " xaxis3_title_text=f\"C (Feature Score = {fs.data.loc['Y','C']:.3f})\",\n", ")\n", "from emat.util.rendering import render_plotly\n", "render_plotly(fig, '.png')" ] }, { "cell_type": "markdown", "id": "ba237ad3", "metadata": {}, "source": [ "One important thing to consider is that changing the range of the input parameters \n", "in the scope can significantly impact the feature scores, even if the underlying \n", "model itself is not changed. For example, consider what happens to the features\n", "scores when we expand the range of the uncertainties:" ] }, { "cell_type": "code", "execution_count": null, "id": "9aeb0c1d", "metadata": {}, "outputs": [], "source": [ "demo_model.scope = emat.Scope(scope_file='', scope_def=\"\"\"---\n", "scope:\n", " name: demo\n", "inputs:\n", " A:\n", " ptype: exogenous uncertainty\n", " dtype: float\n", " min: 0\n", " max: 5\n", " B:\n", " ptype: exogenous uncertainty\n", " dtype: float\n", " min: 0\n", " max: 5\n", " C:\n", " ptype: exogenous uncertainty\n", " dtype: float\n", " min: 0\n", " max: 5\n", "outputs:\n", " Y: \n", " kind: info\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "666fa035", "metadata": {}, "outputs": [], "source": [ "wider_experiments = demo_model.design_experiments(n_samples=5000)\n", "wider_results = demo_model.run_experiments(wider_experiments)" ] }, { "cell_type": "code", "execution_count": null, "id": "61cb58a9", "metadata": {}, "outputs": [], "source": [ "from emat.analysis import feature_scores\n", "wider_fs = feature_scores(demo_model.scope, wider_results)\n", "wider_fs" ] }, { "cell_type": "code", "execution_count": null, "id": "276e5c5f", "metadata": {}, "outputs": [], "source": [ "fig = display_experiments(demo_model.scope, wider_results, render=False, return_figures=True)['Y']\n", "fig.update_layout(\n", " xaxis_title_text =f\"A (Feature Score = {wider_fs.data.loc['Y','A']:.3f})\",\n", " xaxis2_title_text=f\"B (Feature Score = {wider_fs.data.loc['Y','B']:.3f})\",\n", " xaxis3_title_text=f\"C (Feature Score = {wider_fs.data.loc['Y','C']:.3f})\",\n", ")\n", "render_plotly(fig, '.png')" ] }, { "cell_type": "markdown", "id": "fe876c26", "metadata": {}, "source": [ "The pattern has shifted, with the sine wave in *B* looking much more like the random noise,\n", "while the linear trend in *A* is now much more important in predicting the output, and\n", "the feature scores also shift to reflect this change." ] }, { "cell_type": "markdown", "id": "5cfe4ddd", "metadata": {}, "source": [ "## Road Test Feature Scores\n", "\n", "We can apply the feature scoring methodology to the Road Test example \n", "in a similar fashion. To demonstrate scoring, we'll first load and run\n", "a sample set of experients." ] }, { "cell_type": "code", "execution_count": null, "id": "2d49b0b1", "metadata": {}, "outputs": [], "source": [ "from emat.model.core_python import Road_Capacity_Investment\n", "road_scope = emat.Scope(emat.package_file('model','tests','road_test.yaml'))\n", "road_test = PythonCoreModel(Road_Capacity_Investment, scope=road_scope)\n", "road_test_design = road_test.design_experiments(sampler='lhs')\n", "road_test_results = road_test.run_experiments(design=road_test_design)" ] }, { "cell_type": "raw", "id": "e208b4c0", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "With the experimental results in hand, we can use the same :func:`feature_scores`\n", "function to compute the scores." ] }, { "cell_type": "code", "execution_count": null, "id": "18f0e3e8", "metadata": {}, "outputs": [], "source": [ "feature_scores(road_scope, road_test_results)" ] }, { "cell_type": "raw", "id": "937d7c88", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "By default, the :func:`feature_scores`\n", "function returns a stylized DataFrame, but we can also \n", "use the `return_type` argument to get a plain 'dataframe'\n", "or a rendered svg 'figure'." ] }, { "cell_type": "code", "execution_count": null, "id": "9f78f332", "metadata": {}, "outputs": [], "source": [ "feature_scores(road_scope, road_test_results, return_type='dataframe')" ] }, { "cell_type": "code", "execution_count": null, "id": "18f5ae85", "metadata": {}, "outputs": [], "source": [ "feature_scores(road_scope, road_test_results, return_type='figure')" ] }, { "cell_type": "markdown", "id": "3e08ca6d", "metadata": {}, "source": [ "The colors on the returned DataFrame highlight the most important input features\n", "for each performance measure (i.e., in each row). The yellow highlighted cell \n", "indicates the most important input feature for each output feature, and the \n", "other cells are colored from yellow through green to blue, showing high-to-low\n", "importance. These colors are from matplotlib's default \"viridis\" colormap. \n", "A different colormap can be used by giving a named colormap in the `cmap`\n", "argument." ] }, { "cell_type": "code", "execution_count": null, "id": "7dc922c8", "metadata": {}, "outputs": [], "source": [ "feature_scores(road_scope, road_test_results, cmap='copper')" ] }, { "cell_type": "markdown", "id": "0b354aa8", "metadata": {}, "source": [ "You may also notice small changes in the numbers given in the two tables above. This\n", "occurs because the underlying algorithm for scoring uses a random trees algorithm. If\n", "you need to have stable (replicable) results, you can provide an integer in the \n", "`random_state` argument." ] }, { "cell_type": "code", "execution_count": null, "id": "7987e978", "metadata": {}, "outputs": [], "source": [ "feature_scores(road_scope, road_test_results, random_state=1, cmap='bone')" ] }, { "cell_type": "markdown", "id": "85c7a9c4", "metadata": {}, "source": [ "Then if we call the function again with the same `random_state`, we get the same numerical result." ] }, { "cell_type": "code", "execution_count": null, "id": "df376925", "metadata": {}, "outputs": [], "source": [ "feature_scores(road_scope, road_test_results, random_state=1, cmap='YlOrRd')" ] }, { "cell_type": "markdown", "id": "a6e3b1e6", "metadata": {}, "source": [ "## Interpreting Feature Scores\n", "\n", "The correct interpretation of feature scores is obviously important. As noted above,\n", "the feature scores can reveal both linear and non-linear relationships. But the scores\n", "themselves give no information about which is which. \n", "\n", "In addition, while the default feature scoring algorithm generates scores that total \n", "to 1.0, it does not necessarily map to dividing up the explained variance. Factors that\n", "have little to no effect on the output still are given non-zero feature score values.\n", "You can see an example of this in the \"demo\" function above; that simple example \n", "literally ignores the \"C\" input, but it has a non-zero score assigned. If there are\n", "a large number of superfluous inputs, they will appear to reduce the scores attached\n", "to the meaningful inputs.\n", "\n", "It is also important to remember that these scores do not fully reflect \n", "any asymmetric relationships in the data. A feature may be very important for some portion\n", "of the range of a performance measure, and less important in other parts of the range.\n", "For example, in the Road Test model, the \"expand_capacity\" lever has a highly asymmetric\n", "impact on the \"net_benefits\" measure: it is very important in\n", "determining negative values (when congestion isn't going to be bad due to low \"input_flow\" volumes, \n", "the net loss is limited by how much we spend), but not so important for positive values \n", "(nearly any amount of expansion has a big payoff if congestion is going to be bad due \n", "to high \"input_flow\" volume). If we plot this three-way relationship specifically, we can \n", "observe it on one figure:" ] }, { "cell_type": "code", "execution_count": null, "id": "ce942b13", "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "fig, ax = plt.subplots()\n", "ax = road_test_results.plot.scatter(\n", " c='net_benefits', \n", " y='expand_capacity', \n", " x='input_flow', \n", " cmap='coolwarm',\n", " ax=ax,\n", ")" ] }, { "cell_type": "markdown", "id": "29c89764", "metadata": {}, "source": [ "Looking at the figure above, we can see the darker red clustered to the right,\n", "and the darker blue clustered in the top left.\n", "However, if we are not aware of this particular three-way relationship\n", "*a priori*, it may be difficult to discover it by looking through various\n", "combinations of three-way relationships. To uncover this kind of relationship,\n", "threshold scoring may be useful." ] }, { "cell_type": "raw", "id": "90f5609f", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _methodology-threshold-scoring:" ] }, { "cell_type": "markdown", "id": "a9652ef4", "metadata": {}, "source": [ "## Threshold Scoring" ] }, { "cell_type": "raw", "id": "f477e48d", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Threshold scoring provides a set of feature scores that don't relate to the\n", "overall magnitude of a performance measure, but rather whether that performance\n", "measure is above or below some threshold level. The :func:`threshold_feature_scores`\n", "function computes such scores for a variety of different thresholds, to develop\n", "a picture of the relationship across the range of outputs. " ] }, { "cell_type": "code", "execution_count": null, "id": "56020e57", "metadata": {}, "outputs": [], "source": [ "from emat.analysis.feature_scoring import threshold_feature_scores\n", "\n", "threshold_feature_scores(road_scope, 'net_benefits', road_test_results)" ] }, { "cell_type": "raw", "id": "1c5481b8", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "This table of data may be overwhelming even for an analyst to interpret, but\n", "the :func:`threshold_feature_scores` function also offers a violin or ridgeline style figure\n", "that displays similar information in a more digestible format." ] }, { "cell_type": "code", "execution_count": null, "id": "cc135b65", "metadata": {}, "outputs": [], "source": [ "threshold_feature_scores(road_scope, 'net_benefits', road_test_results, return_type='figure.png')" ] }, { "cell_type": "code", "execution_count": null, "id": "2b516df1", "metadata": {}, "outputs": [], "source": [ "threshold_feature_scores(road_scope, 'net_benefits', road_test_results, return_type='ridge figure.svg')" ] }, { "cell_type": "markdown", "id": "ca616eed", "metadata": {}, "source": [ "In these figures, we can see that \"expand_capacity\" is important for negative outcomes,\n", "but for positive outcomes we should focus more on \"input_flow\", and to a lesser but\n", "still meaningful extent also \"value_of_time\"." ] }, { "cell_type": "markdown", "id": "89d3447f", "metadata": {}, "source": [ "## Feature Scoring API " ] }, { "cell_type": "raw", "id": "619b0b07", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "\n", ".. autofunction:: feature_scores\n", ".. autofunction:: threshold_feature_scores" ] } ], "metadata": { "jupytext": { "encoding": "# -*- coding: utf-8 -*-", "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }