Meta-Model API

The MetaModel wraps the code required for automatically implementing metamodels. The resulting object is callable in the expected manner for a function that is attached to a PythonCoreModel (i.e., accepting keyword arguments to set inputs, and returning a dictionary of named performance measure outputs).

It is expected that a user will not instatiate a MetaModel directly itself, but instead create them implicitly through other tools. MetaModels can be created from other existing core models using the create_metamodel_from_design or create_metamodel_from_data methods of a core model, or by using the create_metamodel() function, which can create a MetaModel directly from a scope and experimental results, without requiring a core model instance. Each of these functions returns a PythonCoreModel that already wraps the MetaModel in an interface ready for use with other TMIP-EMAT tools, so that in typical cases the user does not need to interact with or know anything about the MetaModel class itself, unless they care to dive in to the underlying core or mathematical structures.

emat.create_metamodel(scope, experiments=None, metamodel_id=None, db=None, include_measures=None, exclude_measures=None, random_state=None, experiment_stratification=None, suppress_converge_warnings=False, regressor=None, name=None, design_name=None, find_best_metamodeltype=False)[source]

Create a MetaModel from a set of input and output observations.

Parameters:
  • scope (emat.Scope) – The scope for this model.
  • experiments (pandas.DataFrame) – This dataframe should contain all of the experimental inputs and outputs, including values for each uncertainty, level, constant, and performance measure.
  • metamodel_id (int, optional) – An identifier for this meta-model. If not given, a unique id number will be created randomly (if not db is given) or sequentially based on any existing metamodels already stored in the database.
  • db (Database, optional) – The database to use for loading and saving metamodels. If none is given here, the metamodel will not be stored in a database otherwise, the metamodel is automatically saved to the database after it is created.
  • include_measures (Collection[str], optional) – If provided, only output performance measures with names in this set will be included.
  • exclude_measures (Collection[str], optional) – If provided, only output performance measures with names not in this set will be included.
  • random_state (int, optional) – A random state to use in the metamodel regression fitting.
  • experiment_stratification (pandas.Series, optional) – A stratification of experiments, used in cross-validation.
  • suppress_converge_warnings (bool, default False) – Suppress convergence warnings during metamodel fitting.
  • regressor (Estimator, optional) – A scikit-learn estimator implementing a multi-target regression. If not given, a detrended simple Gaussian process regression is used.
  • name (str, optional) – A descriptive name for this metamodel.
  • design_name (str, optional) – The name of the design of experiments from db to use to create the metamodel. Only used if experiments is not given explicitly.
  • find_best_metamodeltype (int, default 0) – Run a search to find the best metamodeltype for each performance measure, repeating each cross-validation step this many times. For more stable results, choose 3 or more, although larger numbers will be slow. If domain knowledge about the normal expected range and behavior of each performance measure is available, it is better to give the metamodeltype explicitly in the Scope.
Returns:

a callable object that, when called as if a function, accepts keyword arguments as inputs and returns a dictionary of (measure name: value) pairs.

Return type:

PythonCoreModel

class emat.MetaModel(input_sample, output_sample, metamodel_types=None, disabled_outputs=None, random_state=None, sample_stratification=None, suppress_converge_warnings=False, regressor=None, use_best_cv=True)[source]

A gaussian process regression-based meta-model.

The MetaModel is a callable object that provides an EMA Workbench standard python interface, taking keyword arguments for parameters and returning a python dictionary of named outcomes.

Parameters:
  • input_sample (pandas.DataFrame) – A set of experimental parameters, where each row in the dataframe is an experiment that has already been evaluated using the core model. Each column will be a required keyword parameter when calling this meta-model.
  • output_sample (pandas.DataFrame) – A set of experimental performance measures, where each row in the dataframe is the results of the experiment evaluated using the core model. Each column of this dataframe will be a named output value in the returned dictionary from calling this meta-model.
  • metamodel_types (Mapping, optional) –

    If given, the keys of this mapping should include a subset of the columns in output_sample, and the values indicate the metamodel type for each performance measure, given as str. Available metamodel types include:

    • log: The natural log of the performance measure is taken before fitting the regression model. This is appropriate only when the performance measure will always give a strictly positive outcome. If the performance measure can take on non-positive values, this may result in errors.
    • log1p: The natural log of 1 plus the performance measure is taken before fitting the regression model. This is preferred to log-linear when the performance measure is only guaranteed to be non-negative, rather than strictly positive.
    • logxp(X): The natural log of X plus the performance measure is taken before fitting the regression model. This allows shifting the position of the regression intercept to a point other than 0.
    • clip(LO,HI): A linear model is used, but results are truncated to the range (LO,HI). Set either value as None to have a one-sided truncation range.
    • linear: No transforms are made. This is the default when a performance measure is not included in metamodel_types.
  • disabled_outputs (Collection, optional) – A collection of disabled outputs. All names included in this collection will be returned in the resulting outcomes dictionary when this meta-model is evaluated, but with a value of None. It is valid to include names in disabled_outputs that are included in the columns of output_sample, although the principal use of this argument is to include names that are not included, as disabling outputs that are included will not prevent these values from being included in the computational process.
  • random_state (int, optional) – A random state, passed to the created regression (but only if that regressor includes a ‘random_state’ parameter).
  • regressor (Estimator, optional) – A scikit-learn estimator implementing a multi-target regression. If not given, a detrended simple Gaussian process regression is used.
__call__(*args, **kwargs)[source]

Evaluate the meta-model.

Parameters:**kwargs – All defined (meta)model parameters are passed as keyword arguments, including both uncertainties and levers.
Returns:A single dictionary containing all performance measure outcomes.
Return type:dict
compute_std(*args, **kwargs)[source]

Evaluate standard deviations of estimates generated by the meta-model.

Parameters:
  • df (pandas.DataFrame, optional) –
  • **kwargs – All defined (meta)model parameters are passed as keyword arguments, including both uncertainties and levers.
Returns:

A single dictionary containing the standard deviaition of the estimate of all performance measure outcomes.

Return type:

dict

cross_val_predicts(cv=5)[source]

Generate cross validated predictions using this meta-model.

Parameters:cv (int, default 5) – The number of folds to use in cross-validation. Set to zero for leave-one-out (i.e., the maximum number of folds), which may be quite slow.
Returns:The cross-validated predictions.
Return type:pandas.DataFrame
cross_val_scores(cv=5, gpr_only=False, use_cache=True, return_type='styled', shortnames=None, **kwargs)[source]

Calculate the cross validation scores for this meta-model.

Parameters:
  • cv (int, default 5) – The number of folds to use in cross-validation.
  • gpr_only (bool, default False) – Whether to limit the cross-validation analysis to only the GPR step (i.e., to measure the improvement in meta-model fit from using the GPR-based meta-model, over and above using the linear regression meta-model alone.)
  • use_cache (bool, default True) – Use cached cross validation results if available. All other arguments are ignored if a cached results if available.
  • return_type ({'styled', 'raw'}) – How to return the results.
  • shortnames (Scope or callable) – If given, use this function to convert the measure names into more readable shortname values from the scope, or by using a function that maps measures names to something else.
Returns:

The cross-validation scores, by output.

Return type:

pandas.Series

get_length_scales()[source]

Get the length scales from the GPR kernels of this metamodel.

This MetaModel must already be fit to use this method, although the fit process is generally completed when the MetaModel is instantiated.

Returns:The columns correspond to the columns of pre-processed input (not raw input) and the rows correspond to the outputs.
Return type:pandas.DataFrame
mix_length_scales(balance=None, inv=True)[source]

Mix the length scales from the GPR kernels of this metamodel.

This MetaModel must already be fit to use this method, although the fit process is generally completed when the MetaModel is instantiated.

Parameters:
  • balance (Mapping or Collection, optional) – When given as a mapping, the keys are the output measures that are included in the mix, and the values are the relative weights to use for mixing. When given as a collection, the items are the output measures that are included in the mix, all with equal weight.
  • inv (bool, default True) – Take the inverse of the length scales before mixing.
Returns:

The columns correspond to the columns of pre-processed input (not raw input) and the rows correspond to the outputs.

Return type:

ndarray

pick_new_experiments(possible_experiments, batch_size, output_focus=None, scope: Optional[emat.scope.scope.Scope] = None, db: Optional[emat.database.database.Database] = None, design_name: Optional[str] = None, debug=None, future_experiments=None, future_experiments_std=None)[source]

Select a set of new experiments to perform from a pool of candidates.

This method implements the “maximin” approach described by Johnson et al (1990), as proposed for batch-sequential augmentation of designs by Loeppky et al (2010). New experiments are selected from a pool of possible new experiments by maximizing the minimum distance between the set of selected experiments, with distances between experiments scaled by the correlation parameters from a GP regression fitted to the initial experimental results. Note that the “binning” aspect of Loeppky is not presently implemented here, instead favoring the analyst’s capability to manually focus the new experiments by manipulating the input possible_experiments.

We also extend Loeppky et al by allowing for multiple output models, mixing the results from a selected set of outputs, to potentially focus the information from the new experiments on a subset of output measures.

Parameters:
  • possible_experiments – A pool of possible experiments. All selected experiments will be selected from this pool, so the pool should be sufficiently large and diverse to provide requried support for this process.
  • batch_size (int) – How many experiments to select from possible_experiments.
  • output_focus (Mapping or Collection, optional) – A subset of output measures that will be the focus of these new experiments. The length scales of these measures will be mixed when developing relative weights.
  • scope (Scope, optional) – The exploratory scope to use for writing the design to a database. Ignored unless db is also given.
  • db (Database, optional) – If provided, this design will be stored in the database indicated. Ignored unless scope is also given.
  • design_name (str, optional) – A name for this design, to identify it in the database. If not given, a unique name will be generated. Has no effect if no db or scope is given.
  • debug (Tuple[str,str], optional) – The names of x and y axis to plot for debugging.
Returns:

A subset of rows from possible_experiments

Return type:

pandas.DataFrame

References

  • Johnson, M.E., Moore, L.M., and Ylvisaker, D., 1990. “Minimax and maximin distance designs.” Journal of Statistical Planning and Inference 26, 131–148.
  • Loeppky, J., Moore, L., and Williams, B.J., 2010. “Batch sequential designs for computer experiments.” Journal of Statistical Planning and Inference 140, 1452–1464.
predict(*args, trend_only=False, residual_only=False, **kwargs)[source]

Generate predictions using the meta-model.

Parameters:
  • df (pandas.DataFrame, optional) –
  • trend_only (bool) –
  • residual_only (bool) –
  • **kwargs – All defined (meta)model parameters are passed as keyword arguments, including both uncertainties and levers.
Returns:

A single dictionary containing the standard deviaition of the estimate of all performance measure outcomes.

Return type:

dict

preprocess_raw_input(df, to_type=None)[source]

Preprocess raw data input.

This convenience method provides batch-processing of a raw data input DataFrame into the format used for regression.

Parameters:
  • df (pandas.DataFrame) – The raw input data to process, which can include input values for multiple experiments.
  • to_type (dtype, optional) – If given, the entire resulting DataFrame is cast to this data type.
Returns:

pandas.DataFrame