AI pipelines

Learning pipeline

  1. Load raw data by setting raw_set attribute.

  2. Feature engineering: produces cleaned/derived data from raw data using feature_engineering_function (default is the identity function).

  3. Split the derived data in train, validation and test sets using train_val_test_split_function (default is train_test_split).

  4. Use predictive_variables and target_variables to define the X and y and passes them to model's fit function. See Training.

  5. Evaluate the performance using compute_metrics_function on the validation set and on the test set, if their sizes are not zero. More in Evaluating.

  6. Send a mail to all validators with a summary of the learning process. Validators are added through add_validator.

The user can trigger all those at once by doing

model = AleiaModel("modelname")
...
model.learn()

or one after the other by doing

model = AleiaModel("modelname")
...
derived = model.feature_engineering("raw")
train, validation, test = model.train_validation_test_split()
model.train()
validation_metrics = model.validate()
test_metrics = model.test()
model._send_summary()

but that is not recommended.

If doing hyperparameter optimisation, one should always use learn, as it is at the beginning of this method that model_class will be called to define model.

Feature engineering

In AleiaModel, feature engineering is an optional method that the user can define and give to the model by defining the feature_engineering_function attribute:

def my_awesome_function(df):
    """do stuff on df"""
    ...
    return df_modified

model = AleiaModel("modelname")
model.feature_engineering_function = my_awesome_function

See User methods for more information about custom methods in AleiaModel.

It is assumed that the process providing the user with the data to learn on yields the same format as the process providing the data to predict on, so by default both the learning and prediction pipelines will execute this function. The learning on the raw_set and the prediction on the observations_set. The resulting datasets are saved in the attribute derived_set and derived_observations_set respectively. One can skip any of those steps by providing the model with the corresponding result dataset before running the pipeline.

AleiaModel will automatically pass the raw_set or observations_set as the first argument to the function. Any other argument is optional and must be passed using custom keyword arguments (see Custom keyword arguments).

The function should not modify the input dataset in place, but return a new one.

As the same method is used for both raw_set and observations_set, it must handle the presence and the absence of the target variable.

Train - Validate - Test split

The derived_set is then passed to this function, that can be specified to the attribute train_val_test_split_function. By default, sklearn.model_selection.train_test_split is used.

It must accept at least 3 arguments, that AleiaModel will pass automatically: . The derived set . The test or validation set size, as a fraction of the size of the derived set . The random state

It should only make ONE split, AleiaModel will call it twice to make both validation and test sets. It should return TWO sets.

The user can change the validation and test sets sizes through set_set_sizes : its first argument is the validation size, the second is the test size. The default values are 0.2 and 0.2 so 20% of the derived set each, leaving 60% for the train set.

Any (and both) of those sizes can be 0, which will result in the pipeline skipping the validation or test step.

It is the user responsibility to specify sizes that will leave the train set with enough data to do something.

The user can change the random state through the random_state attribute (default is None).

The user can also provide Custom keyword arguments.

Training

Training will first look for the columns names or indexe numbers that correspond to the predictive_variables in the train set, and by doing so will detect their types if they were not provided by the user. Then, if existing, it will do the same with the target_variables. Those are optional, as one could be training a clustering model for example.

Those two can be provided through the set_variables method, either as two collections of int or str, or as two Variables objects, to provide more information.

If the data in the train set corresponding to the target_variables are int, bool or str, then classes will be the set of those values. It can be useful for computing some metrics in a classification problem.

The user can also choose to reshape the data by doing model.train(reshape_x=(-1, 1), reshape_y=(-1, 1)). This can be useful if for example, the targets are a 1-D array, but the model needs a 2-D shape, like sklearn.linear_model.LinearRegression.

Finaly, the AleiaModel object calls its model fit method on predictive and target variables.

Custom keyword arguments can be provided.

Evaluating

After the learning, the model will compute metrics using the user-defined compute_metrics_function on the validation and the test sets. If it was not specified, and empty dictionary is returned as metrics.

The function should accept 1 or more arguments: the predicted targets.

  1. The predicted targets are given automatically by AleiaModel when evaluating.

  2. If the model is supervised (real targets are known), they are given as second argument.

  3. If 'x' or 'X' appear in the possible arguments, then the predictive variables are passed to the function. It can be any positional argument.

  4. If it is the first or second argument and the model is supervised, the other argument is the real targets, and predicted targets are not passed along.

  5. Else, the first argument is the predicted targets (and if supervised the second is the real targets), and the predictive variables are passed as 'X'

Any other argument must be passed as Custom keyword arguments when learning. It should return a dictionary of metrics, the keys being the metrics names.

Prediction pipeline

  1. Load observations by setting observations_set attribute. See Give input data to AleiaModel.

  2. Feature engineering: see Feature Engineering.

  3. Uses predictive_variables to define the X to pass to model's predict function, returns the predicted values.

  4. Uses postprocess_function on predictions and returns its result (default is the identity function). It must accept at least one argument, and possible Custom keyword arguments.

The user can trigger all those at once by doing

model = AleiaModel("modelname")
...
model.apply()

or one after the other by doing

model = AleiaModel("modelname")
...
derived = model.feature_engineering("observed")
predictions = model.predict()
post_processed_predictions = model.postprocess()

All the computed predictions can be retrieved via get_predictions.