HowTo: Use ForeTiS as a pip package

In this Jupyter notebook, we show how you can use ForeTiS as a pip package and also guide you through the steps that ForeTiS is doing when triggering an optimization run.

Please clone the whole GitHub repository if you want to run this tutorial on your own, as we need the tutorial data from our GitHub repository and to make sure that all paths we define are correct: git clone https://github.com/grimmlab/ForeTiS.git

Then, start a Jupyter notebook server on your machine and open this Jupyter notebook, which is placed at docs/source/tutorials in the repository.

However, you could also download the single files and define the paths yourself:

Installation, imports and paths

First, we may need to install ForeTiS (uncomment if it is not already installed). Then, we import ForeTiS as well as further libraries that we need in this tutorial. In the end, we define some paths and filenames which we will use more often throughout this tutorial. We will save the results in the same directory where this repository is placed.

[73]:
# !pip3 install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ ForeTiS
[74]:
import ForeTiS
import pathlib
import pandas as pd
import datetime
import pprint
[75]:
# Definition of paths and filenames
cwd = pathlib.Path.cwd()
data_dir = cwd.joinpath('tutorial_data')
save_dir = data_dir
model = 'xgboost'

Run whole optimization pipeline at once

As shown for the Docker workflow, ForeTiS offers a function optim_pipeline.run() that triggers the whole optimization run.

In the definition of optim_pipeline.run(), we set several default values. In order to run it using our tutorial data, we just need to define the data and directories we want to use as well as the models we want to optimize. Furthermore, we set values for the datasplit and n_trials to limit the waiting time for getting the results.

When calling the function, we first see some information regarding the data preprocessing and the configuration of our optimization run, e.g. the data that is used. Then, the current progress of the optuna optimization with results of the individual trials is shown. Afterwards, the optimized model with the best hyperparameters get retrained and tested on a unknown test set. This process of optimizing the hyperparameters and testing the model gets executed for every created feature set. In the end, we show a summary of the whole optimization run.

[76]:
ForeTiS.optim_pipeline.run(data_dir=data_dir, data='nike_sales', config_file_section='nike_sales', save_dir=data_dir, models=[model], n_trials=10, periodical_refit_frequency=['complete'])
/home/josef/.local/lib/python3.10/site-packages/ForeTiS/optimization/optuna_optim.py:120: ExperimentalWarning: RetryFailedTrialCallback is experimental (supported from v2.8.0). The interface can change in the future.
  failed_trial_callback=optuna.storages.RetryFailedTrialCallback(max_retry=3))
[I 2023-03-03 21:50:28,146] A new study created in RDB with name: 2023-03-03_21-50-28_-MODELxgboost-TRIALS10-FEATURESETdataset_full
---Dataset is already preprocessed---
### Dataset is loaded ###
### Starting Optuna Optimization for model xgboost and featureset dataset_full ###
## Starting Optimization
Params for Trial 0
{'n_estimators': 700, 'max_depth': 10, 'learning_rate': 0.225, 'gamma': 600, 'subsample': 0.2, 'colsample_bytree': 0.2, 'reg_lambda': 58.0, 'reg_alpha': 867.0}
[I 2023-03-03 21:50:30,959] Trial 0 finished with value: 12477267674.395584 and parameters: {'n_estimators': 700, 'max_depth': 10, 'learning_rate': 0.225, 'gamma': 600, 'subsample': 0.2, 'colsample_bytree': 0.2, 'reg_lambda': 58.0, 'reg_alpha': 867.0}. Best is trial 0 with value: 12477267674.395584.
Params for Trial 1
{'n_estimators': 800, 'max_depth': 8, 'learning_rate': 0.025, 'gamma': 970, 'subsample': 0.8500000000000001, 'colsample_bytree': 0.25, 'reg_lambda': 182.0, 'reg_alpha': 183.0}
[I 2023-03-03 21:50:34,236] Trial 1 finished with value: 23865390820.383522 and parameters: {'n_estimators': 800, 'max_depth': 8, 'learning_rate': 0.025, 'gamma': 970, 'subsample': 0.8500000000000001, 'colsample_bytree': 0.25, 'reg_lambda': 182.0, 'reg_alpha': 183.0}. Best is trial 0 with value: 12477267674.395584.
Params for Trial 2
{'n_estimators': 650, 'max_depth': 6, 'learning_rate': 0.15, 'gamma': 290, 'subsample': 0.6500000000000001, 'colsample_bytree': 0.15000000000000002, 'reg_lambda': 292.0, 'reg_alpha': 366.0}
[I 2023-03-03 21:50:36,728] Trial 2 finished with value: 16570702631.08224 and parameters: {'n_estimators': 650, 'max_depth': 6, 'learning_rate': 0.15, 'gamma': 290, 'subsample': 0.6500000000000001, 'colsample_bytree': 0.15000000000000002, 'reg_lambda': 292.0, 'reg_alpha': 366.0}. Best is trial 0 with value: 12477267674.395584.
Params for Trial 3
{'n_estimators': 750, 'max_depth': 9, 'learning_rate': 0.07500000000000001, 'gamma': 510, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.05, 'reg_lambda': 608.0, 'reg_alpha': 170.0}
[I 2023-03-03 21:50:39,557] Trial 3 finished with value: 32826934290.40348 and parameters: {'n_estimators': 750, 'max_depth': 9, 'learning_rate': 0.07500000000000001, 'gamma': 510, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.05, 'reg_lambda': 608.0, 'reg_alpha': 170.0}. Best is trial 0 with value: 12477267674.395584.
Params for Trial 4
{'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 810, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.1, 'reg_lambda': 684.0, 'reg_alpha': 440.0}
[I 2023-03-03 21:50:41,567] Trial 4 finished with value: 27304306183.488396 and parameters: {'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 810, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.1, 'reg_lambda': 684.0, 'reg_alpha': 440.0}. Best is trial 0 with value: 12477267674.395584.
Params for Trial 5
{'n_estimators': 550, 'max_depth': 6, 'learning_rate': 0.025, 'gamma': 910, 'subsample': 0.3, 'colsample_bytree': 0.7000000000000001, 'reg_lambda': 312.0, 'reg_alpha': 520.0}
[I 2023-03-03 21:50:43,798] Trial 5 finished with value: 41682236193.01747 and parameters: {'n_estimators': 550, 'max_depth': 6, 'learning_rate': 0.025, 'gamma': 910, 'subsample': 0.3, 'colsample_bytree': 0.7000000000000001, 'reg_lambda': 312.0, 'reg_alpha': 520.0}. Best is trial 0 with value: 12477267674.395584.
Params for Trial 6
{'n_estimators': 800, 'max_depth': 3, 'learning_rate': 0.3, 'gamma': 780, 'subsample': 0.9500000000000001, 'colsample_bytree': 0.9000000000000001, 'reg_lambda': 598.0, 'reg_alpha': 922.0}
[I 2023-03-03 21:50:46,862] Trial 6 finished with value: 10334330757.877611 and parameters: {'n_estimators': 800, 'max_depth': 3, 'learning_rate': 0.3, 'gamma': 780, 'subsample': 0.9500000000000001, 'colsample_bytree': 0.9000000000000001, 'reg_lambda': 598.0, 'reg_alpha': 922.0}. Best is trial 6 with value: 10334330757.877611.
Params for Trial 7
{'n_estimators': 500, 'max_depth': 3, 'learning_rate': 0.025, 'gamma': 320, 'subsample': 0.4, 'colsample_bytree': 0.3, 'reg_lambda': 829.0, 'reg_alpha': 357.0}
[I 2023-03-03 21:50:48,724] Trial 7 finished with value: 57564233462.41369 and parameters: {'n_estimators': 500, 'max_depth': 3, 'learning_rate': 0.025, 'gamma': 320, 'subsample': 0.4, 'colsample_bytree': 0.3, 'reg_lambda': 829.0, 'reg_alpha': 357.0}. Best is trial 6 with value: 10334330757.877611.
Params for Trial 8
{'n_estimators': 650, 'max_depth': 6, 'learning_rate': 0.05, 'gamma': 810, 'subsample': 0.1, 'colsample_bytree': 1.0, 'reg_lambda': 773.0, 'reg_alpha': 198.0}
[I 2023-03-03 21:50:51,289] Trial 8 finished with value: 59564589526.27223 and parameters: {'n_estimators': 650, 'max_depth': 6, 'learning_rate': 0.05, 'gamma': 810, 'subsample': 0.1, 'colsample_bytree': 1.0, 'reg_lambda': 773.0, 'reg_alpha': 198.0}. Best is trial 6 with value: 10334330757.877611.
Params for Trial 9
{'n_estimators': 500, 'max_depth': 9, 'learning_rate': 0.225, 'gamma': 730, 'subsample': 0.8, 'colsample_bytree': 0.1, 'reg_lambda': 358.0, 'reg_alpha': 115.0}
[I 2023-03-03 21:50:53,199] Trial 9 finished with value: 21784303396.630066 and parameters: {'n_estimators': 500, 'max_depth': 9, 'learning_rate': 0.225, 'gamma': 730, 'subsample': 0.8, 'colsample_bytree': 0.1, 'reg_lambda': 358.0, 'reg_alpha': 115.0}. Best is trial 6 with value: 10334330757.877611.
## Optuna Study finished ##
Study statistics:
  Finished trials:  10
  Pruned trials:  0
  Completed trials:  10
  Best Trial:  6
  Value:  10334330757.877611
  Params:
    colsample_bytree: 0.9000000000000001
    gamma: 780
    learning_rate: 0.3
    max_depth: 3
    n_estimators: 800
    reg_alpha: 922.0
    reg_lambda: 598.0
    subsample: 0.9500000000000001
## Retrain best model and test ##
## Results on test set with refitting period: complete ##
{'test_refitting_period_complete_mse': 7746675083.946438, 'test_refitting_period_complete_rmse': 88015.19802821804, 'test_refitting_period_complete_r2_score': 0.849709510924217, 'test_refitting_period_complete_explained_variance': 0.8500353551866467, 'test_refitting_period_complete_MAPE': 4.376419954227729, 'test_refitting_period_complete_sMAPE': 4.09452881498149}
../_images/tutorials_HowTo_Use_ForeTiS_as_a_pip_package_7_22.png
### Finished Optuna Optimization for xgboost and featureset dataset_full ###
# Optimization runs done for models ['xgboost'] and ['dataset_full']
Results overview on the test set(s)
{'xgboost': {'dataset_full': {'Test': {'best_params': {'colsample_bytree': 0.9000000000000001,
                                                       'gamma': 780,
                                                       'learning_rate': 0.3,
                                                       'max_depth': 3,
                                                       'n_estimators': 800,
                                                       'reg_alpha': 922.0,
                                                       'reg_lambda': 598.0,
                                                       'subsample': 0.9500000000000001},
                                       'eval_metrics': {'test_refitting_period_complete_MAPE': 4.376419954227729,
                                                        'test_refitting_period_complete_explained_variance': 0.8500353551866467,
                                                        'test_refitting_period_complete_mse': 7746675083.946438,
                                                        'test_refitting_period_complete_r2_score': 0.849709510924217,
                                                        'test_refitting_period_complete_rmse': 88015.19802821804,
                                                        'test_refitting_period_complete_sMAPE': 4.09452881498149},
                                       'retries': 0,
                                       'runtime_metrics': {'process_time_max': 18.45159076599998,
                                                           'process_time_mean': 13.805673200400008,
                                                           'process_time_min': 10.010519391999992,
                                                           'process_time_std': 2.932392361664868,
                                                           'real_time_max': 3.247318983078003,
                                                           'real_time_mean': 2.4725477933883666,
                                                           'real_time_min': 1.8300788402557373,
                                                           'real_time_std': 0.4933237490143201}}}}}
<Figure size 432x288 with 0 Axes>

Within the defined save_dir, a results folder will be created.

Then, ForeTiS’ default folder structure follows: model/featureset/.

We can see this structure below with all optimization results for the defined model.

[77]:
result_folders = list(save_dir.joinpath('results', model).glob('*'))
for results_dir in result_folders:
    print(results_dir)
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full

In the example below, we can see that each result folder contains different files with detailed results for each of the optimized models.

[78]:
result_elements = list(result_folders[0].glob('*'))
for result_element in result_elements:
    print(result_element)
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/xgboost_dataset_full_best_refitting_cycle_complete.png
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/xgboost_runtime_overview.csv
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/final_model_test_results.csv
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/final_model_feature_importances.csv
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/unfitted_model_trial6
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/Optuna_DB.db
/home/josef/Schreibtisch/01_HorticulturalSalesPrediction/ForeTiS/docs/source/tutorials/tutorial_data/results/xgboost/2023-03-03_21-50-28_dataset_full/validation_results_trial6.csv

The *_runtime_overview.csv file contains the best parameters, evaluation as well as runtime metrics for each of the optimized models as we can see in the example below.

[79]:
runtime_overview_file = [overview_file for overview_file in result_elements if '_runtime_overview' in str(overview_file)][0]
pd.read_csv(runtime_overview_file)
[79]:
Trial refitting_cycle process_time_s real_time_s params note
0 0 NaN 15.554622 2.766639 {'n_estimators': 700, 'max_depth': 10, 'learni... successful
1 1 NaN 18.451591 3.247319 {'n_estimators': 800, 'max_depth': 8, 'learnin... successful
2 2 NaN 13.801943 2.463260 {'n_estimators': 650, 'max_depth': 6, 'learnin... successful
3 3 NaN 15.702649 2.794137 {'n_estimators': 750, 'max_depth': 9, 'learnin... successful
4 4 NaN 10.769471 1.975961 {'n_estimators': 500, 'max_depth': 10, 'learni... successful
5 5 NaN 12.254852 2.201480 {'n_estimators': 550, 'max_depth': 6, 'learnin... successful
6 6 NaN 17.035096 3.030404 {'n_estimators': 800, 'max_depth': 3, 'learnin... successful
7 7 NaN 10.010519 1.830079 {'n_estimators': 500, 'max_depth': 3, 'learnin... successful
8 8 NaN 14.209953 2.536838 {'n_estimators': 650, 'max_depth': 6, 'learnin... successful
9 9 NaN 10.266035 1.879361 {'n_estimators': 500, 'max_depth': 9, 'learnin... successful
10 mean NaN 13.805673 2.472548 NaN NaN
11 std NaN 2.932392 0.493324 NaN NaN
12 max NaN 18.451591 3.247319 NaN NaN
13 min NaN 10.010519 1.830079 NaN NaN
14 retraining_after_10_trials complete 8.822604 1.510213 {'colsample_bytree': 0.9000000000000001, 'gamm... successful

Beyond that, we see below that the detailed results for each optimized model contain validation and test results, saved prediction models, an optuna database, a runtime overview with information for each trial (good for debugging, as pruning reasons are also documented) and for some prediction models also feature importances.

[80]:
final_model_test_results_file = [overview_file for overview_file in result_elements if 'final_model_test_results' in str(overview_file)][0]
pd.read_csv(final_model_test_results_file)
[80]:
y_pred_retrain y_true_retrain y_true_test y_pred_test_refitting_period_complete y_pred_test_var_refitting_period_complete test_refitting_period_complete_mse test_refitting_period_complete_rmse test_refitting_period_complete_r2_score test_refitting_period_complete_explained_variance test_refitting_period_complete_MAPE test_refitting_period_complete_sMAPE
0 1.172851e+06 1.189235e+06 1.632762e+06 1572810.375 0.0 7.746675e+09 88015.198028 0.84971 0.850035 4.37642 4.094529
1 1.138915e+06 1.185445e+06 1.255720e+06 1246359.750 0.0 NaN NaN NaN NaN NaN NaN
2 8.919995e+05 9.519060e+05 1.046709e+06 1061496.375 0.0 NaN NaN NaN NaN NaN NaN
3 9.541891e+05 9.149935e+05 1.080526e+06 1157195.750 0.0 NaN NaN NaN NaN NaN NaN
4 8.826827e+05 8.574261e+05 1.281535e+06 1304759.625 0.0 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
2183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2184 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2185 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2186 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2187 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2188 rows × 11 columns

Further information

This notebooks shows how the use the ForeTiS pip package to run an optimization. Furthermore, we give an overview of the individual steps within optim_pipeline.run().

For more information on specific topcis, see the following links:

[ ]: