ForeTiS.optim_pipeline

Module Contents

Functions

run(data_dir, save_dir[, datasplit, ...])

Run the whole optimization pipeline

ForeTiS.optim_pipeline.run(data_dir, save_dir, datasplit='timeseries-cv', test_set_size_percentage=20, val_set_size_percentage=20, n_splits=3, imputation_method=None, windowsize_current_statistics=3, windowsize_lagged_statistics=3, models=None, n_trials=200, pca_transform=False, save_final_model=False, periodical_refit_frequency=None, refit_drops=0, data=None, config_file_path=None, config_file_section=None, refit_window=5, intermediate_results_interval=None, batch_size=32, n_epochs=100000, event_lags=None, optimize_featureset=False, scale_thr=0.1, scale_seasons=2, cf_thr_perc=70, scale_window_factor=0.1, cf_r=0.4, cf_order=1, cf_smooth=4, scale_window_minimum=2, max_samples_factor=10, valtest_seasons=1, seasonal_valtest=True)

Run the whole optimization pipeline

Parameters:
  • data_dir (str) – data directory where the phenotype and genotype matrix are stored

  • save_dir (str) – directory for saving the results. Default is None, so same directory as data_dir

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

  • n_splits (int) – splits to use for ‘timeseries-cv’ or ‘cv’

  • imputation_method (str) – the imputation method to use. Options are: ‘mean’ , ‘knn’ , ‘iterative’

  • windowsize_current_statistics (int) – the windowsize for the feature engineering of the current statistic

  • windowsize_lagged_statistics (int) – the windowsize for the feature engineering of the lagged statistics

  • models (list) – list of models that should be optimized

  • n_trials (int) – number of trials for optuna

  • pca_transform (bool) – whether pca dimensionality reduction will be optimized or not

  • save_final_model (bool) – specify if the final model should be saved

  • periodical_refit_frequency (list) – if and for which intervals periodical refitting should be performed

  • refit_drops (int) – after how many periods the model should get updated

  • data (str) – the dataset that you want to use

  • config_file_path (str) – the path of the config file

  • config_file_section (str) – the section of the config file for the used dataset

  • refit_window (int) – seasons get used for refitting

  • intermediate_results_interval (int) – number of trials after which intermediate results will be saved

  • batch_size (int) – batch size for neural network models

  • n_epochs (int) – number of epochs for neural network models

  • event_lags (int) – the event lags for the counters

  • optimize_featureset (bool) – whether feature set will be optimized or not output scale threshold

  • scale_thr (float) – only relevant for evars-gpr: output scale threshold

  • scale_seasons (int) – only relevant for evars-gpr: output scale seasons taken into account

  • cf_thr_perc (int) – only relevant for evars-gpr: percentile of train set anomaly factors as threshold for cpd with changefinder

  • scale_window_factor (float) – only relevant for evars-gpr: scale window factor based on seasonal periods

  • cf_r (float) – only relevant for evars-gpr: changefinders r param (decay factor older values)

  • cf_order (int) – only relevant for evars-gpr: changefinders SDAR model order param

  • cf_smooth (int) – only relevant for evars-gpr: changefinders smoothing param

  • scale_window_minimum (int) – only relevant for evars-gpr: scale window minimum

  • max_samples_factor (int) – only relevant for evars-gpr: max samples factor of seasons to keep for gpr pipeline

  • valtest_seasons (int) – define the number of seasons to be used when seasonal_valtest is True

  • seasonal_valtest (bool) – whether validation and test sets should be a multiple of the season length