ForeTiS.preprocess.base_dataset

Module Contents

Classes

Dataset

Class containing datasets ready for optimization.

class ForeTiS.preprocess.base_dataset.Dataset(data_dir, data, config_file_section, test_set_size_percentage, windowsize_current_statistics, windowsize_lagged_statistics, imputation_method='None', config=None, event_lags=None, valtest_seasons=None, seasonal_valtest=None)

Class containing datasets ready for optimization.

Attributes

  • user_input_params (mixed): the arguments passed by the user or default values from run.py respectively

  • values_for_counter (list): the values that should trigger the counter adder

  • columns_for_counter (list): the columns where the counter adder should be applied

  • columns_for_lags (list): the columns that should be lagged by one sample

  • columns_for_rolling_mean (list): the columns where the rolling mean should be applied

  • columns_for_lags_rolling_mean (list): the columns where seasonal lagged rolling mean should be applied

  • string_columns (list): columns containing strings

  • float_columns (list): columns containing floats

  • time_column (str): columns containing the time information

  • seasonal_periods (int): how many datapoints one season has

  • featuresets_regex (list): regular expression with which the feature sets should be filtered

  • imputation (bool): whether to perfrom imputation or not

  • resample_weekly (bool): whether to resample weekly or not

  • time_format (str): the time format, either “W”, “D”, or “H”

  • features (list): the features of the dataset

  • categorical_columns (list): the categorical columns of the dataset

  • max_seasonal_lags (int): maximal number of seasonal lags to be applied

  • target_column (str): the target column for the prediction

  • featuresets (list): list containing all featuresets that get created in this class

Parameters:
  • data_dir (pathlib.Path) – data directory where the data is stored

  • data (str) – the dataset that you want to use

  • config_file_section (str) – the section of the config file for the used dataset

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • windowsize_current_statistics (int) – the windowsize for the feature engineering of the current statistic

  • windowsize_lagged_statistics (int) – the windowsize for the feature engineering of the lagged statistics

  • imputation_method (str) – the imputation method to use. Options are: ‘mean’ , ‘knn’ , ‘iterative’

  • config (configparser.RawConfigParser) – the information from dataset_specific_config.ini

  • event_lags (int) – the event lags for the counters

  • valtest_seasons (int) – the number of seasons to be used for validation and testing when seasonal_valtest is True

  • seasonal_valtest (bool) –

Param:

seasonal_valtest: whether validation and test sets should be a multiple of the season length or a percentage of the dataset

load_raw_data(data_dir, data)

Load raw datasets

Parameters:
  • data_dir (str) – directory where the data is stored

  • data (str) – which dataset should be loaded

Returns:

list of datasets to use for optimization

Return type:

pandas.DataFrame

drop_non_target_useless_columns(df)

Drop the possible target columns that where not chosen as target column

Parameters:

df (pandas.DataFrame) – DataFrame to use for dropping

Returns:

DataFrame with only the target column and features left

set_dtypes(df)

Function setting dtypes of dataset. cols_to_str are converted to string, rest except date to float.

Parameters:

df (pandas.DataFrame) – DataFrame whose columns data types should be set

impute_dataset_train_test(df=None, test_set_size_percentage=20, imputation_method=None)

Get imputed dataset as well as train and test set (fitted to train set)

Parameters:
  • df (pandas.DataFrame) – dataset to impute

  • test_set_size_percentage (int) – the size of the test set in percentage

  • imputation_method (str) – specify the used method if imputation is applied

Returns:

imputed dataset, train and test set

Return type:

pandas.DataFrame

featureadding_and_resampling(df)

Function preparing train and test sets based on raw dataset.

Parameters:

df (pandas.DataFrame) – dataset with raw samples

Returns:

Data with added features and resampling

Return type:

list