`ForeTiS.preprocess.base_dataset`

Module Contents

Classes

Dataset

Class containing datasets ready for optimization.

class ForeTiS.preprocess.base_dataset.Dataset(data_dir, data, config_file_section, test_set_size_percentage, windowsize_current_statistics, windowsize_lagged_statistics, imputation_method='None', config=None, event_lags=None, valtest_seasons=None, seasonal_valtest=None)

Class containing datasets ready for optimization.

Attributes

user_input_params (mixed): the arguments passed by the user or default values from run.py respectively

values_for_counter (list): the values that should trigger the counter adder

columns_for_counter (list): the columns where the counter adder should be applied

columns_for_lags (list): the columns that should be lagged by one sample

columns_for_rolling_mean (list): the columns where the rolling mean should be applied

columns_for_lags_rolling_mean (list): the columns where seasonal lagged rolling mean should be applied

string_columns (list): columns containing strings

float_columns (list): columns containing floats

time_column (str): columns containing the time information

seasonal_periods (int): how many datapoints one season has

featuresets_regex (list): regular expression with which the feature sets should be filtered

imputation (bool): whether to perfrom imputation or not

resample_weekly (bool): whether to resample weekly or not

time_format (str): the time format, either “W”, “D”, or “H”

features (list): the features of the dataset

categorical_columns (list): the categorical columns of the dataset

max_seasonal_lags (int): maximal number of seasonal lags to be applied

target_column (str): the target column for the prediction

featuresets (list): list containing all featuresets that get created in this class

Parameters:

data_dir (pathlib.Path) – data directory where the data is stored
data (str) – the dataset that you want to use
config_file_section (str) – the section of the config file for the used dataset
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
windowsize_current_statistics (int) – the windowsize for the feature engineering of the current statistic
windowsize_lagged_statistics (int) – the windowsize for the feature engineering of the lagged statistics
imputation_method (str) – the imputation method to use. Options are: ‘mean’ , ‘knn’ , ‘iterative’
config (configparser.RawConfigParser) – the information from dataset_specific_config.ini
event_lags (int) – the event lags for the counters
valtest_seasons (int) – the number of seasons to be used for validation and testing when seasonal_valtest is True
seasonal_valtest (bool) –

Param:

seasonal_valtest: whether validation and test sets should be a multiple of the season length or a percentage of the dataset

load_raw_data(data_dir, data)

Load raw datasets

Parameters:

data_dir (str) – directory where the data is stored
data (str) – which dataset should be loaded

Returns:

list of datasets to use for optimization

Return type:

pandas.DataFrame

drop_non_target_useless_columns(df)

Drop the possible target columns that where not chosen as target column

Parameters:: df (pandas.DataFrame) – DataFrame to use for dropping
Returns:: DataFrame with only the target column and features left

set_dtypes(df)

Function setting dtypes of dataset. cols_to_str are converted to string, rest except date to float.

Parameters:: df (pandas.DataFrame) – DataFrame whose columns data types should be set

impute_dataset_train_test(df=None, test_set_size_percentage=20, imputation_method=None)

Get imputed dataset as well as train and test set (fitted to train set)

Parameters:

df (pandas.DataFrame) – dataset to impute
test_set_size_percentage (int) – the size of the test set in percentage
imputation_method (str) – specify the used method if imputation is applied

Returns:

imputed dataset, train and test set

Return type:

pandas.DataFrame

featureadding_and_resampling(df)

Function preparing train and test sets based on raw dataset.

Parameters:: df (pandas.DataFrame) – dataset with raw samples
Returns:: Data with added features and resampling
Return type:: list

ForeTiS.preprocess.base_dataset

Module Contents

Classes

`ForeTiS.preprocess.base_dataset`