ForeTiS.preprocess.base_dataset
Module Contents
Classes
Class containing datasets ready for optimization. |
- class ForeTiS.preprocess.base_dataset.Dataset(data_dir, data, config_file_section, test_set_size_percentage, windowsize_current_statistics, windowsize_lagged_statistics, imputation_method='None', config=None, event_lags=None, valtest_seasons=None, seasonal_valtest=None)
Class containing datasets ready for optimization.
Attributes
user_input_params (mixed): the arguments passed by the user or default values from run.py respectively
values_for_counter (list): the values that should trigger the counter adder
columns_for_counter (list): the columns where the counter adder should be applied
columns_for_lags (list): the columns that should be lagged by one sample
columns_for_rolling_mean (list): the columns where the rolling mean should be applied
columns_for_lags_rolling_mean (list): the columns where seasonal lagged rolling mean should be applied
string_columns (list): columns containing strings
float_columns (list): columns containing floats
time_column (str): columns containing the time information
seasonal_periods (int): how many datapoints one season has
featuresets_regex (list): regular expression with which the feature sets should be filtered
imputation (bool): whether to perfrom imputation or not
resample_weekly (bool): whether to resample weekly or not
time_format (str): the time format, either “W”, “D”, or “H”
features (list): the features of the dataset
categorical_columns (list): the categorical columns of the dataset
max_seasonal_lags (int): maximal number of seasonal lags to be applied
target_column (str): the target column for the prediction
featuresets (list): list containing all featuresets that get created in this class
- Parameters:
data_dir (pathlib.Path) – data directory where the data is stored
data (str) – the dataset that you want to use
config_file_section (str) – the section of the config file for the used dataset
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
windowsize_current_statistics (int) – the windowsize for the feature engineering of the current statistic
windowsize_lagged_statistics (int) – the windowsize for the feature engineering of the lagged statistics
imputation_method (str) – the imputation method to use. Options are: ‘mean’ , ‘knn’ , ‘iterative’
config (configparser.RawConfigParser) – the information from dataset_specific_config.ini
event_lags (int) – the event lags for the counters
valtest_seasons (int) – the number of seasons to be used for validation and testing when seasonal_valtest is True
seasonal_valtest (bool) –
- Param:
seasonal_valtest: whether validation and test sets should be a multiple of the season length or a percentage of the dataset
- load_raw_data(data_dir, data)
Load raw datasets
- drop_non_target_useless_columns(df)
Drop the possible target columns that where not chosen as target column
- Parameters:
df (pandas.DataFrame) – DataFrame to use for dropping
- Returns:
DataFrame with only the target column and features left
- set_dtypes(df)
Function setting dtypes of dataset. cols_to_str are converted to string, rest except date to float.
- Parameters:
df (pandas.DataFrame) – DataFrame whose columns data types should be set
- impute_dataset_train_test(df=None, test_set_size_percentage=20, imputation_method=None)
Get imputed dataset as well as train and test set (fitted to train set)