:py:mod:`ForeTiS.preprocess.base_dataset` ========================================= .. py:module:: ForeTiS.preprocess.base_dataset Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: ForeTiS.preprocess.base_dataset.Dataset .. py:class:: Dataset(data_dir, data, config_file_section, test_set_size_percentage, windowsize_current_statistics, windowsize_lagged_statistics, imputation_method = 'None', config = None, event_lags = None, valtest_seasons = None, seasonal_valtest = None) Class containing datasets ready for optimization. **Attributes** - user_input_params (*mixed*): the arguments passed by the user or default values from run.py respectively - values_for_counter (*list*): the values that should trigger the counter adder - columns_for_counter (*list*): the columns where the counter adder should be applied - columns_for_lags (*list*): the columns that should be lagged by one sample - columns_for_rolling_mean (*list*): the columns where the rolling mean should be applied - columns_for_lags_rolling_mean (*list*): the columns where seasonal lagged rolling mean should be applied - string_columns (*list*): columns containing strings - float_columns (*list*): columns containing floats - time_column (*str*): columns containing the time information - seasonal_periods (*int*): how many datapoints one season has - featuresets_regex (*list*): regular expression with which the feature sets should be filtered - imputation (*bool*): whether to perfrom imputation or not - resample_weekly (*bool*): whether to resample weekly or not - time_format (*str*): the time format, either "W", "D", or "H" - features (*list*): the features of the dataset - categorical_columns (*list*): the categorical columns of the dataset - max_seasonal_lags (*int*): maximal number of seasonal lags to be applied - target_column (*str*): the target column for the prediction - featuresets (*list*): list containing all featuresets that get created in this class :param data_dir: data directory where the data is stored :param data: the dataset that you want to use :param config_file_section: the section of the config file for the used dataset :param test_set_size_percentage: size of the test set relevant for cv-test and train-val-test :param windowsize_current_statistics: the windowsize for the feature engineering of the current statistic :param windowsize_lagged_statistics: the windowsize for the feature engineering of the lagged statistics :param imputation_method: the imputation method to use. Options are: 'mean' , 'knn' , 'iterative' :param config: the information from dataset_specific_config.ini :param event_lags: the event lags for the counters :param valtest_seasons: the number of seasons to be used for validation and testing when seasonal_valtest is True :param: seasonal_valtest: whether validation and test sets should be a multiple of the season length or a percentage of the dataset .. py:method:: load_raw_data(data_dir, data) Load raw datasets :param data_dir: directory where the data is stored :param data: which dataset should be loaded :return: list of datasets to use for optimization .. py:method:: drop_non_target_useless_columns(df) Drop the possible target columns that where not chosen as target column :param df: DataFrame to use for dropping :return: DataFrame with only the target column and features left .. py:method:: set_dtypes(df) Function setting dtypes of dataset. cols_to_str are converted to string, rest except date to float. :param df: DataFrame whose columns data types should be set .. py:method:: impute_dataset_train_test(df = None, test_set_size_percentage = 20, imputation_method = None) Get imputed dataset as well as train and test set (fitted to train set) :param df: dataset to impute :param test_set_size_percentage: the size of the test set in percentage :param imputation_method: specify the used method if imputation is applied :return: imputed dataset, train and test set .. py:method:: featureadding_and_resampling(df) Function preparing train and test sets based on raw dataset. :param df: dataset with raw samples :return: Data with added features and resampling