:py:mod:`ForeTiS.preprocess.base_dataset`
=========================================

.. py:module:: ForeTiS.preprocess.base_dataset


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   ForeTiS.preprocess.base_dataset.Dataset


.. py:class:: Dataset(data_dir, data, config_file_section, test_set_size_percentage, windowsize_current_statistics, windowsize_lagged_statistics, imputation_method = 'None', config = None, event_lags = None, valtest_seasons = None, seasonal_valtest = None)

   Class containing datasets ready for optimization.

   **Attributes**

       - user_input_params (*mixed*): the arguments passed by the user or default values from run.py respectively
       - values_for_counter (*list*): the values that should trigger the counter adder
       - columns_for_counter (*list*): the columns where the counter adder should be applied
       - columns_for_lags (*list*): the columns that should be lagged by one sample
       - columns_for_rolling_mean (*list*): the columns where the rolling mean should be applied
       - columns_for_lags_rolling_mean (*list*): the columns where seasonal lagged rolling mean should be applied
       - string_columns (*list*): columns containing strings
       - float_columns (*list*): columns containing floats
       - time_column (*str*): columns containing the time information
       - seasonal_periods (*int*): how many datapoints one season has
       - featuresets_regex (*list*): regular expression with which the feature sets should be filtered
       - imputation (*bool*): whether to perfrom imputation or not
       - resample_weekly (*bool*): whether to resample weekly or not
       - time_format (*str*): the time format, either "W", "D", or "H"
       - features (*list*): the features of the dataset
       - categorical_columns (*list*): the categorical columns of the dataset
       - max_seasonal_lags (*int*): maximal number of seasonal lags to be applied
       - target_column (*str*): the target column for the prediction
       - featuresets (*list*): list containing all featuresets that get created in this class

   :param data_dir: data directory where the data is stored
   :param data: the dataset that you want to use
   :param config_file_section: the section of the config file for the used dataset
   :param test_set_size_percentage: size of the test set relevant for cv-test and train-val-test
   :param windowsize_current_statistics: the windowsize for the feature engineering of the current statistic
   :param windowsize_lagged_statistics: the windowsize for the feature engineering of the lagged statistics
   :param imputation_method: the imputation method to use. Options are: 'mean' , 'knn' , 'iterative'
   :param config: the information from dataset_specific_config.ini
   :param event_lags: the event lags for the counters
   :param valtest_seasons: the number of seasons to be used for validation and testing when seasonal_valtest is True
   :param: seasonal_valtest: whether validation and test sets should be a multiple of the season length or a percentage of the dataset

   .. py:method:: load_raw_data(data_dir, data)

      Load raw datasets

      :param data_dir: directory where the data is stored
      :param data: which dataset should be loaded

      :return: list of datasets to use for optimization


   .. py:method:: drop_non_target_useless_columns(df)

      Drop the possible target columns that where not chosen as target column

      :param df: DataFrame to use for dropping

      :return: DataFrame with only the target column and features left


   .. py:method:: set_dtypes(df)

      Function setting dtypes of dataset. cols_to_str are converted to string, rest except date to float.

      :param df: DataFrame whose columns data types should be set


   .. py:method:: impute_dataset_train_test(df = None, test_set_size_percentage = 20, imputation_method = None)

      Get imputed dataset as well as train and test set (fitted to train set)

      :param df: dataset to impute
      :param test_set_size_percentage: the size of the test set in percentage
      :param imputation_method: specify the used method if imputation is applied

      :return: imputed dataset, train and test set


   .. py:method:: featureadding_and_resampling(df)

      Function preparing train and test sets based on raw dataset.

      :param df: dataset with raw samples

      :return: Data with added features and resampling