prodata.preprocessing package#
Submodules#
prodata.preprocessing.checkers module#
- prodata.preprocessing.checkers.features_in_dataset(data: DataFrame, features: list)#
Check whether a given dataset has a set of features or not.
- Parameters:
data – dataset with features/attributes.
features – feature names (attributes) to be checked.
- Returns
A list of features found in the dataset.
- prodata.preprocessing.checkers.has_data(data: DataFrame)#
Check whether a given dataset is empty or not.
- Parameters:
data – any dataset to be checked
- Returns:
None
prodata.preprocessing.datapreprocess module#
- prodata.preprocessing.datapreprocess.format_machine_properties(machines: DataFrame) DataFrame#
Formats property columns from deserialized format into a property column per unique property key given.
Discards deserialized columns and keep only new formatted ones.
- Parameters:
machines – Dataframe of machines with property columns
Returns: Dataframe of machines with formatted property columns
- prodata.preprocessing.datapreprocess.is_missing_data(data: DataFrame)#
- prodata.preprocessing.datapreprocess.preprocess_fuel_consumption(data: DataFrame, fuel_feature: str, min_total_fuel_gt=0, min_delta_fuel=0, delta_fuel_threshold=100, print_pp_desc=True, verbose=True)#
This preprocess the Total Fuel Consumption data.
- Parameters:
data – total operating hours data.
fuel_feature – name of the total fuel feature.
min_total_fuel_gt – min. value for total fuel (Filtering via Query).
min_delta_fuel – min. delta fuel hours (Filtering via Query).
delta_fuel_threshold – delta fuel threshold for anomalies (Filtering via Query).
print_pp_desc – print a description of the preprocessed data.
verbose – print preprocessing steps (True/False).
- Returns:
A DataFrame with preprocess Total Fuel Consumption.
- prodata.preprocessing.datapreprocess.preprocess_operating_hours(data: DataFrame, hours_feature: str, time_feature='time', min_total_op_gt=0, min_delta_hours=0, time_issue_feature='has.delivery.issue', print_pp_desc=True, verbose=True)#
This preprocess the Total Operating Hours data.
- Parameters:
data – total operating hours data.
hours_feature – name of the total op. hours feature.
time_feature – name of the time feature.
min_total_op_gt – min. value for total op. hours (Filtering via Query).
min_delta_hours – min. delta op. hours (Filtering via Query).
time_issue_feature – name of the time issue feature.
print_pp_desc – print a description of the preprocessed data.
verbose – print preprocessing steps (True/False).
- Returns:
A DataFrame with preprocess Total Op. Hours.
prodata.preprocessing.disaggregator module#
- class prodata.preprocessing.disaggregator.RawDatasetSplitter(signals_key='signal_key', prefix_metrics=['value.common.', 'value.custom.'])#
Bases:
BaseEstimator,TransformerMixinRaw Dataset Splitter: It splits a RAW dataset that was pulled via Exports endpoint by signal key/name.
- Parameters:
signals_key – feature name that contains signal’s name.
prefix_metrics – list of Proemion’s standard prefix (signal).
- Returns:
A Dictionary with signal’s name as key. The value part contains a signal label and dataset (DataFrame).
- Output structure:
value.common.<name>: {label: ‘name’, data: pd.DataFrame(data)}
- fit(X, y=None)#
- transform(X, y=None)#
prodata.preprocessing.filters module#
- class prodata.preprocessing.filters.MinPointsPerMachine(machine_id: str, min_number_dps: int)#
Bases:
BaseEstimator,TransformerMixinMachine Data Points Filter: it counts data points on a DataFrame by grouping machine IDs and then selects/filters only the machines with the minimum number of data points.
- Parameters:
machine_id – name of the machine ID feature in the dataset.
min_number_dps – minimum number of data points.
- Returns:
A DataFrame with machines with the given minimum number of data points.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.filters.Querier(query: str)#
Bases:
BaseEstimator,TransformerMixinFeature Filter: it runs queries on a DataFrame.
With this filter, you can select the dataset based on your needs.
- Given a feature called A (int values), you could run queries such as:
“A > 10”
“A > 10 & A <= 20”
- Parameters:
query – query to be applied.
- Returns:
A DataFrame with selected data based on the passed query.
- fit(X, y=None)#
- transform(X)#
prodata.preprocessing.markers module#
- class prodata.preprocessing.markers.TimeLogIssueMarker(delta_hours_feature: str, delta_logtime_feature: str)#
Bases:
BaseEstimator,TransformerMixinTime Log Issue Marker: It marks data-points that logged more (delivered) hours than the time passed.
- Parameters:
delta_hours_feature – name of the feature with delta op. hours.
delta_logtime_feature – name of the feature with delta of the log time.
- Returns:
A DataFrame with a boolean feature that indicates an issue in the delivery hours.
Note
Delta log time (attribute) should be in format as the Delta hours (minutes/hours).
Delta hour (feature) should be in the dataset.
Delta log time is rounded to reduce the minor/insignificant differences with delivered hours.
- fit(X, y=None)#
- transform(X)#
prodata.preprocessing.transformers module#
- class prodata.preprocessing.transformers.BoxCoxTransformer(features: list, machine_id_feature: str, lmbda: float, adjust_transformation=True)#
Bases:
BaseEstimator,TransformerMixinBox Cox Transformer: it transforms the data using a Box-Cox transformation.
- Parameters:
features – name of the features to be transformed.
machine_id – name of the machine ID feature in the dataset.
lmbda – lambda parameter of the Box-Cox transformation
adjust_transformation – fill Missing/NaN transformations with the absolute values of the feature(s).
- Returns:
A DataFrame with transformed data.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.CategoricalDataEncoder(features: list)#
Bases:
BaseEstimator,TransformerMixinCategorical Data Enconder: it maps categorical features into a binary vector with length equal to the number of categories in the given feature.
- Parameters:
features – a set of categorical features to be encoded.
- Returns:
A DataFrame with encoded (categorical) data.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.DeltaCalculator(features: list, machine_id_feature: str)#
Bases:
BaseEstimator,TransformerMixinDelta Calculator: it computes delta based on (counter) signals/data.
- Parameters:
features – name of the features to get deltas.
machine_id – name of the machine ID feature in the dataset.
- Returns:
A DataFrame with deltas.
Note
The dataset must contain machine ID so that we can compute the delta as efficient as possible.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.DeltaTimeLogCalculator(machine_id_feature: str, time_metric='h', time_as_index=True, time_feature_name: str = None)#
Bases:
BaseEstimator,TransformerMixinDelta Time Log Calculator: it computes the delta of the log time. You can compute it if the time is already in the index or in the list of features (columns).
- Parameters:
machine_id_feature – name of the machine ID feature in the dataset.
time_metric – convertion metric: h = hour (default), m = minute.
time_as_index – is the time feature already set as index?
time_feature_name – name of the time feature.
- Returns:
A DataFrame with delta of the logged time.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.DropFeatureSelector(drop_features: list)#
Bases:
BaseEstimator,TransformerMixinDrop Feature Selector: it removes columns from the dataset based on names.
- Parameters:
drop_features – name of the features to remove.
- Returns:
A DataFrame without the given list of features.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.DropRowSelector(subset_features: list = None)#
Bases:
BaseEstimator,TransformerMixinDrop Row Selector: it drops rows if they contain missing values. Also, it can be based on a subset of features.
- Parameters:
subset_features – name of features to check if there is any missing values - remove rows based on this set of features.
- Returns:
A DataFrame without missing values based on a given set of features.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.FeatureRenamer(feature_map: dict)#
Bases:
BaseEstimator,TransformerMixinFeature Renamer: it renames the name of features based on a given feature-name map.
- Parameters:
feature_map – dictionary containing the current and new feature names.
- Returns:
A DataFrame with renamed features.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.FeatureSelector(features: list)#
Bases:
BaseEstimator,TransformerMixinFeature Selector: it selects features/attributes based on a given list of features.
- Parameters:
features – list of features to select from a dataset.
- Returns:
A DataFrame with the selected features.
- fit(X, y=None)#
- transform(X)#
- class prodata.preprocessing.transformers.PosixTimeConverter(posix_features: list, keep_posix_features=True, feature_as_index: str = None, rename_index=None)#
Bases:
BaseEstimator,TransformerMixinPosix Time Converter: it converts the POSIX time into a data-time format.
- Parameters:
posix_features – list of POSIX time features.
keep_posix_features – keep POSIX features after transformation.
feature_as_index – name of the feature to set as index.
rename_index – new name for the index.
- Returns:
A DataFrame with date-time data.
- fit(X, y=None)#
- transform(X)#