prodata.preprocessing package#

Submodules#

prodata.preprocessing.checkers module#

prodata.preprocessing.checkers.features_in_dataset(data: DataFrame, features: list)#

Check whether a given dataset has a set of features or not.

Parameters:

data – dataset with features/attributes.
features – feature names (attributes) to be checked.

Returns: A list of features found in the dataset.

prodata.preprocessing.checkers.has_data(data: DataFrame)#

Check whether a given dataset is empty or not.

Parameters:: data – any dataset to be checked
Returns:: None

prodata.preprocessing.datapreprocess module#

prodata.preprocessing.datapreprocess.format_machine_properties(machines: DataFrame) → DataFrame#

Formats property columns from deserialized format into a property column per unique property key given.

Discards deserialized columns and keep only new formatted ones.

Parameters:: machines – Dataframe of machines with property columns

Returns: Dataframe of machines with formatted property columns

prodata.preprocessing.datapreprocess.is_missing_data(data: DataFrame)#

prodata.preprocessing.datapreprocess.preprocess_fuel_consumption(data: DataFrame, fuel_feature: str, min_total_fuel_gt=0, min_delta_fuel=0, delta_fuel_threshold=100, print_pp_desc=True, verbose=True)#

This preprocess the Total Fuel Consumption data.

Parameters:

data – total operating hours data.
fuel_feature – name of the total fuel feature.
min_total_fuel_gt – min. value for total fuel (Filtering via Query).
min_delta_fuel – min. delta fuel hours (Filtering via Query).
delta_fuel_threshold – delta fuel threshold for anomalies (Filtering via Query).
print_pp_desc – print a description of the preprocessed data.
verbose – print preprocessing steps (True/False).

Returns:

A DataFrame with preprocess Total Fuel Consumption.

prodata.preprocessing.datapreprocess.preprocess_operating_hours(data: DataFrame, hours_feature: str, time_feature='time', min_total_op_gt=0, min_delta_hours=0, time_issue_feature='has.delivery.issue', print_pp_desc=True, verbose=True)#

This preprocess the Total Operating Hours data.

Parameters:

data – total operating hours data.
hours_feature – name of the total op. hours feature.
time_feature – name of the time feature.
min_total_op_gt – min. value for total op. hours (Filtering via Query).
min_delta_hours – min. delta op. hours (Filtering via Query).
time_issue_feature – name of the time issue feature.
print_pp_desc – print a description of the preprocessed data.
verbose – print preprocessing steps (True/False).

Returns:

A DataFrame with preprocess Total Op. Hours.

prodata.preprocessing.disaggregator module#

class prodata.preprocessing.disaggregator.RawDatasetSplitter(signals_key='signal_key', prefix_metrics=['value.common.', 'value.custom.'])#

Bases: BaseEstimator, TransformerMixin

Raw Dataset Splitter: It splits a RAW dataset that was pulled via Exports endpoint by signal key/name.

Parameters:

signals_key – feature name that contains signal’s name.
prefix_metrics – list of Proemion’s standard prefix (signal).

Returns:

A Dictionary with signal’s name as key. The value part contains a signal label and dataset (DataFrame).

Output structure:: value.common.<name>: {label: ‘name’, data: pd.DataFrame(data)}

fit(X, y=None)#

transform(X, y=None)#

prodata.preprocessing.filters module#

class prodata.preprocessing.filters.MinPointsPerMachine(machine_id: str, min_number_dps: int)#

Bases: BaseEstimator, TransformerMixin

Machine Data Points Filter: it counts data points on a DataFrame by grouping machine IDs and then selects/filters only the machines with the minimum number of data points.

Parameters:

machine_id – name of the machine ID feature in the dataset.
min_number_dps – minimum number of data points.

Returns:

A DataFrame with machines with the given minimum number of data points.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.filters.Querier(query: str)#

Bases: BaseEstimator, TransformerMixin

Feature Filter: it runs queries on a DataFrame.

With this filter, you can select the dataset based on your needs.

Given a feature called A (int values), you could run queries such as:

“A > 10”
“A > 10 & A <= 20”

Parameters:: query – query to be applied.
Returns:: A DataFrame with selected data based on the passed query.

fit(X, y=None)#

transform(X)#

prodata.preprocessing.markers module#

class prodata.preprocessing.markers.TimeLogIssueMarker(delta_hours_feature: str, delta_logtime_feature: str)#

Bases: BaseEstimator, TransformerMixin

Time Log Issue Marker: It marks data-points that logged more (delivered) hours than the time passed.

Parameters:

delta_hours_feature – name of the feature with delta op. hours.
delta_logtime_feature – name of the feature with delta of the log time.

Returns:

A DataFrame with a boolean feature that indicates an issue in the delivery hours.

Note

Delta log time (attribute) should be in format as the Delta hours (minutes/hours).
Delta hour (feature) should be in the dataset.
Delta log time is rounded to reduce the minor/insignificant differences with delivered hours.

fit(X, y=None)#

transform(X)#

prodata.preprocessing.transformers module#

class prodata.preprocessing.transformers.BoxCoxTransformer(features: list, machine_id_feature: str, lmbda: float, adjust_transformation=True)#

Bases: BaseEstimator, TransformerMixin

Box Cox Transformer: it transforms the data using a Box-Cox transformation.

Parameters:

features – name of the features to be transformed.
machine_id – name of the machine ID feature in the dataset.
lmbda – lambda parameter of the Box-Cox transformation
adjust_transformation – fill Missing/NaN transformations with the absolute values of the feature(s).

Returns:

A DataFrame with transformed data.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.CategoricalDataEncoder(features: list)#

Bases: BaseEstimator, TransformerMixin

Categorical Data Enconder: it maps categorical features into a binary vector with length equal to the number of categories in the given feature.

Parameters:: features – a set of categorical features to be encoded.
Returns:: A DataFrame with encoded (categorical) data.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.DeltaCalculator(features: list, machine_id_feature: str)#

Bases: BaseEstimator, TransformerMixin

Delta Calculator: it computes delta based on (counter) signals/data.

Parameters:

features – name of the features to get deltas.
machine_id – name of the machine ID feature in the dataset.

Returns:

A DataFrame with deltas.

Note

The dataset must contain machine ID so that we can compute the delta as efficient as possible.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.DeltaTimeLogCalculator(machine_id_feature: str, time_metric='h', time_as_index=True, time_feature_name: str = None)#

Bases: BaseEstimator, TransformerMixin

Delta Time Log Calculator: it computes the delta of the log time. You can compute it if the time is already in the index or in the list of features (columns).

Parameters:

machine_id_feature – name of the machine ID feature in the dataset.
time_metric – convertion metric: h = hour (default), m = minute.
time_as_index – is the time feature already set as index?
time_feature_name – name of the time feature.

Returns:

A DataFrame with delta of the logged time.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.DropFeatureSelector(drop_features: list)#

Bases: BaseEstimator, TransformerMixin

Drop Feature Selector: it removes columns from the dataset based on names.

Parameters:: drop_features – name of the features to remove.
Returns:: A DataFrame without the given list of features.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.DropRowSelector(subset_features: list = None)#

Bases: BaseEstimator, TransformerMixin

Drop Row Selector: it drops rows if they contain missing values. Also, it can be based on a subset of features.

Parameters:: subset_features – name of features to check if there is any missing values - remove rows based on this set of features.
Returns:: A DataFrame without missing values based on a given set of features.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.FeatureRenamer(feature_map: dict)#

Bases: BaseEstimator, TransformerMixin

Feature Renamer: it renames the name of features based on a given feature-name map.

Parameters:: feature_map – dictionary containing the current and new feature names.
Returns:: A DataFrame with renamed features.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.FeatureSelector(features: list)#

Bases: BaseEstimator, TransformerMixin

Feature Selector: it selects features/attributes based on a given list of features.

Parameters:: features – list of features to select from a dataset.
Returns:: A DataFrame with the selected features.

fit(X, y=None)#

transform(X)#

class prodata.preprocessing.transformers.PosixTimeConverter(posix_features: list, keep_posix_features=True, feature_as_index: str = None, rename_index=None)#

Bases: BaseEstimator, TransformerMixin

Posix Time Converter: it converts the POSIX time into a data-time format.

Parameters:

posix_features – list of POSIX time features.
keep_posix_features – keep POSIX features after transformation.
feature_as_index – name of the feature to set as index.
rename_index – new name for the index.

Returns:

A DataFrame with date-time data.

fit(X, y=None)#

transform(X)#

prodata.preprocessing package#

Submodules#

prodata.preprocessing.checkers module#

prodata.preprocessing.datapreprocess module#

prodata.preprocessing.disaggregator module#

prodata.preprocessing.filters module#

prodata.preprocessing.markers module#

prodata.preprocessing.transformers module#

Module contents#