prodata.preprocessing package#

Submodules#

prodata.preprocessing.checkers module#

prodata.preprocessing.checkers.features_in_dataset(data: DataFrame, features: list)#

Check whether a given dataset has a set of features or not.

Parameters:
  • data – dataset with features/attributes.

  • features – feature names (attributes) to be checked.

Returns

A list of features found in the dataset.

prodata.preprocessing.checkers.has_data(data: DataFrame)#

Check whether a given dataset is empty or not.

Parameters:

data – any dataset to be checked

Returns:

None

prodata.preprocessing.datapreprocess module#

prodata.preprocessing.datapreprocess.format_machine_properties(machines: DataFrame) DataFrame#

Formats property columns from deserialized format into a property column per unique property key given.

Discards deserialized columns and keep only new formatted ones.

Parameters:

machines – Dataframe of machines with property columns

Returns: Dataframe of machines with formatted property columns

prodata.preprocessing.datapreprocess.is_missing_data(data: DataFrame)#
prodata.preprocessing.datapreprocess.preprocess_fuel_consumption(data: DataFrame, fuel_feature: str, min_total_fuel_gt=0, min_delta_fuel=0, delta_fuel_threshold=100, print_pp_desc=True, verbose=True)#

This preprocess the Total Fuel Consumption data.

Parameters:
  • data – total operating hours data.

  • fuel_feature – name of the total fuel feature.

  • min_total_fuel_gt – min. value for total fuel (Filtering via Query).

  • min_delta_fuel – min. delta fuel hours (Filtering via Query).

  • delta_fuel_threshold – delta fuel threshold for anomalies (Filtering via Query).

  • print_pp_desc – print a description of the preprocessed data.

  • verbose – print preprocessing steps (True/False).

Returns:

A DataFrame with preprocess Total Fuel Consumption.

prodata.preprocessing.datapreprocess.preprocess_operating_hours(data: DataFrame, hours_feature: str, time_feature='time', min_total_op_gt=0, min_delta_hours=0, time_issue_feature='has.delivery.issue', print_pp_desc=True, verbose=True)#

This preprocess the Total Operating Hours data.

Parameters:
  • data – total operating hours data.

  • hours_feature – name of the total op. hours feature.

  • time_feature – name of the time feature.

  • min_total_op_gt – min. value for total op. hours (Filtering via Query).

  • min_delta_hours – min. delta op. hours (Filtering via Query).

  • time_issue_feature – name of the time issue feature.

  • print_pp_desc – print a description of the preprocessed data.

  • verbose – print preprocessing steps (True/False).

Returns:

A DataFrame with preprocess Total Op. Hours.

prodata.preprocessing.disaggregator module#

class prodata.preprocessing.disaggregator.RawDatasetSplitter(signals_key='signal_key', prefix_metrics=['value.common.', 'value.custom.'])#

Bases: BaseEstimator, TransformerMixin

Raw Dataset Splitter: It splits a RAW dataset that was pulled via Exports endpoint by signal key/name.

Parameters:
  • signals_key – feature name that contains signal’s name.

  • prefix_metrics – list of Proemion’s standard prefix (signal).

Returns:

A Dictionary with signal’s name as key. The value part contains a signal label and dataset (DataFrame).

Output structure:

value.common.<name>: {label: ‘name’, data: pd.DataFrame(data)}

fit(X, y=None)#
transform(X, y=None)#

prodata.preprocessing.filters module#

class prodata.preprocessing.filters.MinPointsPerMachine(machine_id: str, min_number_dps: int)#

Bases: BaseEstimator, TransformerMixin

Machine Data Points Filter: it counts data points on a DataFrame by grouping machine IDs and then selects/filters only the machines with the minimum number of data points.

Parameters:
  • machine_id – name of the machine ID feature in the dataset.

  • min_number_dps – minimum number of data points.

Returns:

A DataFrame with machines with the given minimum number of data points.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.filters.Querier(query: str)#

Bases: BaseEstimator, TransformerMixin

Feature Filter: it runs queries on a DataFrame.

With this filter, you can select the dataset based on your needs.

Given a feature called A (int values), you could run queries such as:
  • A > 10”

  • A > 10 & A <= 20”

Parameters:

query – query to be applied.

Returns:

A DataFrame with selected data based on the passed query.

fit(X, y=None)#
transform(X)#

prodata.preprocessing.markers module#

class prodata.preprocessing.markers.TimeLogIssueMarker(delta_hours_feature: str, delta_logtime_feature: str)#

Bases: BaseEstimator, TransformerMixin

Time Log Issue Marker: It marks data-points that logged more (delivered) hours than the time passed.

Parameters:
  • delta_hours_feature – name of the feature with delta op. hours.

  • delta_logtime_feature – name of the feature with delta of the log time.

Returns:

A DataFrame with a boolean feature that indicates an issue in the delivery hours.

Note

  • Delta log time (attribute) should be in format as the Delta hours (minutes/hours).

  • Delta hour (feature) should be in the dataset.

  • Delta log time is rounded to reduce the minor/insignificant differences with delivered hours.

fit(X, y=None)#
transform(X)#

prodata.preprocessing.transformers module#

class prodata.preprocessing.transformers.BoxCoxTransformer(features: list, machine_id_feature: str, lmbda: float, adjust_transformation=True)#

Bases: BaseEstimator, TransformerMixin

Box Cox Transformer: it transforms the data using a Box-Cox transformation.

Parameters:
  • features – name of the features to be transformed.

  • machine_id – name of the machine ID feature in the dataset.

  • lmbda – lambda parameter of the Box-Cox transformation

  • adjust_transformation – fill Missing/NaN transformations with the absolute values of the feature(s).

Returns:

A DataFrame with transformed data.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.CategoricalDataEncoder(features: list)#

Bases: BaseEstimator, TransformerMixin

Categorical Data Enconder: it maps categorical features into a binary vector with length equal to the number of categories in the given feature.

Parameters:

features – a set of categorical features to be encoded.

Returns:

A DataFrame with encoded (categorical) data.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.DeltaCalculator(features: list, machine_id_feature: str)#

Bases: BaseEstimator, TransformerMixin

Delta Calculator: it computes delta based on (counter) signals/data.

Parameters:
  • features – name of the features to get deltas.

  • machine_id – name of the machine ID feature in the dataset.

Returns:

A DataFrame with deltas.

Note

The dataset must contain machine ID so that we can compute the delta as efficient as possible.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.DeltaTimeLogCalculator(machine_id_feature: str, time_metric='h', time_as_index=True, time_feature_name: str = None)#

Bases: BaseEstimator, TransformerMixin

Delta Time Log Calculator: it computes the delta of the log time. You can compute it if the time is already in the index or in the list of features (columns).

Parameters:
  • machine_id_feature – name of the machine ID feature in the dataset.

  • time_metric – convertion metric: h = hour (default), m = minute.

  • time_as_index – is the time feature already set as index?

  • time_feature_name – name of the time feature.

Returns:

A DataFrame with delta of the logged time.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.DropFeatureSelector(drop_features: list)#

Bases: BaseEstimator, TransformerMixin

Drop Feature Selector: it removes columns from the dataset based on names.

Parameters:

drop_features – name of the features to remove.

Returns:

A DataFrame without the given list of features.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.DropRowSelector(subset_features: list = None)#

Bases: BaseEstimator, TransformerMixin

Drop Row Selector: it drops rows if they contain missing values. Also, it can be based on a subset of features.

Parameters:

subset_features – name of features to check if there is any missing values - remove rows based on this set of features.

Returns:

A DataFrame without missing values based on a given set of features.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.FeatureRenamer(feature_map: dict)#

Bases: BaseEstimator, TransformerMixin

Feature Renamer: it renames the name of features based on a given feature-name map.

Parameters:

feature_map – dictionary containing the current and new feature names.

Returns:

A DataFrame with renamed features.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.FeatureSelector(features: list)#

Bases: BaseEstimator, TransformerMixin

Feature Selector: it selects features/attributes based on a given list of features.

Parameters:

features – list of features to select from a dataset.

Returns:

A DataFrame with the selected features.

fit(X, y=None)#
transform(X)#
class prodata.preprocessing.transformers.PosixTimeConverter(posix_features: list, keep_posix_features=True, feature_as_index: str = None, rename_index=None)#

Bases: BaseEstimator, TransformerMixin

Posix Time Converter: it converts the POSIX time into a data-time format.

Parameters:
  • posix_features – list of POSIX time features.

  • keep_posix_features – keep POSIX features after transformation.

  • feature_as_index – name of the feature to set as index.

  • rename_index – new name for the index.

Returns:

A DataFrame with date-time data.

fit(X, y=None)#
transform(X)#

Module contents#