prodata package#
Subpackages#
- prodata.preprocessing package
- Submodules
- prodata.preprocessing.checkers module
- prodata.preprocessing.datapreprocess module
- prodata.preprocessing.disaggregator module
- prodata.preprocessing.filters module
- prodata.preprocessing.markers module
- prodata.preprocessing.transformers module
- Module contents
Submodules#
prodata.proquery module#
- class prodata.proquery.ProApiConfig#
Bases:
ConfigurationExtended proapi Configuration that stores additional OAuth2 token metadata.
- class prodata.proquery.ProQuery(client_id: str = None, client_secret: str = None, username: str = None, password: str = None, url: str = None, timeout: int = None, retries: int | Retry = None, access_token: str = None)#
Bases:
objectProQuery uses the Proemion API via the ‘proapi’ package for accessing endpoints and formatting results for scientific processing.
- There are four ways of calling the endpoints.
Raw responses via the api (ApiHandler) offered Apis.
get_resp()Get data formatted as list.get_df()Get data formatted as pandas Dataframe.input_df()Uses mass requests and returns a Dataframe.
Methods 2-4 offer additional parameters and functionality, reference their respective implementation documentation for details.
On instance initialization attempts to authenticate with the Proemion API with your credentials and white label url (if available).
The authentication is client or user based. Call __init__ with your credentials or provide them (client only) as environment variables: “PROEMION_API_CLIENT_ID” and “PROEMION_API_CLIENT_SECRET”.
Customers with white label URL should use their custom URL, e.g. https://customURL/api, instead of https://dataportal.proemion.com/api as a base URL.
- Parameters:
client_id (str) – Proemion API client id
client_secret (str) – Proemion API client secret
username (str) – Proemion user
password (str) – Proemion password
url (str) – Supply to overwrite base url, it needs to be supplied without version. The version is always taken from the current installed proapi config. To change the API version install the respective proapi version.
timeout (int) – Default timeout in seconds, only applied when invoking get_resp() / get_df() / get_timeseries() without ‘_request_timeout’ kwarg.
retries – Default retries, applied when using the underlying urllib3 pool manager. Applies to direct api calls and get_resp() / get_df() / get_timeseries() invocations. Set as int for simple redirect retries or as urllib3 Retry for a detailed strategy: https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.Retry
- authenticate(client_id: str = None, client_secret: str = None, username: str = None, password: str = None)#
Authenticate proapi client with the API.
If client_id and client_secret are not provided checks environment variables for existence. Prioritizes client over username authentication if both are present.
- Parameters:
client_id (str) – Proemion API client id
client_secret (str) – Proemion API client secret
username (str) – Proemion user
password (str) – Proemion password
- Raises:
ValueError if none of 'client_id' and 'username' is provided. –
- check_area_presence(df: DataFrame, _from, to, areas: list, check_column: str = 'machine_id') DataFrame#
Check if an entity was present during a machines presence in the given geo areas
An entity can be everything from the machine itself to a DTC. The DF of entities to check always needs to have the column ‘machine_id’. For a rows machine_id the area presences are queried.
By default, the presence of the machines itself is checked, check_column=’machine_id’. As long as a single presence is given the machine will be marked as present in the area, regardless of count and duration of the presences between _from and to.
In case the entity df are not machines, then the optional column ‘check_column’ of type (datetime) needs to be populated. This date is checked for the machine_id’s area presence instead of the machine itself. Example: check_column=’start’ (Start date of DTCs)
- Parameters:
df (pd.DataFrame) – DF to check for presences.
_from – Start time to query presences for. Can be an int, tuple or
object. (datetime)
to – End time to query presences for. Can be an int, tuple or
object.
areas (list) – The area ids to check presences for.
check_column (str) – Optional, the date column to check for
for. (presences)
Returns: Df with additional (boolean) columns per checked area. The resulting column(s) are named: {check_column}_in_area_{area_id}
- static convert_column_to_datetime(column: Series, unit: str = 'ms') Series#
Convert a series
Tries to handle a FloatingPointError that is a bug in pandas (> 2.1.4) to_datetime method: pandas-dev/pandas#58419 If the error appears, the column is split in NaN and not NaN parts and datetime conversion is only applied to the none NaN timestamps.
- Parameters:
column – pandas Series with timestamps to convert
unit – pd.to_datetime() timestamp unit default is milliseconds
- Returns:
Series as datetime
- static convert_time_cols(df: DataFrame) DataFrame#
Converts every column of a DataFrame that contains posix timestamps to datetime objects.
- Parameters:
df (pd.DataFrame) – The DataFrame to convert.
- Returns:
The converted DataFrame where every time column now contains datetime objects as values.
- convert_to_posix(*args, delta: float = None) tuple#
Helper method that converts various types of date representations to posix time millis.
The different *arg types: - datetime objects: instances of :class: datetime - tuples: (Year, Month, Day, Hour, Minute, Second) where Year, Month and Day are required - posix timestamps: e.g. 1640995200 or 1640995200000. If the posix timestamp is in seconds it will be converted to millis.
If the delta is populated the method will return a tuple with the start value (now - delta) and the end value (now).
- Parameters:
*args – A various number of time representations.
delta (float) – The time range in days to calculate the timestamps
with.
- Returns:
Returns a tuple of posix timestamps in millis of the given *args in the given order.
- static deserialize_df(df: DataFrame, property_columns: bool = False, **kwargs) DataFrame#
Checks for list columns in the DataFrame, iterates over each row of the list column to extract any keys and values that exist within each list. These key-value pairs are then used to create a new set of columns in the resulting DataFrame.
ALl empty columns will be dropped.
- Parameters:
df – pd.Dataframe
property_columns – bool set machine properties key as column
- Returns:
A fully deserialized DataFrame.
- get_df(func, to_datetime=True, time_delta: float = None, **kwargs) DataFrame#
Wrapper method that offers the same features as
get_resp()but in addition converts the data to a deserialized DataFrame- Parameters:
func – The method that calls the desired endpoint.
to_datetime (bool) – Choose whether the columns of the DataFrame containing time values should be converted to datetime objects or kept as posix timestamps.
time_delta (float) – The value in days to calculate the time range parameters ‘_from’ and ‘to’ with. (-> for further info see
convert_to_posix())**kwargs – if property_columns=True - instead of flat property columns, returns properties with key as column and values as rows
- Returns:
The requested data as a pd.DataFrame.
- get_dtcs(machines, _from=None, to=None, q: str = None, time_delta: float = None, sort: str = 'start', to_datetime: bool = True, as_dataframe: bool = True)#
Gets all DTCs of the given machines.
- Parameters:
machines – List or pd.Series of machine id’s to query for.
_from – The starting time. Can be an integer, tuple or datetime object. (-> for further info see
convert_to_posix())to – The starting time. Can be an integer, tuple or datetime object. (-> for further info see
convert_to_posix())q (str) – The query to execute for filtering.
time_delta (float) – The value in days to calculate the time range parameters ‘_from’ and ‘to’ with. (-> for further info see
convert_to_posix())sort (str) – The name of the property we want to use for sorting.
to_datetime (bool) – Choose whether the columns of the DataFrame containing time values should be converted to datetime objects or kept as posix timestamps.
as_dataframe – Sets the return type. If true -> pd.DataFrame - if false -> list
- Returns: A list of dicts or a pd.DataFrame containing the DTC’s of
each machine.
- get_measurements(machines, signals=None, _from=None, to=None, time_delta: float = None, to_datetime: bool = True, as_dataframe: bool = True)#
Gets last measurements of the given machine list and signals. If no signals provided, returns measurements for all signals. Optionally returns as DataFrame.
- Parameters:
machines – List or pd.Series of machine id’s to query for.
signals – The names of signals to fetch. Can be a list or pd.Series.
_from – The starting time. Can be an integer, tuple or datetime object. (-> for further info see
convert_to_posix())to – The starting time. Can be an integer, tuple or datetime object. (-> for further info see
convert_to_posix())time_delta (float) – The value in days to calculate the time range parameters ‘_from’ and ‘to’ with. (-> for further info see
convert_to_posix())to_datetime (bool) – Choose whether the columns of the DataFrame containing time values should be converted to datetime objects or kept as posix timestamps.
as_dataframe (bool) – Sets the return type. If true -> pd.DataFrame - if false -> list
- Returns: A list of dicts or a pd.DataFrame containing the latest values
of the signals for each machine.
- get_resp(func, time_delta: float = None, **kwargs) list#
Wrapper method that offers additional features compared to a direct call over API.some_api….
Offers time conversion (-> see
convert_to_posix()), time range handling (-> seetime_range_handler()) and pagination handling (-> seepagination_handler())- Parameters:
func – The method that calls the desired endpoint.
time_delta (float) – The value in days to calculate the time range parameters ‘_from’ and ‘to’ with. (-> for further info see
convert_to_posix())**kwargs – The functions’ parameters.
- Returns:
The requested data as a list.
- get_timeseries(machine_id: str, signals: [<class 'str'>], aggregation: str = 'raw', bucket: str | int = 'hour', _from=None, to=None, time_delta: float = None, time_zone: str = None, as_dataframe: bool = True, **kwargs)#
Queries timeseries for a machine and its signals.
- Parameters:
machine_id (str) – Machine ID.
signals (list) – The signal names to query.
aggregation (str) – The function to aggregate with. Possible functions are: min, max, average, std, sum, raw, avg_serial_diff, cumulative_sum or delta.
bucket (str) (int) – Denotes the number of milliseconds in each fixed time bucket to aggregate to. Predefined string options are: hour, day, week or single (end - start). If none of these fits the use case the bucket can be set up manually using an int value.
_from – The starting time. Can be an integer, tuple or datetime object. (-> for further info see
convert_to_posix())to – The starting time. Can be an integer, tuple or datetime object. (-> for further info see
convert_to_posix())time_delta (float) – The value in days to calculate the time range parameters ‘_from’ and ‘to’ with. (-> for further info see
convert_to_posix())time_zone (str) – An identifier of a time-zone in which to interpret
sizes (calendar-based bucket) – as defined by TZ-database (e.g. “Europe/Berlin”). Required only if bucketSize specifies a calendar-based period. Will be ignored if a fixed bucket size is used.
- :paramas defined by TZ-database (e.g. “Europe/Berlin”).
Required only if bucketSize specifies a calendar-based period. Will be ignored if a fixed bucket size is used.
- Parameters:
as_dataframe (bool) – Sets the return type. If true -> pd.DataFrame - if false -> list
**kwargs – Additional endpoints args like: ‘_request_timeout’.
Returns: A list of dicts or a pd.DataFrame containing the timeseries of each signal.
- get_timeseries_based_dtcs(machines: list, source_signal_key: str, spn_signal_key: str, fmi_signal_key: str, _from, to) DataFrame#
Get the DTCs for the given machines using timeseries data. :param machines: machines to request DTCs for :type machines: list :param source_signal_key: signal key of the source address :type source_signal_key: str :param spn_signal_key: signal key of the spn id :type spn_signal_key: str :param fmi_signal_key: signal key of the fmi :type fmi_signal_key: str :param _from: The starting time. Can be an integer, tuple or datetime :param object.: :type object.: -> for further info see
convert_to_posix():param to: The end time. Can be an integer, tuple or datetime :param object.: :type object.: -> for further info seeconvert_to_posix()Returns: df with DTCs
- input_df(func, data, params_col_names: dict = None, body_col_names=None, to_datetime=True, rsql_cols: list = None, **kwargs) DataFrame#
This function is used for mass operations.
You can pass in a pd.DataFrame or pd.Series. The method will iterate over every row and update the parameters (**kwargs) and then call the given function. Parameters set with values of the DataFrame are being updated every iteration while parameters set in **kwargs are static (used for every request). Any exceptions raised by func are caught, and their status code and message are added to the result.
- Parameters:
func – The method that calls the desired endpoint.
data – pd.Series or pd.DataFrame
params_col_names (dict) – A dict that describes what column to use
{'id' (to get the value used for the parameter.) – ‘id_column’, …}
column. (The key represents the kwarg and the value the name of the)
body_col_names (dict, list) – A copy of the request body where the
can (values in curly brackets represent the column name where you)
value. (right. '{}' in 'q' indicating a dynamic)
body (Simple request)
{
"oemExternalKey" – “{the column name
oemExternalKeys (where the values of the)
are}"
}
to_datetime (bool) – Chose weather or not the columns of the
datetime (DataFrame containing time values should be converted to)
timestamps. (objects or kept as posix)
rsql_cols (list) – A list of column names that will be used to get
to (the values for the 'q' parameter. Values will be read from left)
value.
Example – rsql_cols=[‘machine_id’, ‘filename’]
filename=={}" (-> corresponding 'q' parameter q="machine.id=={} and)
**kwargs – The functions’ parameters.
- Returns:
A deserialized DataFrame that contains the DataFrame you passed in, the response message and response code as well the response data (if any).
- internal_get_clfs_as_json(machine_id, _from=None, to=None, as_dataframe=True)#
Method requires access to private endpoints.
Get all the clfs. If neither _from nor to are specified, the time range of last 1 hour is applied. Time range cannot be larger than 1 hour.
- Parameters:
machine_id – machine_id (str) Machine ID.
_from – The starting time. Can be an integer, tuple or datetime object. (-> for further info see :func: convert_to_posix)
to – The starting time. Can be an integer, tuple or datetime object. (-> for further info see :func: convert_to_posix)
as_dataframe – Returns the clfs in a DataFrame with selected columns
- Returns: A dict containing all the clf data or a DataFrame containing
the most important data of the clfs.
- response_to_list(response_objs) list#
Converts the different response types to a list of dicts. This makes the data easier to use in other methods.
- Parameters:
response_objs – These can be classes or list classes
- Returns:
A list of dicts containing the response data.
- class prodata.proquery.TimeseriesQuery(machine_id: str, start: int, end: int, signals: [<class 'str'>], aggregation: str, bucket: int | str, time_zone: str)#
Bases:
objectCarries and formats data for timeseries requests.
- aggregation: str#
- bucket: int | str#
- property delta: timedelta#
- end: int#
- halve_by_range()#
Halve the query by range.
Returns: tuple of two queries (lower half, upper half)
- machine_id: str#
- signals: [<class 'str'>]#
- split_by_signal() [Self]#
Split and return one query per signal.
Returns: list of queries per signal
- start: int#
- time_zone: str#
- to_data() dict#
Formats query as request data.