FeatureEngineer - correlate

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

correlate_aggregate(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>], action: str, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Aggrigate the first header in a list headers to the next for all headers in the list. If only one header is given, then the action is expected to be a single parameter function such as ‘sqrt’. All actions must be a pyarrow compute numeric operation.

Parameters:
  • canonical – a pa.Table as the reference table

  • headers – one or more headers of numeric type

  • action – a string representation of a pyarrow compute numeric operation.

  • to_header – (optional) an optional name to call the column

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_column_join(canonical: ~pyarrow.lib.Table, header: str, others: [<class 'str'>, <class 'list'>], drop_others: bool = None, sep: str = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

creates a composite new column made up of other columns. The new column replaces the header column and the others are dropped unless the appropriate parameters are set.

Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header for the target values to change

  • others – the other headers to join

  • drop_others – drop the others header columns. Default to true

  • sep – a separator between each column value

  • to_header – (optional) an optional name to call the column

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical ca

Returns:

an equal length list of correlated values

correlate_date_delta(canonical: ~pyarrow.lib.Table, header: str, delta: str, units: str = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Correlates a timestamp column to an integer delta column.

Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header for the target values to change

  • delta – a table column to use as the delta

  • units – (optional) The Timedelta units e.g. ‘us’, ‘ms’, ‘s’, ‘m’, ‘h’, ‘D’. default is ‘D’

  • to_header – (optional) an optional name to call the column

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_date_diff(canonical: ~pyarrow.lib.Table, first_date: str, second_date: str, units: str = None, to_header: str = None, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

returns a column for the difference between a primary and secondary date where the primary is an early date than the secondary.

Parameters:
  • canonical

  • first_date – the primary or older date field

  • second_date – the secondary or newer date field

  • units – (optional) The Timedelta units e.g. ‘us’, ‘ms’, ‘s’, ‘m’, ‘h’, ‘D’, ‘W’, ‘M’, ‘Y’. default is ‘D’

  • to_header – (optional) an optional name to call the column

  • precision – the precision of the result

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – a set of kwargs to include in any executable function

Returns:

value set based on the selection list and the action

correlate_date_element(canonical: ~pyarrow.lib.Table, header: [<class 'str'>, <class 'list'>], elements: [<class 'dict'>, <class 'list'>], drop_header: bool = None, day_first: bool = None, year_first: bool = None, date_format: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

breaks a date down into value representations of the various parts that date and returns the elements given. The elements are: yr, dec: decade, mon, day, dow: day of week, hr, min, woy: week of year, doy: day of year

Parameters:
  • canonical

  • header – The column header to take the elements from

  • elements – a list of elements or a dict of element keys and column name values

  • drop_header – drop the target column

  • year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

  • day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)

  • date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pandas.DataFrame.

correlate_dates(canonical: ~pyarrow.lib.Table, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, offset: [<class 'int'>, <class 'dict'>, <class 'str'>] = None, jitter: [<class 'int'>, <class 'str'>] = None, jitter_units: str = None, ignore_time: bool = None, ignore_seconds: bool = None, min_date: str = None, max_date: str = None, now_delta: str = None, to_header: str = None, date_format: str = None, day_first: bool = None, year_first: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a list of continuous dates adjusting those dates, or a subset of those dates, with a normalised jitter along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

When using offset and a dict is passed, the dict should take the form {‘days’: 1}, where the unit is plural, to add 1 day or a singular name {‘hour’: 3}, where the unit is singular, to replace the current with 3 hours. Offsets can be ‘years’, ‘months’, ‘weeks’, ‘days’, ‘hours’, ‘minutes’ or ‘seconds’. If an int is passed days are assumed.

Parameters:
  • canonical – a pd.DataFrame as the reference dataframe

  • header – the header in the DataFrame to correlate

  • choice – (optional) The number of values or percentage between 0 and 1 to choose.

  • choice_header – (optional) those not chosen are given the values of the given header

  • offset – (optional) Temporal parameter that add to or replace the offset value. if int then assume ‘days’

  • jitter – (optional) the random jitter or deviation in days

  • jitter_units – (optional) the units of the jitter, Options: ‘W’, ‘D’, ‘h’, ‘m’, ‘s’. default ‘D’

  • to_header – (optional) an optional name to call the column

  • ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False

  • ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False

  • min_date – (optional)a minimum date not to go below

  • max_date – (optional)a max date not to go above

  • now_delta – (optional) returns a delta from now as an int list, Options: ‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’

  • day_first – (optional) if the dates given are day first firmat. Default to True

  • year_first – (optional) if the dates given are year first. Default to False

  • date_format – (optional) the format of the output

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal size to that given

correlate_missing(canonical: ~pyarrow.lib.Table, header: str, to_header: str = None, strategy: str = None, constant: [<class 'str'>, <class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

correlates a missing data imputation, replacing nulls with a given strategy. If no strategy is given or the strategy isn’t recognised then defaults to ‘mode’. As ‘mean’ or ‘median’ only apply to numeric values, if used on a categorical the strategy will revert back to the default.

The available strategies are ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘forward’ and ‘backward’ where forward carry non-null values forward to fill null slots and backward carry non-null values backward to fill null slots.

Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header for the target values to change

  • strategy – (optional) imputation strategy. ‘mean’,’median’,’mode’,’constant’,’forward’,’backward’

  • constant – (optional) if action is constant, the value to use.

  • to_header – (optional) an optional name to call the column

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_missing_probability(canonical: ~pyarrow.lib.Table, header: str, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header for the target values to change

  • to_header – (optional) an optional name to call the column

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_number(canonical: ~pyarrow.lib.Table, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, to_header: str = None, precision: int = None, jitter: [<class 'int'>, <class 'float'>, <class 'str'>] = None, offset: [<class 'int'>, <class 'float'>, <class 'str'>] = None, code_str: ~typing.Any = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, keep_zero: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

correlate a list of continuous values adjusting those values, or a subset of those values, with a normalised jitter (std from the value) along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

If the choice is an int, it represents the number of rows to choose. If the choice is a float it must be between 1 and 0 and represent a percentage of rows to choose.

Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header in the Table to correlate

  • choice – (optional) The number of values to choose to apply the change to. Can be an environment variable.

  • choice_header – (optional) those not chosen are given the values of the given header

  • to_header – (optional) an optional name to call the column

  • precision – (optional) to what precision the return values should be

  • offset – (optional) a fixed value to offset or if str an operation to perform using @ as the header value.

  • code_str – (optional) passing a str lambda function. e.g. ‘lambda x: (x - 3) / 2’’

  • jitter – (optional) a perturbation of the value where the jitter is a random normally distributed std

  • precision – (optional) how many decimal places. default to 3

  • seed – (optional) the random seed. defaults to current datetime

  • keep_zero – (optional) if True then zeros passed remain zero despite a change, Default is False

  • lower – a minimum value not to go below

  • upper – a max value not to go above

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_on_condition(canonical: ~pyarrow.lib.Table, header: str, condition: list, value: [<class 'int'>, <class 'float'>, <class 'bool'>, <class 'str'>], mask_null: bool = None, default: [<class 'int'>, <class 'float'>, <class 'bool'>, <class 'str'>] = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

correlates a named header to other header where the condition is met and replaces the header column value with a constant or value at the same index of an array.

The selection is a list of triple tuples in the form: [(comparison, operation, logic)] where comparison is the item or column to compare, the operation is what to do when comparing and the logic if you are chaining tuples as in the logic to join to the next boolean flags to the current. An example might be:

[(comparison, operation, logic)] [(1, ‘greater’, ‘or’), (-1, ‘less’, None)] [(pa.array([‘INACTIVE’, ‘PENDING’]), ‘is_in’, None)]

The operator and logic are taken from pyarrow.compute and are:

operator => match_substring, match_substring_regex, equal, greater, less, greater_equal, less_equal, not_equal, is_in, is_null logic => and, or, xor, and_not

Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header for the target values to change

  • condition – a tuple or tuples of

  • value – a constant value. If the value is a string starting @ then a header values are taken

  • default – (optional) a default constant if not value. A string starting @ then a default name is taken

  • to_header – (optional) an optional name to call the column

  • mask_null – (optional) if nulls in the other they require a value representation.

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_on_pandas(canonical: ~pyarrow.lib.Table, header: str, code_str: str, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Allows a Pandas Series method to be run against a Table column. Examples of code_str:

“str.extract(‘([0-9]+)’).astype(‘float’)” “apply(lambda x: x[0] if isinstance(x, str) else None)”

Parameters:
  • canonical – a pa.Table as the reference table

  • header – the header for the target values to change

  • code_str – a code string matching a Pandas Series method such as str or apply

  • to_header – (optional) an optional name to call the column

  • seed – (optional) the random seed. defaults to current datetime

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_outliers(canonical: ~pyarrow.lib.Table, header: str, method: str = None, measure: [<class 'int'>, <class 'float'>, <class 'tuple'>] = None, seed: int = None, to_header: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

creates a boolean column indicating which elements within the target column meet the method criteria. The method criteria for outliers are ‘empirical’, ‘interquartile’ (‘iqr’) or ‘custom’, where the lower and upper limits are set by the user. With ‘custom’, the measure parameter should be passed as a tuple of the lower and higher boundaries of exceptions. The default method is ‘interquartile’

Parameters:
  • canonical – a pyarrow table

  • header – The name of the target string column

  • method – (optional) The outlier method. ‘empirical’, ‘iqr’ or custom

  • measure – (optional) The outlier distance being std-width, k-factor or a (min, max) tuple

  • to_header – (optional) an optional name to call the column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the intent name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

correlate_replace(canonical: ~pyarrow.lib.Table, header: str, pattern: str, replacement: str, is_regex: bool = None, max_replacements: int = None, seed: int = None, to_header: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

For each string in target, replace non-overlapping substrings that match the given literal pattern with the given replacement. If max_replacements is given and not equal to -1, it limits the maximum amount replacements per input, counted from the left. Null values emit null.

If is a regex then RE2 Regular Expression Syntax is used

Parameters:
  • canonical

  • header – The name of the target string column

  • pattern – Substring pattern to look for inside input values.

  • replacement – What to replace the pattern with.

  • is_regex – (optional) if the pattern is a regex. Default False

  • max_replacements – (optional) The maximum number of strings to replace in each input value.

  • to_header – (optional) an optional name to call the column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the intent name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical