FeatureEngineer - correlate

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

correlate_aggregate(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>], action: str, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

Aggrigate the first header in a list headers to the next for all headers in the list. If only one header is given, then the action is expected to be a single parameter function such as ‘sqrt’. All actions must be a pyarrow compute numeric operation.

Parameters:

canonical – a pa.Table as the reference table
headers – one or more headers of numeric type
action – a string representation of a pyarrow compute numeric operation.
to_header – (optional) an optional name to call the column
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_column_join(canonical: ~pyarrow.lib.Table, header: str, others: [<class 'str'>, <class 'list'>], drop_others: bool = None, sep: str = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

creates a composite new column made up of other columns. The new column replaces the header column and the others are dropped unless the appropriate parameters are set.

Parameters:

canonical – a pa.Table as the reference table
header – the header for the target values to change
others – the other headers to join
drop_others – drop the others header columns. Default to true
sep – a separator between each column value
to_header – (optional) an optional name to call the column
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical ca

Returns:

an equal length list of correlated values

correlate_date_delta(canonical: ~pyarrow.lib.Table, header: str, delta: str, units: str = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

Correlates a timestamp column to an integer delta column.

Parameters:

canonical – a pa.Table as the reference table
header – the header for the target values to change
delta – a table column to use as the delta
units – (optional) The Timedelta units e.g. ‘us’, ‘ms’, ‘s’, ‘m’, ‘h’, ‘D’. default is ‘D’
to_header – (optional) an optional name to call the column
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_date_diff(canonical: ~pyarrow.lib.Table, first_date: str, second_date: str, units: str = None, to_header: str = None, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs)

returns a column for the difference between a primary and secondary date where the primary is an early date than the secondary.

Parameters:

canonical –
first_date – the primary or older date field
second_date – the secondary or newer date field
units – (optional) The Timedelta units e.g. ‘us’, ‘ms’, ‘s’, ‘m’, ‘h’, ‘D’, ‘W’, ‘M’, ‘Y’. default is ‘D’
to_header – (optional) an optional name to call the column
precision – the precision of the result
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – a set of kwargs to include in any executable function

Returns:

value set based on the selection list and the action

correlate_date_element(canonical: ~pyarrow.lib.Table, header: [<class 'str'>, <class 'list'>], elements: [<class 'dict'>, <class 'list'>], drop_header: bool = None, day_first: bool = None, year_first: bool = None, date_format: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

breaks a date down into value representations of the various parts that date and returns the elements given. The elements are: yr, dec: decade, mon, day, dow: day of week, hr, min, woy: week of year, doy: day of year

Parameters:

canonical –
header – The column header to take the elements from
elements – a list of elements or a dict of element keys and column name values
drop_header – drop the target column
year_first – specifies if to parse with the year first If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
day_first – specifies if to parse with the day first If True, parses dates with the day first, eg %d-%m-%Y. If False default to the a prefered preference, normally %m-%d-%Y (but not strict)
date_format – if the date can’t be inferred uses date format eg format=’%Y%m%d’
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pandas.DataFrame.

correlate_dates(canonical: ~pyarrow.lib.Table, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, offset: [<class 'int'>, <class 'dict'>, <class 'str'>] = None, jitter: [<class 'int'>, <class 'str'>] = None, jitter_units: str = None, ignore_time: bool = None, ignore_seconds: bool = None, min_date: str = None, max_date: str = None, now_delta: str = None, to_header: str = None, date_format: str = None, day_first: bool = None, year_first: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

correlate a list of continuous dates adjusting those dates, or a subset of those dates, with a normalised jitter along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

When using offset and a dict is passed, the dict should take the form {‘days’: 1}, where the unit is plural, to add 1 day or a singular name {‘hour’: 3}, where the unit is singular, to replace the current with 3 hours. Offsets can be ‘years’, ‘months’, ‘weeks’, ‘days’, ‘hours’, ‘minutes’ or ‘seconds’. If an int is passed days are assumed.

Parameters:

canonical – a pd.DataFrame as the reference dataframe
header – the header in the DataFrame to correlate
choice – (optional) The number of values or percentage between 0 and 1 to choose.
choice_header – (optional) those not chosen are given the values of the given header
offset – (optional) Temporal parameter that add to or replace the offset value. if int then assume ‘days’
jitter – (optional) the random jitter or deviation in days
jitter_units – (optional) the units of the jitter, Options: ‘W’, ‘D’, ‘h’, ‘m’, ‘s’. default ‘D’
to_header – (optional) an optional name to call the column
ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False
ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False
min_date – (optional)a minimum date not to go below
max_date – (optional)a max date not to go above
now_delta – (optional) returns a delta from now as an int list, Options: ‘Y’, ‘M’, ‘W’, ‘D’, ‘h’, ‘m’, ‘s’
day_first – (optional) if the dates given are day first firmat. Default to True
year_first – (optional) if the dates given are year first. Default to False
date_format – (optional) the format of the output
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of equal size to that given

correlate_missing(canonical: ~pyarrow.lib.Table, header: str, to_header: str = None, strategy: str = None, constant: [<class 'str'>, <class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

correlates a missing data imputation, replacing nulls with a given strategy. If no strategy is given or the strategy isn’t recognised then defaults to ‘mode’. As ‘mean’ or ‘median’ only apply to numeric values, if used on a categorical the strategy will revert back to the default.

The available strategies are ‘mean’, ‘median’, ‘mode’, ‘constant’, ‘forward’ and ‘backward’ where forward carry non-null values forward to fill null slots and backward carry non-null values backward to fill null slots.

Parameters:

canonical – a pa.Table as the reference table
header – the header for the target values to change
strategy – (optional) imputation strategy. ‘mean’,’median’,’mode’,’constant’,’forward’,’backward’
constant – (optional) if action is constant, the value to use.
to_header – (optional) an optional name to call the column
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_missing_probability(canonical: ~pyarrow.lib.Table, header: str, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

Parameters:

canonical – a pa.Table as the reference table
header – the header for the target values to change
to_header – (optional) an optional name to call the column
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

correlate_number(canonical: ~pyarrow.lib.Table, header: str, choice: [<class 'int'>, <class 'float'>, <class 'str'>] = None, choice_header: str = None, to_header: str = None, precision: int = None, jitter: [<class 'int'>, <class 'float'>, <class 'str'>] = None, offset: [<class 'int'>, <class 'float'>, <class 'str'>] = None, code_str: ~typing.Any = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, keep_zero: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

correlate a list of continuous values adjusting those values, or a subset of those values, with a normalised jitter (std from the value) along with a value offset. choice, jitter and offset can accept environment variable string names starting with ${ and ending with }.

If the choice is an int, it represents the number of rows to choose. If the choice is a float it must be between 1 and 0 and represent a percentage of rows to choose.

Parameters:

canonical – a pa.Table as the reference table
header – the header in the Table to correlate
choice – (optional) The number of values to choose to apply the change to. Can be an environment variable.
choice_header – (optional) those not chosen are given the values of the given header
to_header – (optional) an optional name to call the column
precision – (optional) to what precision the return values should be
offset – (optional) a fixed value to offset or if str an operation to perform using @ as the header value.
code_str – (optional) passing a str lambda function. e.g. ‘lambda x: (x - 3) / 2’’
jitter – (optional) a perturbation of the value where the jitter is a random normally distributed std
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
keep_zero – (optional) if True then zeros passed remain zero despite a change, Default is False
lower – a minimum value not to go below
upper – a max value not to go above
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_on_condition(canonical: ~pyarrow.lib.Table, header: str, condition: list, value: [<class 'int'>, <class 'float'>, <class 'bool'>, <class 'str'>], mask_null: bool = None, default: [<class 'int'>, <class 'float'>, <class 'bool'>, <class 'str'>] = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

correlates a named header to other header where the condition is met and replaces the header column value with a constant or value at the same index of an array.

The selection is a list of triple tuples in the form: [(comparison, operation, logic)] where comparison is the item or column to compare, the operation is what to do when comparing and the logic if you are chaining tuples as in the logic to join to the next boolean flags to the current. An example might be:

[(comparison, operation, logic)] [(1, ‘greater’, ‘or’), (-1, ‘less’, None)] [(pa.array([‘INACTIVE’, ‘PENDING’]), ‘is_in’, None)]

The operator and logic are taken from pyarrow.compute and are:

operator => match_substring, match_substring_regex, equal, greater, less, greater_equal, less_equal, not_equal, is_in, is_null logic => and, or, xor, and_not

Parameters:

canonical – a pa.Table as the reference table
header – the header for the target values to change
condition – a tuple or tuples of
value – a constant value. If the value is a string starting @ then a header values are taken
default – (optional) a default constant if not value. A string starting @ then a default name is taken
to_header – (optional) an optional name to call the column
mask_null – (optional) if nulls in the other they require a value representation.
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_on_pandas(canonical: ~pyarrow.lib.Table, header: str, code_str: str, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

Allows a Pandas Series method to be run against a Table column. Examples of code_str:: “str.extract(‘([0-9]+)’).astype(‘float’)” “apply(lambda x: x[0] if isinstance(x, str) else None)”

Parameters:

canonical – a pa.Table as the reference table
header – the header for the target values to change
code_str – a code string matching a Pandas Series method such as str or apply
to_header – (optional) an optional name to call the column
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an equal length list of correlated values

correlate_outliers(canonical: ~pyarrow.lib.Table, header: str, method: str = None, measure: [<class 'int'>, <class 'float'>, <class 'tuple'>] = None, seed: int = None, to_header: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

creates a boolean column indicating which elements within the target column meet the method criteria. The method criteria for outliers are ‘empirical’, ‘interquartile’ (‘iqr’) or ‘custom’, where the lower and upper limits are set by the user. With ‘custom’, the measure parameter should be passed as a tuple of the lower and higher boundaries of exceptions. The default method is ‘interquartile’

Parameters:

canonical – a pyarrow table
header – The name of the target string column
method – (optional) The outlier method. ‘empirical’, ‘iqr’ or custom
measure – (optional) The outlier distance being std-width, k-factor or a (min, max) tuple
to_header – (optional) an optional name to call the column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

correlate_replace(canonical: ~pyarrow.lib.Table, header: str, pattern: str, replacement: str, is_regex: bool = None, max_replacements: int = None, seed: int = None, to_header: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)

For each string in target, replace non-overlapping substrings that match the given literal pattern with the given replacement. If max_replacements is given and not equal to -1, it limits the maximum amount replacements per input, counted from the left. Null values emit null.

If is a regex then RE2 Regular Expression Syntax is used

Parameters:

canonical –
header – The name of the target string column
pattern – Substring pattern to look for inside input values.
replacement – What to replace the pattern with.
is_regex – (optional) if the pattern is a regex. Default False
max_replacements – (optional) The maximum number of strings to replace in each input value.
to_header – (optional) an optional name to call the column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical