FeatureEngineer - model
- class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)
This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.
- model_cat_cast(canonical: ~pyarrow.lib.Table, cat_type: bool = None, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, tm_format: str = None, tm_locale: str = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Reverses casting of int, float values to string, decodes Dictionary types to string, casts bools to 1 and 0, then converts to string and converts dates to a given string format. If cat_type is True then all string types passed are converted to dictionary type
- Parameters:
canonical – the pa.Table
cat_type – (optional) converts str to Categorical type
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
tm_format – Pattern for formatting input values. Default “%Y-%m-%dT%H:%M:%S”
tm_locale – Locale to use for locale-specific format specifiers.. Default ‘C’
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- model_concat_remote(canonical: ~pyarrow.lib.Table, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], headers: list, replace: bool = None, rename_map: [<class 'dict'>, <class 'list'>] = None, multi_map: dict = None, relative_freq: list = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Takes a remote target dataset and samples columns from that target to the size of the canonical
- Parameters:
canonical – a pa.Table as the reference table
other – a direct pa.Table or reference to a connector.
headers – the headers to be selected from the other table
rename_map – (optional) a direct (list) or named (dict) mapping to the headers names.
multi_map – (optional) multiple columns from a single e.g. {new_name: name} where name is copied new_name
replace – (optional) assuming other is bigger than canonical, selects without replacement when True
relative_freq – (optional) a weighting pattern of the selected data
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- model_drop_columns(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Removes columns that are selected.
- Parameters:
canonical – the pa.Table
headers – (optional) a filter of headers from the dataset
d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
drop – (optional) to drop or not drop the headers if specified
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- model_filter_mask(canonical: ~pyarrow.lib.Table, mask: str, save_intent: bool = None, drop_mask: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
using a mask boolean column to filter a table.
- Parameters:
canonical – the pa.Table
mask – the header name of the mask
drop_mask – (optional) if the mask column should be dropped
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- model_group(canonical: ~pyarrow.lib.Table, group_by: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, regex: bool = None, aggregator: str = None, list_choice: int = None, list_max: int = None, drop_group_by: bool = False, seed: int = None, include_weighting: bool = False, freq_precision: int = None, remove_weighting_zeros: bool = False, remove_aggregated: bool = False, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
returns the full column values directly from another connector data source. in addition the the standard groupby aggregators there is also ‘list’ and ‘set’ that returns an aggregated list or set. These can be using in conjunction with ‘list_choice’ and ‘list_size’ allows control of the return values. if list_max is set to 1 then a single value is returned rather than a list of size 1.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
headers – the column headers to apply the aggregation too
group_by – the column headers to group by
regex – if the column headers is q regex
aggregator – (optional) the aggregator as a function of Pandas DataFrame ‘groupby’ or ‘list’ or ‘set’
list_choice – (optional) used in conjunction with list or set aggregator to return a random n choice
list_max – (optional) used in conjunction with list or set aggregator restricts the list to a n size
drop_group_by – (optional) drops the group by headers
include_weighting – (optional) include a percentage weighting column for each
freq_precision – (optional) a precision for the relative_freq values
remove_aggregated – (optional) if used in conjunction with the weighting then drops the aggrigator column
remove_weighting_zeros – (optional) removes zero values
seed – (optional) this is a place holder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
- model_merge(canonical: ~typing.Any, other: ~typing.Any, left_on: str = None, right_on: str = None, on: str = None, how: str = None, headers: list = None, suffixes: tuple = None, indicator: bool = None, validate: str = None, replace_nulls: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame
returns the full column values directly from another connector data source.
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
other – a direct or generated pd.DataFrame. see context notes below
left_on – the canonical key column(s) to join on
right_on – the merging dataset key column(s) to join on
on – if th left and right join have the same header name this can replace left_on and right_on
how – (optional) One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method.
headers – (optional) a filter on the headers included from the right side
suffixes – (optional) A tuple of string suffixes to apply to overlapping columns. Defaults (‘’, ‘_dup’).
indicator – (optional) Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame or Series, right_only for observations whose merge key only appears in ‘right’ DataFrame or Series, and both if the observation’s merge key is found in both.
validate – (optional) validate : string, default None. If specified, checks if merge is of specified type. “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: checks if merge keys are unique in left dataset. “many_to_one” or “m:1”: checks if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks.
replace_nulls – (optional) replaces nulls with an appropriate value dependent upon the field type
seed – this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pd.DataFrame
The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:
pd.Dataframe -> a deep copy of the pd.DataFrame
pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given
str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection
- dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
- methods:
model_*(…) -> one of the SyntheticBuilder model methods and parameters
- @empty -> generates an empty pd.DataFrame where size and headers can be passed
:size sets the index size of the dataframe :headers any initial headers for the dataframe
- @generate -> generate a synthetic file from a remote Domain Contract
:task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only
- model_num_cast(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, remove: list = None, save_intent: bool = None, tm_format: str = None, tm_units: str = None, tm_tz: str = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Reverses casting of strings and dictionary types to int or float removing .
- Parameters:
canonical – the pa.Table
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
remove – (optional) a list of items to remove from the string such a ‘$’ or’,’
tm_format – (optional) the format of the string dates used if the date cannot be coerced
tm_units – (optional)
tm_tz – (optional)
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- model_reinstate_nulls(canonical: ~pyarrow.lib.Table, nulls_list=None, headers: [<class 'str'>, <class 'list'>] = None, data_type: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
reinstates nulls in a string that have been masked with alternate values such as space or question-mark. By default, the nulls list is [‘’,’ ‘,’NaN’,’nan’,’None’,’null’,’Null’,’NULL’]
- Parameters:
canonical – the pa.Table
nulls_list – (optional) potential null values to replace with a null.
headers – a list of headers to drop or filter on type
data_type – the column types to include or exclude. Default None else int, float, bool, object, ‘number’
regex – a regular expression to search the headers
drop – to drop or not drop the headers
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.