FeatureEngineer - model

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

model_cat_cast(canonical: ~pyarrow.lib.Table, cat_type: bool = None, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, tm_format: str = None, tm_locale: str = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Reverses casting of int, float values to string, decodes Dictionary types to string, casts bools to 1 and 0, then converts to string and converts dates to a given string format. If cat_type is True then all string types passed are converted to dictionary type

Parameters:
  • canonical – the pa.Table

  • cat_type – (optional) converts str to Categorical type

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • drop – (optional) to drop or not drop the headers if specified

  • d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • tm_format – Pattern for formatting input values. Default “%Y-%m-%dT%H:%M:%S”

  • tm_locale – Locale to use for locale-specific format specifiers.. Default ‘C’

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

model_concat_remote(canonical: ~pyarrow.lib.Table, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], headers: list, replace: bool = None, rename_map: [<class 'dict'>, <class 'list'>] = None, multi_map: dict = None, relative_freq: list = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Takes a remote target dataset and samples columns from that target to the size of the canonical

Parameters:
  • canonical – a pa.Table as the reference table

  • other – a direct pa.Table or reference to a connector.

  • headers – the headers to be selected from the other table

  • rename_map – (optional) a direct (list) or named (dict) mapping to the headers names.

  • multi_map – (optional) multiple columns from a single e.g. {new_name: name} where name is copied new_name

  • replace – (optional) assuming other is bigger than canonical, selects without replacement when True

  • relative_freq – (optional) a weighting pattern of the selected data

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pa.Table

model_drop_columns(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Removes columns that are selected.

Parameters:
  • canonical – the pa.Table

  • headers – (optional) a filter of headers from the dataset

  • d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • drop – (optional) to drop or not drop the headers if specified

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

model_filter_mask(canonical: ~pyarrow.lib.Table, mask: str, save_intent: bool = None, drop_mask: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

using a mask boolean column to filter a table.

Parameters:
  • canonical – the pa.Table

  • mask – the header name of the mask

  • drop_mask – (optional) if the mask column should be dropped

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

model_group(canonical: ~pyarrow.lib.Table, group_by: [<class 'str'>, <class 'list'>], headers: [<class 'str'>, <class 'list'>] = None, regex: bool = None, aggregator: str = None, list_choice: int = None, list_max: int = None, drop_group_by: bool = False, seed: int = None, include_weighting: bool = False, freq_precision: int = None, remove_weighting_zeros: bool = False, remove_aggregated: bool = False, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

returns the full column values directly from another connector data source. in addition the the standard groupby aggregators there is also ‘list’ and ‘set’ that returns an aggregated list or set. These can be using in conjunction with ‘list_choice’ and ‘list_size’ allows control of the return values. if list_max is set to 1 then a single value is returned rather than a list of size 1.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • headers – the column headers to apply the aggregation too

  • group_by – the column headers to group by

  • regex – if the column headers is q regex

  • aggregator – (optional) the aggregator as a function of Pandas DataFrame ‘groupby’ or ‘list’ or ‘set’

  • list_choice – (optional) used in conjunction with list or set aggregator to return a random n choice

  • list_max – (optional) used in conjunction with list or set aggregator restricts the list to a n size

  • drop_group_by – (optional) drops the group by headers

  • include_weighting – (optional) include a percentage weighting column for each

  • freq_precision – (optional) a precision for the relative_freq values

  • remove_aggregated – (optional) if used in conjunction with the weighting then drops the aggrigator column

  • remove_weighting_zeros – (optional) removes zero values

  • seed – (optional) this is a place holder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the intent name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

model_merge(canonical: ~typing.Any, other: ~typing.Any, left_on: str = None, right_on: str = None, on: str = None, how: str = None, headers: list = None, suffixes: tuple = None, indicator: bool = None, validate: str = None, replace_nulls: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

returns the full column values directly from another connector data source.

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • other – a direct or generated pd.DataFrame. see context notes below

  • left_on – the canonical key column(s) to join on

  • right_on – the merging dataset key column(s) to join on

  • on – if th left and right join have the same header name this can replace left_on and right_on

  • how – (optional) One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method.

  • headers – (optional) a filter on the headers included from the right side

  • suffixes – (optional) A tuple of string suffixes to apply to overlapping columns. Defaults (‘’, ‘_dup’).

  • indicator – (optional) Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame or Series, right_only for observations whose merge key only appears in ‘right’ DataFrame or Series, and both if the observation’s merge key is found in both.

  • validate – (optional) validate : string, default None. If specified, checks if merge is of specified type. “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: checks if merge keys are unique in left dataset. “many_to_one” or “m:1”: checks if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks.

  • replace_nulls – (optional) replaces nulls with an appropriate value dependent upon the field type

  • seed – this is a placeholder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the intent name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pd.DataFrame

The other is a pd.DataFrame, a pd.Series or list, a connector contract str reference or a set of parameter instructions on how to generate a pd.Dataframe. the description of each is:

  • pd.Dataframe -> a deep copy of the pd.DataFrame

  • pd.Series or list -> creates a pd.DataFrame of one column with the ‘header’ name or ‘default’ if not given

  • str -> instantiates a connector handler with the connector_name and loads the DataFrame from the connection

  • dict -> use canonical2dict(…) to help construct a dict with a ‘method’ to build a pd.DataFrame
    methods:
    • model_*(…) -> one of the SyntheticBuilder model methods and parameters

    • @empty -> generates an empty pd.DataFrame where size and headers can be passed

      :size sets the index size of the dataframe :headers any initial headers for the dataframe

    • @generate -> generate a synthetic file from a remote Domain Contract

      :task_name the name of the SyntheticBuilder task to run :repo_uri the location of the Domain Product :size (optional) a size to generate :seed (optional) if a seed should be applied :run_book (optional) if specific intent should be run only

model_num_cast(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, remove: list = None, save_intent: bool = None, tm_format: str = None, tm_units: str = None, tm_tz: str = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Reverses casting of strings and dictionary types to int or float removing .

Parameters:
  • canonical – the pa.Table

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • drop – (optional) to drop or not drop the headers if specified

  • d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • remove – (optional) a list of items to remove from the string such a ‘$’ or’,’

  • tm_format – (optional) the format of the string dates used if the date cannot be coerced

  • tm_units – (optional)

  • tm_tz – (optional)

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

model_reinstate_nulls(canonical: ~pyarrow.lib.Table, nulls_list=None, headers: [<class 'str'>, <class 'list'>] = None, data_type: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

reinstates nulls in a string that have been masked with alternate values such as space or question-mark. By default, the nulls list is [‘’,’ ‘,’NaN’,’nan’,’None’,’null’,’Null’,’NULL’]

Parameters:
  • canonical – the pa.Table

  • nulls_list – (optional) potential null values to replace with a null.

  • headers – a list of headers to drop or filter on type

  • data_type – the column types to include or exclude. Default None else int, float, bool, object, ‘number’

  • regex – a regular expression to search the headers

  • drop – to drop or not drop the headers

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.