FeatureSelect

class ds_capability.intent.feature_select_intent.FeatureSelectIntent(property_manager: ~ds_capability.managers.feature_select_property_manager.FeatureSelectPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature selection intent actions focusing on dimensionality and specifically columnar reduction. Its purpose is to disregard irrelevant features to remove, amongst other things, constants, duplicates and statistically uninteresting columns.

As an early stage data pipeline process, FeatureSelect focuses on data preprocessing, and as such is a filter step for extracting features of interest.

auto_aggregate(canonical: ~pyarrow.lib.Table, action: str, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, to_header: str = None, drop_aggregated: bool = None, precision: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

given a set of columns, aggregates those columns based upon the aggregation action given. The actions are ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’, ‘mean’, ‘list’, ‘list_first’, ‘list_last’. ‘list_first’ and ‘list_last’ return the first or last value in a list.

Parameters:
  • canonical – the pa.Table

  • action – an aggregation action such as count or list_first.

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • drop – (optional) to drop or not drop the headers if specified

  • d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • to_header – (optional) an optional name to call the column

  • drop_aggregated – (optional) drop the aggregation headers

  • precision – the value precision of the return values

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

Canonical,.

auto_append_tables(canonical: ~pyarrow.lib.Table, other: ~pyarrow.lib.Table = None, headers: [<class 'str'>, <class 'list'>] = None, data_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, other_headers: [<class 'str'>, <class 'list'>] = None, other_data_type: [<class 'str'>, <class 'list'>] = None, other_regex: [<class 'str'>, <class 'list'>] = None, other_drop: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Appends the canonical table with other

Parameters:
  • canonical – a pa.Table

  • other – (optional) the pa.Table or connector to join. This is the dominant table and will replace like named columns

  • headers – (optional) headers to select

  • data_types – (optional) data types to select. use PyArrow data types eg ‘pa.string()’

  • regex – (optional) a regular expression

  • drop – (optional) if True then drop the headers. False by default

  • other_headers – other headers to select

  • other_data_type – other data types to select. use PyArrow data types eg ‘pa.string()’

  • other_regex – other regular expression

  • other_drop – if True then drop the other headers

  • seed – (optional) placeholder

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pa.Table

auto_cast_types(canonical: ~pyarrow.lib.Table, include_category: bool = None, category_max: int = None, include_bool: bool = None, include_timestamp: bool = None, tm_format: str = None, tm_units: str = None, tm_tz: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

attempts to cast the columns of a table to its appropriate type. Categories boolean and timestamps are toggled on and off with the include parameters being true or false.

Parameters:
  • canonical – the pa.Table

  • include_category – (optional) if categories should be cast. Default True

  • category_max – (optional) the max number of unique values to consider categorical

  • include_bool – (optional) if booleans should be cast. Default True

  • include_timestamp – (optional) if categories should be cast. Default True

  • tm_format – (optional) if not standard, the format of the dates, example ‘%m-%d-%Y %H:%M:%S’

  • tm_units – (optional) units to cast timestamp. Options are ‘s’, ‘ms’, ‘us’, ‘ns’

  • tm_tz – (optional) timezone to cast timestamp

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

auto_clean_header(canonical: ~pyarrow.lib.Table, case: str = None, rename_map: [<class 'dict'>, <class 'list'>, <class 'str'>] = None, replace_spaces: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

clean the headers of a Table replacing space with underscore. This also allows remapping and case selection

Parameters:
  • canonical – the pa.Table

  • rename_map – (optional) a dict of name value pairs, a fixed length list of column names or connector name

  • case – (optional) changes the headers to lower, upper, title. if none of these then no change

  • replace_spaces – (optional) character to replace spaces with. Default is ‘_’ (underscore)

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

auto_drop_columns(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

auto removes columns that are selected.

Parameters:
  • canonical – the pa.Table

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • drop – (optional) to drop or not drop the headers if specified

  • d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

auto_drop_correlated(canonical: ~pyarrow.lib.Table, threshold: float = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

uses ‘brute force’ techniques to remove highly correlated numeric columns based on the threshold, set by default to 0.95.

Parameters:
  • canonical – the pa.Table

  • threshold – (optional) threshold correlation between columns. default 0.95

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

Canonical,.

auto_drop_duplicates(canonical: ~pyarrow.lib.Table, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Removes columns that are duplicates of each other

Parameters:
  • canonical – the pa.Table

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

Canonical,.

auto_drop_noise(canonical: ~pyarrow.lib.Table, variance_threshold: float = None, nulls_threshold: float = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

auto removes columns that are at least 0.998 percent np.NaN, a single value, std equal zero or have a predominant value greater than the default 0.998 percent.

Parameters:
  • canonical – the pa.Table

  • variance_threshold – (optional) The threshold limit of variance of the valued. Default 0.01

  • nulls_threshold – (optional) The threshold limit of a nulls value. Default 0.95

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.

auto_projection(canonical: ~pyarrow.lib.Table, headers: list = None, drop: bool = None, n_components: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) Table

Principal component analysis (PCA) is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

Parameters:
  • canonical – the pa.Table

  • headers – (optional) a list of headers to select (default) or drop from the dataset

  • drop – (optional) if True then srop the headers. False by default

  • n_components – (optional) Number of components to keep.

  • seed – (optional) placeholder

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – additional parameters to pass the PCA model

Returns:

a pd.DataFrame

auto_sample_rows(canonical: ~pyarrow.lib.Table, size: int, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

auto samples rows of a canonical returning a randomly selected subset of the canonical based on size.

Parameters:
  • canonical – the pa.Table

  • size – the randomly selected subset size of the canonical

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the level name that groups intent by a reference name

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table.