FeatureSelect
- class ds_capability.intent.feature_select_intent.FeatureSelectIntent(property_manager: ~ds_capability.managers.feature_select_property_manager.FeatureSelectPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)
This class represents feature selection intent actions focusing on dimensionality and specifically columnar reduction. Its purpose is to disregard irrelevant features to remove, amongst other things, constants, duplicates and statistically uninteresting columns.
As an early stage data pipeline process, FeatureSelect focuses on data preprocessing, and as such is a filter step for extracting features of interest.
- auto_aggregate(canonical: ~pyarrow.lib.Table, action: str, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, to_header: str = None, drop_aggregated: bool = None, precision: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
given a set of columns, aggregates those columns based upon the aggregation action given. The actions are ‘sum’, ‘prod’, ‘count’, ‘min’, ‘max’, ‘mean’, ‘list’, ‘list_first’, ‘list_last’. ‘list_first’ and ‘list_last’ return the first or last value in a list.
- Parameters:
canonical – the pa.Table
action – an aggregation action such as count or list_first.
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
to_header – (optional) an optional name to call the column
drop_aggregated – (optional) drop the aggregation headers
precision – the value precision of the return values
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
Canonical,.
- auto_append_tables(canonical: ~pyarrow.lib.Table, other: ~pyarrow.lib.Table = None, headers: [<class 'str'>, <class 'list'>] = None, data_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, other_headers: [<class 'str'>, <class 'list'>] = None, other_data_type: [<class 'str'>, <class 'list'>] = None, other_regex: [<class 'str'>, <class 'list'>] = None, other_drop: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Appends the canonical table with other
- Parameters:
canonical – a pa.Table
other – (optional) the pa.Table or connector to join. This is the dominant table and will replace like named columns
headers – (optional) headers to select
data_types – (optional) data types to select. use PyArrow data types eg ‘pa.string()’
regex – (optional) a regular expression
drop – (optional) if True then drop the headers. False by default
other_headers – other headers to select
other_data_type – other data types to select. use PyArrow data types eg ‘pa.string()’
other_regex – other regular expression
other_drop – if True then drop the other headers
seed – (optional) placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- auto_cast_types(canonical: ~pyarrow.lib.Table, include_category: bool = None, category_max: int = None, include_bool: bool = None, include_timestamp: bool = None, tm_format: str = None, tm_units: str = None, tm_tz: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
attempts to cast the columns of a table to its appropriate type. Categories boolean and timestamps are toggled on and off with the include parameters being true or false.
- Parameters:
canonical – the pa.Table
include_category – (optional) if categories should be cast. Default True
category_max – (optional) the max number of unique values to consider categorical
include_bool – (optional) if booleans should be cast. Default True
include_timestamp – (optional) if categories should be cast. Default True
tm_format – (optional) if not standard, the format of the dates, example ‘%m-%d-%Y %H:%M:%S’
tm_units – (optional) units to cast timestamp. Options are ‘s’, ‘ms’, ‘us’, ‘ns’
tm_tz – (optional) timezone to cast timestamp
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- auto_clean_header(canonical: ~pyarrow.lib.Table, case: str = None, rename_map: [<class 'dict'>, <class 'list'>, <class 'str'>] = None, replace_spaces: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
clean the headers of a Table replacing space with underscore. This also allows remapping and case selection
- Parameters:
canonical – the pa.Table
rename_map – (optional) a dict of name value pairs, a fixed length list of column names or connector name
case – (optional) changes the headers to lower, upper, title. if none of these then no change
replace_spaces – (optional) character to replace spaces with. Default is ‘_’ (underscore)
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- auto_drop_columns(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
auto removes columns that are selected.
- Parameters:
canonical – the pa.Table
headers – (optional) a filter of headers from the ‘other’ dataset
drop – (optional) to drop or not drop the headers if specified
d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
uses ‘brute force’ techniques to remove highly correlated numeric columns based on the threshold, set by default to 0.95.
- Parameters:
canonical – the pa.Table
threshold – (optional) threshold correlation between columns. default 0.95
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
Canonical,.
- auto_drop_duplicates(canonical: ~pyarrow.lib.Table, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Removes columns that are duplicates of each other
- Parameters:
canonical – the pa.Table
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
Canonical,.
- auto_drop_noise(canonical: ~pyarrow.lib.Table, variance_threshold: float = None, nulls_threshold: float = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
auto removes columns that are at least 0.998 percent np.NaN, a single value, std equal zero or have a predominant value greater than the default 0.998 percent.
- Parameters:
canonical – the pa.Table
variance_threshold – (optional) The threshold limit of variance of the valued. Default 0.01
nulls_threshold – (optional) The threshold limit of a nulls value. Default 0.95
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.
- auto_projection(canonical: ~pyarrow.lib.Table, headers: list = None, drop: bool = None, n_components: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) Table
Principal component analysis (PCA) is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
- Parameters:
canonical – the pa.Table
headers – (optional) a list of headers to select (default) or drop from the dataset
drop – (optional) if True then srop the headers. False by default
n_components – (optional) Number of components to keep.
seed – (optional) placeholder
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – additional parameters to pass the PCA model
- Returns:
a pd.DataFrame
- auto_sample_rows(canonical: ~pyarrow.lib.Table, size: int, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
auto samples rows of a canonical returning a randomly selected subset of the canonical based on size.
- Parameters:
canonical – the pa.Table
size – the randomly selected subset size of the canonical
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the level name that groups intent by a reference name
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
pa.Table.