FeatureBuild
- class ds_capability.intent.feature_build_intent.FeatureBuildIntent(property_manager: ~ds_capability.managers.feature_build_property_manager.FeatureBuildPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)
This class is for feature builds intent actions which are bespoke to a certain used case but have broader reuse beyond this use case.
- build_difference(canonical: ~pyarrow.lib.Table, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], on_key: [<class 'str'>, <class 'list'>], drop_zero_sum: bool = None, summary_connector: bool = None, flagged_connector: str = None, detail_connector: str = None, unmatched_connector: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
returns the difference between two canonicals, joined on a common and unique key. The
on_keyparameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is usedon_keyshould be set to"${<<YOUR_ENVIRON>>}"where <<YOUR_ENVIRON>> is the environment variable name.If the
flagged connectorparameter is used, a report flagging mismatched left data with right data is produced for this connector where 1 indicate a difference and 0 they are the same. By default this method returns this report but if this parameter is set the original canonical returned. This allows a canonical pipeline to continue through the component while outputting the difference report.If the
detail connectorparameter is used, a detail report of the difference where the left and right values that differ are shown.If the
unmatched connectorparameter is used, the on_key’s that don’t match between left and right are reported- Parameters:
canonical – a pa.Table as the reference table
other – a direct pa.Table or reference to a connector.
on_key – The name of the key that uniquely joins the canonical to others
drop_zero_sum – (optional) drops rows and columns which has a total sum of zero differences
summary_connector – (optional) a connector name where the summary report is sent
flagged_connector – (optional) a connector name where the differences are flagged
detail_connector – (optional) a connector name where the differences are shown
unmatched_connector – (optional) a connector name where the unmatched keys are shown
seed – (optional) this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- build_profiling(canonical: ~pyarrow.lib.Table, profiling: str, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, connector_name: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
Data profiling provides, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. It can be used to identify any errors, anomalies, or patterns that may exist within the data. There are three types of data profiling available ‘dictionary’, ‘schema’ or ‘quality’
- Parameters:
canonical – a direct or generated pd.DataFrame. see context notes below
profiling – The profiling name. Options are ‘dictionary’, ‘schema’ or ‘quality’
headers – (optional) a filter of headers from the ‘other’ dataset
d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object
regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’
drop – (optional) to drop or not drop the headers if specified
connector_name – (optional) a connector name where the outcome is sent
seed – (optional) this is a placeholder, here for compatibility across methods
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table