FeatureBuild

class ds_capability.intent.feature_build_intent.FeatureBuildIntent(property_manager: ~ds_capability.managers.feature_build_property_manager.FeatureBuildPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class is for feature builds intent actions which are bespoke to a certain used case but have broader reuse beyond this use case.

build_difference(canonical: ~pyarrow.lib.Table, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], on_key: [<class 'str'>, <class 'list'>], drop_zero_sum: bool = None, summary_connector: bool = None, flagged_connector: str = None, detail_connector: str = None, unmatched_connector: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

returns the difference between two canonicals, joined on a common and unique key. The on_key parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used on_key should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name.

If the flagged connector parameter is used, a report flagging mismatched left data with right data is produced for this connector where 1 indicate a difference and 0 they are the same. By default this method returns this report but if this parameter is set the original canonical returned. This allows a canonical pipeline to continue through the component while outputting the difference report.

If the detail connector parameter is used, a detail report of the difference where the left and right values that differ are shown.

If the unmatched connector parameter is used, the on_key’s that don’t match between left and right are reported

Parameters:
  • canonical – a pa.Table as the reference table

  • other – a direct pa.Table or reference to a connector.

  • on_key – The name of the key that uniquely joins the canonical to others

  • drop_zero_sum – (optional) drops rows and columns which has a total sum of zero differences

  • summary_connector – (optional) a connector name where the summary report is sent

  • flagged_connector – (optional) a connector name where the differences are flagged

  • detail_connector – (optional) a connector name where the differences are shown

  • unmatched_connector – (optional) a connector name where the unmatched keys are shown

  • seed – (optional) this is a placeholder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pa.Table

build_profiling(canonical: ~pyarrow.lib.Table, profiling: str, headers: [<class 'str'>, <class 'list'>] = None, d_types: [<class 'str'>, <class 'list'>] = None, regex: [<class 'str'>, <class 'list'>] = None, drop: bool = None, connector_name: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Data profiling provides, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. It can be used to identify any errors, anomalies, or patterns that may exist within the data. There are three types of data profiling available ‘dictionary’, ‘schema’ or ‘quality’

Parameters:
  • canonical – a direct or generated pd.DataFrame. see context notes below

  • profiling – The profiling name. Options are ‘dictionary’, ‘schema’ or ‘quality’

  • headers – (optional) a filter of headers from the ‘other’ dataset

  • d_types – (optional) a filter on data type for the ‘other’ dataset. int, float, bool, object

  • regex – (optional) a regular expression to search the headers. example ‘^((?!_amt).)*$)’ excludes ‘_amt’

  • drop – (optional) to drop or not drop the headers if specified

  • connector_name – (optional) a connector name where the outcome is sent

  • seed – (optional) this is a placeholder, here for compatibility across methods

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pa.Table