FeatureEngineer - sample

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

get_dist_bernoulli(probability: float, canonical: ~pyarrow.lib.Table = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

A Bernoulli discrete random distribution using scipy

Parameters:
  • probability – the probability occurrence

  • canonical – (optional) a pa.Table to append the result table to

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_binomial(number: [<class 'int'>, <class 'str'>, <class 'float'>], canonical: ~pyarrow.lib.Table = None, size: int = None, num_nulls: [<class 'float'>, <class 'int'>] = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Creates a binomial as a boolean set of values based the number or a probability, dependent on the number passed. If the number is between 0 and 1 it is taken as a probability, else it is taken as a numeric count.

The number parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used number should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name

Parameters:
  • number – a probability between 0 and 1 or the number of True values.

  • canonical – (optional) a pa.Table to append the result table to

  • size – (optional) the size of the array.

  • num_nulls – (optional) a probability between 0 and 1 or the number of null values.

  • to_header – (optional) an optional name to call the column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. If None: default’s to -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level True - replaces the current intent method with the new False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of 1 or 0

as choice is a fixed value, number can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_dist_bounded_normal(mean: float, std: float, lower: float, upper: float, canonical: ~pyarrow.lib.Table = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

A bounded normal continuous random distribution.

Parameters:
  • mean – the mean of the distribution

  • std – the standard deviation

  • lower – the lower limit of the distribution

  • upper – the upper limit of the distribution

  • canonical – (optional) a pa.Table to append the result table to

  • precision – the precision of the returned number. if None then assumes int value else float

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_normal(mean: float, std: float, canonical: ~pyarrow.lib.Table = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

A normal (Gaussian) continuous random distribution.

Parameters:
  • mean – The mean (“centre”) of the distribution.

  • std – The standard deviation (jitter or “width”) of the distribution. Must be >= 0

  • canonical – (optional) a pa.Table to append the result table to

  • precision – The number of decimal points. The default is 3

  • size – the size of the sample. if a tuple of intervals, size must match the tuple

  • quantity – a number between 0 and 1 representing data that isn’t null

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_distribution(distribution: str, canonical: ~pyarrow.lib.Table = None, is_stats: bool = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) Table

returns a number based the distribution type.

Parameters:
  • distribution – The string name of the distribution function from numpy random Generator class

  • is_stats – (optional) if the generator is from the stats package and not numpy

  • canonical – (optional) a pa.Table to append the result table to

  • precision – (optional) the precision of the returned number

  • size – (optional) the size of the sample

  • quantity – (optional) a number between 0 and 1 representing data that isn’t null

  • to_header – (optional) an optional name to call the column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – the parameters of the method

Returns:

a random number

get_noise(size: int, num_columns: int, canonical: ~pyarrow.lib.Table = None, seed: int = None, name_prefix: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) DataFrame

Generates multiple columns of noise in your dataset

Parameters:
  • size – The number of rows

  • num_columns – the number of columns of noise

  • canonical – (optional) a pa.Table to append the result table to

  • name_prefix – a name the prefix the column names

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

get_sample_list(sample_name: str, canonical: ~pyarrow.lib.Table = None, sample_size: int = None, shuffle: bool = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

returns a sample set based on sample_name. To see the potential samples call the property ‘sample_list’.

Parameters:
  • sample_name – The name of the Sample method to be used.

  • canonical – (optional) a pa.Table to append the result table to

  • sample_size – (optional) the size of the sample to take from the reference file

  • shuffle – (optional) if the selection should be shuffled before selection. Default is true

  • quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data

  • size – (optional) size of the return. default to 1

  • to_header – (optional) an optional name to call the column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a sample list

get_sample_map(sample_map: str, size: int, canonical: ~pyarrow.lib.Table = None, selection: list = None, mask_null: bool = None, headers: [<class 'str'>, <class 'list'>] = None, shuffle: bool = None, rename_columns: [<class 'dict'>, <class 'list'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) Table

returns a sample table based on sample_map. To see the potential samples call the property ‘sample_map’. The returned table can be filtered by row (selection) or by column (headers)

The selection is a list of triple tuples in the form: [(comparison, operation, logic)] where comparison is the item or column to compare, the operation is what to do when comparing and the logic if you are chaining tuples as in the logic to join to the next boolean flags to the current. An example might be:

[(comparison, operation, logic)] [(1, ‘greater’, ‘or’), (-1, ‘less’, None)] [(pa.array([‘INACTIVE’, ‘PENDING’]), ‘is_in’, None)]

The operator and logic are taken from pyarrow.compute and are:

operator => extract_regex, equal, greater, less, greater_equal, less_equal, not_equal, is_in, is_null logic => and, or, xor, and_not

{header: [(comparison, operation, logic)]}

Parameters:
  • sample_map – the sample map name.

  • size – size of the return table.

  • canonical – (optional) a pa.Table to append the result table to

  • rename_columns – (optional) rename the returning sample columns with an exact list or replacement dict

  • selection – (optional) a list of

  • mask_null – (optional)

  • headers – a header or list of headers to filter on

  • shuffle – (optional) if the selection should be shuffled before selection. Default is true

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

  • kwargs – any additional parameters to pass to the sample map

Returns:

pa.Table

get_synthetic_data_types(size: int, extend: bool = None, prob_nulls: float = None, seed: int = None, category_encode: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

A dataset with example data types

Parameters:
  • size – The size of the sample

  • extend – extend the synthetic dataset to include non-standard types

  • prob_nulls – (optional) a value between 0 an 1 of the percentage of nulls. Default 0.02

  • category_encode – (optional) if the categorical should be encoded to DictionaryArray

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pyarrow Table

get_synthetic_persona_usa(size: int, canonical: ~pyarrow.lib.Table = None, seed: int = None, category_encode: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

A synthetic dataset representing a persona in the United States of America

Parameters:
  • size – The size of the sample

  • canonical – (optional) a pa.Table to append the result table to

  • category_encode – (optional) if the categorical should be encoded to DictionaryArray

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pyarrow Table

static sample_inspect(method: str)

A taste of a given sample method

property sample_list: list

A list of sample options

property sample_map: list

A list of sample options