FeatureEngineer - sample

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

get_dist_bernoulli(probability: float, canonical: ~pyarrow.lib.Table = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

A Bernoulli discrete random distribution using scipy

Parameters:

probability – the probability occurrence
canonical – (optional) a pa.Table to append the result table to
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_binomial(number: [<class 'int'>, <class 'str'>, <class 'float'>], canonical: ~pyarrow.lib.Table = None, size: int = None, num_nulls: [<class 'float'>, <class 'int'>] = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

Creates a binomial as a boolean set of values based the number or a probability, dependent on the number passed. If the number is between 0 and 1 it is taken as a probability, else it is taken as a numeric count.

The number parameter can be a direct reference to the canonical column header or to an environment variable. If the environment variable is used number should be set to "${<<YOUR_ENVIRON>>}" where <<YOUR_ENVIRON>> is the environment variable name

Parameters:

number – a probability between 0 and 1 or the number of True values.
canonical – (optional) a pa.Table to append the result table to
size – (optional) the size of the array.
num_nulls – (optional) a probability between 0 and 1 or the number of null values.
to_header – (optional) an optional name to call the column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. If None: default’s to -1 if -1: added to a level above any current instance of the intent section, level 0 if not found if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level True - replaces the current intent method with the new False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a list of 1 or 0

as choice is a fixed value, number can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_dist_bounded_normal(mean: float, std: float, lower: float, upper: float, canonical: ~pyarrow.lib.Table = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

A bounded normal continuous random distribution.

Parameters:

mean – the mean of the distribution
std – the standard deviation
lower – the lower limit of the distribution
upper – the upper limit of the distribution
canonical – (optional) a pa.Table to append the result table to
precision – the precision of the returned number. if None then assumes int value else float
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_dist_normal(mean: float, std: float, canonical: ~pyarrow.lib.Table = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

A normal (Gaussian) continuous random distribution.

Parameters:

mean – The mean (“centre”) of the distribution.
std – The standard deviation (jitter or “width”) of the distribution. Must be >= 0
canonical – (optional) a pa.Table to append the result table to
precision – The number of decimal points. The default is 3
size – the size of the sample. if a tuple of intervals, size must match the tuple
quantity – a number between 0 and 1 representing data that isn’t null
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_distribution(distribution: str, canonical: ~pyarrow.lib.Table = None, is_stats: bool = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) → Table

returns a number based the distribution type.

Parameters:

distribution – The string name of the distribution function from numpy random Generator class
is_stats – (optional) if the generator is from the stats package and not numpy
canonical – (optional) a pa.Table to append the result table to
precision – (optional) the precision of the returned number
size – (optional) the size of the sample
quantity – (optional) a number between 0 and 1 representing data that isn’t null
to_header – (optional) an optional name to call the column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – the parameters of the method

Returns:

a random number

get_noise(size: int, num_columns: int, canonical: ~pyarrow.lib.Table = None, seed: int = None, name_prefix: str = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → DataFrame

Generates multiple columns of noise in your dataset

Parameters:

size – The number of rows
num_columns – the number of columns of noise
canonical – (optional) a pa.Table to append the result table to
name_prefix – a name the prefix the column names
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a DataFrame

get_sample_list(sample_name: str, canonical: ~pyarrow.lib.Table = None, sample_size: int = None, shuffle: bool = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

returns a sample set based on sample_name. To see the potential samples call the property ‘sample_list’.

Parameters:

sample_name – The name of the Sample method to be used.
canonical – (optional) a pa.Table to append the result table to
sample_size – (optional) the size of the sample to take from the reference file
shuffle – (optional) if the selection should be shuffled before selection. Default is true
quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data
size – (optional) size of the return. default to 1
to_header – (optional) an optional name to call the column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a sample list

get_sample_map(sample_map: str, size: int, canonical: ~pyarrow.lib.Table = None, selection: list = None, mask_null: bool = None, headers: [<class 'str'>, <class 'list'>] = None, shuffle: bool = None, rename_columns: [<class 'dict'>, <class 'list'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None, **kwargs) → Table

returns a sample table based on sample_map. To see the potential samples call the property ‘sample_map’. The returned table can be filtered by row (selection) or by column (headers)

The selection is a list of triple tuples in the form: [(comparison, operation, logic)] where comparison is the item or column to compare, the operation is what to do when comparing and the logic if you are chaining tuples as in the logic to join to the next boolean flags to the current. An example might be:

[(comparison, operation, logic)] [(1, ‘greater’, ‘or’), (-1, ‘less’, None)] [(pa.array([‘INACTIVE’, ‘PENDING’]), ‘is_in’, None)]

The operator and logic are taken from pyarrow.compute and are:

operator => extract_regex, equal, greater, less, greater_equal, less_equal, not_equal, is_in, is_null logic => and, or, xor, and_not

{header: [(comparison, operation, logic)]}

Parameters:

sample_map – the sample map name.
size – size of the return table.
canonical – (optional) a pa.Table to append the result table to
rename_columns – (optional) rename the returning sample columns with an exact list or replacement dict
selection – (optional) a list of
mask_null – (optional)
headers – a header or list of headers to filter on
shuffle – (optional) if the selection should be shuffled before selection. Default is true
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
kwargs – any additional parameters to pass to the sample map

Returns:

pa.Table

get_synthetic_data_types(size: int, extend: bool = None, prob_nulls: float = None, seed: int = None, category_encode: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

A dataset with example data types

Parameters:

size – The size of the sample
extend – extend the synthetic dataset to include non-standard types
prob_nulls – (optional) a value between 0 an 1 of the percentage of nulls. Default 0.02
category_encode – (optional) if the categorical should be encoded to DictionaryArray
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pyarrow Table

get_synthetic_persona_usa(size: int, canonical: ~pyarrow.lib.Table = None, seed: int = None, category_encode: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

A synthetic dataset representing a persona in the United States of America

Parameters:

size – The size of the sample
canonical – (optional) a pa.Table to append the result table to
category_encode – (optional) if the categorical should be encoded to DictionaryArray
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pyarrow Table

static sample_inspect(method: str): A taste of a given sample method

property sample_list: list: A list of sample options

property sample_map: list: A list of sample options