FeatureEngineer - synthesis

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

get_analysis(size: int, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], canonical: [<class 'str'>, <class 'pyarrow.lib.Table'>] = None, category_limit: int = None, date_jitter: int = None, date_units: str = None, sort_by: [<class 'str'>, <class 'list'>] = None, offset: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

builds a set of synthetic data columns based on other. If common columns exist in the named canonical, those columns will remain as the canonical. This allows already constructed association to be used as reference for a sub category.

Parameters:

size – The number of rows
other – a direct or generated pa.Table.
canonical – (optional) a pa.Table to append the result table to
category_limit – (optional) a global cap on categories captured. default to 20
sort_by – (optional) Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)
date_jitter – (optional) The size of the jitter. Default to 2
date_units – (optional) The date units. Options [‘W’, ‘D’, ‘h’, ‘m’, ‘s’, ‘milli’, ‘micro’]. Default ‘D’
offset – (optional) an offset value of a numeric column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. In - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pa.Table

get_analysis_group(size: int, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], group_by: [<class 'str'>, <class 'list'>], sort_by: [<class 'str'>, <class 'list'>] = None, canonical: [<class 'str'>, <class 'pyarrow.lib.Table'>] = None, category_limit: int = None, date_jitter: int = None, date_units: str = None, offset: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

builds a set of synthetic data columns based on other and separated by group_by at analysis. If common columns exist in the named canonical, those columns will remain as the canonical. This allows already constructed association to be used as reference for a sub category.

Parameters:

size – The number of rows
other – a direct or generated pa.Table.
group_by – Name of the column to use to group by
sort_by – (optional) Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)
canonical – (optional) a pa.Table to append the result table to
category_limit – (optional) a global cap on categories captured. zero value returns no limits
date_jitter – (optional) The size of the jitter. Default to 2
date_units – (optional) The date units. Options [‘W’, ‘D’, ‘h’, ‘m’, ‘s’, ‘milli’, ‘micro’]. Default ‘D’
offset – (optional) an offset value of a numeric column
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

get_boolean(size: int, canonical: ~pyarrow.lib.Table = None, probability: float = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

A boolean discrete random distribution

Parameters:

size – the size of the sample
canonical – (optional) a pa.Table to append the result table to
probability – a float between 0 and 1 of the probability of success. Default = 0.5
quantity – a number between 0 and 1 representing data that isn’t null
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_category(selection: list, size: int, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, to_categorical: bool = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

returns a categorical as a string.

Parameters:

selection – a list of items to select from
size – size of the return
canonical – (optional) a pa.Table to append the result table to
relative_freq – a weighting pattern that does not have to add to 1
to_categorical – if the categorical should be returned encoded as a dictionary type or string type (default)
quantity – a number between 0 and 1 representing the percentage quantity of the data
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an item or list of items chosen from the list

get_datetime(start: ~typing.Any, until: ~typing.Any, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, at_most: int = None, ordered: str = None, date_format: str = None, timezone: str = None, time_unit: str = None, as_num: bool = None, ignore_time: bool = None, ignore_seconds: bool = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, day_first: bool = None, year_first: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

returns a random date between two date and/or times. weighted patterns can be applied to the overall date range. if a signed ‘int’ type is passed to the start and/or until dates, the inferred date will be the current date time with the integer being the offset from the current date time in ‘days’.

Note: If no patterns are set this will return a linearly random number between the range boundaries.

Parameters:

timezone – (optional)
time_unit – (optional) the time units for the timezone. Options ‘s’ ‘ms’ ‘us’ or ‘ns’
start – the start boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int
until – up until boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int
canonical – (optional) a pa.Table to append the result table to
quantity – (optional) the quantity of values that are not null. Number between 0 and 1
relative_freq – (optional) A pattern across the whole date range.
at_most – (optional) the most times a selection should be chosen
ordered – (optional) order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’
ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False
ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False
date_format – the string format of the date to be returned. if not set then pd.Timestamp returned
as_num – returns a list of Matplotlib date values as a float. Default is False
size – the size of the sample to return. Default to 1
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
year_first – specifies if to parse with the year first - If True parses dates with the year first, e.g. 10/11/12 is parsed as 2010-11-12. - If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).
day_first – specifies if to parse with the day first - If True, parses dates with the day first, eg %d-%m-%Y. - If False default to a preferred preference, normally %m-%d-%Y (but not strict)
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a date or size of dates in the format given.

get_intervals(intervals: list, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

returns a number based on a list selection of tuple(lower, upper) interval

Parameters:

intervals – a list of unique tuple pairs representing the interval lower and upper boundaries
canonical – (optional) a pa.Table to append the result table to
relative_freq – a weighting pattern or probability that does not have to add to 1
precision – the precision of the returned number. if None then assumes int value else float
size – the size of the sample
quantity – a number between 0 and 1 representing data that isn’t null
to_header – (optional) an optional name to call the column
seed – a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_number(start: [<class 'int'>, <class 'float'>, <class 'str'>] = None, stop: [<class 'int'>, <class 'float'>, <class 'str'>] = None, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, precision: int = None, ordered: str = None, at_most: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

returns a number in the range from_value to to_value. if only to_value given from_value is zero

Parameters:

start – optional (signed) integer or float to start from. See below for str
stop – (signed) integer or float the number sequence goes to but not include. See below
canonical – (optional) a pa.Table to append the result table to
relative_freq – (optional) a weighting pattern or probability that does not have to add to 1
precision – (optional) the precision of the returned number. if None then assumes int value else float
ordered – (optional) order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’
at_most – (optional)the most times a selection should be chosen
to_header – (optional) an optional name to call the column
size – (optional) the size of the sample
quantity – (optional) a number between 0 and 1 representing data that isn’t null
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

The values can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_string_pattern(pattern: str, canonical: ~pyarrow.lib.Table = None, choices: dict = None, as_binary: bool = None, quantity: [<class 'float'>, <class 'int'>] = None, size: int = None, choice_only: bool = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) → Table

Returns a random string based on the pattern given. The pattern is made up from the choices passed but by default is as follows:

c = random char [a-z][A-Z]

d = digit [0-9]

l = lower case char [a-z]

U = upper case char [A-Z]

p = all punctuation

s = space

you can also use punctuation in the pattern that will be retained A pattern example might be

uuddsduu => BA12 2NE or dl-{uu} => 4g-{FY}

to create your own choices pass a dictionary with a reference char key with a list of choices as a value

Parameters:

pattern – the pattern to create the string from
canonical – (optional) a pa.Table to append the result table to
choices – (optional) an optional dictionary of list of choices to replace the default.
as_binary – (optional) if the return string is prefixed with a b
quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data
size – (optional) the size of the return list. if None returns a single value
choice_only – (optional) if to only use the choices given or to take not found characters as is
to_header – (optional) an optional name to call the column
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a string based on the pattern