FeatureEngineer - synthesis

class ds_capability.intent.feature_engineer_intent.FeatureEngineerIntent(property_manager: ~ds_capability.managers.feature_engineer_property_manager.FeatureEngineerPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)

This class represents feature engineering intent actions that, depending on its application, represent data’s statistical and distributive characteristics to provide targeted features of interests. Its focus is around build, correlate and model features in a way that is more conducive with the downstream feature requirements.

get_analysis(size: int, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], canonical: [<class 'str'>, <class 'pyarrow.lib.Table'>] = None, category_limit: int = None, date_jitter: int = None, date_units: str = None, sort_by: [<class 'str'>, <class 'list'>] = None, offset: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

builds a set of synthetic data columns based on other. If common columns exist in the named canonical, those columns will remain as the canonical. This allows already constructed association to be used as reference for a sub category.

Parameters:
  • size – The number of rows

  • other – a direct or generated pa.Table.

  • canonical – (optional) a pa.Table to append the result table to

  • category_limit – (optional) a global cap on categories captured. default to 20

  • sort_by – (optional) Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)

  • date_jitter – (optional) The size of the jitter. Default to 2

  • date_units – (optional) The date units. Options [‘W’, ‘D’, ‘h’, ‘m’, ‘s’, ‘milli’, ‘micro’]. Default ‘D’

  • offset – (optional) an offset value of a numeric column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. In - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a pa.Table

get_analysis_group(size: int, other: [<class 'str'>, <class 'pyarrow.lib.Table'>], group_by: [<class 'str'>, <class 'list'>], sort_by: [<class 'str'>, <class 'list'>] = None, canonical: [<class 'str'>, <class 'pyarrow.lib.Table'>] = None, category_limit: int = None, date_jitter: int = None, date_units: str = None, offset: [<class 'int'>, <class 'float'>] = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

builds a set of synthetic data columns based on other and separated by group_by at analysis. If common columns exist in the named canonical, those columns will remain as the canonical. This allows already constructed association to be used as reference for a sub category.

Parameters:
  • size – The number of rows

  • other – a direct or generated pa.Table.

  • group_by – Name of the column to use to group by

  • sort_by – (optional) Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)

  • canonical – (optional) a pa.Table to append the result table to

  • category_limit – (optional) a global cap on categories captured. zero value returns no limits

  • date_jitter – (optional) The size of the jitter. Default to 2

  • date_units – (optional) The date units. Options [‘W’, ‘D’, ‘h’, ‘m’, ‘s’, ‘milli’, ‘micro’]. Default ‘D’

  • offset – (optional) an offset value of a numeric column

  • seed – seed: (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

pa.Table

get_boolean(size: int, canonical: ~pyarrow.lib.Table = None, probability: float = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

A boolean discrete random distribution

Parameters:
  • size – the size of the sample

  • canonical – (optional) a pa.Table to append the result table to

  • probability – a float between 0 and 1 of the probability of success. Default = 0.5

  • quantity – a number between 0 and 1 representing data that isn’t null

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_category(selection: list, size: int, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, to_categorical: bool = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

returns a categorical as a string.

Parameters:
  • selection – a list of items to select from

  • size – size of the return

  • canonical – (optional) a pa.Table to append the result table to

  • relative_freq – a weighting pattern that does not have to add to 1

  • to_categorical – if the categorical should be returned encoded as a dictionary type or string type (default)

  • quantity – a number between 0 and 1 representing the percentage quantity of the data

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

an item or list of items chosen from the list

get_datetime(start: ~typing.Any, until: ~typing.Any, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, at_most: int = None, ordered: str = None, date_format: str = None, timezone: str = None, time_unit: str = None, as_num: bool = None, ignore_time: bool = None, ignore_seconds: bool = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, day_first: bool = None, year_first: bool = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

returns a random date between two date and/or times. weighted patterns can be applied to the overall date range. if a signed ‘int’ type is passed to the start and/or until dates, the inferred date will be the current date time with the integer being the offset from the current date time in ‘days’.

Note: If no patterns are set this will return a linearly random number between the range boundaries.

Parameters:
  • timezone – (optional)

  • time_unit – (optional) the time units for the timezone. Options ‘s’ ‘ms’ ‘us’ or ‘ns’

  • start – the start boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int

  • until – up until boundary of the date range can be str, datetime, pd.datetime, pd.Timestamp or int

  • canonical – (optional) a pa.Table to append the result table to

  • quantity – (optional) the quantity of values that are not null. Number between 0 and 1

  • relative_freq – (optional) A pattern across the whole date range.

  • at_most – (optional) the most times a selection should be chosen

  • ordered – (optional) order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’

  • ignore_time – ignore time elements and only select from Year, Month, Day elements. Default is False

  • ignore_seconds – ignore second elements and only select from Year to minute elements. Default is False

  • date_format – the string format of the date to be returned. if not set then pd.Timestamp returned

  • as_num – returns a list of Matplotlib date values as a float. Default is False

  • size – the size of the sample to return. Default to 1

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • year_first – specifies if to parse with the year first - If True parses dates with the year first, e.g. 10/11/12 is parsed as 2010-11-12. - If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

  • day_first – specifies if to parse with the day first - If True, parses dates with the day first, eg %d-%m-%Y. - If False default to a preferred preference, normally %m-%d-%Y (but not strict)

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a date or size of dates in the format given.

get_intervals(intervals: list, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, precision: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

returns a number based on a list selection of tuple(lower, upper) interval

Parameters:
  • intervals – a list of unique tuple pairs representing the interval lower and upper boundaries

  • canonical – (optional) a pa.Table to append the result table to

  • relative_freq – a weighting pattern or probability that does not have to add to 1

  • precision – the precision of the returned number. if None then assumes int value else float

  • size – the size of the sample

  • quantity – a number between 0 and 1 representing data that isn’t null

  • to_header – (optional) an optional name to call the column

  • seed – a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

get_number(start: [<class 'int'>, <class 'float'>, <class 'str'>] = None, stop: [<class 'int'>, <class 'float'>, <class 'str'>] = None, canonical: ~pyarrow.lib.Table = None, relative_freq: list = None, precision: int = None, ordered: str = None, at_most: int = None, size: int = None, quantity: float = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_order: int = None, intent_level: [<class 'int'>, <class 'str'>] = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

returns a number in the range from_value to to_value. if only to_value given from_value is zero

Parameters:
  • start – optional (signed) integer or float to start from. See below for str

  • stop – (signed) integer or float the number sequence goes to but not include. See below

  • canonical – (optional) a pa.Table to append the result table to

  • relative_freq – (optional) a weighting pattern or probability that does not have to add to 1

  • precision – (optional) the precision of the returned number. if None then assumes int value else float

  • ordered – (optional) order the data ascending ‘asc’ or descending ‘dec’, values accepted ‘asc’ or ‘des’

  • at_most – (optional)the most times a selection should be chosen

  • to_header – (optional) an optional name to call the column

  • size – (optional) the size of the sample

  • quantity – (optional) a number between 0 and 1 representing data that isn’t null

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a random number

The values can be represented by an environment variable with the format ‘${NAME}’ where NAME is the environment variable name

get_string_pattern(pattern: str, canonical: ~pyarrow.lib.Table = None, choices: dict = None, as_binary: bool = None, quantity: [<class 'float'>, <class 'int'>] = None, size: int = None, choice_only: bool = None, to_header: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table

Returns a random string based on the pattern given. The pattern is made up from the choices passed but by default is as follows:

  • c = random char [a-z][A-Z]

  • d = digit [0-9]

  • l = lower case char [a-z]

  • U = upper case char [A-Z]

  • p = all punctuation

  • s = space

you can also use punctuation in the pattern that will be retained A pattern example might be

uuddsduu => BA12 2NE or dl-{uu} => 4g-{FY}

to create your own choices pass a dictionary with a reference char key with a list of choices as a value

Parameters:
  • pattern – the pattern to create the string from

  • canonical – (optional) a pa.Table to append the result table to

  • choices – (optional) an optional dictionary of list of choices to replace the default.

  • as_binary – (optional) if the return string is prefixed with a b

  • quantity – (optional) a number between 0 and 1 representing the percentage quantity of the data

  • size – (optional) the size of the return list. if None returns a single value

  • choice_only – (optional) if to only use the choices given or to take not found characters as is

  • to_header – (optional) an optional name to call the column

  • seed – (optional) a seed value for the random function: default to None

  • save_intent – (optional) if the intent contract should be saved to the property manager

  • intent_level – (optional) the column name that groups intent to create a column

  • intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist

  • replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent

  • remove_duplicates – (optional) removes any duplicate intent in any level that is identical

Returns:

a string based on the pattern