FeatureTransform
- class ds_capability.intent.feature_transform_intent.FeatureTransformIntent(property_manager: ~ds_capability.managers.feature_transform_property_manager.FeatureTransformPropertyManager, default_save_intent: bool = None, default_intent_level: [<class 'str'>, <class 'int'>, <class 'float'>] = None, order_next_available: bool = None, default_replace_intent: bool = None)
This class represents feature transformation intent actions whereby features are converted from one format or structure to another. This includes, scaling, encoding, discretization and activation trigger algorithms.
- activate_relu(canonical: ~pyarrow.lib.Table, header: str, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Activation functions play a crucial role in the backpropagation algorithm, which is the primary algorithm used for training neural networks. During backpropagation, the error of the output is propagated backwards through the network, and the weights of the network are updated based on this error. The activation function is used to introduce non-linearity into the output of a neural network layer.
Rectified Linear Unit (ReLU) function. is the most popular activation function, which replaces negative values with zero and keeps the positive values unchanged. and is defined as f(x) = x * (x > 0)
- Parameters:
canonical – a pa.Table as the reference dataframe
header – the header in the Table to correlate
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- activate_sigmoid(canonical: ~pyarrow.lib.Table, header: str, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Activation functions play a crucial role in the backpropagation algorithm, which is the primary algorithm used for training neural networks. During backpropagation, the error of the output is propagated backwards through the network, and the weights of the network are updated based on this error. The activation function is used to introduce non-linearity into the output of a neural network layer.
Logistic Sigmoid a.k.a logit, tmaps any input value to a value between 0 and 1, making it useful for binary classification problems and is defined as f(x) = 1/(1+exp(-x))
The sigmoid function has an S-shaped curve, and it asymptotically approaches 0 as z approaches negative infinity and approaches 1 as z approaches positive infinity. This property makes it useful for binary classification problems, where the goal is to classify inputs into one of two categories.
- Parameters:
canonical – a pa.Table as the reference dataframe
header – the header in the Table to correlate
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- activate_tanh(canonical: ~pyarrow.lib.Table, header: str, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Activation functions play a crucial role in the backpropagation algorithm, which is the primary algorithm used for training neural networks. During backpropagation, the error of the output is propagated backwards through the network, and the weights of the network are updated based on this error. The activation function is used to introduce non-linearity into the output of a neural network layer.
Tangent Hyperbolic (tanh) function is a shifted and stretched version of the Sigmoid function but maps the input values to a range between -1 and 1. and is defined as f(x) = (exp(x)-exp(-x))/(exp(x)+exp(-x))
Similar to the logistic sigmoid function, the hyperbolic tangent function has an S-shaped curve, but it maps its input to values between -1 and 1. This function is advantageous because it has desirable properties, such as being zero-centered (meaning that the average of its output is close to zero), which can help with the convergence of optimization algorithms during training.
- Parameters:
canonical – a pa.Table as the reference dataframe
header – the header in the Table to correlate
precision – (optional) how many decimal places. default to 3
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- discrete_custom(canonical: ~pyarrow.lib.Table, header: str, interval: [<class 'int'>, <class 'float'>, <class 'list'>] = None, lower: [<class 'int'>, <class 'float'>] = None, upper: [<class 'int'>, <class 'float'>] = None, categories: list = None, to_header: str = None, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
converts continuous representation into discrete representation through interval categorisation based on custom intervals
- Parameters:
canonical – a pa.Table as the reference table
header – the header in the Table to correlate
interval – (optional) the granularity of the analysis across the range. Default is 3 float passed - the length of each interval list[tuple] - specific interval periods e.g []
lower – (optional) the lower limit of the number value. Default min()
upper – (optional) the upper limit of the number value. Default max()
to_header – (optional) an optional name to call the column
precision – (optional) The precision of the range and boundary values. by default set to 5.
categories – (optional) a set of labels the same length as the intervals to name the categories
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- discrete_intervals(canonical: ~pyarrow.lib.Table, header: str, interval: [<class 'int'>, <class 'list'>] = None, categories: list = None, to_header: str = None, precision: int = None, duplicates: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Converts continuous values into discrete values through interval categorisation based on quantile discretization. Intervals can be either a single value representing the number of quantiles or an ordered list of quantiles between 0 and 1, e.g. [0, .25, .5, .75, 1.]
If intervals are not unique duplicates provide a strategy to ‘raise’ a ValueError, ‘drop’ duplicate bins or ‘rank’ values from 1 to n before binning. Ranking might be used for sparse date with predominant zeros as an example.
- Parameters:
canonical – a pa.Table as the reference table
header – the header in the Table to correlate
interval – (optional) the granularity of the analysis across the range. Default is 3 int passed - represents the number of periods list[int] - specific interval periods e.g []
to_header – (optional) an optional name to call the column
precision – (optional) The precision of the range and boundary values. by default set to 5.
categories – (optional) a set of labels the same length as the intervals to name the categories
duplicates – (optional) If intervals are not unique, ‘raise’ ValueError, ‘drop’ intervals or ‘rank’ values
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- discrete_quantiles(canonical: ~pyarrow.lib.Table, header: str, interval: [<class 'int'>, <class 'list'>] = None, categories: list = None, to_header: str = None, precision: int = None, duplicates: str = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Converts continuous values into discrete values through interval categorisation based on interval bins. Intervals can be either a single value number of bins or a list of bin edges allowing for non-uniform width.
If intervals are not unique duplicates provide a strategy to ‘raise’ a ValueError, ‘drop’ duplicate bins or ‘rank’ values from 1 to n before binning. Ranking might be used for sparse date with predominant zeros as an example.
- Parameters:
canonical – a pa.Table as the reference table
header – the header in the Table to correlate
interval – (optional) the granularity of the analysis across the range. Default is 3 int passed - represents the number of periods list[float] - the percentile or quantities, All should fall between 0 and 1
to_header – (optional) an optional name to call the column
precision – (optional) The precision of the range and boundary values. by default set to 5.
categories – (optional) a set of labels the same length as the intervals to name the categories
duplicates – (optional) If intervals are not unique, ‘raise’ ValueError, ‘drop’ intervals or ‘rank’ values
seed – (optional) the random seed. defaults to current datetime
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the column name that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
an equal length list of correlated values
- encode_category_integer(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, ordinal: bool = None, label_count: int = None, prefix=None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Integer encoding replaces the categories by digits from 1 to n, where n is the number of distinct categories of the variable. Integer encoding can be either nominal or ordinal.
Nominal data is categorical variables without any particular order between categories. This means that the categories cannot be sorted and there is no natural order between them.
Ordinal data represents categories with a natural, ordered relationship between each category. This means that the categories can be sorted in either ascending or descending order. In order to encode integers as ordinal, a ranking must be provided.
If ranking is given, the return will be ordinal values based on the ranking order of the list. If a categorical value is not found in the list it is grouped with other missing values and given the last ranking. This is known as rare-label encoding.
- Parameters:
canonical – pyarrow Table
headers – the header(s) to apply encoding too
ordinal – (optional) if the integer encoder is ordinal
label_count – (optional) if the ordinal is rare-label, the number of categories to rank
prefix – (optional) a str to prefix the column
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- encode_category_one_hot(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, prefix=None, data_type: ~pyarrow.lib.Table = None, prefix_sep: str = None, dummy_na: bool = False, drop_first: bool = False, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None) Table
encodes categorical data types, One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.
- Parameters:
canonical – pyarrow Table
headers – the header(s) to apply encoding too
prefix – str, list of str, or dict of str, String to append Table intent levels, with equal length.
prefix_sep – str separator, default ‘_’
dummy_na – Add a column to indicate null values, if False nullss are ignored.
drop_first – Whether to get k-1 dummies out of k categorical levels by removing the first level.
data_type – Data type for new columns. Only a single dtype is allowed.
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- encode_date_integer(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, prefix=None, day_first: bool = None, year_first: bool = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
date encoding to integer replaces dates for integer values.
- Parameters:
canonical – pyarrow Table
headers – the header(s) to apply encoding too
prefix – (optional) a str to prefix the column
day_first – (optional) if the dates given are day first format. Default to True
year_first – (optional) if the dates given are year first. Default to False
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- scale_normalize(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, scalar: [<class 'tuple'>, <class 'str'>] = None, prefix: str = None, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Normalization of continuous data using either Min-Max Scaling or Robust Scaling, dependent on the scalar passed.
Min-Max Scaling: scales the data such that it falls within a specified range, typically (0,1) as default if no value is passed, or (min, max) tuple. This method is suitable when your data should be uniformly distributed across the specified range.
Robust Scaling: Robust scaling is similar to min-max scaling but is less sensitive to outliers. It uses the interquartile range (0.25, 0.75) to scale the data. This is used if the scalar equals ‘robust’. Robust scaling is particularly useful when your dataset contains outliers or when the underlying data distribution is not necessarily Gaussian.
- Parameters:
canonical – pyarrow Table
headers – the header(s) to apply scaling too
scalar – (optional) a tuple scalar representing min and max values or ‘robust’ for interquartile scaling
prefix – (optional) a str prefix for generated headers
precision – (optional) how many decimal places. default to 3
seed – (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- scale_standardize(canonical: ~pyarrow.lib.Table, headers: [<class 'str'>, <class 'list'>] = None, prefix: str = None, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Z-Score Standardization (Standard Scaling). This method transforms the data to have a mean of 0 and a standard deviation of 1. It’s particularly useful when your data follows a Gaussian (normal) distribution. This transformation makes it easier to compare and work with features that may have different scales and ensures that they contribute equally to model training.
- Parameters:
canonical – pyarrow Table
headers – the header(s) to apply scaling too
prefix – (optional) a str prefix for generated headers
precision – (optional) how many decimal places. default to 3
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table
- scale_transform(canonical: ~pyarrow.lib.Table, transform: str, headers: [<class 'str'>, <class 'list'>] = None, prefix: str = None, precision: int = None, seed: int = None, save_intent: bool = None, intent_level: [<class 'int'>, <class 'str'>] = None, intent_order: int = None, replace_intent: bool = None, remove_duplicates: bool = None)
Log, sqrt (square root), cbrt (cube root), Box-Cox, and Yeo-Johnson transformations. These are techniques used to modify the distribution or scale of continuous data to make the data conform more closely to a normal distribution or to stabilize the variance.
Log Transformation (log): The log transformation involves taking the natural logarithm of each data point. It is particularly useful when dealing with data that follows an exponential distribution or has a right-skewed distribution. The log transformation can help make the data more symmetric and stabilize the variance.
Square Root Transformation (sqrt): The square root transformation involves taking the square root of each data point. It is often used when dealing with count data or data with a right-skewed distribution. Like the log transformation, the square root transformation can make the data more symmetric and stabilize the variance.
Cube Root Transformation (cbrt): The cube root transformation involves taking the cube root of each data point. It is another option for stabilizing variance and making data less skewed, especially when dealing with positive data with right-skewed distributions.
Box-Cox Transformation: The Box-Cox transformation is a family of power transformations that includes both the log and square root transformations as special cases. Can only work with positive values
Yeo-Johnson Transformation: The Yeo-Johnson transformation is an extension of the Box-Cox transformation that allows for the transformation of data with both positive and negative values.
These transformations are often applied to address issues like non-normality, heteroscedasticity (unequal variance), and skewed distributions in the data.
- Parameters:
canonical – pyarrow Table
headers – the header(s) to apply scaling to
transform – transform function, ‘log’, ‘sqrt’, ‘cbrt’, ‘boxcox’ or ‘yeojohnson’
prefix – (optional) a str prefix for generated headers
precision – (optional) how many decimal places. default to 3
seed – seed: (optional) a seed value for the random function: default to None
save_intent – (optional) if the intent contract should be saved to the property manager
intent_level – (optional) the intent level that groups intent to create a column
intent_order – (optional) the order in which each intent should run. - If None: default’s to -1 - if -1: added to a level above any current instance of the intent section, level 0 if not found - if int: added to the level specified, overwriting any that already exist
replace_intent – (optional) if the intent method exists at the level, or default level - True - replaces the current intent method with the new - False - leaves it untouched, disregarding the new intent
remove_duplicates – (optional) removes any duplicate intent in any level that is identical
- Returns:
a pa.Table