As of version v0.12.0, FeatureTools allows you to assign custom names to multi-output primitives: https://github.com/alteryx/featuretools/pull/794. By default, the when you define custom multi-output primitives, the column names for the generated features are appended with a [0], [1], [2], etc. So let us say that I have the following code to output a multi-output primitive:
def sine_and_cosine_datestamp(column):
"""
Returns the Sin and Cos of the hour of datestamp
"""
sine_hour = np.sin(column.dt.hour)
cosine_hour = np.cos(column.dt.hour)
ret = [sine_hour, cosine_hour]
return ret
Sine_Cosine_Datestamp = make_trans_primitive(function = sine_and_cosine_datestamp,
input_types = [vtypes.Datetime],
return_type = vtypes.Numeric,
number_output_features = 2)
In the dataframe generated from DFS, the names of the two generated columns will be SINE_AND_COSINE_DATESTAMP(datestamp)[0] and SINE_AND_COSINE_DATESTAMP(datestamp)[1]. In actuality, I would have liked the names of the columns to reflect the operations being taken on the column. So I would have liked the column names to be something like SINE_AND_COSINE_DATESTAMP(datestamp)[sine] and SINE_AND_COSINE_DATESTAMP(datestamp)[cosine]. Apparently you have to use the generate_names method in order to do so. I could not find anything online to help me use this method and I kept running into errors. For example, when I tried the following code:
def sine_and_cosine_datestamp(column, string = ['sine, cosine']):
"""
Returns the Sin and Cos of the hour of the datestamp
"""
sine_hour = np.sin(column.dt.hour)
cosine_hour = np.cos(column.dt.hour)
ret = [sine_hour, cosine_hour]
return ret
def sine_and_cosine_generate_names(self, base_feature_names):
return u'STRING_COUNT(%s, "%s")' % (base_feature_names[0], self.kwargs['string'])
Sine_Cosine_Datestamp = make_trans_primitive(function = sine_and_cosine_datestamp,
input_types = [vtypes.Datetime],
return_type = vtypes.Numeric,
number_output_features = 2,
description = "For each value in the base feature"
"outputs the sine and cosine of the hour, day, and month.",
cls_attributes = {'generate_names': sine_and_cosine_generate_names})
I had gotten an assertion error. What's even more perplexing to me is that when I went into the transform_primitve_base.py file found in the featuretools/primitives/base folder, I saw that the generate_names function looks like this:
def generate_names(self, base_feature_names):
n = self.number_output_features
base_name = self.generate_name(base_feature_names)
return [base_name + "[%s]" % i for i in range(n)]
In the function above, it looks like there is no way that you can generate custom primitive names since it uses the base_feature_names and the number of output features by default. Any help would be appreciated.
Thanks for the question! This feature hasn't been documented well.
The main issue with your code was that string_count_generate_name should return a list of strings, one for each column.
It looks like you were adapting the StringCount example from the docs -- I think for this primitive it would be less error-prone to always use "sine" and "cosine" for the custom names, and remove the optional string argument from sine_and_cosine_datestamp. I also updated the feature name text to match your desired text.
After these changes:
def sine_and_cosine_datestamp(column):
"""
Returns the Sin and Cos of the hour of the datestamp
"""
sine_hour = np.sin(column.dt.hour)
cosine_hour = np.cos(column.dt.hour)
ret = [sine_hour, cosine_hour]
return ret
def sine_and_cosine_generate_names(self, base_feature_names):
template = 'SINE_AND_COSINE_DATESTAMP(%s)[%s]'
return [template % (base_feature_names[0], string) for string in ['sine', 'cosine']]
This created feature column names like SINE_AND_COSINE_DATESTAMP(order_date)[sine]. No changes were necessary to the actual make_trans_primitive call.
In the function above, it looks like there is no way that you can generate custom primitive names since it uses the base_feature_names and the number of output features by default.
That is the default generate_names function for transform primitives. Since we are assigning this custom generate names function to Sine_Cosine_Datestamp , the default will not be used.
Hope that helps, let me know if you still have questions!
Related
I am trying to map the individual rows of a dataframe into a custom object. The dataframe consists of multiple molecules that interact with a specific target. Additionally, multiple molecular descriptors are given. A slice is given below:
Now i need to map each instance into a Molecule object defined as something like this:
class Molecule:
allDescriptorKeys = []
def __init__(self, smiles, target, values):
self.smiles = smiles
self.target = target
self.d = {}
for i in range(len(Molecule.allDescriptorKeys)):
self.d[Molecule.allDescriptorKeys[i]] = values[i]
Where the allDescriptorsKeys class variable is set from outside the class using
def initdescriptorkeys(df):
Molecule.allDescriptorKeys = df.keys().values
Now I need a class function readMolDescriptors that reads in the molecule descriptors of a single molecule(row/instance). To use it later on in an external method to loop over the whole dataframe .I guess I need something like this:
def readMolDescriptors(self, index):
smiles = df.iloc[index]["SMILES"]
target = df.iloc[index]["Target"]
values = df.iloc[index][2:-1]
newMolecule = Molecule(smiles, target, values)
return newMolecule
But of course this is not a class function since the df is defined outside the class. I have a hard time wrapping my head around this, probably easy, problem. Hope someone can help.
It seems that you want to build a class from which you build a new instance for each row of the dataframe, and after that you want to get rid of the dataframe and play with those Molecule instances alone. Consider this:
class Molecule:
def __init__(self, data_row):
''' data_row: pd.Series. '''
self.smiles = data_row['SMILES']
# more self.xxx = data_row['xxx']
self.d = data_row.to_dict()
With this you can create an object of Molecule using a data row. For example,
molecules = [Molecule(data_row) for index, data_row in df.iterrows()]
To access a certain descriptor (e.g. nAT) value from the first molecule, you may do
print(molecules[0].d['nAT'])
although you can choose to define more dedicated method with the class to handle access like that.
Ofcourse, to build something like readMolDescriptors, below is my version.
def build_molecule_from_dataframe(df, index):
return Molecule(df.loc[index])
I am looking to build fairly detailed annotations for methods in a Python class. These to be used in troubleshooting, documentation, tooltips for a user interphase, etc. However it's not clear how I can keep these annotations associated to the functions.
For context, this is a feature engineering class, so two example methods might be:
def create_feature_momentum(self):
return self.data['mass'] * self.data['velocity'] *
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
For example:
It'd be good to tell easily what core features were used in each engineered feature.
It'd be good to track arbitrary metadata about each method
It'd be good to embed non-string data as metadata about each function. Eg. some example calculations on sample dataframes.
So far I've been manually creating docstrings like:
def create_feature_kinetic_energy(self)->pd.Series:
'''Calculate the non-relativistic kinetic energy.
Depends on: ['mass', 'velocity']
Supports NaN Values: False
Unit: Energy (J)
Example:
self.data= pd.DataFrame({'mass':[0,1,2], 'velocity':[0,1,2]})
self.create_feature_kinetic_energy()
>>> pd.Series([0, 0.5, 4])
'''
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
And then I'm using regex to get the data about a function by inspecting the __doc__ attribute. However, is there a better place than __doc__ where I could store information about a function? In the example above, it's fairly easy to parse the Depends on list, but in my use case it'd be good to also embed some example data as dataframes somehow (and I think writing them as markdown in the docstring would be hard).
Any ideas?
I ended up writing an class as follows:
class ScubaDiver(pd.DataFrame):
accessed = None
def __getitem__(self, key):
if self.accessed is None:
self.accessed = set()
self.accessed.add(key)
return pd.Series(dtype=float)
#property
def columns(self):
return list(self.accessed)
The way my code is writen, I can do this:
sd = ScubbaDiver()
foo(sd)
sd.columns
and sd.columns contains all the columns accessed by foo
Though this might not work in your codebase.
I also wrote this decorator:
def add_note(notes: dict):
'''Adds k:v pairs to a .notes attribute.'''
def _(f):
if not hasattr(f, 'notes'):
f.notes = {}
f.notes |= notes # Summation for dicts
return f
return _
You can use it as follows:
#add_note({'Units':'J', 'Relativity':False})
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
and then you can do:
create_feature_kinetic_energy.notes['Units'] # J
I am aware that Python does not have strong typing and that it does not support keywords specifying return types like void, int and similar in Java and C. I am also aware that we can use type hints to tell the users that they could expect something of specific type in return from a function.
I am trying to implement a Python class which will read a config file (say, a JSON file) that dictates what data transformation methods should be applied on a pandas dataframe. The config file looks something like:
[
{
"input_folder_path": "./input/budget/",
"input_file_name_or_pattern": "Global Budget Roll-up_9.16.19.xlsx",
"sheet_name_of_excel_file": "Budget Roll-Up",
"output_folder_path": "./output/budget/",
"output_file_name_prefix": "transformed_budget_",
"__comment__": "(Optional) File with Python class that houses data transformation functions, which will be imported and used in the transform process. If not provided, then the code will use default class in the 'transform_function.py' file.",
"transform_functions_file": "./transform_functions/budget_transform_functions.py",
"row_number_of_column_headers": 0,
"row_number_where_data_starts": 1,
"number_of_rows_to_skip_from_the_bottom_of_the_file": 0,
"__comment__": "(Required) List of the functions and their parameters.",
"__comment__": "These functions must be defined either in transform_functions.py or individual transformation file such as .\\transform_function\\budget_transform_functions.py",
"functions_to_apply": [
{
"__function_comment__": "Drop empty columns in Budget roll up Excel file. No parameters required.",
"function_name": "drop_unnamed_columns"
},
{
"__function_comment__": "By the time we run this function, there should be only 13 columns total remaining in the raw data frame.",
"function_name": "assert_number_of_columns_equals",
"function_args": [13]
},
{
"__function_comment__": "Map raw channel names 'Ecommerce' and 'ecommerce' to 'E-Commerce'.",
"transform_function_name": "standardize_to_ecommerce",
"transform_function_args": [["Ecommerce", "ecommerce"]]
}
]
}
]
In the main.py code, I have something like this:
if __name__ == '__main__':
# 1. Process arguments passed into the program
parser = argparse.ArgumentParser(description=transform_utils.DESC,
formatter_class = argparse.RawTextHelpFormatter,
usage=argparse.SUPPRESS)
parser.add_argument('-c', required=True, type=str,
help=transform_utils.HELP)
args = parser.parse_args()
# 2. Load JSON configuration file
if (not args.c) or (not os.path.exists(args.c)):
raise transform_errors.ConfigFileError()
# 3. Iterate through each transform procedure in config file
for config in transform_utils.load_config(args.c):
output_file_prefix = transform_utils.get_output_file_path_with_name_prefix(config)
custom_transform_funcs_module = transform_utils.load_custom_functions(config)
row_idx_where_data_starts = transform_utils.get_row_index_where_data_starts(config)
footer_rows_to_skip = transform_utils.get_number_of_rows_to_skip_from_bottom(config)
for input_file in transform_utils.get_input_files(config):
print("Processing file:", input_file)
col_headers_from_input_file = transform_utils.get_raw_column_headers(input_file, config)
if transform_utils.is_excel(input_file):
sheet = transform_utils.get_sheet(config)
print("Skipping this many rows (including header row) from the top of the file:", row_idx_where_data_starts)
cur_df = pd.read_excel(input_file,
sheet_name=sheet,
skiprows=row_idx_where_data_starts,
skipfooter=footer_rows_to_skip,
header=None,
names=col_headers_from_input_file
)
custom_funcs_instance = custom_transform_funcs_module.TaskSpecificTransformFunctions()
for func_and_params in transform_utils.get_functions_to_apply(config):
print("=>Invoking transform function:", func_and_params)
func_args = transform_utils.get_transform_function_args(func_and_params)
func_kwargs = transform_utils.get_transform_function_kwargs(func_and_params)
cur_df = getattr(custom_funcs_instance,
transform_utils.get_transform_function_name(
func_and_params))(cur_df, *func_args, **func_kwargs)
In budget_transform_functions.py file, I have:
class TaskSpecificTransformFunctions(TransformFunctions):
def drop_unnamed_columns(self, df):
"""
Drop columns that have 'Unnamed' as column header, which is a usual
occurrence for some Excel/CSV raw data files with empty but hidden columns.
Args:
df: Raw dataframe to transform.
params: We don't need any parameter for this function,
so it's defaulted to None.
Returns:
Dataframe whose 'Unnamed' columns are dropped.
"""
return df.loc[:, ~df.columns.str.contains(r'Unnamed')]
def assert_number_of_columns_equals(self, df, num_of_cols_expected):
"""
Assert that the total number of columns in the dataframe
is equal to num_of_cols (int).
Args:
df: Raw dataframe to transform.
num_of_cols_expected: Number of columns expected (int).
Returns:
The original dataframe is returned if the assertion is successful.
Raises:
ColumnCountMismatchError: If the number of columns found
does not equal to what is expected.
"""
if df.shape[1] != num_of_cols_expected:
raise transform_errors.ColumnCountError(
' '.join(["Expected column count of:", str(num_of_cols_expected),
"but found:", str(df.shape[1]), "in the current dataframe."])
)
else:
print("Successfully check that the current dataframe has:", num_of_cols_expected, "columns.")
return df
As you can see, I need future implementer of budget_transform_functions.py to be aware that the functions within TaskSpecificTransformFunctions must always return a pandas dataframe. I know that in Java, you can create an interface and whoever implements that interface have to abide by the return values of each method in that interface. I'm wondering if we have similar construct (or a workaround to achieve similar thing) in Python.
Hope this lengthy question make sense and I'm hoping someone with a lot more Python experience than I have will be able to teach me something about this. Thank you very much in advance for your answers/suggestions!
One way to check the return type of a function at least at run time is to wrap the function in another function that checks the return type. To automate this for subclasses, there is __init_subclass__. This can be used in the following way (polishing and handling of special cases needed yet):
import pandas as pd
def wrapCheck(f):
def checkedCall(*args, **kwargs):
r = f(*args, **kwargs)
if not isinstance(r, pd.DataFrame):
raise Exception(f"Bad return value of {f.__name__}: {r!r}")
return r
return checkedCall
class TransformFunctions:
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
for k, v in cls.__dict__.items():
if callable(v):
setattr(cls, k, wrapCheck(v))
class TryTransform(TransformFunctions):
def createDf(self):
return pd.DataFrame(data={"a":[1,2,3], "b":[4,5,6]})
def noDf(self, a, b):
return a + b
tt = TryTransform()
print(tt.createDf()) # Works
print(tt.noDf(2, 2)) # Fails with exception
I'm writing an HTTP Request Handler with intuitive routing. My goal is to be able to apply a decorator to a function which states the HTTP method being used as well as the path to be listened on for executing the decorated function. Here's a sample of this implementation:
#route_handler("GET", "/personnel")
def retrievePersonnel():
return personnelDB.retrieveAll()
However, I also want to be able to add variables to the path. For example, /personnel/3 would fetch a personnel with an ID of 3. The way I want to go about doing this is providing a sort of 'variable mask' to the path passed into the route_handler. A new example would be:
#route_handler("GET", "/personnel/{ID}")
def retrievePersonnelByID(ID):
return personnelDB.retrieveByID(ID)
The decorator's purpose would be to compare the path literal (/personnel/3 for example) with the path 'mask' (/personnel/{ID}) and pass the 3 into the decorated function. I'm assuming the solution would be to compare the two strings, keep the differences, and place the difference in the literal into a variable named after the difference in the mask (minus the curly braces). But then I'd also have to check to see if the literal matches the mask minus the {} variable catchers...
tl;dr - is there a way to do
stringMask("/personnel/{ID}", "/personnel/5") -> True, {"ID": 5}
stringMask("/personnel/{ID}", "/flowers/5") -> False, {}
stringMask("/personnel/{ID}", "/personnel") -> False, {}
Since I'm guessing there isn't really an easy solution to this, I'm gonna post the solution I did. I was hoping there would be something I could do in a few lines, but oh well ¯_(ツ)_/¯
def checkPath(self, mask):
mask_parts = mask[1:].split("/")
path_parts = self.path[1:].rstrip("/").split("/")
if len(mask_parts) != len(path_parts):
self.urlVars = {}
return False
vars = {}
for i in range(len(mask_parts)):
if mask_parts[i][0] == "{":
vars[mask_parts[i][1:-1]] = path_parts[i]
else:
if mask_parts[i] != path_parts[i]:
self.urlVars = {}
return False
self.url_vars = vars # save extracted variables
return True
A mask is just a string like one of the ones below:
/resource
/resource/{ID}
/group/{name}/resource/{ID}
I am working on a code which takes a dataset and runs some algorithms on it.
User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:
workflow =
{0: {'dataset': 'some dataset'},
1: {'algorithm1': "parameters"},
2: {'algorithm2': "parameters"},
3: {'algorithm3': "parameters"}
}
Which means I'll take workflow[0] as my dataset, and I will run algorithm1 on it. Then, I will take its results and I will run algorithm2 on this results as my new dataset. And I will take the new results and run algorithm3 on it. It goes like this until the last item and there is no length limit for this workflow.
I am writing this in Python. Can you suggest some strategies about processing this workflow?
You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:
result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)
This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):
workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]
And that your algorithms look like:
def algo0(p, data):
…
return output_data.filename
algo_by_name takes a name and gives you an algo function; for example:
def algo_by_name(name):
return {'algo0': algo0, 'algo1': algo1, }[name]
(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)
If each algorithm works on each element on dataset, map() would be an elegant option:
dataset=workflow[0]
for algorithm in workflow[1:]:
dataset=map(algorithm, dataset)
e.g. for the square roots of odd numbers only, use,
>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]
The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.
And advice: I would put the workflow structure in a list with tuples rather than a dictionary
workflow = [ ('dataset', 'some dataset'),
('algorithm1', "parameters"),
('algorithm2', "parameters"),
('algorithm3', "parameters")]
Define a Dataset class that tracks... data... for your set. Define methods in this class. Something like this:
class Dataset:
# Some member fields here that define your data, and a constructor
def algorithm1(self, param1, param2, param3):
# Update member fields based on algorithm
def algorithm2(self, param1, param2):
# More updating/processing
Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset class.
myDataset = Dataset() # Whatever actual construction you need to do
For each subsequent entry...
Extract the key/value somehow (I'd recommend changing your workflow data structure if possible, dict is inconvenient here)
Parse the param string to a tuple of arguments (this step is up to you).
Assuming you now have the string algorithm and the tuple params for the current iteration...
getattr(myDataset, algorithm)(*params)
This will call the function on myDataset with the name specified by "algorithm" with the argument list contained in "params".
Here is how I would do this (all code untested):
Step 1: You need to create the algorithms. The Dataset could look like this:
class Dataset(object):
def __init__(self, dataset):
self.dataset = dataset
def __iter__(self):
for x in self.dataset:
yield x
Notice that you make an iterator out of it, so you iterate over it one item at a time. There's a reason for that, you'll see later:
Another algorithm could look like this:
class Multiplier(object):
def __init__(self, previous, multiplier):
self.previous = previous
self.multiplier = multiplier
def __iter__(self):
for x in previous:
yield x * self.multiplier
Step 2
Your user would then need to make a chain of this somehow. Now if he had access to Python directly, you can just do this:
dataset = Dataset(range(100))
multiplier = Multiplier(dataset, 5)
and then get the results by:
for x in multiplier:
print x
And it would ask the multiplier for one piece of data at a time, and the multiplier would in turn as the dataset. If you have a chain, then this means that one piece of data is handled at a time. This means you can handle huge amounts of data without using a lot of memory.
Step 3
Probably you want to specify the steps in some other way. For example a text file or a string (sounds like this may be web-based?). Then you need a registry over the algorithms. The easiest way is to just create a module called "registry.py" like this:
algorithms = {}
Easy, eh? You would register a new algorithm like so:
from registry import algorithms
algorithms['dataset'] = Dataset
algorithms['multiplier'] = Multiplier
You'd also need a method that creates the chain from specifications in a text file or something. I'll leave that up to you. ;)
(I would probably use the Zope Component Architecture and make algorithms components and register them in the component registry. But that is all strictly speaking overkill).