How to map dataframe rows into a custom object? - python

I am trying to map the individual rows of a dataframe into a custom object. The dataframe consists of multiple molecules that interact with a specific target. Additionally, multiple molecular descriptors are given. A slice is given below:
Now i need to map each instance into a Molecule object defined as something like this:
class Molecule:
allDescriptorKeys = []
def __init__(self, smiles, target, values):
self.smiles = smiles
self.target = target
self.d = {}
for i in range(len(Molecule.allDescriptorKeys)):
self.d[Molecule.allDescriptorKeys[i]] = values[i]
Where the allDescriptorsKeys class variable is set from outside the class using
def initdescriptorkeys(df):
Molecule.allDescriptorKeys = df.keys().values
Now I need a class function readMolDescriptors that reads in the molecule descriptors of a single molecule(row/instance). To use it later on in an external method to loop over the whole dataframe .I guess I need something like this:
def readMolDescriptors(self, index):
smiles = df.iloc[index]["SMILES"]
target = df.iloc[index]["Target"]
values = df.iloc[index][2:-1]
newMolecule = Molecule(smiles, target, values)
return newMolecule
But of course this is not a class function since the df is defined outside the class. I have a hard time wrapping my head around this, probably easy, problem. Hope someone can help.

It seems that you want to build a class from which you build a new instance for each row of the dataframe, and after that you want to get rid of the dataframe and play with those Molecule instances alone. Consider this:
class Molecule:
def __init__(self, data_row):
''' data_row: pd.Series. '''
self.smiles = data_row['SMILES']
# more self.xxx = data_row['xxx']
self.d = data_row.to_dict()
With this you can create an object of Molecule using a data row. For example,
molecules = [Molecule(data_row) for index, data_row in df.iterrows()]
To access a certain descriptor (e.g. nAT) value from the first molecule, you may do
print(molecules[0].d['nAT'])
although you can choose to define more dedicated method with the class to handle access like that.
Ofcourse, to build something like readMolDescriptors, below is my version.
def build_molecule_from_dataframe(df, index):
return Molecule(df.loc[index])

Related

How to implement custom naming for multioutput primitives in FeatureTools

As of version v0.12.0, FeatureTools allows you to assign custom names to multi-output primitives: https://github.com/alteryx/featuretools/pull/794. By default, the when you define custom multi-output primitives, the column names for the generated features are appended with a [0], [1], [2], etc. So let us say that I have the following code to output a multi-output primitive:
def sine_and_cosine_datestamp(column):
"""
Returns the Sin and Cos of the hour of datestamp
"""
sine_hour = np.sin(column.dt.hour)
cosine_hour = np.cos(column.dt.hour)
ret = [sine_hour, cosine_hour]
return ret
Sine_Cosine_Datestamp = make_trans_primitive(function = sine_and_cosine_datestamp,
input_types = [vtypes.Datetime],
return_type = vtypes.Numeric,
number_output_features = 2)
In the dataframe generated from DFS, the names of the two generated columns will be SINE_AND_COSINE_DATESTAMP(datestamp)[0] and SINE_AND_COSINE_DATESTAMP(datestamp)[1]. In actuality, I would have liked the names of the columns to reflect the operations being taken on the column. So I would have liked the column names to be something like SINE_AND_COSINE_DATESTAMP(datestamp)[sine] and SINE_AND_COSINE_DATESTAMP(datestamp)[cosine]. Apparently you have to use the generate_names method in order to do so. I could not find anything online to help me use this method and I kept running into errors. For example, when I tried the following code:
def sine_and_cosine_datestamp(column, string = ['sine, cosine']):
"""
Returns the Sin and Cos of the hour of the datestamp
"""
sine_hour = np.sin(column.dt.hour)
cosine_hour = np.cos(column.dt.hour)
ret = [sine_hour, cosine_hour]
return ret
def sine_and_cosine_generate_names(self, base_feature_names):
return u'STRING_COUNT(%s, "%s")' % (base_feature_names[0], self.kwargs['string'])
Sine_Cosine_Datestamp = make_trans_primitive(function = sine_and_cosine_datestamp,
input_types = [vtypes.Datetime],
return_type = vtypes.Numeric,
number_output_features = 2,
description = "For each value in the base feature"
"outputs the sine and cosine of the hour, day, and month.",
cls_attributes = {'generate_names': sine_and_cosine_generate_names})
I had gotten an assertion error. What's even more perplexing to me is that when I went into the transform_primitve_base.py file found in the featuretools/primitives/base folder, I saw that the generate_names function looks like this:
def generate_names(self, base_feature_names):
n = self.number_output_features
base_name = self.generate_name(base_feature_names)
return [base_name + "[%s]" % i for i in range(n)]
In the function above, it looks like there is no way that you can generate custom primitive names since it uses the base_feature_names and the number of output features by default. Any help would be appreciated.
Thanks for the question! This feature hasn't been documented well.
The main issue with your code was that string_count_generate_name should return a list of strings, one for each column.
It looks like you were adapting the StringCount example from the docs -- I think for this primitive it would be less error-prone to always use "sine" and "cosine" for the custom names, and remove the optional string argument from sine_and_cosine_datestamp. I also updated the feature name text to match your desired text.
After these changes:
def sine_and_cosine_datestamp(column):
"""
Returns the Sin and Cos of the hour of the datestamp
"""
sine_hour = np.sin(column.dt.hour)
cosine_hour = np.cos(column.dt.hour)
ret = [sine_hour, cosine_hour]
return ret
def sine_and_cosine_generate_names(self, base_feature_names):
template = 'SINE_AND_COSINE_DATESTAMP(%s)[%s]'
return [template % (base_feature_names[0], string) for string in ['sine', 'cosine']]
This created feature column names like SINE_AND_COSINE_DATESTAMP(order_date)[sine]. No changes were necessary to the actual make_trans_primitive call.
In the function above, it looks like there is no way that you can generate custom primitive names since it uses the base_feature_names and the number of output features by default.
That is the default generate_names function for transform primitives. Since we are assigning this custom generate names function to Sine_Cosine_Datestamp , the default will not be used.
Hope that helps, let me know if you still have questions!

How to get class instance variables, python

I would like to get the names of __init__ parameters and modify them when the code runs. My class looks like this:
class Sample:
def __init__ (self,indicators:dict):
self.names = []
self.returns = 0.0
for k,v in indicators.items():
setattr(self, k, v)
self.names.append(k)
The input of this class is a random choice of items from a lis; then I assign those random items to a dictionary with integer values.
indicatorsList =["SMA", "WMA", "EMA", "STOCHASTIC", "MACD", "HIGHEST_HIGH",
"HIGHEST_LOW", "HIGHEST_CLOSE", "LOWEST_HIGH", "LOWEST_LOW",
"LOWEST_CLOSE", "ATR", "LINGRES", "RSI", "WRSI", "ROC",
"DAY", "MONTH"]
# initializing the value of n
n = random.randint(2,int(math.ceil(len(indicatorsList)/2)))
randomIndList = n * [None]
for i in range(n):
choice = random.choice(indicatorsList)
randomIndList[i] = choice
...
...
sample = Sample(randDict)
Problem is, I don't know the names of these parameters in __init__, and I need to modify them later, for example like this:
sample.sma = random.randint(0, maxVal)
But I don't know if the object will have sma, or ema, or any other attribute, because of the way they're assigned randomly.
First of all, this code:
sample.sma = random.randint(0, maxVal)
will work, even if sample doesn't have an sma attribute. It will create one. Try it yourself and see.
But as you specified in your comment that you only want to modify attributes that already exist, that won't help in this case.
What you could do, with your existing class definition, is to loop over the names attribute you've already defined.
for name in sample.names:
setattr(sample, name, random.randint(0, maxVal))
However, you've basically reinvented a dictionary here, so why not redefine your class to directly use a dictionary?
class Sample:
def __init__(self, indicators:dict):
self.indicators = indicators
Now you no longer need dynamic setattr or getattr lookups. They're simply keys and values:
for key in sample.indicators:
sample.indicators[key] = random.randint(0, maxVal)
(This also means you don't need the separate names attribute.)

apply python class methods on list of instances

I recently moved from Matlab to Python and want to transfer some Matlab code to Python. However an obstacle popped up.
In Matlab you can define a class with its methods and create nd-arrays of instances. The nice thing is that you can apply the class methods to the array of instances as long as the method is written so it can deal with the arrays. Now in Python I found that this is not possible: when applying a class method to a list of instances it will not find the class method. Below an example of how I would write the code:
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
classlist = [testclass(1), testclass(10), testclass(100)]
times5(classlist)
This will give an error on the times5(classlist) line. Now this is a simple example explaining what I want to do (the final class will have multiple numpy arrays as variables).
What is the best way to get this kind of functionality in Python? The reason I want to do this is because it allows batch operations and they make the class a lot more powerful. The only solution I can think of is to define a second class that has a list of instances of the first class as variables. The batch processing would need to be implemented in the second class then.
thanks!
UPDATE:
In your comment , I notice this sentence,
For example a function that takes the data of the first class in the list and substracts the data of all following classe.
This can be solved by reduce function.
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
from functools import reduce
classlist = [x.data for x in [testclass(1), testclass(10), testclass(100)]]
result = reduce(lambda x,y:x-y,classlist[1:],classlist[0])
print(result)
ORIGIN ANSWER:
In fact, what you need is List Comprehensions.
Please let me show you the code
class testclass():
def __init__(self, data):
self.data = data
def times5(self):
return testclass(self.data * 5)
classlist = [testclass(1), testclass(10), testclass(100)]
results = [x.times5() for x in classlist]
print(results)

How can I use the value of a variable in the name of another without using a dictionary in python?

The answer people have already given for using the value of a variable in the assignment of another is:
to create a dictionary and,
use dict[oldVariable] instead of defining a new one
I don't think that works in the context of what I'm trying to do...
I'm trying to define a class for a vector which would take a list as an input and assign an entry in the vector for each element of the list.
My code looks something like this right now:
class vector:
def __init__(self, entries):
for dim in range(len(entries)):
for entry in entries:
self.dim = entry #here I want to assign self.1, self.2, etc all the way to however
#many elements are in entries, but I can't replace self.dim with
# dict[dim]
def __str__(self):
string = []
for entry in range(1,4):
string.append(self.entry)
print(string)
How do I do this?
What you are doing here is a bit strange, since you are using a variable named "dim" in a for, but you do not do anything with that variable. It looks like you want to use a class as if it was an array... why don't you define an array within the class and access it from the outside with the index? v.elements[1] ... and so on?
Example:
class Vector:
def __init__(self, entries):
self.elements = []
for e in entries:
self.elements.append(self.process(e))
def __str__(self):
buff = ''
for e in self.elements:
buff += str(e)
return buff
Hope this helps.
If I'm reading your question correctly, I think you're looking for the setattr function (https://docs.python.org/2/library/functions.html#setattr).
If you wanted to name the fields with a particular string value, you could just do this:
class vector:
def __init__(self, entries):
for dim in range(len(entries)):
for entry in entries:
#self.dim = entry
setattr(self, str(dict[dim]), dim)
That will result in your object self having attributes named with whatever the values of dict[dim] are and values equal to the dim.
That being said, be aware that an integer value is generally a poor attribute name. You won't be able to do print obj.1 without error. You'd have to do getattr(obj,'1').
I agree with #Ricardo that you are going about this strangely and you should probably rethink how you're structuring this class, but I wanted to directly answer the question in case others land here looking for how to do dynamic naming.

Getting all possible value combinations

For an automatic test thing, I have a class with ~15 parameters. I want to automatically generate instances of the class for every possible value combination. For instance, if the class was defined like so:
class meep():
def __init__(self):
self.par1 = 0 # can be in range {0-3}
self.par2 = 1 # can be in range {1-2}
self.par3 = a # can be in range {a-c}
What is the most efficient to get instances of it with all possible value combinations? (IE
inst1=(par1=0,par2=1,par3=a),
inst2=(par1=0,par2=1,par3=b),
inst3=(par1=0,par2=1,par3=c),
inst4=(par1=1,par2=1,par3=a),
inst5=(par1=1,par2=1,par3=b),
inst6=(par1=1,par2=1,par3=c),
etc.)
itertools.product()

Categories

Resources