Populating an object from dataframe - python

Currently trying to implement Genetic Algorithm. I have built a Python class Gene
I am trying to load an object Gene from a dataframe df
class Gene:
def __init__(self,id,nb_trax,nb_days):
self.id=id
self.nb_trax=nb_trax
self.nb_days=nb_days
and then create another object Chrom
class Chromosome(object):
def __init__(self):
self.port = [Gene() for id in range(20)]
And a second class Chromosome with 20 Gene objects as its property
This is the dataframe
ID nb_obj nb_days
ECGYE 10259 62.965318
NLRTM 8007 46.550562
I successfully loaded the Gene using
tester=df.apply(lambda row: Gene(row['Injection Port'],row['Avg Daily Injection'],random.randint(1,10)), axis=1)
But i cannot load Chrom class using
f=Chromosome(tester)
I get this error
Traceback (most recent call last):
File "chrom.py", line 27, in <module>
f=Chromosome(tester)
TypeError: __init__() takes 1 positional argument but 2 were given
Any help please?

The error is misleading because it says __init__ takes 1 positional argument (which is the self from the object of the class Chromosome).
Secondly, what you are getting from the operation on df in tester is actually a DataFrame indexed as df with one column of Gene values.
To solve this you would have to change the code along these lines:
class Chromosome(object):
def __init__(self, df):
self.port = [Gene() for id in range(20)]
self.xxx = list(df)

Related

Class method called in __init__ not giving same output as the same function used outside the class

I'm sure I'm missing something in how classes work here, but basically this is my class:
import pandas as pd
import numpy as np
import scipy
#example DF with OHLC columns and 100 rows
gold = pd.DataFrame({'Open':[i for i in range(100)],'Close':[i for i in range(100)],'High':[i for i in range(100)],'Low':[i for i in range(100)]})
class Backtest:
def __init__(self, ticker, df):
self.ticker = ticker
self.df = df
self.levels = pivot_points(self.df)
def pivot_points(self,df,period=30):
highs = scipy.signal.argrelmax(df.High.values,order=period)
lows = scipy.signal.argrelmin(df.Low.values,order=period)
return list(df.High[highs[0]]) + list(df.Low[lows[0]])
inst = Backtest('gold',gold) #gold is a Pandas Dataframe with Open High Low Close columns and data
inst.levels # This give me the whole dataframe (inst.df) instead of the expected output of the pivot_point function (a list of integers)
The problem is inst.levels returns the whole DataFrame instead of the return value of the function pivot_points (which is supposed to be a list of integers)
When I called the pivot_points function on the same DataFrame outside this class I got the list I expected
I expected to get the result of the pivot_points() function after assigning it to self.levels inside the init but instead I got the entire DataFrame
You would have to address pivot_points() as self.pivot_points()
And there is no need to add period as an argument if you are not changing it, if you are, its okay there.
I'm not sure if this helps, but here are some tips about your class:
class Backtest:
def __init__(self, ticker, df):
self.ticker = ticker
self.df = df
# no need to define a instance variable here, you can access the method directly
# self.levels = pivot_points(self.df)
def pivot_points(self):
period = 30
# period is a local variable to pivot_points so I can access it directly
print(f'period inside Backtest.pivot_points: {period}')
# df is an instance variable and can be accessed in any method of Backtest after it is instantiated
print(f'self.df inside Backtest.pivot_points(): {self.df}')
# to get any values out of pivot_points we return some calcualtions
return 1 + 1
# if you do need an attribute like level to access it by inst.level you could create a property
#property
def level(self):
return self.pivot_points()
gold = 'some data'
inst = Backtest('gold', gold) # gold is a Pandas Dataframe with Open High Low Close columns and data
print(f'inst.pivot_points() outside the class: {inst.pivot_points()}')
print(f'inst.level outside the class: {inst.level}')
This would be the result:
period inside Backtest.pivot_points: 30
self.df inside Backtest.pivot_points(): some data
inst.pivot_points() outside the class: 2
period inside Backtest.pivot_points: 30
self.df inside Backtest.pivot_points(): some data
inst.level outside the class: 2
Thanks to the commenter Henry Ecker I found that I had the function by the same name defined elsewhere in the file where the output is the df. After changing that my original code is working as expected

Slicing Pandas DataFrames without losing the DataFrames attributes

I like to store metadata about a dataframe by simply setting an attribute and its corresponding value, like this:
df.foo = "bar"
However, I've found that attributes stored like this are gone once I slice the dataframe:
df.foo = "bar"
df[:100].foo
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\admin\PycharmProjects\project\venv\lib\site-packages\pandas\core\generic.py", line 5465, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'foo'
I wonder if this behavior can be changed, similar to how drop=True or inplace=True change the way attributes like df.set_index(args) work. I didn't find anything helpful in the pandas docs.
For many operations, pandas returns a new object so any attributes you have defined, which aren't natively supported in the pd.DataFrame class will not persist.
A simple alternative is to subclass the DataFrame. You need to be sure to add the attribute to the _metadata else it wont persist
import pandas as pd
class MyDataFrame(pd.DataFrame):
# temporary properties
_internal_names = pd.DataFrame._internal_names
_internal_names_set = set(_internal_names)
# normal properties
_metadata = ["foo"]
#property
def _constructor(self):
return MyDataFrame
df = MyDataFrame({'data': range(10)})
df.foo = 'bar'
df[:100].foo
#'bar'

create PySpark Dataframe column based on class method

I have a python class and it has functions like below:
class Features():
def __init__(self, json):
self.json = json
def get_email(self):
email = self.json.get('fields', {}).get('email', None)
return email
And I am trying to use the get_email function in a pyspark dataframe to create a new column based on another column, "raw_json",which consists of json value:
df = data.withColumn('email', (F.udf(lambda j: Features.get_email(json.loads(j)), t.StringType()))('raw_json'))
So the ideal pyspark dataframe looks like below:
+---------------+-----------
|raw_json |email
+----------------+----------
| |
+----------------+--------
| |
+----------------+-------
But I am getting an error saying:
TypeError: unbound method get_email() must be called with Features instance as first argument (got dict instance instead)
How should I do to achieve this?
I have seen a similar question asked before but it was not resolved.
I guess you have misunderstood how classes are used in Python. You're probably looking for this instead:
udf = F.udf(lambda j: Features(json.loads(j)).get_email())
df = data.withColumn('email', udf('raw_json'))
where you instantiate a Features object and call the get_email method of the object.

using assign method to add a column to an already-existing table

Below is the problem, the code and the error that arises. top_10_movies has two columns, which are rating and name.
import babypandas as bpd
top_10_movies = top_10_movies = bpd.DataFrame().assign(
Rating = top_10_movie_ratings,
Name = top_10_movie_names
)
top_10_movies
You can use the assign method to add a column to an already-existing
table, too. Create a new DataFrame called with_ranking by adding a
column named "Ranking" to the table in top_10_movies
import babypandas as bpd
Ranking = my_ranking
with_ranking = top_10_movies.assign(Ranking)
TypeError Traceback (most recent call last)
<ipython-input-41-a56d9c05ae19> in <module>
1 import babypandas as bpd
2 Ranking = my_ranking
----> 3 with_ranking = top_10_movies.assign(Ranking)
TypeError: assign() takes 1 positional argument but 2 were given
While using assign, it needs a key to assign to, you can do:
with_ranking = top_10_movies.assign(ranking = Ranking)
Here's a simple example to check:
df = pd.DataFrame({'col': ['a','b']})
ranks = [1, 2]
df.assign(ranks) # causes the same error
df.assign(rank = ranks) # works

Pandas Data Frame is not correctly identified: Instance of 'tuple' has no 'filter' member

I am writing a class containing pandas functionalities. As an input I have a pandas dataframe but python seems to not recognizing it right.
import pandas as pd
class box:
def __init__(self, dataFrame, pers, limit):
self.df = dataFrame,
self.pers = pers,
self.data = limit
def cleanDataset(self):
persDf = self.df.filter(regex=('^' + self.pers + r'[1-9]$'))
persDF.replace({'-': None})
self.df.filter(...) gives me the warning: Instance of 'tuple' has no 'filter' member. I have found this but cannot apply the solution though since the problem is not caused by django.
Anyone who can help me out here?
Your problem is the comma at the end of self.df = dataFrame, (and self.pers = pers,). The comma isn't necessary here.
The comma makes the class think you're defining self.df as a tuple with one member. To check this, create a box object b and try print type(box.df). I'm guessing this will return <type 'tuple'>.
Remove the commas after the attribute definitions:
class box:
def __init__(self, dataFrame, pers, limit):
self.df = dataFrame
self.pers = pers
self.data = limit

Categories

Resources