Create class that appends all entries to a DataFrame

Create class that appends all entries to a DataFrame - python

I am having trouble with the following task.
I need to create a class that accomodates student id, name and grades of all students. My idea was to create an empty DataFrame to append the values I add to the class.
I came up with the below.
class Student:
students_archive = pd.DataFrame(index = 'student_id', columns = ['student_id', 'name', 'grade'])
def __init__(self, s_id, name, grade):
self.s_id = s_id
self.name = name
self.grade = grade
st = {'student_id': self.s_id,'name': self.name, 'grade': self.grade}
pd.concat([Student.students_archive, st])
I am however getting the following error:
If using all scalar values, you must pass an index
I dont really understand whats wrong and I have looked it all around, can anybody help me? Thanks
I also cant help but think mine is the wrong approach since the task doesnt actually specify that it needs to be a dataframe, just says that I have to 'create a class that accomodate students name, grade, and id, and create methods to add, remove or upgrade the students values'. Perhaps I can do all of that without creating a dataframe?
Thank you

I can't comment yet so here's an answer.
Running the code shows me 2 errors:
This index = 'student_id' should be more like this index = ['student_id']
This pd.concat([Student.students_archive, st]) does not accept a dictionary rather something like pd.concat([dataframe_0, dataframe_1])
Also I think it's better if you add your dataframe inside the __init__().
So it should result in something like:
class Student:
def __init__(self, s_id, name, grade):
self.s_id = s_id
self.name = name
self.grade = grade
self.students_archive = pd.DataFrame(index = ['student_id'], columns = ['student_id', 'name', 'grade'])
temp_df = pd.DataFrame.from_dict({'student_id': [self.s_id],'name': [self.name], 'grade': [self.grade]})
new_temp_df = pd.concat([self.students_archive, temp_df])
But keep in mind that temp_df and new_temp_df wil probably be garbage collected so, keep what you want by adding self.
As I understand it:
template = pd.DataFrame(index = ['student_id'], columns = ['student_id', 'name', 'grade'])
entries = pd.DataFrame.from_dict({'student_id': [self.s_id],'name': [self.name], 'grade': [self.grade]})
self.student_df = pd.concat([template, entries])
UPDATE:
Having read your comment and the code you linked, it seems like instantiating each student is out of the scope. You seem to be on the right track though.
I would still add the main dataframe inside the __init__(), keeps it more tidy imho. You can still Acceess it from outside the class (more below).
incorporate the functionality needed through methods (add/remove/update), which you started doing.
Whether this approach is correct or not is probably up to your professor and if they forbid libraries like pandas. I don't see anything wrong as it gives you much of what's needed.
So, in code I would suggest something like this:
class Students:
def __init__(self):
# keep in mind that pandas will add another column to the left for row id (0, 1, 2, etc..)
self.df = pd.DataFrame(columns = ['student_id', 'name', 'grade'])
def add_student(self, s_id, nm, gr):
# a new dataframe is returned from .append() containing the new values, we replace our old on with that
self.df = self.df.append({'student_id':s_id, 'name':nm, 'grade':gr}, ignore_index = True)
def remove_student(self, s_id):
# we are basically, getting a new dataframe without the rows that have s_id as their student_id
self.df = self.df[self.df.student_id != s_id]
# this could be broken down into 3 methods, 1 for each (id, name, grade)
def update_student(self, s_id, nm, gr):
# as i dont know how to update a row, i leave this as a placeholder:
self.remove_student(s_id)
self.add_student(s_id, nm, gr)
Accessing the dataframe from outside is as easy as:
S = Student()
print(S.df)

Related

Class method called in init not giving same output as the same function used outside the class

I'm sure I'm missing something in how classes work here, but basically this is my class:
import pandas as pd
import numpy as np
import scipy
#example DF with OHLC columns and 100 rows
gold = pd.DataFrame({'Open':[i for i in range(100)],'Close':[i for i in range(100)],'High':[i for i in range(100)],'Low':[i for i in range(100)]})
class Backtest:
def __init__(self, ticker, df):
self.ticker = ticker
self.df = df
self.levels = pivot_points(self.df)
def pivot_points(self,df,period=30):
highs = scipy.signal.argrelmax(df.High.values,order=period)
lows = scipy.signal.argrelmin(df.Low.values,order=period)
return list(df.High[highs[0]]) + list(df.Low[lows[0]])
inst = Backtest('gold',gold) #gold is a Pandas Dataframe with Open High Low Close columns and data
inst.levels # This give me the whole dataframe (inst.df) instead of the expected output of the pivot_point function (a list of integers)
The problem is inst.levels returns the whole DataFrame instead of the return value of the function pivot_points (which is supposed to be a list of integers)
When I called the pivot_points function on the same DataFrame outside this class I got the list I expected
I expected to get the result of the pivot_points() function after assigning it to self.levels inside the init but instead I got the entire DataFrame

You would have to address pivot_points() as self.pivot_points()
And there is no need to add period as an argument if you are not changing it, if you are, its okay there.
I'm not sure if this helps, but here are some tips about your class:
class Backtest:
def __init__(self, ticker, df):
self.ticker = ticker
self.df = df
# no need to define a instance variable here, you can access the method directly
# self.levels = pivot_points(self.df)
def pivot_points(self):
period = 30
# period is a local variable to pivot_points so I can access it directly
print(f'period inside Backtest.pivot_points: {period}')
# df is an instance variable and can be accessed in any method of Backtest after it is instantiated
print(f'self.df inside Backtest.pivot_points(): {self.df}')
# to get any values out of pivot_points we return some calcualtions
return 1 + 1
# if you do need an attribute like level to access it by inst.level you could create a property
#property
def level(self):
return self.pivot_points()
gold = 'some data'
inst = Backtest('gold', gold) # gold is a Pandas Dataframe with Open High Low Close columns and data
print(f'inst.pivot_points() outside the class: {inst.pivot_points()}')
print(f'inst.level outside the class: {inst.level}')
This would be the result:
period inside Backtest.pivot_points: 30
self.df inside Backtest.pivot_points(): some data
inst.pivot_points() outside the class: 2
period inside Backtest.pivot_points: 30
self.df inside Backtest.pivot_points(): some data
inst.level outside the class: 2

Thanks to the commenter Henry Ecker I found that I had the function by the same name defined elsewhere in the file where the output is the df. After changing that my original code is working as expected

python/pandas: DataFrame inheritance and DataFrame update when 'inplace' is not possible

I am sorry, I am aware the title is somewhat fuzzy.
Context
I am using a Dataframe to keep track of files because pandas DataFrame features several relevant functions to do all kind of filtering a dict cannot do, with loc, pd.IndexSlice, .index, .columns, pd.MultiIndex...
Ok, so this may not appear as the best choice for expert developers (which I am not), but all these functions have been so much handy that I have come to use a DataFrame for this.
And cherry on the cake, __repr__ of a MultiIndex Dataframe is just perfect when I want to know what is inside my file list.
Quick introduction to Summary class, inheriting from DataFrame
Because my DataFrame, that I call 'Summary', has some specific functions, I would like to make it a class, inheriting from pandas DataFrame class.
It also has 'fixed' MultiIndexes, for both rows and columns.
Finally, because my Summary class is defined outside the Store class which is actually managing file organization, Summary class needs a function from Store to be able to retrieve file organization.
Questions
Trouble with pd.DataFrame is (AFAIK) you cannot append rows without creating a new DataFrame.
As Summary has a refresh function so that it can recreate itself by reading folder content, a refresh somehow 'reset' the 'Summary' object.
To manage Summary refresh, I have come up with a first code (not working) and finally a second one (working).
import pandas as pd
import numpy as np
# Dummy function
def summa(a,b):
return a+b
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
self = pd.DataFrame(values, index=rmidx, columns=self.columns)
ex1 = DatF1(summa)
In [10]: ex1.meth(3,4)
Out[10]: 7
ex1.refresh()
In [11]: ex1
Out[11]: Empty DatF1
Columns: [(Index, First), (Index, Last)]
Index: []
After refresh(), ex1 is still empty. refresh has not worked correctly.
# Works
class DatF2(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
super().__init__(values, index=rmidx, columns=self.columns)
ex2 = DatF2(summa)
In [10]: ex2.meth(3,4)
Out[10]: 7
ex2.refresh()
In [11]: ex2
Out[11]: Index
First Last
Component Interval
Comp1 1h 2020-02-10 08:00:00 2020-02-10 08:00:00
1W 2020-02-11 08:00:00 2020-02-12 08:00:00
This code works!
I have 2 questions:
why the 1st code is not working? (I am sorry, this is maybe obvious, but I am completely ignorant why it does not work)
is calling super().__init__ in my refresh method acceptable coding practise? (or rephrased differently: is it acceptable to call super().__init__ in other places than in __init__ of my subclass?)
Thanks a lot for your help and advice. The world of class inheritance is for me quite new, and the fact that DataFrame content cannot be directly modified, so to say, seems to me to make it a step more difficult to handle.
Have a good day,
Bests,
Error message when adding a new row
import pandas as pd
import numpy as np
# Dummy function
def new_rows():
return [['Comp1','Comp1'],['1h','1W']]
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = self.meth()
self[rmidx] = values
ex1 = DatF1(new_rows)
ex1.refresh()
KeyError: "None of [MultiIndex([('Comp1', 'Comp1'),\n ( '1h', '1W')],\n names=['Component', 'Interval'])] are in the [index]"

Answers to your questions
why the 1st code is not working?
You are trying to call the class you've inherited from. Honestly, I don't know what's happening exactly in your case. I assumed this would produce an error but you got an empty dataframe.
is calling super().__init__ in my refresh method acceptable coding practise?
Maybe a legitimate use case exists for calling super().__init__ outside the __init__() method. But your case is not one of them. You have already inherited evertyhing from in your __init__() . Why use it again?
A better solution
The solution to your problem is unexpectedly simple. Because you can append a row to a Dataframe:
df['new_row'] = [value_1, value_2, ...]
Or in your case with an MultiIndex (see this SO post):
df.loc[('1h', '1W'), :] = [pd.Timestamp('2020/02/10 8:00'), pd.Timestamp('2020/02/10 8:00')]
Best practice
You should not inherit from pd.DataFrame. If you want to extend pandas use the documented API.

DataFrame not being assigned given value

I have the following class and the print statement returns an empty dataframe even though I'm sure my get_percent_change method is returning the values. I even tried just assigning test to three. Still, empty dataframe.
Is it something to do with the fact it's inside a class? Inside the init method? I tried using self.metrics too.
class options_metrics:
def __init__(self, calls, puts):
self.calls, self.puts = calls, puts
self.calls = self.calls.drop(["Type"])
self.puts = self.puts.drop(["Type"])
metrics = pd.DataFrame()
metrics['Perc_Chg_Vol_Call'], metrics['Perc_Chg_Open_Int_Call'] = self.get_percent_change(self.calls)
metrics['Test'] = 3
print(metrics)
input()
def get_percent_change(self, option_df):
perc_changes = option_df.pct_change(axis=1)
print(perc_changes)
return (perc_changes.ix['Vol',1], perc_changes.ix['Open_Int',1])
Here is the output:
Empty DataFrame
Columns: [Perc_Chg_Vol_Call, Perc_Chg_Open_Int_Call, Test]
Index: []

Switching the DataFrame to a Series worked.

Propagate pandas series metadata through joins

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.
I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.
So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')
for c in df1:
df1[c].filename = 'fname1.csv'
df2[c].filename = 'fname2.csv'
df1[0]._metadata # ['name', 'filename']
df1[0].filename # fname1.csv
df2[0].filename # fname2.csv
df1[0][:3].filename # fname1.csv
mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata # ['name', 'filename']
mgd['1_x'].filename # raises AttributeError
Any way to preserve this?
Update: Epilogue
As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:
def cust_merge(d1, d2):
"Custom merge function for 2 dicts"
...
def finalize_df(self, other, method=None, **kwargs):
for name in self._metadata:
if method == 'merge':
lmeta = getattr(other.left, name, {})
rmeta = getattr(other.right, name, {})
newmeta = cust_merge(lmeta, rmeta)
object.__setattr__(self, name, newmeta)
else:
object.__setattr__(self, name, getattr(other, name, None))
return self
df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df

I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).
See this issue for a more detailed example/bug fix.
DataFrame._metadata = ['name','filename']
def __finalize__(self, other, method=None, **kwargs):
"""
propagate metadata from other to self
Parameters
----------
other : the object from which to get the attributes that we are going
to propagate
method : optional, a passed method name ; possibly to take different
types of propagation actions based on this
"""
### you need to arbitrate when their are conflicts
for name in self._metadata:
object.__setattr__(self, name, getattr(other, name, None))
return self
DataFrame.__finalize__ = __finalize__
So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.
This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.
This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).

Python- Automating Variable naming for with imported CSV data [duplicate]

This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 9 months ago.
I'm a little new to python, stuck in on the 6.00x (spring 2013) course.
I'd hoped to try some of my new found knowledge but appear to have overreached.
The idea was to import a load a CSV file containing my bank statement into python. I'd then hoped place turn each transaction into an instance of a class. I'd then hoped to start playing around with the data to see what I could do but I appear to be failing at even the first hurdle, getting things nicely fitted into my Object Oriented program.
I started with this to import my file:
import csv
datafile = open('PATH/TO/file.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:
data.append(row)
That seems to work. I get a list of all the statement data that looks something like this below (You'll understand me not uploading the real data...)
[['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string'],['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string'],['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string'],['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string']]
so typing data[0] would get me:
['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string']
So then I created my class and constructor. With the idea of decomposing each one of these transactions into an easily accessible item.
class Transaction(object):
"""
Abstract class for building different kinds of transaction
"""
def __init__(self, data):
self.date = data[0]
self.trans_type = data[1]
self.description = data[2]
self.amount = data[3]
self.balance = data[4]
self.account_type = data[5]
self.account_details = data[6]
I find this works if I now enter
T1 = Transaction(data[0])
However I don't want to have to constantly type T1 =... T2=... t3=... t4=... there's LOADS of transctions it'd take forever!
So I tried this!
for i in range(len(data)):
eval("T" + str(i)) = Transaction(data[i])
But python really doesn't like that... It reports back:
SyntaxError: There is an error in your program: * can't assign to function call (FILENAME.py, line 80)
So my question is. Why can't I iteratively use the eval() function to assign my data as an instance into class Transaction(object)?
If there's no way around that, how might else I go about doing so?
I also have a lingering doubt that my mistake his suggests I've missed some point about Object Orientated Programming and when its appropriate to use it. Would I be better just feeding my csv data into a dictionary as its imported and playing around with it from there?
Many thanks!
Huw

Use a transactions = [] list instead, simply .append() new Transaction instances:
transactions = []
for row in datareader:
transactions.append(Transaction(row))
or even:
transactions = [Transaction(row) for row in datareader]
There is no need to create individual variables for each row result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create class that appends all entries to a DataFrame - python

Related

Class method called in init not giving same output as the same function used outside the class

python/pandas: DataFrame inheritance and DataFrame update when 'inplace' is not possible

DataFrame not being assigned given value

Propagate pandas series metadata through joins

Python- Automating Variable naming for with imported CSV data [duplicate]

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create class that appends all entries to a DataFrame - python

Related

Class method called in __init__ not giving same output as the same function used outside the class

python/pandas: DataFrame inheritance and DataFrame update when 'inplace' is not possible

DataFrame not being assigned given value

Propagate pandas series metadata through joins

Python- Automating Variable naming for with imported CSV data [duplicate]

Categories

Resources

Class method called in init not giving same output as the same function used outside the class