Python - Creating classes & calculation methods with dataframe inputs - python

I need to build classes and methods for different calculation steps:
I got a dataframe with numerical columns A,B,C.
I want the class to initialize the columns of the dataframe as inputs so I can call following methods:
Method1:
sum(A)
Method2:
sum(A)*B
How do you do that in Python?
I know it is a really general question, but I only came across really abstract OOP tutorials. I need it more specific on calculations & finance.
A good source to some tutorial would also help.
Thanks,
KS

You can do like this.
import pandas as pd
import numpy as np
class MyCalculator:
def __init__(self, df):
self.df = df
def m1(self):
return self.df['A'].sum()
def m2(self):
# return np.multiply(self.m1(), self.df['B']).values
return np.multiply(self.m1(), self.df['B']).values.reshape(-1, 1)
d = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
def main():
df = pd.DataFrame(d)
print(f'frame:\n {df}')
b = MyCalculator(df)
print(f'method 1:\n {b.m1()}')
print(f'method 2:\n {b.m2()}')
# start
main()
Output:
frame:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
method 1:
6
method 2:
[[24]
[30]
[36]]

Related

Unique values in pandas

Hi so I've just started learning python.And I am trying to learn pandas and I have this doubt on how to find the unique start and stop values in a data frame.Can someone help me out here
As you did not provide an example dataset, let's assume this one:
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'start': np.random.randint(0,10,5),
'stop': np.random.randint(0,10,5),
}).T.apply(sorted).T
start stop
0 0 5
1 1 8
2 7 9
3 5 6
4 0 9
To get unique values for a given column (here start):
>>> df['start'].unique()
array([0, 1, 7, 5])
For all columns at once:
>>> df.apply(pd.unique, result_type='reduce')
start [0, 1, 7, 5]
stop [5, 8, 9, 6]
dtype: object

Removing duplicates in dataframe via creating a list of their indices pandas

i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2

Running the same apply function multiple times across a pandas groupby with different parameters passed in each time

My question involves the most efficient way to apply the same function again and again to a pandas groupby object while changing the parameters passed in each time.
Suppose I have the following code that creates a simple dataframe and a trivial apply function:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': np.repeat(['A', 'B', 'C'], 3),
'values': np.arange(2, 20, 2)
})
def simple_function(data, value):
new_df = data.sum() + value
new_df['added'] = value
return new_df
The simple_function sums across the df, adds value to this sum, and creates a new column that contains the value added.
I know how to use apply on an individual case:
new_df_add_five = df.groupby('group').apply(simple_function, 5)
"""
Returns:
values added
group
A 17 5
B 35 5
C 53 5
"""
new_df_add_six = df.groupby('group').apply(simple_function, 6)
But suppose I now want to combine the results of new_df_add_five and new_df_add_six together, to get something like this:
"""
values added
group
A 17 5
B 35 5
C 53 5
A 18 6
B 36 6
C 54 6
"""
Is there any way to achieve this without having to use a for loop across the params?:
pd_list = []
for param in [5, 6]:
pd_list.append(df.groupby('group').apply(simple_function, param))
combined_df = pd.concat(pd_list)

Pandas replacing one value with another for specified columns

I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)

From a dataframe using the apply() method, how to return a new column with lists of elements from the dataframe?

There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.

Categories

Resources