Apply a function to a dataframe which includes the previous row - python

I have an input data frame for daily grocery spend which looks like this:
input_df1
Date Potatoes Spinach Lettuce
01/01/22 10 47 0
02/01/22 0 22 3
03/01/22 11 0 3
04/01/22 3 9 2
...
I need to apply a function that takes input_df1 + (previous inflated_df2 row * inflation%) to get inflated_df2 (excepted for the first row - the first day of the month does not have any inflation effect, will just be the same as input_df1).
inflated_df2
inflation% 0.01 0.05 0.03
Date Potatoes Spinach Lettuce
01/01/22 10 47 0
02/01/22 0.10 24.35 3
03/01/22 11.0 1.218 3.09
04/01/22 3.11 9.06 2.093
...
This is what I attempted to get inflated_df2
inflated_df2.iloc[2:3,:] = input_df1.iloc[0:1,:]
inflated_df2.iloc[3:,:] = inflated_df2.apply(lambda x: input_df1[x] + (x.shift(periods=1, fill_value=0)) * x['inflation%'])

You can use accumulate from itertools
from itertools import accumulate
rates = {'Potatoes': 0.01, 'Spinach': 0.05, 'Lettuce': 0.03}
c = list(rates.keys())
r = list(rates.values())
df[c] = list(accumulate(df[c].to_numpy(), lambda bal, val: val+ bal * r))
Output:
>>> df
Date Potatoes Spinach Lettuce
0 01/01/22 10.00000 47.000000 0.0000
1 02/01/22 0.10000 24.350000 3.0000
2 03/01/22 11.00100 1.217500 3.0900
3 04/01/22 3.11001 9.060875 2.0927

Related

Manipulate ordering/sorting of Multirow columns in a pandas DataFrame

This is a side-problem caused by an answer form another question.
I do combine two crosstab() results with counted and normalized values. The problem is that the resulting column names are not in the right order. "Right" means that the margins_name (in my example it is "gesamt") should always appear at the last row/column and not like this:
sex female gesamt male
n % n % n %
age
What I need is
sex female male gesamt
n % n % n %
age
This is the minimal working example
#!/usr/bin/env python3
import pandas as pd
import pydataset
# sample data
df = pydataset.data('agefat')
df = df.loc[df.age < 35]
# Label of the margin column/row
mn = 'gesamt'
# count absolute
taba = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn)
# percentage / normalized
tabb = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn,
normalize=True).round(4)*100
# combine (based on: https://stackoverflow.com/a/68362010/4865723)
tab = pd.concat([taba, tabb], axis=1, keys=['n', '%']).swaplevel(axis=1)
# sort the columns
tab = tab.sort_index(axis=1, ascending=[True, False])
print(tab)
Also I have a possible solution which works but I am not sure if this is a good panda's way. I do manipulate the sorting-algorithm this way that the margins_name always get the highest possible chr() value to make it appear always at the end of a lexicographical ordering.
# workaround
tab = tab.sort_index(axis=1, ascending=[False, False],
key=lambda x: x.where(x.isin([mn]), chr(0x10ffff)))
print(tab) # looks like I expect
The result output
sex female male gesamt
n % n % n %
age
23 1 16.67 1 16.67 2 33.33
24 0 0.00 1 16.67 1 16.67
27 0 0.00 2 33.33 2 33.33
31 1 16.67 0 0.00 1 16.67
gesamt 2 33.33 4 66.67 6 100.00
Use ordered CategoricalIndex for custom ordering of first level of MultiIndex:
i = tab.columns.levels[0]
out = sorted(i.difference([mn]))
out.append(mn)
new = pd.CategoricalIndex(i, ordered=True, categories=out)
tab.columns = tab.columns.set_levels(new,level=0)
tab = tab.sort_index(axis=1, ascending=[True, False])
print(tab)
sex female male gesamt
n % n % n %
age
2000 2 33.33 0 0.00 2 33.33
2001 1 16.67 1 16.67 2 33.33
2002 1 16.67 1 16.67 2 33.33
gesamt 4 66.67 2 33.33 6 100.00
I would just select the total columns using a list comprehension and piece together the columns selection as desired:
cols_tot = [c for c in tab.columns if c[0] == mn]
print(tab[[c for c in tab.columns if not c in cols_tot] + cols_tot])
sex female male gesamt
n % n % n %
age
23 1 16.67 1 16.67 2 33.33
24 0 0.00 1 16.67 1 16.67
27 0 0.00 2 33.33 2 33.33
31 1 16.67 0 0.00 1 16.67
gesamt 2 33.33 4 66.67 6 100.00
Please let me highlight a detail in addition to #jezrael's original answer.
We still know from #jezrael 's answer that .sort_index() does take the ordering of categories into account. This has consequences when you crosstab() on a column that still is an ordered categorical and you adding a margin= (e.g. a total column) to the crosstab.
Going back to the MWE of my question. Lets assume that age is not a number but a ordered cateogry.
['younger then 20' < '20 till 60' < 'older then 60']
The column will lose its categorical order and .sort_index() will sort it only by its lexicographical order when you do it like this (as in the original MWE):
# Label of the margin column/row
mn = 'gesamt'
# count absolute
taba = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn)
What you have to do is to add the margin= column as one of the categories before calling .crosstab().
df.age = df.age.cat.add_categories([mn]) # mn=='gesamt'

What would be the best way to perform operations against a significant number of rows from one df to one row from another?

If I have df1:
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n1 ~ 2000
and df2
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n2 = 10000+
And have to perform an operation like:
df12 =
df1[0,A]-df2[0,A] df1[0,B]-df2[0,B] df1[0,C]-df2[0,C]....
df1[0,A]-df2[1,A] df1[0,B]-df2[1,B] df1[0,C]-df2[1,C]
...
df1[0,A]-df2[n2,A] df1[0,B]-df2[n2,B] df1[0,C]-df2[n2,C]
...
df1[1,A]-df2[0,A] df1[1,B]-df2[0,B] df1[1,C]-df2[0,C]....
df1[1,A]-df2[1,A] df1[1,B]-df2[1,B] df1[1,C]-df2[1,C]
...
df1[1,A]-df2[n2,A] df1[1,B]-df2[n2,B] df1[1,C]-df2[n2,C]
...
df1[n1,A]-df2[0,A] df1[n1,B]-df2[0,B] df1[n1,C]-df2[0,C]....
df1[n1,A]-df2[1,A] df1[n1,B]-df2[1,B] df1[n1,C]-df2[1,C]
...
df1[n1,A]-df2[n2,A] df1[n1,B]-df2[n2,B] df1[n1,C]-df2[n2,C]
Where every row in df1 is compared against every row in df2 producing a score.
What would be the best way to perform this operation using either pandas or vaex/equivalent?
Thanks in advance!
Broadcasting is the way to go:
pd.DataFrame((df1.to_numpy()[:,None] - df2.to_numpy()[None,...]).reshape(-1, df1.shape[1]),
columns = df2.columns,
index = pd.MultiIndex.from_product((df1.index,df2.index))
)
Output (for df1 the three first rows, df2 the two first rows):
A B C D
0 0 0.00 0.000 0.0 0.0
1 1.39 2.768 2.0 0.0
1 0 -1.39 -2.768 -2.0 0.0
1 0.00 0.000 0.0 0.0
2 0 2.47 1.201 3.9 -1.0
1 3.86 3.969 5.9 -1.0
I would use openpyxl
This loop would do
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df1 = cell.value
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df2 = cell.value
from here you want do what , create new values ? put them where? the code here points on them
An interesting question, that vaex can actually solve quite memory efficient (although we should be able to require practically no memory in the future).
Let's start creating the vaex dataframes, and increase the numbers a bit, 2,000 and 200,000 rows.
import vaex
import numpy as np
names = "ABCD"
N = 2000
M = N * 100
print(f'{N*M:,} rows')
400,000,000 rows
df1 = vaex.from_dict({name + '1': np.random.random(N) * 6 for name in names})
# We add a virtual range column for joining (requires no memory)
df1['i1'] = vaex.vrange(0, N, dtype=np.int32)
print(df1)
# A1 B1 C1 D1 i1
0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0
1 5.873731485927979 5.669031702051764 5.696571067838359 1.0310578585207142 1
2 4.513310303419997 4.466469647700519 5.047406986222205 3.4417402924374407 2
3 0.43402400660624174 1.157476656465433 2.179139262842482 1.1666706679131253 3
4 3.3698854360766526 2.203558794966768 0.39649910973621827 2.5576740079630502 4
... ... ... ... ... ...
1,995 4.836227485536714 4.093067389612236 5.992282902119859 1.3549691660861871 1995
1,996 1.1157617217838995 1.1606619796004967 3.2771620798090533 4.249631266421745 1996
1,997 4.628846984287445 4.019449674317169 3.7307713985954947 3.7702606362049362 1997
1,998 1.3196727531762933 2.6758762345410565 3.249315566523623 2.6501467681546123 1998
1,999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1999
df2 = vaex.from_dict({name + '2': np.random.random(M) * 6 for name in names})
df2['i2'] = vaex.vrange(0, M, dtype=np.int32)
print(df2)
# A2 B2 C2 D2 i2
0 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814 0
1 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813 1
2 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523 2
3 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354 3
4 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291 4
... ... ... ... ... ...
199,995 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905 199995
199,996 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974 199996
199,997 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913 199997
199,998 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983 199998
199,999 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579 199999
Now we create our 'master' vaex dataframe, that requires no memory at all, it's made of a virtual column and two expressions (stored as virtual columns):
df = vaex.from_arrays(i=vaex.vrange(0, N*M, dtype=np.int64))
df['i1'] = df.i // M # index to df1
df['i2'] = df.i % M # index to df2
print(df)
# i i1 i2
0 0 0 0
1 1 0 1
2 2 0 2
3 3 0 3
4 4 0 4
... ... ... ...
399,999,995 399999995 1999 199995
399,999,996 399999996 1999 199996
399,999,997 399999997 1999 199997
399,999,998 399999998 1999 199998
399,999,999 399999999 1999 199999
Unfortunately vaex cannot use these integer indices as lookups for joining directly, it has to go through as hashmap. So there is room for improvement here for vaex. If vaex could do this, we could scale this idea up to trillions of rows.
print(f"The next two joins require ~{len(df)*8*2//1024**2:,} MB of RAM")
The next two joins require ~6,103 MB of RAM
df_big = df.join(df1, on='i1')
df_big = df_big.join(df2, on='i2')
print(df_big)
# i i1 i2 A1 B1 C1 D1 A2 B2 C2 D2
0 0 0 0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814
1 1 0 1 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813
2 2 0 2 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523
3 3 0 3 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354
4 4 0 4 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291
... ... ... ... ... ... ... ... ... ... ... ...
399,999,995 399999995 1999 199995 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905
399,999,996 399999996 1999 199996 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974
399,999,997 399999997 1999 199997 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913
399,999,998 399999998 1999 199998 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983
399,999,999 399999999 1999 199999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579
Now we have our big dataframe, and we only need to do the computation, which is using virtual columns, and thus require no extra memory.
# add virtual colums (which require no memory)
for name in names:
df_big[name] = df_big[name + '1'] - df_big[name + '2']
print(df_big[['A', 'B', 'C', 'D']])
# A B C D
0 3.17088337394884 0.13557709240744487 -2.6561095686690526 -3.2102593816457916
1 1.038958480528846 0.40281056887283384 -0.20455723594495767 -0.9066638216305232
2 1.5286139180996834 2.8940641279631096 -2.344316689352858 -0.2881587515492332
3 5.821449101859425 0.8309275842903854 -0.2716504605722432 -2.3881997446052456
4 3.1143842126546923 -0.18179974540784372 0.7378207553268026 -4.87247823234962
... ... ... ... ...
399,999,995 -0.30815443055055525 -4.249263637083556 -1.244858612724894 -3.257308187087283
399,999,996 4.906435209862181 -4.032405386526832 0.2462168895612611 -1.2802665943876756
399,999,997 -0.7271795402745553 -3.51391801469319 -3.765878629706985 -1.930689509247291
399,999,998 -0.8255006637202813 -2.5585697698855805 -3.629680556902816 -0.41595813284337657
399,999,999 3.318493079331818 -1.7330868224329334 -3.4258806652175418 -2.220727961485957
If we had to do this all in memory, how much RAM would that have required?
print(f"This would otherwise require {len(df_big) * (4*3*8)//1023**2:,} MB of RAM")
This would otherwise require 36,692 MB of RAM
So, quite efficient I would say, and in the future it would be interesting to see if we can do the join more efficiently, and require practically zero RAM for this problem.

Calculating current, min, max, mean monthly growth from pandas dataframe

I have a dataset similar to the one below:
product_ID month amount_sold
1 1 23
1 2 34
1 3 85
2 1 47
2 2 28
2 3 9
3 1 73
3 2 84
3 3 12
I want the output to be like this:
For example, for product 1:
-avg_monthly_growth is calculated by ((85-34)/34*100 + (34-23)/23*100)/2 = 98.91%
-lowest_monthly_growth is (34-23)/23*100) = 47.83%
-highest_monthly_growth is (85-34)/34*100) = 150%
-current_monthly_growth is the growth between the lastest two months (in this case, it's the growth from month 2 to month 3, as the month ranges from 1-3 for each product)
product_ID avg_monthly_growth lowest_monthly_growth highest_monthly_growth current_monthly_growth
1 98.91% 47.83% 150% 150%
2 ... ... ... ...
3 ... ... ... ...
I've tried df.loc[df.groupby('product_ID')['amount_sold'].idxmax(), :].reset_index() which gets me the max (and similarly the min), but I'm not too sure how to get the percentage growths.
You can use a pivot_table withh pct_change() on axis=1 , then create a dictionary with desired series and create a df:
m=df.pivot_table(index='product_ID',columns='month',values='amount_sold').pct_change(axis=1)
d={'avg_monthly_growth':m.mean(axis=1)*100,'lowest_monthly_growth':m.min(1)*100,
'highest_monthly_growth':m.max(1)*100,'current_monthly_growth':m.iloc[:,-1]*100}
final=pd.DataFrame(d)
print(final)
avg_monthly_growth lowest_monthly_growth highest_monthly_growth \
product_ID
1 98.913043 47.826087 150.000000
2 -54.141337 -67.857143 -40.425532
3 -35.322896 -85.714286 15.068493
current_monthly_growth
product_ID
1 150.000000
2 -67.857143
3 -85.714286

difference between two rows pandas

i have a dataframe as :
id|amount|date
20|-7|2017:12:25
20|-170|2017:12:26
20|7|2017:12:27
i want to subtract each row from another for 'amount' column:
the output should be like:
id|amount|date|amount_diff
20|-7|2017:12:25|0
20|-170|2017:12:26|-177
20|7|2017:12:27|-163
i used the code:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['invoice_amount'].diff()
and obtained the output as:
id|amount|date|amount_diff
20|-7|2017:12:25|163
20|-170|2017:12:26|-218
20|48|2017:12:27|0
IIUC you need:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['amount'].add(df['amount'].shift()).fillna(0)
print (df)
id amount date amount_diff
0 20 -7 2017:12:25 0.0
1 20 -170 2017:12:26 -177.0
2 20 7 2017:12:27 -163.0
Because if want subtract your solution should work:
df.sort_values(by='date',inplace=True)
df['amount_diff1'] = df['amount'].sub(df['amount'].shift()).fillna(0)
df['amount_diff2'] = df['amount'].diff().fillna(0)
print (df)
id amount date amount_diff1 amount_diff2
0 20 -7 2017:12:25 0.0 0.0
1 20 -170 2017:12:26 -163.0 -163.0
2 20 7 2017:12:27 177.0 177.0

Modify and round numbers in a pandas dataframe in Python

Long story short, I have a csv file which I read as a pandas dataframe. The file contains a weather report, but all of the measurements for temperature are in Fahrenheit. I've figured out how to convert them:
import pandas as np
df = np.read_csv('report.csv')
df['average temperature'] = (df['average temperature'] - 32) * 5/9
But then the data for this column is in decimals up to 6 points.
I've found code that will round up all the data in the dataframe, but I need only this column.
df.round(2)
I don't like how it has to be a separate piece of code on a separate line and how it modifies all of my data. Is there a way to go about this problem more elegantly? Is there a way to apply this to other columns in my dataframe, such as maximum temperature and minimum temperature without having to copy the above piece of code?
For round only some columns use subset:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = df[cols].round(2)
If want convert only some columns from list:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
If want round each column separately:
df['average temperature'] = df['average temperature'].round(2)
df['maximum temperature'] = df['maximum temperature'].round(2)
df['minimum temperature'] = df['minimum temperature'].round(2)
Sample:
df = (pd.DataFrame(np.random.randint(30, 100, (10, 3)),
columns=['maximum temperature','minimum temperature','average temperature'])
.assign(a='m', b=range(10)))
print (df)
maximum temperature minimum temperature average temperature a b
0 97 60 98 m 0
1 64 86 64 m 1
2 32 64 95 m 2
3 60 56 93 m 3
4 43 89 64 m 4
5 40 62 86 m 5
6 37 40 70 m 6
7 61 33 46 m 7
8 36 44 46 m 8
9 63 30 33 m 9
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
print (df)
maximum temperature minimum temperature average temperature a b
0 36.11 15.56 36.67 m 0
1 17.78 30.00 17.78 m 1
2 0.00 17.78 35.00 m 2
3 15.56 13.33 33.89 m 3
4 6.11 31.67 17.78 m 4
5 4.44 16.67 30.00 m 5
6 2.78 4.44 21.11 m 6
7 16.11 0.56 7.78 m 7
8 2.22 6.67 7.78 m 8
9 17.22 -1.11 0.56 m 9
Here's a single line solution with apply and a conversion function.
def convert_to_celsius (f):
return 5.0/9.0*(f-32)
df[['Column A','Column B']] = df[['Column A','Column B']].apply(convert_to_celsius).round(2)

Categories

Resources