Sequential Calculation of Pandas Column without For Loop - python

I have the sample dataframe below
perc 2018_norm
0 0.009069 27.799849
1 0.011384 0.00
2 -0.000592 0.00
3 -0.002667 0.00
The value of the first row of 2018_norm comes from another DataFrame. I then want to calculate the value of the second row through the end of the DataFrame of the 2018_norm column using the percentage change in the perc column and previous row's value in 2018_norm column, which I can currently achieve using a For Loop to give the following result:
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116324
2 -0.000592 28.099667
3 -0.002667 28.024713
4 -0.006538 27.841490
For Loops on DataFrames are just slow so I know I am missing something basic but my google searching hasn't yielded what I am looking for.
I've tried variations of y1df['2018_norm'].iloc[1:] = (y1df['perc'] * y1df['2018_norm'].shift(1)) + y1df['2018_norm'].shift(1) that just yield:
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116324
2 -0.000592 0.00
3 -0.002667 0.00
4 -0.006538 0.00`
What am I missing?
EDIT: To clarify, a basic For loop and df.iloc were not preferable and a for loop with iterrows sped the computation up substantially such that a for loop using that function is a great solution for my use. Wen-Ben's respone also directly answers the question I didn't mean to ask in my original post.

You can use df.iterrows() to loop much more quickly through a pandas data frame:
for idx, row in y1df.iterrows():
if idx > 0: # Skip first row
y1df.loc[idx, '2018_norm'] = (1 + row['perc']) * y1df.loc[idx-1, '2018_norm']
print(y1df)
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116322
2 -0.000592 28.099678
3 -0.002667 28.024736

This is just cumprod
s=(df.perc.shift(-1).fillna(1)+1).cumprod().shift().fillna(1)*df['2018_norm'].iloc[0]
df['2018_norm']=s
df
Out[390]:
perc 2018_norm
0 0.009069 27.799849
1 0.011384 28.116322
2 -0.000592 28.099678
3 -0.002667 28.024736

Related

How to delete all rows under a special row in pandas (python)?

How to delete all the rows under the row with one column "Exercises" in pandas (Python)?
Data:
2021.08.16 19:37:15 146242975 XAUEUR buy 0.02 1 517.04 1 517.19 1 519.54 2021.08.16 20:38:30 1 517.15 - 0.12 0.00 0.22
2021.08.16 19:37:15 146242976 XAUEUR buy 0.02 1 517.04 1 517.19 1 522.04 2021.08.16 20:38:30 1 517.15 - 0.12 0.00 0.22
Exercises
2021.08.16 01:02:11 146037881 XAUUSD buy 0.18 / 0.18 market 1 777.72 1 781.47 2021.08.16 01:02:11 filled TP1
...
df = pd.DataFrame({'num':[1,2,3,4,'Excercises',6,7,8]})
#First find the row index by filtering the column value
my_index = df.index[df['num'] == 'Exercises'].tolist()[0] # as you my find multiple match, take the first index found by [0]
#my_index = 4
#Then slice the Dataframe and take values into new df
df_new = df[:my_index] # or if you want to exclude that row , then add +1 to my_index
I used the loc function.
df = pd.DataFrame({'col':['2021.08.16 19:37:15 146242975','2021.08.16 19:37:15 146242976','Exercises','2021.08.16 01:02:11 146037881'],'values':['a','b','c','d']})
df2 = df.set_index('col')
df2.loc[:'Exercises'][:-1].reset_index()

How transform all values NOT = 0 in 1 with REGEX, efficiently

I've a Column that contains 0 and 12/02/19 dates, I want to transforming all dates into ones and multiply by the column Enrolls_F
-
Preferring using REGEX, but any other options should be fine too, it is a large Dataset, I tried with simple for loop and my kernel could not run it.
-
Data:
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
Attempts:
trying to search for everything starts with 2 and replace with 1 and multiply by Enrolls_F
df_test = (df.replace({'Enrolled_Date': r'2.$'}, {'Enrolled_Date': '1'}, regex=True)) * df.Enrolls_F
# Nothing happens
IIUC, this should help you get the trouble sorted;
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
df['Enrolled_Date'] = np.where(df['Enrolled_Date'] == '0',0,1)
df['multiplication_column'] = df['Enrolled_Date'] * df['Enrolls_F']
print(df)
Output:
Enrolled_Date Enrolls_F multiplication_column
0 0 1.11
1 1 1.11 1.11
2 0 0.222
3 0 1.11
4 1 5.22 5.22
5 0 1
If you want output is float, try this
df.Enrolled_Date.ne('0').astype(int) * df.Enrolls_F.astype(float)
Out[212]:
0 0.00
1 1.11
2 0.00
3 0.00
4 5.22
5 0.00
dtype: float64

Merge multiple measurements into a pandas dataframe

I have some measurements organized in *.csv files as follows:
m_number,value
0,0.154
1,0.785
…
55,0.578
NaN,NaN
0,1.214
1,0.742
…
So there is always a set of x measurements (x should be constant inside a single file but it's not guaranteed and I have to check this number) separated by a NaN line.
After reading the data into a dataframe, I want to reorganize it for later usage:
m_number value 1 value 2 value 3 value 4
0 0 0.154 0.214 0.229 0.234
1 1 0.785 0.742 0.714 0.771
...
55 55 0.578 0.647 0.597 0.623
Each set of measurements should be one column.
Here's a snippet of the code:
split_index = df.index[df_benchmark['id'].isnull()]
df_sliced = pd.DataFrame()
for i, index in enumerate(split_index):
if i == 0:
df_sliced = df.loc[0:index - 1].copy()
else:
#ToDo: Rename first column to 'value 1' if more than 1 measurement
temp = df['value'].loc[0:index - 1].copy()
temp.reset_index(drop=True, inplace=True)
df_sliced['value '+str(i)] = temp
df.drop(df.index[0:index - split_index[i - 1]], inplace=True)
The code works, but I do not like my current approach. So I'm asking if there's a better and more elegant solution for this problem.
Best,
Julz
You can use cumsum, set_index, and unstack to do this is three lines of code:
#Create dummy data with 4 runs of 10 measures
df = pd.DataFrame({'m_number':np.tile(np.arange(10),4), 'value':np.random.random(40)})
#Use condition to find first run and increment using cumsum and unstack to create
MultiIndex column headers
df_u = df.set_index([df['m_number'].eq(0).cumsum(), df['m_number']])[['value']].unstack()
#Use condition to find first run and increment using cumsum and unstack to create
#MultiIndex column headers (Corrected per comments below)
df_u = df.set_index([df['m_number'], df['m_number'].eq(0).cumsum()])[['value']].unstack()
#Flatten MultiIndex column headers
df_u.columns = [f'{i}_{j}' for i, j in df_u.columns]
#Display results
df_u
Output:
value_1 value_2 value_3 value_4
m_number
0 0.919057 0.064409 0.288592 0.742759
1 0.449587 0.867031 0.193493 0.853700
2 0.551929 0.925111 0.895273 0.117306
3 0.487501 0.893696 0.696540 0.381469
4 0.389431 0.818801 0.771516 0.489404
5 0.790619 0.478995 0.023236 0.344112
6 0.015389 0.815073 0.195856 0.628263
7 0.068860 0.483731 0.752803 0.581106
8 0.109404 0.281335 0.330910 0.909965
9 0.695120 0.538676 0.766864 0.247283

Populate column in dataframe based on iat values

lookup={'Tier':[1,2,3,4],'Terr.1':[0.88,0.83,1.04,1.33],'Terr.2':[0.78,0.82,0.91,1.15],'Terr.3':[0.92,0.98,1.09,1.33],'Terr.4':[1.39,1.49,1.66,1.96],'Terr.5':[1.17,1.24,1.39,1.68]}
df={'Tier':[1,1,2,2,3,2,4,4,4,1],'Territory':[1,3,4,5,4,4,2,1,1,2]}
df=pd.DataFrame(df)
lookup=pd.DataFrame(lookup)
lookup contains the lookup values, and df contains the data being fed into iat.
I get the correct values when I print(lookup.iat[tier,terr]). However, when I try to set those values in a new column, it endlessly runs, or in this simple test case just copies 1 value 10 times.
for i in df["Tier"]:
tier=i-1
for j in df["Territory"]:
terr=j
#print(lookup.iat[tier,terr])
df["Rate"]=lookup.iat[tier,terr]
Any thoughts on a possible better solution?
You can use apply() after some modification to your lookup dataframe:
lookup = lookup.rename(columns={i: i.split('.')[-1] for i in lookup.columns}).set_index('Tier')
lookup.columns = lookup.columns.astype(int)
df['Rate'] = df.apply(lambda x: lookup.loc[x['Tier'],x['Territory']], axis=1)
Returns:
Tier Territory Rate
0 1 1 0.88
1 1 3 0.92
2 2 4 1.49
3 2 5 1.24
4 3 4 1.66
5 2 4 1.49
6 4 2 1.15
7 4 1 1.33
8 4 1 1.33
9 1 2 0.78
Once lookup modified a bit the same way than #rahlf23 plus using stack, you can merge both dataframes such as:
df['Rate'] = df.merge( lookup.rename(columns={ i: int(i.split('.')[-1])
for i in lookup.columns if 'Terr' in i})
.set_index('Tier').stack()
.reset_index().rename(columns={'level_1':'Territory'}),
how='left')[0]
If you have a big dataframe df, then it should be faster than using apply and loc
Also, if any couple (Tier, Territory) in df does not exist in lookup, this method won't throw an error

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35

Categories

Resources