Pandas dataframe calculations with previous row - python

I have the dataframe comprised of 'Linear' and 'Delta' and need to create the 'New' column.
In:
Linear Delta
30 -3
60 1.4
65 -0.3
62 4.4
21 -2.5
18 -0.1
34 -3.1
30 -1.5
45 0.5
55 -1.4
43 2.8
51 4.7
62 2.7
Out:
Linear Delta New
30 -3
60 1.4 60.0
65 -0.3 59.7
62 4.4 64.1
21 -2.5 61.6
18 -0.1 61.5
34 -3.1 58.4
30 -1.5 56.9
45 0.5 57.4
55 -1.4 55.0
43 2.8 57.8
51 4.7 51.0
62 2.7 53.7
The algorithmic formula is the following one:
New[i] = IF( AND(Linear[i-1]<50,Linear[i]>50) , Linear , New[i-1]+Delta[i] )
I tried a lot of different approaches such as cumsum() but never found the solution. I have spent many hours but in vain.

For this kind of recursive algorithm, consider numba with a manual loop. You will likely find just-in-time compilation more efficient than Pandas-based methods / iteration.
from numba import jit
#jit(nopython=True)
def calc_new(L, D):
res = np.zeros(L.shape)
res[0] = np.nan
for i in range(1, len(res)):
res[i] = L[i] if (L[i-1] < 50) & (L[i] > 50) else res[i-1] + D[i]
return res
df['New'] = calc_new(df['Linear'].values, df['Delta'].values)
Result
print(df)
Linear Delta New
0 30 -3.0 NaN
1 60 1.4 60.0
2 65 -0.3 59.7
3 62 4.4 64.1
4 21 -2.5 61.6
5 18 -0.1 61.5
6 34 -3.1 58.4
7 30 -1.5 56.9
8 45 0.5 57.4
9 55 -1.4 55.0
10 43 2.8 57.8
11 51 4.7 51.0
12 62 2.7 53.7

Not very nice, but working:
df['NEW'] = np.nan
for i, row in df.iterrows():
if i > 0:
m = (row['Linear'] > 50) & (df.loc[i-1, 'Linear'] < 50)
df.loc[i, 'NEW'] = np.where(m, row['Linear'], row['Delta'] + df.loc[i-1, 'NEW'])
print (df)
Linear Delta New NEW
0 30 -3.0 NaN NaN
1 60 1.4 60.0 60.0
2 65 -0.3 59.7 59.7
3 62 4.4 64.1 64.1
4 21 -2.5 61.6 61.6
5 18 -0.1 61.5 61.5
6 34 -3.1 58.4 58.4
7 30 -1.5 56.9 56.9
8 45 0.5 57.4 57.4
9 55 -1.4 55.0 55.0
10 43 2.8 57.8 57.8
11 51 4.7 51.0 51.0
12 62 2.7 53.7 53.7

In this program I have used two temporary lists to get the work done. "prevLinear" list is used to get the results of Linear[i-1] results. "tempList" list is used to store the results of "new" column for future calculation purpose.
Execute the code to get the result.
Let me know if this helps.
Thank you.
import pandas as pd
import numpy as np
Linear = [30,60,65,62,21,18,34,30,45,55,43,51,62]
Delta = [-3,1.4,-0.3,4.4,-2.5,-0.1,-3.1,-1.5,0.5,-1.4,2.8,4.7,2.7]
df = pd.DataFrame({
'Linear':Linear,
'Delta':Delta
})
prevLinear = [np.NaN if(i==0) else df.iloc[i-1,1] for i,value in enumerate(df['Linear'].values)]
df['prevLinear'] = prevLinear
tempList = []
new = []
for i,value in enumerate(df['Linear'].values):
if(value>50 and df.iloc[i,2]<50):
new.append(value)
tempList.append(value)
else:
if(i>0):
new.append(df.iloc[i,0]+tempList[i-1])
tempList.append(df.iloc[i,0]+tempList[i-1])
else:
new.append(np.NaN)
tempList.append(np.NaN)
new = np.round(new,1)
df['New'] = new
finalDF = df.iloc[:,[1,0,3]]
print(finalDF)

Related

Create a single column from multiple columns in a dataframe

I have a dataframe with 5 columns: M1, M2, M3, M4 and M5. Each column contains floating-point values. Now I want to combine the data of 5 columns into one.
I tried
cols = list(df.columns)
df_new['Total'] = []
df_new['Total'] = [df_new['Total'].append(df[i], ignore_index=True) for i in cols]
But I'm getting this
I'm using Python 3.8.5 and Pandas 1.1.2.
Here's a part of my df
M1 M2 M3 M4 M5
0 5 12 20 26
0.5 5.5 12.5 20.5 26.5
1 6 13 21 27
1.5 6.5 13.5 21.5 27.5
2 7 14 22 28
2.5 7.5 14.5 22.5 28.5
10 15 22 30 36
10.5 15.5 22.5 30.5 36.5
11 16 23 31 37
11.5 16.5 23.5 31.5 37.5
12 17 24 32 38
12.5 17.5 24.5 32.5 38.5
And this is what I'm expecting
0
0.5
1
1.5
2
2.5
10
10.5
11
11.5
12
12.5
5
5.5
6
6.5
7
7.5
15
15.5
16
16.5
17
17.5
12
12.5
13
13.5
14
14.5
22
22.5
23
23.5
24
24.5
20
20.5
21
21.5
22
22.5
30
30.5
31
31.5
32
32.5
26
26.5
27
27.5
28
28.5
36
36.5
37
37.5
38
38.5
import pandas as pd
Just make use of concat() method and list comprehension:
result=pd.concat((df[x] for x in df.columns),ignore_index=True)
Now If you print result then you will get your desired output
Performance(concat() vs unstack()):

How to style my dataframe by column with conditions?

I want to paint the share price cell green if it is higher than the target price and red if it is lower than the alert price and my code is not working as it keeps popping errors.
This is the code that I use
temp_df.style.apply(lambda x: ["background: red" if v < x.iloc[:,1:] and x.iloc[:,1:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
temp_df.style.apply(lambda x: ["background: green" if v > x.iloc[:,2:] and x.iloc[:,2:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
Can anyone give me an idea on how to do it?
Index Share Price Alert/Entry Target
0 622.0 424.0 950.0
1 6880.0 5200.0 7450.0
2 62860.0 40000.0 60000.0
3 7669.0 5500.0 8000.0
4 5295.0 3500.0 5500.0
5 227.0 165.0 250.0
6 3970.0 3200.0 4250.0
7 1300.0 850.0 1650.0
8 8480.0 6500.0 8500.0
9 11.3 0.0 0.0
10 66.0 58.0 75.0
11 7.3 6.4 9.6
12 114.8 75.0 130.0
13 172.3 90.0 0.0
14 2.6 2.4 3.2
15 76.8 68.0 85.0
16 19.6 15.4 21.0
17 21.9 11.0 18.6
18 35.4 29.0 42.0
19 12.5 9.2 0.0
20 15.5 0.0 0.0
21 449.8 0.0 0.0
22 4.3 3.6 5.0
23 47.4 40.0 55.0
24 0.6 0.5 0.6
25 49.2 45.0 72.0
26 13.9 0.0 0.0
27 3.0 2.4 4.5
28 2.4 1.8 4.2
29 54.0 0.0 0.0
30 293.5 100.0 250.0
31 190000.0 140000.0 220000.0
32 52200.0 46000.0 58000.0
33 100500.0 75000.0 115000.0
34 4.9 3.8 6.5
35 0.2 0.0 0.0
36 1430.0 980.0 1450.0
37 1585.0 0.0 0.0
38 15.6 11.0 18.0
39 3.3 2.8 6.0
40 52.5 45.0 68.0
41 46.5 35.0 0.0
42 193.6 135.0 0.0
43 122.8 90.0 0.0
44 222.6 165.0 265.0
Provided that "Index" is also a column:
temp_df.style.apply(lambda x: ["background: green" if (i==1 and v > x.iloc[3] and x.iloc[3] != 0) else ("background: red" if (i==1 and v < x.iloc[2]) else "") for i, v in enumerate(x)], axis=1)
i: aims to define the column Share Price to be styled (column: 1)

Add a dataframe that represents the max value based on comparison of other dataframes

I have the following multi-index dataframe where one df represents the daily high of hypothetical stocks and the other consists of their previous day close.
High_Price Yest_Close
Ticker ABC XYZ RST ABC XYZ. RST
2/1/19 3 10 90 2 9 88
1/31/19 3.5 9 88 4 9.5 89
1/30/19 2.5 9.5 86 3 9.8 85
1/29/19 4 8.5 92 3.5 8 93
1/28/19 4.5 8.2 95 4.8 8 96
1/27/19 2.8 7 94 2.6 6.5 93
1/26/19 2.6 6.5 93 2.7 7 92
I want to append another dataframe that represents the max value between the two dfs (High_Price and Yest_Close). So the third df should look like the following:
High_Price Yest_Close Max
Ticker ABC XYZ RST ABC XYZ RST ABC XYZ RST
2/1/19 3 10 90 2 9 88 3 10 90
1/31/19 3.5 9 88 4 9.5 89 4 9.5 89
1/30/19 2.5 9.5 86 3 9.8 85 3 9.8 86
1/29/19 4 8.5 92 3.5 8 93 4 8.5 93
1/28/19 4.5 8.2 95 4.8 8 96 4.8 8.2 96
1/27/19 2.8 7 94 2.6 6.5 93 2.8 7 94
1/26/19 2.6 6.5 93 2.7 7 92 2.7 7 93
I tried the following logic but it's not getting me the proper result:
df['Max',ticker] = df[['High_Price','Yest_Close']].max(axis=1)
How should I fix my code to get the result I'm lookking for?
You want level=1 inside max ,then create a multiindex followed by df.join:
m = df[['High_Price','Yest_Close']].max(level=1,axis=1)
m.columns = pd.MultiIndex.from_product((['Max'],m.columns))
out = df.join(m)
High_Price Yest_Close Max
ABC XYZ RST ABC XYZ RST ABC XYZ RST
Ticker
2/1/19 3.0 10.0 90 2.0 9.0 88 3.0 10.0 90.0
1/31/19 3.5 9.0 88 4.0 9.5 89 4.0 9.5 89.0
1/30/19 2.5 9.5 86 3.0 9.8 85 3.0 9.8 86.0
1/29/19 4.0 8.5 92 3.5 8.0 93 4.0 8.5 93.0
1/28/19 4.5 8.2 95 4.8 8.0 96 4.8 8.2 96.0
1/27/19 2.8 7.0 94 2.6 6.5 93 2.8 7.0 94.0
1/26/19 2.6 6.5 93 2.7 7.0 92 2.7 7.0 93.0

Merge in range pandas

This is my dataset. For this problem just consider the first and the last column.
45,37.25,14.5,-43.15,8.6
46,37.25,13.5,-42.15,8.6
47,37.25,12.5,-41.15,8.6
48,37.25,11.5,-40.15,8.6
49,37.25,10.5,-39.15,8.6
50,37.25,9.5,-38.15,8.6
51,36.25,8.5,-37.15,7.6
52,35.25,7.5,-36.15,6.6
53,34.25,6.5,-35.15,5.6
54,33.25,5.5,-34.15,4.6
55,32.25,4.5,-33.15,3.6
56,31.25,3.5,-32.15,2.6
57,30.25,2.5,-31.15,1.6
58,29.25,1.5,-30.15,0.6
59,28.25,0.5,-29.15,-0.4
60,27.25,-0.5,-28.15,-1.4
61,26.25,-0.5,-27.15,-1.4
62,25.25,-0.5,-26.15,-1.4
63,24.25,-0.5,-25.15,-1.4
64,23.25,-0.5,-24.15,-1.4
65,22.25,-0.5,-23.15,-1.4
The output expecting is:
Below 50,8.6
51,7.6
52,6.6
53,5.6
54,4.6
55,3.6
56,2.6
57,1.6
58,0.6
59,-0.4
Above 60, -1.4
The logic here is if the value of the last columns is same for 5 continuous rows then break the loop and return the output above.
I am trying to solve in Pandas way, but not getting any thoughts to start with. Any help will be appreciated.
As suggested in comments by #Erfan, there is probably a mistake in the first column of the output.
Here, one solution assuming you want to keep the first row of each group:
# I renamed the columns
print(df)
# a x y z b
# 0 45 37.25 14.5 -43.15 8.6
# 1 46 37.25 13.5 -42.15 8.6
# 2 47 37.25 12.5 -41.15 8.6
# 3 48 37.25 11.5 -40.15 8.6
# 4 49 37.25 10.5 -39.15 8.6
# 5 50 37.25 9.5 -38.15 8.6
# 6 51 36.25 8.5 -37.15 7.6
# 7 52 35.25 7.5 -36.15 6.6
# 8 53 34.25 6.5 -35.15 5.6
# 9 54 33.25 5.5 -34.15 4.6
# 10 55 32.25 4.5 -33.15 3.6
# 11 56 31.25 3.5 -32.15 2.6
# 12 57 30.25 2.5 -31.15 1.6
# 13 58 29.25 1.5 -30.15 0.6
# 14 59 28.25 0.5 -29.15 -0.4
# 15 60 27.25 -0.5 -28.15 -1.4
# 16 61 26.25 -0.5 -27.15 -1.4
# 17 62 25.25 -0.5 -26.15 -1.4
# 18 63 24.25 -0.5 -25.15 -1.4
# 19 64 23.25 -0.5 -24.15 -1.4
# 20 65 22.25 -0.5 -23.15 -1.4
def valid(x):
if len(x) < 5: return x
return x.head(1)
df["ids"] = (df.b != df.b.shift()).cumsum()
output = df.groupby("ids").apply(valid).reset_index(level=0, drop=True)[df.columns[:-1]]
print(output)
# a x y z b
# 0 45 37.25 14.5 -43.15 8.6
# 6 51 36.25 8.5 -37.15 7.6
# 7 52 35.25 7.5 -36.15 6.6
# 8 53 34.25 6.5 -35.15 5.6
# 9 54 33.25 5.5 -34.15 4.6
# 10 55 32.25 4.5 -33.15 3.6
# 11 56 31.25 3.5 -32.15 2.6
# 12 57 30.25 2.5 -31.15 1.6
# 13 58 29.25 1.5 -30.15 0.6
# 14 59 28.25 0.5 -29.15 -0.4
# 15 60 27.25 -0.5 -28.15 -1.4
If you want the last row (in the case there are more than 5 consecutives same row), replace x.head(1) by x.tail(1) or whatever function you want.
Here is one way
n=5
s1=df.iloc[:,-1].diff().ne(0).cumsum()
s2=s1.groupby(s1).transform('count')>n
pd.concat([df[s2].groupby(s1).head(1),df[~s2]]).sort_index()
1 2 3 4 5
0 45 37.25 14.5 -43.15 8.6
6 51 36.25 8.5 -37.15 7.6
7 52 35.25 7.5 -36.15 6.6
8 53 34.25 6.5 -35.15 5.6
9 54 33.25 5.5 -34.15 4.6
10 55 32.25 4.5 -33.15 3.6
11 56 31.25 3.5 -32.15 2.6
12 57 30.25 2.5 -31.15 1.6
13 58 29.25 1.5 -30.15 0.6
14 59 28.25 0.5 -29.15 -0.4
15 60 27.25 -0.5 -28.15 -1.4

Convert specific string to a numeric value in pandas

I am trying to do data analysis of some rainfall data. Example of the data looks like this:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 TRACE 3.5 17 TRACE 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 T 3 12 T 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
The rainfall data contain a specific string 'TRACE' or 'T' (both meaning non measurable rainfall amount). For analysis, I would like to convert this strings in to '1.0' (float). My desired data should look like this so as to plot the values as line diagram:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 1.0 3.5 17 1.0 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 1.0 3 12 1.0 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
Can some one point me to right direction?
You can use df.replace, and then converting the numeric to float using df.astype (the original datatype would be object, so any operations on these columns would still suffer from performance issues):
df = df.replace('^T(RACE)?$', 1.0, regex=True)
df.iloc[:, 1:] = df.iloc[:, 1:].astype(float) # converting object columns to floats
This will replace all T or TRACE elements with 1.0.
Output:
10 18/05/2016 26.9 40 20.8 34.0 52.2 20.8 46.5 45.0
11 19/05/2016 25.5 32 0.3 41.6 42.0 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9.0 36.0 18.4 28.6 46.0
13 21/05/2016 24.5 18 1 3.5 17.0 1 4.4 40.0
14 22/05/2016 0.6 18 0 6.5 14.0 0 8.6 20.0
15 23/05/2016 3.5 9 0.6 4.3 14.0 0.6 7.0 15.0
16 24/05/2016 3.6 25 1 3.0 12.0 1 14.9 9.0
17 25/05/2016 25.0 21 2.2 25.6 50.0 2.2 25.0 9.0
Use replace by dict:
df = df.replace({'T':1.0, 'TRACE':1.0})
And then if necessary convert columns to float:
cols = df.columns.difference(['Date','another cols dont need convert'])
df[cols] = df[cols].astype(float)
df = df.replace({'T':1.0, 'TRACE':1.0})
cols = df.columns.difference(['Date','a'])
df[cols] = df[cols].astype(float)
print (df)
a Date 2 3 4 5 6 7 8 9
0 10 18/05/2016 26.9 40.0 20.8 34.0 52.2 20.8 46.5 45.0
1 11 19/05/2016 25.5 32.0 0.3 41.6 42.0 0.3 56.3 65.2
2 12 20/05/2016 8.5 29.0 18.4 9.0 36.0 18.4 28.6 46.0
3 13 21/05/2016 24.5 18.0 1.0 3.5 17.0 1.0 4.4 40.0
4 14 22/05/2016 0.6 18.0 0.0 6.5 14.0 0.0 8.6 20.0
5 15 23/05/2016 3.5 9.0 0.6 4.3 14.0 0.6 7.0 15.0
6 16 24/05/2016 3.6 25.0 1.0 3.0 12.0 1.0 14.9 9.0
7 17 25/05/2016 25.0 21.0 2.2 25.6 50.0 2.2 25.0 9.0
print (df.dtypes)
a int64
Date object
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtype: object
Extending the answer from #jezrael, you can replace and convert to floats in a single statement (assumes the first column is Date and the remaining are the desired numeric columns):
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'T':1.0, 'TRACE':1.0}).astype(float)

Categories

Resources