This is my dataset. For this problem just consider the first and the last column.
45,37.25,14.5,-43.15,8.6
46,37.25,13.5,-42.15,8.6
47,37.25,12.5,-41.15,8.6
48,37.25,11.5,-40.15,8.6
49,37.25,10.5,-39.15,8.6
50,37.25,9.5,-38.15,8.6
51,36.25,8.5,-37.15,7.6
52,35.25,7.5,-36.15,6.6
53,34.25,6.5,-35.15,5.6
54,33.25,5.5,-34.15,4.6
55,32.25,4.5,-33.15,3.6
56,31.25,3.5,-32.15,2.6
57,30.25,2.5,-31.15,1.6
58,29.25,1.5,-30.15,0.6
59,28.25,0.5,-29.15,-0.4
60,27.25,-0.5,-28.15,-1.4
61,26.25,-0.5,-27.15,-1.4
62,25.25,-0.5,-26.15,-1.4
63,24.25,-0.5,-25.15,-1.4
64,23.25,-0.5,-24.15,-1.4
65,22.25,-0.5,-23.15,-1.4
The output expecting is:
Below 50,8.6
51,7.6
52,6.6
53,5.6
54,4.6
55,3.6
56,2.6
57,1.6
58,0.6
59,-0.4
Above 60, -1.4
The logic here is if the value of the last columns is same for 5 continuous rows then break the loop and return the output above.
I am trying to solve in Pandas way, but not getting any thoughts to start with. Any help will be appreciated.
As suggested in comments by #Erfan, there is probably a mistake in the first column of the output.
Here, one solution assuming you want to keep the first row of each group:
# I renamed the columns
print(df)
# a x y z b
# 0 45 37.25 14.5 -43.15 8.6
# 1 46 37.25 13.5 -42.15 8.6
# 2 47 37.25 12.5 -41.15 8.6
# 3 48 37.25 11.5 -40.15 8.6
# 4 49 37.25 10.5 -39.15 8.6
# 5 50 37.25 9.5 -38.15 8.6
# 6 51 36.25 8.5 -37.15 7.6
# 7 52 35.25 7.5 -36.15 6.6
# 8 53 34.25 6.5 -35.15 5.6
# 9 54 33.25 5.5 -34.15 4.6
# 10 55 32.25 4.5 -33.15 3.6
# 11 56 31.25 3.5 -32.15 2.6
# 12 57 30.25 2.5 -31.15 1.6
# 13 58 29.25 1.5 -30.15 0.6
# 14 59 28.25 0.5 -29.15 -0.4
# 15 60 27.25 -0.5 -28.15 -1.4
# 16 61 26.25 -0.5 -27.15 -1.4
# 17 62 25.25 -0.5 -26.15 -1.4
# 18 63 24.25 -0.5 -25.15 -1.4
# 19 64 23.25 -0.5 -24.15 -1.4
# 20 65 22.25 -0.5 -23.15 -1.4
def valid(x):
if len(x) < 5: return x
return x.head(1)
df["ids"] = (df.b != df.b.shift()).cumsum()
output = df.groupby("ids").apply(valid).reset_index(level=0, drop=True)[df.columns[:-1]]
print(output)
# a x y z b
# 0 45 37.25 14.5 -43.15 8.6
# 6 51 36.25 8.5 -37.15 7.6
# 7 52 35.25 7.5 -36.15 6.6
# 8 53 34.25 6.5 -35.15 5.6
# 9 54 33.25 5.5 -34.15 4.6
# 10 55 32.25 4.5 -33.15 3.6
# 11 56 31.25 3.5 -32.15 2.6
# 12 57 30.25 2.5 -31.15 1.6
# 13 58 29.25 1.5 -30.15 0.6
# 14 59 28.25 0.5 -29.15 -0.4
# 15 60 27.25 -0.5 -28.15 -1.4
If you want the last row (in the case there are more than 5 consecutives same row), replace x.head(1) by x.tail(1) or whatever function you want.
Here is one way
n=5
s1=df.iloc[:,-1].diff().ne(0).cumsum()
s2=s1.groupby(s1).transform('count')>n
pd.concat([df[s2].groupby(s1).head(1),df[~s2]]).sort_index()
1 2 3 4 5
0 45 37.25 14.5 -43.15 8.6
6 51 36.25 8.5 -37.15 7.6
7 52 35.25 7.5 -36.15 6.6
8 53 34.25 6.5 -35.15 5.6
9 54 33.25 5.5 -34.15 4.6
10 55 32.25 4.5 -33.15 3.6
11 56 31.25 3.5 -32.15 2.6
12 57 30.25 2.5 -31.15 1.6
13 58 29.25 1.5 -30.15 0.6
14 59 28.25 0.5 -29.15 -0.4
15 60 27.25 -0.5 -28.15 -1.4
Related
I used an example dataset which I load into a dataframe. I then use a statsmodels OLS comparing Texture as a result of Mix and then use that model for an ANOVA table.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('contrastExampleData.csv')
mod = ols(formula = 'Texture ~ Mix', data = df).fit()
aov_table = sm.stats.anova_lm(mod, typ = 1)
print(aov_table)
If it's preferred that I upload the csv and link it, please let me know.
The dataframe:
Mix Blend Flour SPI Texture
0 1 0.5 KSS 1.1 107.3
1 1 0.5 KSS 1.1 110.1
2 1 0.5 KSS 1.1 112.6
3 2 0.5 KSS 2.2 97.9
4 2 0.5 KSS 2.2 100.1
5 2 0.5 KSS 2.2 102.0
6 3 0.5 KSS 3.3 86.8
7 3 0.5 KSS 3.3 88.1
8 3 0.5 KSS 3.3 89.1
9 4 0.5 KNC 1.1 108.1
10 4 0.5 KNC 1.1 110.1
11 4 0.5 KNC 1.1 111.8
12 5 0.5 KNC 2.2 108.6
13 5 0.5 KNC 2.2 110.2
14 5 0.5 KNC 2.2 111.2
15 6 0.5 KNC 3.3 95.0
16 6 0.5 KNC 3.3 95.4
17 6 0.5 KNC 3.3 95.5
18 7 1.0 KSS 1.1 97.3
19 7 1.0 KSS 1.1 99.1
20 7 1.0 KSS 1.1 100.6
21 8 1.0 KSS 2.2 92.8
22 8 1.0 KSS 2.2 94.6
23 8 1.0 KSS 2.2 96.7
24 9 1.0 KSS 3.3 86.8
25 9 1.0 KSS 3.3 88.1
26 9 1.0 KSS 3.3 89.1
27 10 1.0 KNC 1.1 94.1
28 10 1.0 KNC 1.1 96.1
29 10 1.0 KNC 1.1 97.8
30 11 1.0 KNC 2.2 95.7
31 11 1.0 KNC 2.2 97.6
32 11 1.0 KNC 2.2 99.8
33 12 1.0 KNC 3.3 90.2
34 12 1.0 KNC 3.3 92.1
35 12 1.0 KNC 3.3 93.7
Resulting in output:
df sum_sq mean_sq F PR(>F)
Mix 1.0 520.080472 520.080472 10.828726 0.002334
Residual 34.0 1632.947028 48.027854 NaN NaN
However, this is entirely incorrect - the correct ANOVA table can be seen here. At first notice, the degrees of freedom should be 11 instead of 1, given that there are 12 levels to Mix, but I cannot figure out why this has happened. I've done similar analyses with simpler datasets of only two columns and haven't had an issue. I've attempted to use sm.OLS and others but haven't had much luck. What is the issue that is resulting in an incorrect ANOVA?
This question is effectively answered by this R question, as statsmodels uses R type formulae. I found this just after posting and wanted to update for others with similar questions for python.
The solution is to convert the independent variable to a categorical variable instead of a numeric variable, as the "Mix" in this is not a continuous numerical variable, but instead 12 discrete labels. This is done by:
mod = ols(formula = 'Texture ~ C(Mix)', data = df).fit()
which results in the correct ANOVA table:
C(Mix) 11.0 2080.2875 189.117045 62.397705 6.550053e-15
Residual 24.0 72.7400 3.030833 NaN NaN
I have a dataframe with 5 columns: M1, M2, M3, M4 and M5. Each column contains floating-point values. Now I want to combine the data of 5 columns into one.
I tried
cols = list(df.columns)
df_new['Total'] = []
df_new['Total'] = [df_new['Total'].append(df[i], ignore_index=True) for i in cols]
But I'm getting this
I'm using Python 3.8.5 and Pandas 1.1.2.
Here's a part of my df
M1 M2 M3 M4 M5
0 5 12 20 26
0.5 5.5 12.5 20.5 26.5
1 6 13 21 27
1.5 6.5 13.5 21.5 27.5
2 7 14 22 28
2.5 7.5 14.5 22.5 28.5
10 15 22 30 36
10.5 15.5 22.5 30.5 36.5
11 16 23 31 37
11.5 16.5 23.5 31.5 37.5
12 17 24 32 38
12.5 17.5 24.5 32.5 38.5
And this is what I'm expecting
0
0.5
1
1.5
2
2.5
10
10.5
11
11.5
12
12.5
5
5.5
6
6.5
7
7.5
15
15.5
16
16.5
17
17.5
12
12.5
13
13.5
14
14.5
22
22.5
23
23.5
24
24.5
20
20.5
21
21.5
22
22.5
30
30.5
31
31.5
32
32.5
26
26.5
27
27.5
28
28.5
36
36.5
37
37.5
38
38.5
import pandas as pd
Just make use of concat() method and list comprehension:
result=pd.concat((df[x] for x in df.columns),ignore_index=True)
Now If you print result then you will get your desired output
Performance(concat() vs unstack()):
I have the following multi-index dataframe where one df represents the daily high of hypothetical stocks and the other consists of their previous day close.
High_Price Yest_Close
Ticker ABC XYZ RST ABC XYZ. RST
2/1/19 3 10 90 2 9 88
1/31/19 3.5 9 88 4 9.5 89
1/30/19 2.5 9.5 86 3 9.8 85
1/29/19 4 8.5 92 3.5 8 93
1/28/19 4.5 8.2 95 4.8 8 96
1/27/19 2.8 7 94 2.6 6.5 93
1/26/19 2.6 6.5 93 2.7 7 92
I want to append another dataframe that represents the max value between the two dfs (High_Price and Yest_Close). So the third df should look like the following:
High_Price Yest_Close Max
Ticker ABC XYZ RST ABC XYZ RST ABC XYZ RST
2/1/19 3 10 90 2 9 88 3 10 90
1/31/19 3.5 9 88 4 9.5 89 4 9.5 89
1/30/19 2.5 9.5 86 3 9.8 85 3 9.8 86
1/29/19 4 8.5 92 3.5 8 93 4 8.5 93
1/28/19 4.5 8.2 95 4.8 8 96 4.8 8.2 96
1/27/19 2.8 7 94 2.6 6.5 93 2.8 7 94
1/26/19 2.6 6.5 93 2.7 7 92 2.7 7 93
I tried the following logic but it's not getting me the proper result:
df['Max',ticker] = df[['High_Price','Yest_Close']].max(axis=1)
How should I fix my code to get the result I'm lookking for?
You want level=1 inside max ,then create a multiindex followed by df.join:
m = df[['High_Price','Yest_Close']].max(level=1,axis=1)
m.columns = pd.MultiIndex.from_product((['Max'],m.columns))
out = df.join(m)
High_Price Yest_Close Max
ABC XYZ RST ABC XYZ RST ABC XYZ RST
Ticker
2/1/19 3.0 10.0 90 2.0 9.0 88 3.0 10.0 90.0
1/31/19 3.5 9.0 88 4.0 9.5 89 4.0 9.5 89.0
1/30/19 2.5 9.5 86 3.0 9.8 85 3.0 9.8 86.0
1/29/19 4.0 8.5 92 3.5 8.0 93 4.0 8.5 93.0
1/28/19 4.5 8.2 95 4.8 8.0 96 4.8 8.2 96.0
1/27/19 2.8 7.0 94 2.6 6.5 93 2.8 7.0 94.0
1/26/19 2.6 6.5 93 2.7 7.0 92 2.7 7.0 93.0
I have the dataframe comprised of 'Linear' and 'Delta' and need to create the 'New' column.
In:
Linear Delta
30 -3
60 1.4
65 -0.3
62 4.4
21 -2.5
18 -0.1
34 -3.1
30 -1.5
45 0.5
55 -1.4
43 2.8
51 4.7
62 2.7
Out:
Linear Delta New
30 -3
60 1.4 60.0
65 -0.3 59.7
62 4.4 64.1
21 -2.5 61.6
18 -0.1 61.5
34 -3.1 58.4
30 -1.5 56.9
45 0.5 57.4
55 -1.4 55.0
43 2.8 57.8
51 4.7 51.0
62 2.7 53.7
The algorithmic formula is the following one:
New[i] = IF( AND(Linear[i-1]<50,Linear[i]>50) , Linear , New[i-1]+Delta[i] )
I tried a lot of different approaches such as cumsum() but never found the solution. I have spent many hours but in vain.
For this kind of recursive algorithm, consider numba with a manual loop. You will likely find just-in-time compilation more efficient than Pandas-based methods / iteration.
from numba import jit
#jit(nopython=True)
def calc_new(L, D):
res = np.zeros(L.shape)
res[0] = np.nan
for i in range(1, len(res)):
res[i] = L[i] if (L[i-1] < 50) & (L[i] > 50) else res[i-1] + D[i]
return res
df['New'] = calc_new(df['Linear'].values, df['Delta'].values)
Result
print(df)
Linear Delta New
0 30 -3.0 NaN
1 60 1.4 60.0
2 65 -0.3 59.7
3 62 4.4 64.1
4 21 -2.5 61.6
5 18 -0.1 61.5
6 34 -3.1 58.4
7 30 -1.5 56.9
8 45 0.5 57.4
9 55 -1.4 55.0
10 43 2.8 57.8
11 51 4.7 51.0
12 62 2.7 53.7
Not very nice, but working:
df['NEW'] = np.nan
for i, row in df.iterrows():
if i > 0:
m = (row['Linear'] > 50) & (df.loc[i-1, 'Linear'] < 50)
df.loc[i, 'NEW'] = np.where(m, row['Linear'], row['Delta'] + df.loc[i-1, 'NEW'])
print (df)
Linear Delta New NEW
0 30 -3.0 NaN NaN
1 60 1.4 60.0 60.0
2 65 -0.3 59.7 59.7
3 62 4.4 64.1 64.1
4 21 -2.5 61.6 61.6
5 18 -0.1 61.5 61.5
6 34 -3.1 58.4 58.4
7 30 -1.5 56.9 56.9
8 45 0.5 57.4 57.4
9 55 -1.4 55.0 55.0
10 43 2.8 57.8 57.8
11 51 4.7 51.0 51.0
12 62 2.7 53.7 53.7
In this program I have used two temporary lists to get the work done. "prevLinear" list is used to get the results of Linear[i-1] results. "tempList" list is used to store the results of "new" column for future calculation purpose.
Execute the code to get the result.
Let me know if this helps.
Thank you.
import pandas as pd
import numpy as np
Linear = [30,60,65,62,21,18,34,30,45,55,43,51,62]
Delta = [-3,1.4,-0.3,4.4,-2.5,-0.1,-3.1,-1.5,0.5,-1.4,2.8,4.7,2.7]
df = pd.DataFrame({
'Linear':Linear,
'Delta':Delta
})
prevLinear = [np.NaN if(i==0) else df.iloc[i-1,1] for i,value in enumerate(df['Linear'].values)]
df['prevLinear'] = prevLinear
tempList = []
new = []
for i,value in enumerate(df['Linear'].values):
if(value>50 and df.iloc[i,2]<50):
new.append(value)
tempList.append(value)
else:
if(i>0):
new.append(df.iloc[i,0]+tempList[i-1])
tempList.append(df.iloc[i,0]+tempList[i-1])
else:
new.append(np.NaN)
tempList.append(np.NaN)
new = np.round(new,1)
df['New'] = new
finalDF = df.iloc[:,[1,0,3]]
print(finalDF)
I am trying to do data analysis of some rainfall data. Example of the data looks like this:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 TRACE 3.5 17 TRACE 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 T 3 12 T 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
The rainfall data contain a specific string 'TRACE' or 'T' (both meaning non measurable rainfall amount). For analysis, I would like to convert this strings in to '1.0' (float). My desired data should look like this so as to plot the values as line diagram:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 1.0 3.5 17 1.0 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 1.0 3 12 1.0 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
Can some one point me to right direction?
You can use df.replace, and then converting the numeric to float using df.astype (the original datatype would be object, so any operations on these columns would still suffer from performance issues):
df = df.replace('^T(RACE)?$', 1.0, regex=True)
df.iloc[:, 1:] = df.iloc[:, 1:].astype(float) # converting object columns to floats
This will replace all T or TRACE elements with 1.0.
Output:
10 18/05/2016 26.9 40 20.8 34.0 52.2 20.8 46.5 45.0
11 19/05/2016 25.5 32 0.3 41.6 42.0 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9.0 36.0 18.4 28.6 46.0
13 21/05/2016 24.5 18 1 3.5 17.0 1 4.4 40.0
14 22/05/2016 0.6 18 0 6.5 14.0 0 8.6 20.0
15 23/05/2016 3.5 9 0.6 4.3 14.0 0.6 7.0 15.0
16 24/05/2016 3.6 25 1 3.0 12.0 1 14.9 9.0
17 25/05/2016 25.0 21 2.2 25.6 50.0 2.2 25.0 9.0
Use replace by dict:
df = df.replace({'T':1.0, 'TRACE':1.0})
And then if necessary convert columns to float:
cols = df.columns.difference(['Date','another cols dont need convert'])
df[cols] = df[cols].astype(float)
df = df.replace({'T':1.0, 'TRACE':1.0})
cols = df.columns.difference(['Date','a'])
df[cols] = df[cols].astype(float)
print (df)
a Date 2 3 4 5 6 7 8 9
0 10 18/05/2016 26.9 40.0 20.8 34.0 52.2 20.8 46.5 45.0
1 11 19/05/2016 25.5 32.0 0.3 41.6 42.0 0.3 56.3 65.2
2 12 20/05/2016 8.5 29.0 18.4 9.0 36.0 18.4 28.6 46.0
3 13 21/05/2016 24.5 18.0 1.0 3.5 17.0 1.0 4.4 40.0
4 14 22/05/2016 0.6 18.0 0.0 6.5 14.0 0.0 8.6 20.0
5 15 23/05/2016 3.5 9.0 0.6 4.3 14.0 0.6 7.0 15.0
6 16 24/05/2016 3.6 25.0 1.0 3.0 12.0 1.0 14.9 9.0
7 17 25/05/2016 25.0 21.0 2.2 25.6 50.0 2.2 25.0 9.0
print (df.dtypes)
a int64
Date object
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtype: object
Extending the answer from #jezrael, you can replace and convert to floats in a single statement (assumes the first column is Date and the remaining are the desired numeric columns):
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'T':1.0, 'TRACE':1.0}).astype(float)