Grouping dataframe based on consecutive occurrence of values - python

I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false values. I have tried to use pandas.groupby but haven't succeeded using that method, albeit I think that's down to my lack of understanding. An example of the dataframe can be found below:
df = pd.DataFrame(df)
print df
print df
index condition H t
0 1 2 1.1
1 1 7 1.5
2 0 1 0.9
3 0 6.5 1.6
4 1 7 1.1
5 1 9 1.8
6 1 22 2.0
Ideally the output of the program would be something along the lines of what can be found below. I was thinking of using some sort of 'grouping' method to make it easier to call each set of results but not sure if this is the best method. Any help would be greatly appreciated.
index condition H t group
0 1 2 1.1 1
1 1 7 1.5 1
2 0 1 0.9 2
3 0 6.5 1.6 2
4 1 7 1.1 3
5 1 9 1.8 3
6 1 22 2.0 3

Since you're dealing with 0/1s, here's another alternative using diff + cumsum -
df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
df
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
If you don't mind floats, this can be made a little faster.
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df
index condition H t group
0 0 1 2.0 1.1 1.0
1 1 1 7.0 1.5 1.0
2 2 0 1.0 0.9 2.0
3 3 0 6.5 1.6 2.0
4 4 1 7.0 1.1 3.0
5 5 1 9.0 1.8 3.0
6 6 1 22.0 2.0 3.0
Here's the version with numpy equivalents -
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
On my machine, here are the timings -
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 25.1 ms per loop
%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
10 loops, best of 3: 23.4 ms per loop
%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
10 loops, best of 3: 21.4 ms per loop
%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop

Compare with ne (!=) by shifted column and then use cumsum:
df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
Detail:
print (df['condition'].ne(df['condition'].shift()))
index
0 True
1 False
2 True
3 False
4 True
5 False
6 False
Name: condition, dtype: bool
Timings:
df = pd.concat([df]*100000).reset_index(drop=True)
In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop
In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop
In [56]: %%timeit
...: df['group'] = 1
...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
...:
10 loops, best of 3: 26.6 ms per loop

Related

Select a number of rows in a dataframe after a valid condition

I would like to select a specified number of lines after a condition is verified:
Here is my dataframe :
I would like to select three columns after the entry is equal to 1, so for the first occurrence I would obtain something like that :
I don't know what's the most appropriate output if I want to study every occurrence, maybe a groupby ?
First remove 0 rows before first 1:
df = df[df['entry'].eq(1).cumsum().ne(0)]
df = df.groupby(df['entry'].cumsum()).head(4)
Timestamp entry
1 11.2 1
2 11.3 0
3 11.4 0
4 11.5 0
7 11.8 1
8 11.9 0
9 12.0 0
10 12.1 0
Details & explanation:
For general solution for remove all values before first match is used compare by Series.eq, then cumulative sum by Series.cumsum and compare by Series.ne - so filter out all 0 values after cumsum operation:
print (df.assign(comp1 = df['entry'].eq(1),
cumsum =df['entry'].eq(1).cumsum(),
mask = df['entry'].eq(1).cumsum().ne(0)))
Timestamp entry comp1 cumsum mask
0 11.1 0 False 0 False
1 11.2 1 True 1 True
2 11.3 0 False 1 True
3 11.4 0 False 1 True
4 11.5 0 False 1 True
5 11.6 0 False 1 True
6 11.7 0 False 1 True
7 11.8 1 True 2 True
8 11.9 0 False 2 True
9 12.0 0 False 2 True
10 12.1 0 False 2 True
After filter by boolean indexing create helper Series with cumulative sum for groups:
print (df['entry'].cumsum())
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
Name: entry, dtype: int64
So for final solution use GroupBy.head with 4 values for get rows with 1 and next 3 rows:
df = df.groupby(df['entry'].cumsum()).head(4)
print (df)
Timestamp entry
1 11.2 1
2 11.3 0
3 11.4 0
4 11.5 0
7 11.8 1
8 11.9 0
9 12.0 0
10 12.1 0
For loop by groups use:
for i, g in df.groupby(df['entry'].cumsum()):
print (g.head(4))
If want output list of DataFrames:
L = [g.head(4) for i, g in df.groupby(df['entry'].cumsum())]

Interpolating multi index a pandas dataframe

I need to interpolate multi index dataframe:
for example:
this is the main dataframe:
a b c result
1 1 1 6
1 1 2 9
1 2 1 8
1 2 2 11
2 1 1 7
2 1 2 10
2 2 1 9
2 2 2 12
I need to find the result for:
1.3 1.7 1.55
What I've been doing so far is appending a pd.Series inside with NaN
for each index individually.
As you can see. this seems like a VERY inefficient way.
I would be happy if someone can enrich me.
P.S.
I spent some time looking over SO, and if the answer is in there, I missed it:
Fill multi-index Pandas DataFrame with interpolation
Resampling Within a Pandas MultiIndex
pandas multiindex dataframe, ND interpolation for missing values
Fill multi-index Pandas DataFrame with interpolation
Algorithm:
stage 1:
a b c result
1 1 1 6
1 1 2 9
1 2 1 8
1 2 2 11
1.3 1 1 6.3
1.3 1 2 9.3
1.3 2 1 8.3
1.3 2 2 11.3
2 1 1 7
2 1 2 10
2 2 1 9
2 2 2 12
stage 2:
a b c result
1 1 1 6
1 1 2 9
1 2 1 8
1 2 2 11
1.3 1 1 6.3
1.3 1 2 9.3
1.3 1.7 1 7.7
1.3 1.7 2 10.7
1.3 2 1 8.3
1.3 2 2 11.3
2 1 1 7
2 1 2 10
2 2 1 9
2 2 2 12
stage 3:
a b c result
1 1 1 6
1 1 2 9
1 2 1 8
1 2 2 11
1.3 1 1 6.3
1.3 1 2 9.3
1.3 1.7 1 7.7
1.3 1.7 1.55 9.35
1.3 1.7 2 10.7
1.3 2 1 8.3
1.3 2 2 11.3
2 1 1 7
2 1 2 10
2 2 1 9
2 2 2 12
You can use scipy.interpolate.LinearNDInterpolator to do what you want. If the dataframe is a MultiIndex with the column 'a','b' and 'c', then:
from scipy.interpolate import LinearNDInterpolator as lNDI
print (lNDI(points=df.index.to_frame().values, values=df.result.values)([1.3, 1.7, 1.55]))
now if you have dataframe with all the tuples (a, b, c) as index you want to calculate, you can do for example:
def pd_interpolate_MI (df_input, df_toInterpolate):
from scipy.interpolate import LinearNDInterpolator as lNDI
#create the function of interpolation
func_interp = lNDI(points=df_input.index.to_frame().values, values=df_input.result.values)
#calculate the value for the unknown index
df_toInterpolate['result'] = func_interp(df_toInterpolate.index.to_frame().values)
#return the dataframe with the new values
return pd.concat([df_input, df_toInterpolate]).sort_index()
Then for example with your df and df_toI = pd.DataFrame(index=pd.MultiIndex.from_tuples([(1.3, 1.7, 1.55),(1.7, 1.4, 1.9)],names=df.index.names))
then you get
print (pd_interpolate_MI(df, df_toI))
result
a b c
1.0 1.0 1.00 6.00
2.00 9.00
2.0 1.00 8.00
2.00 11.00
1.3 1.7 1.55 9.35
1.7 1.4 1.90 10.20
2.0 1.0 1.00 7.00
2.00 10.00
2.0 1.00 9.00
2.00 12.00

update missing values in Python Pandas dataframe with matching conditions

I have a dataframe df1 with 3 columns (A,B,C), NaN represents missing value here
A B C
1 2 NaN
2 1 2.3
2 3 2.5
I have a dataframe df2 with 3 columns (A,B,D)
A B D
1 2 2
2 1 2
2 3 4
The expected output would be
A B C
1 2 2
2 1 2.3
2 3 2.5
I want to have values in column C in df1 intact if not missing, replaced by corresponding value in D with other two columns value equal, i.e, df1.A==df2.A and df1.B==df2.B
any good solution?
One way would be to use the columns A and B as the index. If you use fillna then, pandas will align the indices and give you the correct result:
df1.set_index(['A', 'B'])['C'].fillna(df2.set_index(['A', 'B'])['D']).reset_index()
Out:
A B C
0 1 2 2.0
1 2 1 2.3
2 2 3 2.5
IIUC:
In [100]: df['C'] = np.where((np.isnan(df.C))&(df.A==df1.A)&(df.B==df1.B),df1.D,df.C)
In [101]: df
Out[101]:
A B C
0 1.0 2.0 2.0
1 2.0 1.0 2.3
2 2.3 1.2 2.5
np.where is faster when compared:
In [102]: %timeit df['C'] = np.where((np.isnan(df.C))&(df.A==df1.A)&(df.B==df1.B),df1.D,df.C)
1000 loops, best of 3: 1.3 ms per loop
In [103]: %timeit df.set_index(['A', 'B'])['C'].fillna(df1.set_index(['A', 'B'])['D']).reset_index()
100 loops, best of 3: 5.92 ms per loop

How to fillna by groupby outputs in pandas?

I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.
For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
I think you need:
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
Sample:
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,1,1,3],
'D':[1,np.nan,3,5]})
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Link to duplicate of this question for further information:
Pandas Dataframe: Replacing NaN with row average
Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose:
df.T.fillna(df.mean(axis=1)).T

Getting only top values within each group that have the same column value

I have a table that looks something like this:
Column 1 | Column 2 | Column 3
1 a 100
1 r 100
1 h 200
1 j 200
2 a 50
2 q 50
2 k 40
3 a 10
3 q 150
3 k 150
Imagine I am trying to get the top values of each groupby('Column 1')
Normally I would just .head(n) but in this case I am also trying to get only top rows with the same Column 3 value like:
Column 1 | Column 2 | Column 3
1 a 100
1 r 100
2 a 50
2 q 50
3 a 10
Assuming the table is already in the order I want it
Any advice would be highly appreciated
I think you need first need groupby with first and then merge:
print df.groupby('Column 1')['Column 3'].first().reset_index()
Column 1 Column 3
0 1 100
1 2 50
2 3 10
print pd.merge(df,
df.groupby('Column 1')['Column 3'].first().reset_index(),
on=['Column 1','Column 3'])
Column 1 Column 2 Column 3
0 1 a 100
1 1 r 100
2 2 a 50
3 2 q 50
4 3 a 10
Timings:
df = pd.concat([df]*1000).reset_index(drop=True)
%timeit pd.merge(df, df.groupby('Column 1')['Column 3'].first().reset_index(), on=['Column 1','Column 3'])
100 loops, best of 3: 3.58 ms per loop
%timeit df[(df.assign(diff=df.groupby('Column 1')['Column 3'].diff().fillna(0)).groupby('Column 1')['diff'].cumsum() == 0)]
100 loops, best of 3: 5.06 ms per loop
My solution (without merging):
In [83]: idx = (df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0))
....: .groupby('Column1')['diff'].cumsum() == 0
....: )
In [84]: df[idx]
Out[84]:
Column1 Column2 Column3
0 1 a 100
1 1 r 100
4 2 a 50
5 2 q 50
7 3 a 10
Explanation:
In [85]: df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0))
Out[85]:
Column1 Column2 Column3 diff
0 1 a 100 0.0
1 1 r 100 0.0
2 1 h 200 100.0
3 1 j 200 0.0
4 2 a 50 0.0
5 2 q 50 0.0
6 2 k 40 -10.0
7 3 a 10 0.0
8 3 q 150 140.0
9 3 k 150 0.0
In [86]: df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0)).groupby('Column1')['diff'].cumsum()
Out[86]:
0 0.0
1 0.0
2 100.0
3 100.0
4 0.0
5 0.0
6 -10.0
7 0.0
8 140.0
9 140.0
Name: diff, dtype: float64

Categories

Resources