How to efficiently remove leading rows containing only 0 as value? - python

I have a pandas dataframe and the first rows have only zeros as value.
I would like to remove those rows.
So, denoting df my dataframe and ['a', 'b', 'c'] its columns. I tried the following code.
df[(df[['a', 'b', 'c']] != 0).all(axis=1)]
But it will also turn the following dataframe :
a b c
0 0 0
0 0 0
1 0 0
0 0 0
2 3 5
4 5 6
0 0 0
1 1 1
Into this one :
a b c
1 0 0
2 3 5
4 5 6
1 1 1
That's not what I want. I just want to focus on leading rows. So, I would like to have :
a b c
1 0 0
0 0 0
2 3 5
4 5 6
0 0 0
1 1 1
It would be great to have a simple and efficient solution using pandas functions. Thanks

General solution working if all 0 rows in data - first use cummsum for cumualtive sum and then test any Trues per rows:
df1 = df[(df[['a', 'b', 'c']] != 0).cumsum().any(1)]
print (df1)
a b c
2 1 0 0
3 0 0 0
4 2 3 5
5 4 5 6
6 0 0 0
7 1 1 1
Solution if at least one non 0 row in data - get first value of non 0 rows with Series.idxmax:
df1 = df.iloc[(df[['a', 'b', 'c']] != 0).any(axis=1).idxmax():]
print (df1)
a b c
2 1 0 0
3 0 0 0
4 2 3 5
5 4 5 6
6 0 0 0
7 1 1 1

Here is an example that finds the first row that is not all zeros and then selects all from that point on. Should solve the problem you are describing:
ix_first_valid = df[(df != 0).any(axis=1)].index[0]
df[ix_first_valid:]

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

Flag creation based on count of consecutive ones in a column

I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)

ApplyMap function on Multiple columns pandas

I have this dataframe
dd = pd.DataFrame({'a':[1,5,3],'b':[3,2,3],'c':[2,4,5]})
a b c
0 1 3 2
1 5 2 4
2 3 3 5
I just want to replace numbers of column a and b which are smaller than column c numbers. I want to this operation row wise
I did this
dd.applymap(lambda x: 0 if x < x['c'] else x )
I get error
TypeError: 'int' object is not subscriptable
I understood x is a int but how to get value of column c for that row
I want this output
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Use DataFrame.mask with DataFrame.lt:
df = dd.mask(dd.lt(dd['c'], axis=0), 0)
print (df)
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Or you can set values by compare broadcasting by column c:
dd[dd < dd['c'].to_numpy()[:, None]] = 0
print (dd)
a b c
0 0 3 2
1 5 0 4
2 0 0 5

Difference of one multi index level

For a MultiIndex with a repeating level, how can I calculate the differences with another level of the index, effectively ignoring it?
Let me explain in code.
>>> ix = pd.MultiIndex.from_product([(0, 1, 2), (0, 1, 2, 3)])
>>> df = pd.DataFrame([5]*4 + [4]*4 + [3, 2, 1, 0], index=ix)
>>> df
0
0 0 5
1 5
2 5
3 5
1 0 4
1 4
2 4
3 4
2 0 3
1 2
2 1
3 0
Now by some operation I'd like to subtract the last set of values (2, 0:4) from the whole data frame. I.e. df - df.loc[2] to produce this:
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0
But the statement produces an error. df - df.loc[2:3] does not, but in addition to the trailing zeros only NaNs are produced - naturally of course because the indices don't match.
How could this be achieved?
I realised that the index level is precisely the problem. So I got a bit closer.
>>> df.droplevel(0) - df.loc[2]
0
0 2
0 1
0 0
1 3
1 2
1 0
2 4
2 3
2 0
3 5
3 4
3 0
Still not quite what I want. But I don't know if there's a convenient way of achieving what I'm after.
This with stack and unstack:
new_df = df.unstack()
new_df.sub(new_df.loc[2]).stack()
Output:
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0
Try creating a dataframe with identical index and mapping the last set of data with the first level and populate across the dataframe , then substract:
df - pd.DataFrame(index=df.index,data=df.index.get_level_values(1).map(df.loc[2].squeeze()))
0
0 0 2
1 3
2 4
3 5
1 0 1
1 2
2 3
3 4
2 0 0
1 0
2 0
3 0

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

Categories

Resources