Efficiently Drop Rows in a Pandas Dataframe - python

I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0 # --> gets removed since this row appears after id 1 already had a status of 1
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.
The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
However, this runs very slowly, any way to fix this or to alternatively speed up the computation?

First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1:
#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
Another solution is use custom lambda function with Series.idxmax:
def f(x):
if x['new'].any():
return x.iloc[:x['new'].idxmax()+1, :]
else:
return x
df1 = (df.assign(new=(df['Status'] == 1))
.groupby(df['Id'], group_keys=False)
.apply(f).drop('new', axis=1))
print (df1)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:
m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]
m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
.apply(lambda x: x.shift(fill_value=0).cumsum())
.eq(0))
df = df[m2.reindex(df.index, fill_value=True)]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0

Let's start with this dataset.
l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])
We will find the status=1 index for each id.
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
index
id
1 4
2 8
Now we join over df_ with status_1_indice
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
Notice .fillna(np.inf) for id's that dont have status=1. Result:
level_0 id status index
0 0 1 0 4.000000
1 1 1 0 4.000000
2 2 1 0 4.000000
3 3 1 0 4.000000
4 4 1 1 4.000000
5 5 2 0 8.000000
6 6 1 0 4.000000
7 7 2 0 8.000000
8 8 2 1 8.000000
9 9 3 0 inf
10 10 2 0 8.000000
11 11 3 0 inf
Required dataframe can be obtained by:
join_table.query('level_0 <= index')[['id', 'status']]
Together:
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 2 1
9 3 0
11 3 0
I cant vouch for the performance but this is more straight forward than the method in question.

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

Flag creation based on count of consecutive ones in a column

I have a data frame with a column with only 0's and 1's. I need to create a flag column where there are more than a certain number of consecutive ones in the first column.
In the example below, x >= 4 , if there are 4 or more consecutive one's, then the flag should be 1 for all those consecutive rows.
col1 Flag
0 1 0
1 0 0
2 1 1
3 1 1
4 1 1
5 1 1
6 0 0
7 1 0
8 1 0
9 0 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 0 0
One change, let's say there is a new column group, we need to group by that and find the flag,
Group col1 Flag
0 A 1 0
1 B 0 0
2 B 1 1
3 B 1 1
4 B 1 1
5 B 1 1
6 C 0 0
7 C 1 0
8 C 1 0
9 C 0 0
10 D 1 0
11 D 1 0
12 D 1 0
13 E 1 0
14 E 1 0
15 E 0 0
As you can there are consecutive ones from 10 to 14 but they belong to different groups. And elements in group can be in any order.
No that hard try with cumsum create the key then do the transform count
(df.groupby(df.col1.ne(1).cumsum())['col1'].transform('count').ge(5) & df.col1.eq(1)).astype(int)
Out[83]:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 0
Name: col1, dtype: int32
You can achieve this in a couple of steps:
rolling(4).sum() to attain consecutive summations of your column
Use where to get the 1's from "col1" where their summation window (from the previous step) is >= 4. Turn the rest of the values into np.NaN
bfill(limit=3) to backwards fill the leftover 1s in your column by a maximum of 3 places.
fillna(0) fill what's leftover with 0
df["my_flag"] = (df["col1"]
.where(
df["col1"].rolling(4).sum() >= 4
) # Selects the 1's whose consecutive sum >= 4. All other values become NaN
.bfill(limit=3) # Moving backwards from our leftover values,
# take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
col1 Flag my_flag
0 1 0 0
1 0 0 0
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 0 0
8 1 0 0
9 0 0 0
10 1 1 1
11 1 1 1
12 1 1 1
13 1 1 1
14 1 1 1
15 0 0 0
Edit:
For a grouped approach, you just need to use groupby().rolling to create your mask for use in where(). Everything after that is the same. I separated the rolling step to keep it as readable as possible:
grouped_counts_ge_4 = (df.groupby("Group")["col1"]
.rolling(4)
.sum()
.ge(4)
.reset_index(level=0, drop=True))
df["my_flag"] = (df["col1"]
.where(grouped_counts_ge_4)
.bfill(limit=3) # Moving backwards from our leftover values, take the existing value and fill in a maximum of 3 NaNs
.fillna(0) # Fill in the rest of the NaNs with 0
.astype(int)) # Cast to integer data type, since we were working with floats temporarily
print(df)
Group col1 Flag my_flag
0 A 1 0 0
1 B 0 0 0
2 B 1 1 1
3 B 1 1 1
4 B 1 1 1
5 B 1 1 1
6 C 0 0 0
7 C 1 0 0
8 C 1 0 0
9 C 0 0 0
10 D 1 0 0
11 D 1 0 0
12 D 1 0 0
13 E 1 0 0
14 E 1 0 0
15 E 0 0 0
Try this:
df['Flag'] = np.where(df['col1'].groupby((df['col1'].diff().ne(0) | df['col1'].eq(0)).cumsum()).transform('size').ge(4),1,0)

Make pandas df in wide format and unconcatenate values to different columns

sorry, I have a bit of a trouble explaining the problem in title
By accident we pivoted our Pandas Dataframe to this:
df = pd.DataFrame(np.array([[1,1,2], [1,2,1], [2,1,2], [2,2,2],[3,1,3]]),columns=['id', '3s', 'score'])
id 3s score
1 1 2
1 2 1
2 1 2
2 2 2
3 1 3
But we need to unstack this so df will look like this (the original version): The '3s' column 'unpivots' to the discrete set by 3 ordered columns with 0s and 1s, which add in order. So if we had '3s'= 2 with 'score'= 2 the values will be [1,1,0] (2 out of 3 in order) in columns ['4','5','6'] (second set of 3s) for corresponding id
df2 = pd.DataFrame(np.array([[1,1,1,0,1,0,0], [2,1,1,0,1,1,0], [3,1,1,1,np.nan,np.nan,np.nan] ]),columns=['id', '1', '2','3','4','5','6'])
id 1 2 3 4 5 6
1 1 1 0 1 0 0
2 1 1 0 1 1 0
3 1 1 1
Any help greatly appreciated!
(please save me)
Use:
n = 3
df2 = df.reindex(index = df.index.repeat(n))
new_df = (df2.assign(score = df2['score'].gt(df2.groupby(['id','3s'])
.id
.cumcount())
.astype(int),
columns = df2.groupby('id').cumcount().add(1))
.pivot_table(index = 'id',
values='score',
columns = 'columns',
fill_value = '')
.rename_axis(columns = None)
.reset_index())
print(new_df)
Output
id 1 2 3 4 5 6
0 1 1.0 1.0 0.0 1 0 0
1 2 1.0 1.0 0.0 1 1 0
2 3 1.0 1.0 1.0
If you want you can use fill_value = 0
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0
This should do the trick:
for gr in df.groupby('3s').groups:
for i in range(1,4):
df[str(i+(gr-1)*3)]=np.where((df['3s'].eq(gr))&(df['score'].ge(i)), 1,0)
df=df.drop(['3s', 'score'], axis=1).groupby('id').max().reset_index()
Output:
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0

Python:how to get unique values over 2 different columns?

I have a dataframe like the following
df
idA idB yA yB
0 3 2 0 1
1 0 1 0 0
2 0 4 0 1
3 0 2 0 1
4 0 3 0 0
I would like to have a unique y for each id. So
df
id y
0 0 0
1 1 0
2 2 1
3 3 3
4 4 1
First create new DataFrame by flatten columns selected by iloc with numpy.ravel, then sort_values and drop_duplicates by id column:
df2 = (pd.DataFrame({'id':df.iloc[:,:2].values.ravel(),
'y': df.iloc[:,2:4].values.ravel()})
.sort_values('id')
.drop_duplicates(subset=['id'])
.reset_index(drop=True))
print (df2)
id y
0 0 0
1 1 0
2 2 1
3 3 0
4 4 1
Detail:
print (pd.DataFrame({'id':df.iloc[:,:2].values.ravel(),
'y': df.iloc[:,2:4].values.ravel()}))
id y
0 3 0
1 2 1
2 0 0
3 1 0
4 0 0
5 4 1
6 0 0
7 2 1
8 0 0
9 3 0

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

Categories

Resources