Python dataframe, drop everything after a certain record - python

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3], \
'counter' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3], \
'status':['a', 'b', 'b' ,'c', 'a', 'a', 'a', 'a', 'a', 'b'], \
'additional_data' : [12,35,13,523,6,12,6,1,46,236]}, \
columns=['id', 'counter', 'status', 'additional_data'])
df
Out[37]:
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
2 1 3 b 13
3 1 4 c 523
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236
The id column indicates which data belongs together, the counter indicates the order of the rows, and status is a special status code. I want to drop all rows after the first occurence of a row with status='b', keeping the first row with status='b'.
Final output should look like this
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236
All help is, as always, greatly appreciated.

Use custom function with idxmax for return index of values by condition, add 1 for not lost b row:
def f(x):
m = x['status'].eq('b')
b = m.idxmax()
if m.any():
x = x.loc[:b]
else:
x
return x
a = df.groupby('id', group_keys=False).apply(f)
print (a)
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236

Related

df.loc behavior when assigning a dict to a column

Lets say we have a df like below:
df = pd.DataFrame({'A': [3, 9, 3, 4], 'B': [7, 1, 6, 0], 'C': [9, 0, 3, 4], 'D': [1, 8, 0, 0]})
Starting df:
A B C D
0 3 7 9 1
1 9 1 0 8
2 3 6 3 0
3 4 0 4 0
If we wanted to assign new values to column A, I would expect the following to work:
d = {0:10,1:20,2:30,3:40}
df.loc[:,'A'] = d
Output:
A B C D
0 0 7 9 1
1 1 1 0 8
2 2 6 3 0
3 3 0 4 0
The values that are assigned instead are the keys of the dictionary.
If however, instead of assigning the dictionary to an existing column, if we create a new column, we will get the same result the first time we run it, but running the same code again will get the expected result. We then are able to select any column and it will output the expected output.
First time running df.loc[:,'E'] = {0:10,1:20,2:30,3:40}
Output:
A B C D E
0 0 7 9 1 0
1 1 1 0 8 1
2 2 6 3 0 2
3 3 0 4 0 3
Second time running df.loc[:,'E'] = {0:10,1:20,2:30,3:40}
A B C D E
0 0 7 9 1 10
1 1 1 0 8 20
2 2 6 3 0 30
3 3 0 4 0 40
Then if we run the same code as we did at first, we get a different result:
df.loc[:,'A'] = {0:10,1:20,2:30,3:40}
Output:
A B C D E
0 10 7 9 1 10
1 20 1 0 8 20
2 30 6 3 0 30
3 40 0 4 0 40
Is this the intended behavior? (I am running pandas version 1.4.2)

How to get the cumulative count based on two columns

Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.
col
0 0
1 1
2 1
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 1
11 1
12 1
13 0
14 1
15 1
df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()
But the problem I see is when you need to use groupby with and id field. If we added an id field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.
id col
0 B 0
1 B 1
2 B 1
3 B 1
4 A 0
5 A 0
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:
s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()
Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum() at the end of the last groupby() with .transform('count') would actually give us 6 instead of 5 for the count of consecutive 1's in the first B group.
The only solution I can come up with for this problem is the following code:
df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')
Expected output:
0 1
1 5
2 5
3 5
4 2
5 2
6 5
7 5
8 1
9 2
10 2
11 2
12 2
13 1
14 2
15 2
This works, but uses transform() twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0) to get the "real" groups.
Index 1,2,3,6 and 7 are all id B, with the same value in the 'col' column, so the count would not be reset, so they would all be apart of the same group.
Can this be done without using multiple .transform()?
The following code uses only 1 .transform(), and relies upon ordering the index, to get the correct counts.
The original index is kept, so the final result can be reindexed back to the original order.
Use cum_counts['cum_counts'] to get the exact desired output, without the other column.
import pandas as pd
# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1)
col
index id
4 A 0
5 A 0
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
0 B 0
1 B 1
2 B 1
3 B 1
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()
# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()
id cum_counts
index
0 B 1
1 B 5
2 B 5
3 B 5
4 A 2
5 A 2
6 B 5
7 B 5
8 B 1
9 B 2
10 B 2
11 A 2
12 A 2
13 A 1
14 A 2
15 A 2
After looking at #TrentonMcKinney solution, I came up with:
df = df.sort_values(['id'])
grp =(df[['id','col']] != df[['id','col']].shift()).any(axis=1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df.sort_index()
Output:
id col count
0 B 0 1
1 B 1 5
2 B 1 5
3 B 1 5
4 A 0 2
5 A 0 2
6 B 1 5
7 B 1 5
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
IIUC, do you want?
grp = (df[['id', 'col']] != df[['id', 'col']].shift()).any(axis = 1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df
Output:
id col count
0 B 0 1
1 B 1 3
2 B 1 3
3 B 1 3
4 A 0 2
5 A 0 2
6 B 1 2
7 B 1 2
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2

How to count the number of occurrences of semi-duplicate rows and make the count a new column

I have a pandas dataframe as follows:
df = pd.DataFrame({'A':[4, 4, 1, 5, 1, 1],
'B':[2, 2, 2, 5, 2, 2],
'C':[1, 1, 3, 5, 3, 3],
'D':['q', 'e', 'r', 'y', 'u',' w']})
which looks like
A B C D
0 4 2 1 q
1 4 2 1 e
2 1 2 3 r
3 5 5 5 y
4 1 2 3 u
5 1 2 3 w
I would like to add a new column that is the count of duplicate rows, with respect to only the columns A, B, and C. This would look like
A B C D Count
0 4 2 1 q 2
1 4 2 1 e 2
2 1 2 3 r 3
3 5 5 5 y 1
4 1 2 3 u 3
5 1 2 3 w 3
I'm guessing this will be something like df.groupby(['A','B','C']).size() but I am unsure how to map the values back to the new 'Count' column. Thanks!
We can do transform
df['Count'] = df.groupby(['A','B','C']).D.transform('count')
df['Count']
0 2
1 2
2 3
3 1
4 3
5 3
Name: Count, dtype: int64

Calculate DataFrame mode based on a grouped data

I have the following DataFrame:
>>> df = pd.DataFrame({"a": [1, 1, 1, 1, 2, 2, 3, 3, 3], "b": [1, 5, 7, 9, 2, 4, 6, 14, 5], "c": [1, 0, 0, 1, 1, 1, 1, 0, 1]})
>>> df
a b c
0 1 1 1
1 1 5 0
2 1 7 0
3 1 9 1
4 2 2 1
5 2 4 1
6 3 6 1
7 3 14 0
8 3 5 1
I want to calculate the mode of column c for every unique value in a and then select the rows where c has this value.
This is my own solution:
>>> major_types = df.groupby(['a'])['c'].apply(lambda x: pd.Series.mode(x)[0])
>>> df = df.merge(major_types, how="left", right_index=True, left_on="a", suffixes=("", "_major"))
>>> df = df[df['c'] == df['c_major']].drop(columns="c_major", axis=1)
Which would output the following:
>>> df
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
It is very insufficient for large DataFrames. Any idea on what to do?
IIUC, GroupBy.transform instead apply + merge
df.loc[df['c'].eq(df.groupby('a')['c'].transform(lambda x: x.mode()[0]))]
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
Or
s = df.groupby(['a','c'])['c'].transform('size')
df.loc[s.eq(s.groupby(df['c']).transform('max'))]

Adding dataframes only on selected rows

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],\
'crit_1' : [0, 0, 1, 0, 0, 0, 1, 0, 0, 1], \
'crit_2' : ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'a', 'a', 'a'],\
'value' : [3, 4, 3, 5, 1, 2, 4, 6, 2, 3]}, \
columns=['id' , 'crit_1', 'crit_2', 'value' ])
df
Out[41]:
id crit_1 crit_2 value
0 1 0 a 3
1 1 0 a 4
2 1 1 b 3
3 1 0 b 5
4 2 0 a 1
5 2 0 b 2
6 2 1 a 4
7 3 0 a 6
8 3 0 a 2
9 3 1 a 3
I pull a subset out of this frame based on crit_1
df_subset = df[(df['crit_1']==1)]
Then I perform a complex operation (the nature of which is unimportant for this question) on that subeset producing a new column
df_subset['some_new_val'] = [1, 4,2]
df_subset
Out[42]:
id crit_1 crit_2 value some_new_val
2 1 1 b 3 1
6 2 1 a 4 4
9 3 1 a 3 2
Now, I want to add some_new_val and back into my original dataframe onto the column value. However, I only want to add it in where there is a match on id and crit_2
The result should look like this
id crit_1 crit_2 value new_value
0 1 0 a 3 3
1 1 0 a 4 4
2 1 1 b 3 4
3 1 0 b 5 6
4 2 0 a 1 1
5 2 0 b 2 6
6 2 1 a 4 4
7 3 0 a 6 8
8 3 0 a 2 4
9 3 1 a 3 5
You can use merge with left join and then add:
#filter only columns for join and for append
cols = ['id','crit_2', 'some_new_val']
df = pd.merge(df, df_subset[cols], on=['id','crit_2'], how='left')
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 NaN
1 1 0 a 4 NaN
2 1 1 b 3 1.0
3 1 0 b 5 1.0
4 2 0 a 1 4.0
5 2 0 b 2 NaN
6 2 1 a 4 4.0
7 3 0 a 6 2.0
8 3 0 a 2 2.0
9 3 1 a 3 2.0
df['some_new_val'] = df['some_new_val'].add(df['value'], fill_value=0)
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 3.0
1 1 0 a 4 4.0
2 1 1 b 3 4.0
3 1 0 b 5 6.0
4 2 0 a 1 5.0
5 2 0 b 2 2.0
6 2 1 a 4 8.0
7 3 0 a 6 8.0
8 3 0 a 2 4.0
9 3 1 a 3 5.0

Categories

Resources