Find first non-zero element within a group in pandas - python

I have a dataframe that you can see how it is in the following. The column named target is my desired column:
group value target
1 1 0
1 2 0
1 3 2
1 4 0
1 5 1
2 1 0
2 2 0
2 3 0
2 4 1
2 5 3
Now I want to find the first non-zero value in the target column for each group and remove rows before that row in each group. So the output should be like this:
group value target
1 3 2
1 4 0
1 5 1
2 4 1
2 5 3
I have seen this post, but I don't how to change the code to get my desired result.
How can I do this?

In the groupby, set sort to False, get the cumsum, then filter for rows not equal to 0 :
df.loc[df.groupby(["group"], sort=False).target.cumsum() != 0]
group value target
2 1 3 2
3 1 4 0
4 1 5 1
8 2 4 1
9 2 5 3

This shoul do. I'm sure you can do it with less reset_index(), but this shouldn't affect too much the speed if your dataframe isn't too big:
idx = dff[dff.target.ne(0)].reset_index().groupby('group').index.first()
mask = (dff.reset_index().set_index('group')['index'].ge(idx.to_frame()['index'])).values
df_final = dff[mask]
Output:
0 group value target
3 1 3 2
4 1 4 0
5 1 5 1
9 2 4 1
10 2 5 3

Related

Set value when row is maximum in group by - Python Pandas

I am trying to create a column (is_max) that has either 1 if a column B is the maximum in a group of values of column A or 0 if it is not.
Example:
[Input]
A B
1 2
2 3
1 4
2 5
[Output]
A B is_max
1 2 0
2 5 0
1 4 1
2 3 0
What I'm trying:
df['is_max'] = 0
df.loc[df.reset_index().groupby('A')['B'].idxmax(),'is_max'] = 1
Fix your code by remove the reset_index
df['is_max'] = 0
df.loc[df.groupby('A')['B'].idxmax(),'is_max'] = 1
df
Out[39]:
A B is_max
0 1 2 0
1 2 3 0
2 1 4 1
3 2 5 1
I make assumption A is your group now that you did not state
df['is_max']=(df['B']==df.groupby('A')['B'].transform('max')).astype(int)
or
df1.groupby('A')['B'].apply(lambda x: x==x.max()).astype(int)

regrouping similar column values in pandas

I have dataframe with many lines and columns, looking like this :
index
col1
col2
1
0
1
2
5
1
3
5
4
4
5
4
5
3
4
6
2
4
7
2
1
8
2
2
I would like to keep only the values that are different from the previous index and replace the others by 0. On the example dataframe, it would be :
index
col1
col2
1
0
1
2
5
0
3
0
4
4
0
0
5
3
0
6
2
0
7
0
1
8
0
2
What is a solution that works for any number of row/columns ?
So you'd like to keep the values where the difference to previous row is not equal to 0 (i.e., they're not the same), and put 0 to other places:
>>> df.where(df.diff().ne(0), other=0)
col1 col2
index
1 0 1
2 5 0
3 0 4
4 0 0
5 3 0
6 2 0
7 0 1
8 0 2

Filter based on pairs within a group

Group Code
1 2
1 2
1 4
1 1
2 4
2 1
2 2
2 3
2 1
2 1
2 3
Within each group there are pairs. In Group 1 for example; the pairs are (2,2),(2,4),(4,1)
I want to filter these pairs based on code number 2 being present at the beginning of the pair.
In group 1 for example, only (2,2) and (2,4) will be kept while (4,1) will be filtered out.
Excepted Output:
Group Code
1 2
1 2
1 4
2 2
2 3
Check with groupby shift
out = df[df.groupby('Group').Code.apply(lambda x : x.eq(2) | x.eq(2).shift())]
Out[67]:
Group Code
0 1 2
1 1 2
2 1 4
6 2 2
7 2 3

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Return sorted indexes skipping nan values in pandas?

I have the following data set:
PID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
I know I can use numpy.argsort to return the sorted indexes of the values:
SQ_AL_INDX = numpy.argsort(df_sequence[['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']], axis=1)
...returns...
RUN_START_DATE PUSHUP_START_DATE SITUP_START_DATE PULLUP_START_DATE
0 2 1 0 3
1 3 2 1 0
2 2 1 0 3
3 2 3 1 0
4 0 1 3 2
5 3 0 1 2
6 1 2 0 3
7 3 0 2 1
8 3 1 0 2
9 3 1 0 2
But, it seems to put pandas.NaT values into the first position. So in this example where PID == 1 the sort order returns 2 1 0 3. But, the second index position is a pandas.Nat value.
How can I get the sorted indexes while skipping the pandas.NaT values (e.g., the return index values would be 2 1 np.NaN 3 or 2 1 pandas.NaT 3 or better yet 1 0 2 for PID 1 instead of 2 1 0 3)?
Pass numpy.argsort to the apply method instead of using it directly. This way, NaNs/NaTs persist. For your example:
In [2]: df_sequence[['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']].apply(numpy.argsort, axis=1)
Out[2]:
RUN_START_DATE PUSHUP_START_DATE SITUP_START_DATE PULLUP_START_DATE
0 1 0 NaN 2
(etc.)

Categories

Resources