search for duplicated consecutive rows and put in additional column pandas

search for duplicated consecutive rows and put in additional column pandas - python

I have a df:
df1
a b c d
0 2 4 1
0 2 5 1
0 1 6 2
1 2 7 2
1 1 8 1
1 1 4 1
I need to group by a and b and if two consecutive values in d are = 1 within groups, I want c in a column next to the row . Like:
df1
a b c d c1
0 2 4 1 5
0 1 6 2 nan
1 2 7 2 nan
1 1 8 1 4
Any ideas?
I tried
df1.groupby([df1.a, df1.b, d.diff().ne(0)]
then loc() only the rows with 1s and merge the two dataframes again, but the first function is not completely correct.

Related

create a new column using previous row of other column consider a duplicate as well as a based on group [duplicate]

I have a Pandas dataframe, and I want to create a new column whose values are that of another column, shifted down by one row. The last row should show NaN.
The catch is that I want to do this by group, with the last row of each group showing NaN. NOT have the last row of a group "steal" a value from a group that happens to be adjacent in the dataframe.
My attempted implementation is quite shamefully broken, so I'm clearly misunderstanding something fundamental.
df['B_shifted'] = df.groupby(['A'])['B'].transform(lambda x:x.values[1:])

Newer versions of pandas can now perform a shift on a group:
df['B_shifted'] = df.groupby(['A'])['B'].shift(1)
Note that when shifting down, it's the first row that has NaN.

Shift works on the output of the groupby clause:
>>> df = pandas.DataFrame(numpy.random.randint(1,3, (10,5)), columns=['a','b','c','d','e'])
>>> df
a b c d e
0 2 1 2 1 1
1 2 1 1 1 1
2 1 2 2 1 2
3 1 2 1 1 2
4 2 2 1 1 2
5 2 2 2 2 1
6 2 2 1 1 1
7 2 2 2 1 1
8 2 2 2 2 1
9 2 2 2 2 1
for k, v in df.groupby('a'):
print k
print 'normal'
print v
print 'shifted'
print v.shift(1)
1
normal
a b c d e
2 1 2 2 1 2
3 1 2 1 1 2
shifted
a b c d e
2 NaN NaN NaN NaN NaN
3 1 2 2 1 2
2
normal
a b c d e
0 2 1 2 1 1
1 2 1 1 1 1
4 2 2 1 1 2
5 2 2 2 2 1
6 2 2 1 1 1
7 2 2 2 1 1
8 2 2 2 2 1
9 2 2 2 2 1
shifted
a b c d e
0 NaN NaN NaN NaN NaN
1 2 1 2 1 1
4 2 1 1 1 1
5 2 2 1 1 2
6 2 2 2 2 1
7 2 2 1 1 1
8 2 2 2 1 1
9 2 2 2 2 1

#EdChum's comment is a better answer to this question, so I'm posting it here for posterity:
df['B_shifted'] = df.groupby(['A'])['B'].transform(lambda x:x.shift())
or similarly
df['B_shifted'] = df.groupby(['A'])['B'].transform('shift').
The former notation is more flexible, of course (e.g. if you want to shift by 2).

Pandas drop duplicate base on 2 columns, having differents value

How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.

Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0

Count unique values for each group in multi column with criteria in Pandas

UPDATED THE SAMPLE DATASET
I have the following data:
location ID Value
A 1 1
A 1 1
A 1 1
A 1 1
A 1 2
A 1 2
A 1 2
A 1 2
A 1 3
A 1 4
A 2 1
A 2 2
A 3 1
A 3 2
B 4 1
B 4 2
B 5 1
B 5 1
B 5 2
B 5 2
B 6 1
B 6 1
B 6 1
B 6 1
B 6 1
B 6 2
B 6 2
B 6 2
B 7 1
I want to count unique Values (only if value is equals to 1 or 2) for each location and for each ID for the following output.
location ID_Count Value_Count
A 3 6
B 4 7
I tried using df.groupby(['location'])['ID','value'].nunique(), but I am getting only the unique count of values, like for I am getting value_count for A as 4 and for B as 2.

Try agg with slice on ID on True values.
For your updated sample, you just need to drop duplicates before processing. The rest is the same
df = df.drop_duplicates(['location', 'ID', 'Value'])
df_agg = (df.Value.isin([1,2]).groupby(df.location)
.agg(ID_count=lambda x: df.loc[x[x].index, 'ID'].nunique(),
Value_count='sum'))
Out[93]:
ID_count Value_count
location
A 3 6
B 4 7

IIUC, You can try series.isin with groupby.agg
out = (df.assign(Value_Count=df['Value'].isin([1,2])).groupby("location",as_index=False)
.agg({"ID":'nunique',"Value_Count":'sum'}))
print(out)
location ID Value_Count
0 A 3 6.0
1 B 4 7.0

Roughly same as anky, but then using Series.where and named aggregations so we can rename the columns while creating them in the groupby.
grp = df.assign(Value=df['Value'].where(df['Value'].isin([1, 2]))).groupby('location')
grp.agg(
ID_count=('ID', 'nunique'),
Value_count=('Value', 'count')
).reset_index()
location ID_count Value_count
0 A 3 6
1 B 4 7

Let's try a very similar approach to other answers. This time we filter first:
(df[df['Value'].isin([1,2])]
.groupby(['location'],as_index=False)
.agg({'ID':'nunique', 'Value':'size'})
)
Output:
location ID Value
0 A 3 6
1 B 4 7

Concatenate groups of multiple dataframes

I have a df1:
a b c
1 0 1 4
2 0 2 5
3 1 1 3
and a second df2:
a b c
1 0 1 5
2 0 2 5
3 1 1 4
These df's have the same goups in a and b. Within groupby of 'a' and 'b' I want df2 underneath df1:
a b c
1 0 1 4
2 0 1 5
3 0 2 5
4 0 2 5
5 1 1 3
6 1 1 4
How can I combine groupby() and concat() to get the desired output?

You can do concat then sort_values
df=pd.concat[df1,df2]).sort_values(['a','b']).reset_index(drop=True)

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?

You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

search for duplicated consecutive rows and put in additional column pandas - python

Related

create a new column using previous row of other column consider a duplicate as well as a based on group [duplicate]

Pandas drop duplicate base on 2 columns, having differents value

Count unique values for each group in multi column with criteria in Pandas

Concatenate groups of multiple dataframes

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

Categories

Resources