Python Pandas keep maximum 3 consecutive duplicates

Python Pandas keep maximum 3 consecutive duplicates - python

I have this table:
import pandas as pd
list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2]
df = pd.DataFrame(list1)
df.columns = ['A']
I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates.
The result should look like this:
list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2]
result = pd.DataFrame(list2)
result.columns = ['A']

Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum:
df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3)
print (df1)
A
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Detail:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 6
14 6
Name: A, dtype: int32

Last us do
df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3]
0
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2

Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements:
import itertools
pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A'])))
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 1
9 1
10 1
11 2
12 2
dtype: int64

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6

Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

How to drop duplicates in pandas but keep more than the first

Let's say I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is
>> df
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 2
9 2
EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.

You can create unique value for each consecutive group, then use groupby and head:
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
# result:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 3
9 3

Use boolean indexing with groupby.cumcount:
N = 3
df[df.groupby('a').cumcount().lt(N)]
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
8 3
9 3
For the last N:
df[df.groupby('a').cumcount(ascending=False).lt(N)]
apply on consecutive repetitions
df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1 # this is #3 of the local group
8 3
9 3
advantages of boolean indexing
You can use it for many other operations, such as setting values or masking:
group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)
df.where(m)
a
0 1.0
1 2.0
2 2.0
3 2.0
4 NaN
5 1.0
6 1.0
7 1.0
8 3.0
9 3.0
df.loc[~m] = -1
a
0 1
1 2
2 2
3 2
4 -1
5 1
6 1
7 1
8 3
9 3

Pandas: Get rows with consecutive column values

I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.

IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2

you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2

Sort a subset of columns of a pandas dataframe alphabetically by column name

I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.

You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3

Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6

You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]

Python dataframe find index of top-5, then index into another column

I have a dataframe with two numeric columns, A & B. I want to find the top 5 values from col A and return the values from Col B held in the location of those top 5.
Many thanks.

I think need DataFrame.nlargest with column A for top 5 rows and then select column B:
df = pd.DataFrame({'A':[4,5,26,43,54,36,18,7,8,9],
'B':range(10)})
print (df)
A B
0 4 0
1 5 1
2 26 2
3 43 3
4 54 4
5 36 5
6 18 6
7 7 7
8 8 8
9 9 9
print (df.nlargest(5, 'A'))
A B
4 54 4
3 43 3
5 36 5
2 26 2
6 18 6
a = df.nlargest(5, 'A')['B']
print (a)
4 4
3 3
5 5
2 2
6 6
Name: B, dtype: int64
Alternative solution with sorting:
a = df.sort_values('A', ascending=False)['B'].head(5)
print (a)
4 4
3 3
5 5
2 2
6 6
Name: B, dtype: int64

nlargest function on the dataframe will do your work, df.nlargest(#of rows,'column_to_sort')
import pandas
df = pd.DataFrame({'A':[1,1,1,2,2,2,2,3,4],'B':[1,2,3,1,2,3,4,1,1]})
df.nlargest(5,'B')
Out[13]:
A B
6 2 4
2 1 3
5 2 3
1 1 2
4 2 2
# if you want only certain column in the output, the use
df.nlargest(5,'B')['A']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas keep maximum 3 consecutive duplicates - python

Last us do df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3] 0 0 1 1 1 2 2 3 2 4 3 5 3 6 3 8 4 9 1 10 1 11 1 13 2 14 2

Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements: import itertools pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A']))) 0 1 1 1 2 2 3 2 4 3 5 3 6 3 7 4 8 1 9 1 10 1 11 2 12 2 dtype: int64

Related

How to identify one column with continuous number and same value of another column?

How to drop duplicates in pandas but keep more than the first

Pandas: Get rows with consecutive column values

Sort a subset of columns of a pandas dataframe alphabetically by column name

Python dataframe find index of top-5, then index into another column

Categories

Resources