Python Pandas keep maximum 3 consecutive duplicates - python
I have this table:
import pandas as pd
list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2]
df = pd.DataFrame(list1)
df.columns = ['A']
I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates.
The result should look like this:
list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2]
result = pd.DataFrame(list2)
result.columns = ['A']
Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum:
df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3)
print (df1)
A
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Detail:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 6
14 6
Name: A, dtype: int32
Last us do
df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3]
0
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements:
import itertools
pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A'])))
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 1
9 1
10 1
11 2
12 2
dtype: int64
Related
How to identify one column with continuous number and same value of another column?
I have a DataFrame with two columns A and B. I want to create a new column named C to identify the continuous A with the same B value. Here's an example import pandas as pd df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]}) I found a similar question, but that method only identifies the continuous A regardless of B. df['C'] = df['A'].diff().ne(1).cumsum().sub(1) I have tried to groupby B and apply the function like this: df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1)) However, it doesn't work: TypeError: incompatible index of inserted column with frame index. The expected output is A B C 1 1 0 2 1 0 3 2 1 5 2 2 6 3 3 10 3 4 11 3 4 12 3 4 13 4 5 18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0] Result A B C 0 1 1 0 1 2 1 0 2 3 2 1 3 5 2 2 4 6 3 3 5 10 3 4 6 11 3 4 7 12 3 4 8 13 4 5 9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1: df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1) print (df) A B C 0 1 1 0 1 2 1 0 2 3 2 1 3 5 2 2 4 6 3 3 5 10 3 4 6 11 3 4 7 12 3 4 8 13 4 5 9 18 4 6
How to drop duplicates in pandas but keep more than the first
Let's say I have a pandas DataFrame: import pandas as pd df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]}) >> df a 0 1 1 2 2 2 3 2 4 2 5 1 6 1 7 1 8 2 9 2 I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is >> df a 0 1 1 2 2 2 3 2 5 1 6 1 7 1 8 2 9 2 EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.
You can create unique value for each consecutive group, then use groupby and head: group_value = np.cumsum(df.a.shift() != df.a) df.groupby(group_value).head(3) # result: a 0 1 1 2 2 2 3 2 5 1 6 1 7 1 8 3 9 3
Use boolean indexing with groupby.cumcount: N = 3 df[df.groupby('a').cumcount().lt(N)] Output: a 0 1 1 2 2 2 3 2 5 1 6 1 8 3 9 3 For the last N: df[df.groupby('a').cumcount(ascending=False).lt(N)] apply on consecutive repetitions df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)]) Output: a 0 1 1 2 2 2 3 2 5 1 6 1 7 1 # this is #3 of the local group 8 3 9 3 advantages of boolean indexing You can use it for many other operations, such as setting values or masking: group = df['a'].ne(df['a'].shift()).cumsum() m = df.groupby(group).cumcount().lt(N) df.where(m) a 0 1.0 1 2.0 2 2.0 3 2.0 4 NaN 5 1.0 6 1.0 7 1.0 8 3.0 9 3.0 df.loc[~m] = -1 a 0 1 1 2 2 2 3 2 4 -1 5 1 6 1 7 1 8 3 9 3
Pandas: Get rows with consecutive column values
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only col row x y 1 1 1 1 5 7 3 0 2 2 2 2 6 3 3 8 9 2 3 4 5 3 3 9 4 9 4 4 5 5 5 1 3 7 5 2 6 6 6 6 5 8 6 2 3 7 6 0 The results output would be: col row x y consecutive-count 6 3 3 8 1 9 2 3 4 1 5 3 3 9 1 5 5 5 1 2 3 7 5 2 2 I tried m = df['x'].eq(df['x'].shift()) df[m|m.shift(-1, fill_value=False)] But that includes the consecutive 6 that I don't want. I also tried: df.query( 'x in [3,5]') That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order: m1 = df['x'].eq(3) m2 = df['x'].eq(5) out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())] Output: col row x y 2 6 3 3 8 3 9 2 3 4 4 5 3 3 9 6 5 5 5 1 7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x: # create unique ids for consecutive groups, then get group length: group_num = (df.x.shift() != df.x).cumsum() group_len = group_num.groupby(group_num).transform("count") # filter main df: df2 = df[(df.x.isin([3,5])) & (group_len > 1)] # add new group num col df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum() output: col row x y consecutive-count 3 6 3 3 8 1 4 9 2 3 4 1 5 5 3 3 9 1 7 5 5 5 1 2 8 3 7 5 2 2
Sort a subset of columns of a pandas dataframe alphabetically by column name
I'm having trouble finding the solution to a fairly simple problem. I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually). Example df: import pandas as pd subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4] timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6] c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] df = pd.DataFrame({'subject':subject, 'timepoint':timepoint, 'c':c, 'd':d, 'a':a, 'b':b}) df.head() subject timepoint c d a b 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6 How could I rearrange the column names to generate a df.head() that looks like this: subject timepoint a b c d 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6 i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names. Thanks in advance.
You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together: >>> pd.concat([df[['subject','timepoint']], df[df.columns.difference(['subject', 'timepoint'])]\ .sort_index(axis=1)],ignore_index=False,axis=1) subject timepoint a b c d 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6 5 1 6 7 7 7 7 6 2 1 3 3 3 3 7 2 2 4 4 4 4 8 2 3 1 1 1 1 9 2 4 2 2 2 2 10 2 5 3 3 3 3 11 2 6 4 4 4 4 12 3 1 5 5 5 5 13 3 2 4 4 4 4 14 3 4 5 5 5 5 15 4 1 8 8 8 8 16 4 2 4 4 4 4 17 4 3 5 5 5 5 18 4 4 6 6 6 6 19 4 5 2 2 2 2 20 4 6 3 3 3 3
Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame. import numpy as np first_cols = ['subject', 'timepoint'] #first_cols = df.columns[0:2].tolist() # OR determine first two other_cols = np.sort(df.columns.difference(first_cols)).tolist() df = df.loc[:, first_cols+other_cols] print(df.head()) subject timepoint a b c d 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6
You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols] import pandas as pd subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4] timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6] c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] df = pd.DataFrame({'subject':subject, 'timepoint':timepoint, 'c':c, 'd':d, 'a':a, 'b':b}) cols = df.columns.tolist() cols = cols[:2] + sorted(cols[2:]) df = df[cols]
Python dataframe find index of top-5, then index into another column
I have a dataframe with two numeric columns, A & B. I want to find the top 5 values from col A and return the values from Col B held in the location of those top 5. Many thanks.
I think need DataFrame.nlargest with column A for top 5 rows and then select column B: df = pd.DataFrame({'A':[4,5,26,43,54,36,18,7,8,9], 'B':range(10)}) print (df) A B 0 4 0 1 5 1 2 26 2 3 43 3 4 54 4 5 36 5 6 18 6 7 7 7 8 8 8 9 9 9 print (df.nlargest(5, 'A')) A B 4 54 4 3 43 3 5 36 5 2 26 2 6 18 6 a = df.nlargest(5, 'A')['B'] print (a) 4 4 3 3 5 5 2 2 6 6 Name: B, dtype: int64 Alternative solution with sorting: a = df.sort_values('A', ascending=False)['B'].head(5) print (a) 4 4 3 3 5 5 2 2 6 6 Name: B, dtype: int64
nlargest function on the dataframe will do your work, df.nlargest(#of rows,'column_to_sort') import pandas df = pd.DataFrame({'A':[1,1,1,2,2,2,2,3,4],'B':[1,2,3,1,2,3,4,1,1]}) df.nlargest(5,'B') Out[13]: A B 6 2 4 2 1 3 5 2 3 1 1 2 4 2 2 # if you want only certain column in the output, the use df.nlargest(5,'B')['A']