Remove duplicates after a certain number of occurrences

Remove duplicates after a certain number of occurrences - python

How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3

If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...

I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])

Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)

Related

Pandas replace values (grouping by and iteration)

Good morning
I have a current problem when trying to replace some values. I have a dataframe that has a column "loc10p" that separates the records into 10 groups, and for each group I have grouped those records into smaller groups, but each group has a starting range of 1 of the subgroups instead of counting the last subgroup. For example:
c2[c2.loc10p.isin([1,2])].sort_values(['loc10p','subgrupoloc10'])[['loc10p','subgrupoloc10']]
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 1
8 2 1
9 2 1
16 2 1
17 2 1
18 2 2
23 2 2
How can I transform that into something like the following:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
I tried to do a loop that separates each group category into a different dataframe and then, replacing the values of the subgroup with a counter of the previous group, but it didn't replace anything:
w=1
temporal=[]
for e in range(1,11):
temp=c2[c2['loc10p']==e]
temporal.append(temp)
for e,i in zip(temporal,range(1,9)):
try:
e.loc[,'subgrupoloc10']=w
w+=1
except:
pass
Any help will be really appreciated!!

Try with ngroup
df['out'] = df.groupby(['loc10p','subgrupoloc10']).ngroup()+1
Out[204]:
1 1
7 1
15 1
0 2
14 2
30 2
31 2
2 3
8 3
9 3
16 3
17 3
18 4
23 4
dtype: int64

Try:
groups = (df["subgrupoloc10"] != df["subgrupoloc10"].shift()).cumsum()
df["subgrupoloc10"] = groups
print(df)
Prints:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean

The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).

Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

How do you parse out data from a dataframe for each ID when an adjacent column contains a certain value?

I have a large dataframe in the following format. I need to parse out only the values where values ==1 and through the remaining id. This should reset on each ID so that it takes the first value in a unique id that contains the value 1 and ends when the id number terminates.
d={'ID':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5] \
,'values':[0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,0,0,0,1,1,0,1,0,1,1,1,1,1,] }
df=pd.DataFrame(data=d)
df=pd.DataFrame(data=d)
df
ND = {'ID':[1,1,2,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5],\
'values':[1,0,1,0,1,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1]}
df_final=pd.DataFrame(ND)
df_final
'''

IIUC,
df[df.groupby('ID')['values'].transform('cummax')==1]
Output:
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1
Details, use cummax to keep the value of 1 after first found. Then use equal to 1 to create a boolean series, which then is used to do boolean indexing.

if your column values is only 0 and 1, you can use groupby.cummax that will replace 0 by 1 if they are after a 1 per ID, then use this as a boolean mask:
df_ = df[df.groupby('ID')['values'].cummax().astype(bool).to_numpy()]
print(df_)
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1

How do I obtain the second highest value in a row?

I want to obtain the second highest value of a certain section for each row from a dataframe. How do I do this?
I have tried the following code but it doesn't work:
df.iloc[:, 5:-3].nlargest(2)(axis=1, level=2)
Is there any other way to obtain this?

Using apply with axis=1 you can find the second largest value for each row. by finding the first 2 largest and then getting the last of them
df.iloc[:, 5:-3].apply(lambda row: row.nlargest(2).values[-1],axis=1)
Example
The code below find the second largest value in each row of df.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'Col{}'.format(i):np.random.randint(0,100,5) for i in range(5)})
In [4]: df
Out[4]:
Col0 Col1 Col2 Col3 Col4
0 82 32 14 62 90
1 62 32 74 62 72
2 31 79 22 17 3
3 42 54 66 93 50
4 13 88 6 46 69
In [5]: df.apply(lambda row: row.nlargest(2).values[-1],axis=1)
Out[5]:
0 82
1 72
2 31
3 66
4 69
dtype: int64

I think you need sorting per rows and then select:
a = np.sort(df.iloc[:, 5:-3], axis=1)[:, -2]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 8 8 3 7 7 0 4 2 5 2
1 2 2 1 0 8 4 0 9 6 2
2 4 1 5 3 4 4 3 7 1 1
3 7 7 0 2 9 9 3 2 5 8
4 1 0 7 6 2 0 8 2 5 1
5 8 1 5 4 2 8 3 5 0 9
6 3 6 3 4 7 6 3 9 0 4
7 4 5 7 6 6 2 4 2 7 1
8 6 6 0 7 2 3 5 4 2 4
9 3 7 9 0 0 5 9 6 6 5
print (df.iloc[:, 5:-3])
5 6
0 0 4
1 4 0
2 4 3
3 9 3
4 0 8
5 8 3
6 6 3
7 2 4
8 3 5
9 5 9
a = np.sort(df.iloc[:, 5:-3], axis=1)[:, -2]
print (a)
[0 0 3 3 0 3 3 2 3 5]
If need both values:
a = df.iloc[:, 5:-3].values
b = pd.DataFrame(a[np.arange(len(a))[:, None], np.argsort(a, axis=1)])
print (b)
0 1
0 0 4
1 0 4
2 3 4
3 3 9
4 0 8
5 3 8
6 3 6
7 2 4
8 3 5
9 5 9

You need to sort your dataframe with numpy.sort() and then get the second value.
import numpy as np
second = np.sort(df.iloc[:, 5:-3], axis=1)[:, 1]

Pivot column and column values in pandas dataframe

I have a dataframe that looks like this, but with 26 rows and 110 columns:
index/io 1 2 3 4
0 42 53 23 4
1 53 24 6 12
2 63 12 65 34
3 13 64 23 43
Desired output:
index io value
0 1 42
0 2 53
0 3 23
0 4 4
1 1 53
1 2 24
1 3 6
1 4 12
2 1 63
2 2 12
...
I have tried with dict and lists by transforming the dataframe to dict, and then create a new list with index values and update in new dict with io.
indx = []
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indx.append(key)
indxio = {}
for element in indx:
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indxio.update({element:k})
I know this is too far probably, but it's the only thing I could think of. The process was too long, so I stopped.

You can use set_index, stack, and reset_index().
df.set_index("index/io").stack().reset_index(name="value")\
.rename(columns={'index/io':'index','level_1':'io'})
Output:
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

You need set_index + stack + rename_axis + reset_index:
df = df.set_index('index/io').stack().rename_axis(('index','io')).reset_index(name='value')
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Solution with melt, rename, but there is different order of values, so sort_values is necessary:
d = {'index/io':'index'}
df = df.melt('index/io', var_name='io', value_name='value') \
.rename(columns=d).sort_values(['index','io']).reset_index(drop=True)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
And alternative solution for numpy lovers:
df = df.set_index('index/io')
a = np.repeat(df.index, len(df.columns))
b = np.tile(df.columns, len(df.index))
c = df.values.ravel()
cols = ['index','io','value']
df = pd.DataFrame(np.column_stack([a,b,c]), columns = cols)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates after a certain number of occurrences - python

I will do without groupby df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])

Thanks Henry, I modified your code and I think this should work as well. Your df.groupby("ID").head(3) is great. Thanks. df["count"] = df.groupby("ID")["Value"].cumcount() df.loc[df["count"]<3].drop(['count'], axis=1)

Related

Pandas replace values (grouping by and iteration)

Python Dataframe GroupBy Function

How do you parse out data from a dataframe for each ID when an adjacent column contains a certain value?

How do I obtain the second highest value in a row?

Pivot column and column values in pandas dataframe

Categories

Resources