pandas dataframe duplicate values count not properly working

pandas dataframe duplicate values count not properly working - python

value count is : df['ID'].value_counts().values
-----> array([4,3,3,1], dtype=int64)
input:
ID emp
a 1
a 1
b 1
a 1
b 1
c 1
c 1
a 1
b 1
c 1
d 1
when I jumble the ID column
df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values
output:
ID emp
a 4
c 3
d 3
c 1
b 1
a 1
c 1
a 1
b 1
b 1
a 1
expected result:
ID emp
a 4
c 3
d 1
c 1
b 3
a 1
c 1
a 1
b 1
b 1
a 1
problem :the count is not checking the ID before assigning it the emp.

Here is problem ouput of df['ID'].value_counts() is Series with counted values in different number of values like original data, for new column filled by couter value use Series.map:
df.loc[~df.duplicated(subset=['ID']), 'emp'] = df['ID'].map(df['ID'].value_counts())
Or GroupBy.transform with size:
df.loc[~df.duplicated(subset=['ID']), 'emp'] = df.groupby('ID')['ID'].transform('size')
Output Series with 4 values cannot assign back, because different index in df1.index and df['ID'].value_counts().index
print (df['ID'].value_counts())
a 4
b 3
c 3
d 1
Name: ID, dtype: int64
If convert to numpy array only first 4 values are assigned, because in this DataFrame are 4 groups a,b,c,d, so df.duplicated(subset=['ID']) returned 4 times Trues, but in order 4,3,3,1 what reason of wrong output:
print (df['ID'].value_counts().values)
[4 3 3 1]
What need - new column (Series) with same df.index:
print (df['ID'].map(df['ID'].value_counts()))
0 4
1 4
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 1
Name: ID, dtype: int64
print (df.groupby('ID')['ID'].transform('size'))
0 4
1 4
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 1
Name: ID, dtype: int64

This alone is giving df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values desired output for your given sample dataframe
but you can try:
cond=~df.duplicated(keep='first', subset=['ID'])
df.loc[cond,'emp']=df.loc[cond,'ID'].map(df['ID'].value_counts())

Related

Pandas drop duplicate base on 2 columns, having differents value

How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.

Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0

search column name based on matching row values

I have a data frame like below:
A B C D E F Input
1 2 3 4 5 6 1
1 2 3 4 5 6 3
I want an output column where I can get the column name, something like below:
A B C D E F Input Output
1 2 3 4 5 6 1 A
1 2 3 4 5 6 3 C
As you can see above that in row 1, Input has value 1 and column A also has value 1, so the output is A.

We can do idxmax
df['Output']=df.drop('Input',1).eq(df.Input,0).idxmax(1)
df['Output']
0 A
1 C
dtype: object

Alternative with .dot:
df.drop('Input',1).eq(df['Input'],axis=0).dot(df.columns.difference(['Input']))
0 A
1 C

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!

If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

Delete keys with missing values?

My dataframe is
ID Alphabet Number1 Number2
1 A NaN 9
1 A 3 5
1 A 1 4
1 A 2 4
2 B 7 3
2 B 2 8
2 B 4 1
2 B 8 5
3 C 2 2
3 C 1 9
4 D 2 3
4 D 6 2
4 D 8 NaN
I got unique Alphabets by doing
df.groupby('Alphabet')['ID'].nunique()
and the result is
A 1
B 1
C 1
D 1
but I want to store Alphabets that does NOT have missing data in them
I want the result to be look like
B 1
C 1
and from this console result, how would I store "B" and "C" into a list?

IIUC, using all()
s=df.groupby('Alphabet').apply(lambda x : x.notnull().all()).all(1)
df.groupby('Alphabet').ID.nunique()[s[s].index]
Out[1082]:
Alphabet
B 1
C 1
Name: ID, dtype: int64
Or
df.loc[~df.Alphabet.isin(df.loc[s[s].index,'Alphabet'])].groupby('Alphabet').ID.nunique()
Out[1095]:
Alphabet
B 1
C 1
Name: ID, dtype: int64

Count duplicate rows for each unique row value

I have the following pandas DataFrame:
a b c
1 s 5
1 w 5
2 s 5
3 s 6
3 e 6
3 e 5
I need to count duplicate rows for each unique value of a to obtain the following result:
a qty
1 2
2 1
3 3
How to do this in python?

You can use groupby:
g = df.groupby('a').size()
This returns:
a
1 2
2 1
3 3
dtype: int64
EDIT: rename only the single new column of counts.
If you need a new column you can:
g = df1.groupby('a').size().reset_index().rename(columns={0:'qty'})
to obtain:
a qty
0 1 2
1 2 1
2 3 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe duplicate values count not properly working - python

This alone is giving df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values desired output for your given sample dataframe but you can try: cond=~df.duplicated(keep='first', subset=['ID']) df.loc[cond,'emp']=df.loc[cond,'ID'].map(df['ID'].value_counts())

Related

Pandas drop duplicate base on 2 columns, having differents value

search column name based on matching row values

Aggregate data frame rows based on conditions

Delete keys with missing values?

Count duplicate rows for each unique row value

Categories

Resources