I work with this df:
A B C D
1 1 2 3
2 1 3 4
3 3 3 3
I want to add column E that holds the number of equal values in columns A-D by row.
Expected output:
A B C D E
1 1 2 3 2
2 1 3 4 0
3 3 3 3 4
Can anyone point me in the right direction?
Thanks!
Use custom lambda function with Series.duplicated with keep=False for all dupes and count Trues by sum:
df['E'] = df.apply(lambda x: x.duplicated(keep=False).sum(), axis=1)
print (df)
A B C D E
0 1 1 2 3 2
1 2 1 3 4 0
2 3 3 3 3 4
If need specify columns names:
df['E'] = df.loc[:, 'A':'D'].apply(lambda x: x.duplicated(keep=False).sum(), axis=1)
Related
I have a dataframe i would like to filter. Consider the below input dataframe.
a b c
1 1 1
1 0 1
2 2 1
2 2 2
There are 3 columns ( a, b , and c)
I would like to get the count of unique values for c, for each unique pair of a and b, for a new d column, which has the count of unique values in c for its a, b pair :
a b c d
1 1 1 1
1 0 1 1
1 2 1 2
1 2 2 2
rows 0, 1 have different a,b column pairs, and so the appended d values for both rows would be 1.
rows 2 and 3 have shared a, b columns and 2 unique values for that pair, their d values would be 2
Let us try
df['cnt'] = df.groupby(['a','b'])['c'].transform('nunique')
df
Out[303]:
a b c cnt
0 1 1 1 1
1 1 0 1 1
2 2 2 1 2
3 2 2 2 2
I think you want to use groupby and nunique
import pandas as pd
data = pd.DataFrame({
'a':[1,1,2,2],
'b':[1,0,2,2],
'c':[1,1,2,3]
})
unique_count = data.groupby(
['a','b']
).c.nunique()
data.set_index(['a','b']).assign(
d = unique_count
).reset_index()
Output:
a b c d
1 1 1 1
1 0 1 1
2 2 2 2
2 2 3 2
UPDATED THE SAMPLE DATASET
I have the following data:
location ID Value
A 1 1
A 1 1
A 1 1
A 1 1
A 1 2
A 1 2
A 1 2
A 1 2
A 1 3
A 1 4
A 2 1
A 2 2
A 3 1
A 3 2
B 4 1
B 4 2
B 5 1
B 5 1
B 5 2
B 5 2
B 6 1
B 6 1
B 6 1
B 6 1
B 6 1
B 6 2
B 6 2
B 6 2
B 7 1
I want to count unique Values (only if value is equals to 1 or 2) for each location and for each ID for the following output.
location ID_Count Value_Count
A 3 6
B 4 7
I tried using df.groupby(['location'])['ID','value'].nunique(), but I am getting only the unique count of values, like for I am getting value_count for A as 4 and for B as 2.
Try agg with slice on ID on True values.
For your updated sample, you just need to drop duplicates before processing. The rest is the same
df = df.drop_duplicates(['location', 'ID', 'Value'])
df_agg = (df.Value.isin([1,2]).groupby(df.location)
.agg(ID_count=lambda x: df.loc[x[x].index, 'ID'].nunique(),
Value_count='sum'))
Out[93]:
ID_count Value_count
location
A 3 6
B 4 7
IIUC, You can try series.isin with groupby.agg
out = (df.assign(Value_Count=df['Value'].isin([1,2])).groupby("location",as_index=False)
.agg({"ID":'nunique',"Value_Count":'sum'}))
print(out)
location ID Value_Count
0 A 3 6.0
1 B 4 7.0
Roughly same as anky, but then using Series.where and named aggregations so we can rename the columns while creating them in the groupby.
grp = df.assign(Value=df['Value'].where(df['Value'].isin([1, 2]))).groupby('location')
grp.agg(
ID_count=('ID', 'nunique'),
Value_count=('Value', 'count')
).reset_index()
location ID_count Value_count
0 A 3 6
1 B 4 7
Let's try a very similar approach to other answers. This time we filter first:
(df[df['Value'].isin([1,2])]
.groupby(['location'],as_index=False)
.agg({'ID':'nunique', 'Value':'size'})
)
Output:
location ID Value
0 A 3 6
1 B 4 7
Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.
Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]
I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])
I have the following pandas DataFrame:
a b c
1 s 5
1 w 5
2 s 5
3 s 6
3 e 6
3 e 5
I need to count duplicate rows for each unique value of a to obtain the following result:
a qty
1 2
2 1
3 3
How to do this in python?
You can use groupby:
g = df.groupby('a').size()
This returns:
a
1 2
2 1
3 3
dtype: int64
EDIT: rename only the single new column of counts.
If you need a new column you can:
g = df1.groupby('a').size().reset_index().rename(columns={0:'qty'})
to obtain:
a qty
0 1 2
1 2 1
2 3 3