Top N rows vs rows with Top N unique values using pandas - python
I have a pandas dataframe like as shown below
import pandas as pd
data ={'Count':[1,1,2,3,4,2,1,1,2,1,3,1,3,6,1,1,9,3,3,6,1,5,2,2,0,2,2,4,0,1,3,2,5,0,3,3,1,2,2,1,6,2,3,4,1,1,3,3,4,3,1,1,4,2,3,0,2,2,3,1,3,6,1,8,4,5,4,2,1,4,1,1,1,2,3,4,1,1,1,3,2,0,6,2,3,2,9,10,2,1,2,3,1,2,2,3,2,1,8,4,0,3,3,5,12,1,5,13,6,13,7,3,5,2,3,3,1,1,5,15,7,9,1,1,1,2,2,2,4,3,3,2,4,1,2,9,3,1,3,0,0,4,0,1,0,1,0]}
df = pd.DataFrame(data)
I would like to do the below
a) Top 5 rows (this will return only 5 rows)
b) Rows with Top 5 unique values (this can return N > 5 rows if the top 5 values are repeating). See my example screenshot below where we have 8 rows for selecting top 5 unique values
While am able to get Top 5 rows by using the below
df.nlargest(5,['Count'])
However, when I try the below for b), I don't get the expected output
df.nlargest(5,['Count'],keep='all')
I expect my output to be like as below
Are you after top 5 unique values or largest top five values?
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df.index.isin(df.drop_duplicates(keep='first').head(5).index), 'Y','N')))
or did you need
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df['Count'].isin(list(df['Count'].unique()[:5])),'Y','N')))
Count top5rows top5unique
0 1 Y Y
1 1 Y Y
2 2 Y Y
3 3 Y Y
4 4 Y Y
5 2 N Y
6 1 N Y
7 1 N Y
8 2 N Y
9 1 N Y
10 3 N Y
11 1 N Y
12 3 N Y
13 6 N Y
14 1 N Y
Related
group column values with difference of 3(say) digit in python
I am new in python, problem statement is like we have below data as dataframe df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]}) Diff value 1 x 1 x 2 y 3 x 4 x 4 x 5 y 6 x 7 z 7 x 8 x 9 y 9 y 10 z we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count Expected output is like Diff x y z 0-3 2 1 3-6 3 1 6-9 3 1 >=9 2 1
Example example code is wrong. someone who want exercise, use following code df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')}) Code labels = ['0-3', '3-6', '6-9', '>=9'] grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels) pd.crosstab(grouper, df['value']) output: value x y z Diff 0-3 2 1 0 3-6 3 1 0 6-9 3 0 1 >=9 0 2 1
pd.Dataframe.update puts the result at the top of the dataframe
Lets say I have two dataframes like this: n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']} nf = pd.DataFrame(n) m = {'x':['b','d','e'], 'z':['10','100','1000']} mf = pd.DataFrame(n) I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x when i call nf.update(mf) i get x y z b 1 10 d 2 100 e 3 1000 d 4 0 e 5 0 instead of the desired output x y z a 1 0 b 2 10 c 3 0 d 4 100 e 5 1000
To answer your problem, you need to match the indexes of both dataframes, here how you can do it : n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']} nf = pd.DataFrame(n).set_index('x') m = {'x':['b','d','e'], 'z':['10','100','1000']} mf = pd.DataFrame(m).set_index('x') nf.update(mf) nf = nf.reset_index()
Counting items by different conditions in rows in Python Pandas
I am trying to create a confusion matrix of an experiment. So the dataset is like this; Responses Condition 3 1 R 4 1 R 6 1 R 7 1 R 8 -1 R 9 -1 N 10 -1 N 11 -1 N 12 -1 R 13 1 R I want to categorize four different conditions; and I want to count each one of the conditions. 1 & N, -1 & N, 1 & R, -1 & R I want to count each one of those situations in the dataframe. I have tried to use .itertuples , but I don't know how to use it with 2 parameters.
IIUC df.groupby(['Condition','Responses']).size().unstack(fill_value=0).stack() Condition Responses N -1 3 1 0 R -1 2 1 5 dtype: int64 Or pd.crosstab(df.Condition,df.Responses) Responses -1 1 Condition N 3 0 R 2 5
this could work import pandas as pd df = pd.DataFrame({"Conditions":["R","R","N","N"], "Responses":[1,-1,-1,-1]}) df.groupby(["Conditions","Responses"]).apply(len).to_frame("occurrences")
Dataframe set_index produces duplicate index values instead of doing hierarchical grouping
I have a dataframe that looks like this (index not shown) Time Letter Type Value 0 A x 10 0 B y 20 1 A y 30 1 B x 40 3 C x 50 I want to produce a dataframe that looks like this: Time Letter TypeX TypeY 0 A 10 20 0 B 20 1 A 30 1 B 40 3 C 50 To do that, I decided I would first create a table that have multiple indices, Time, Letter and then unstack the last index Type. Let's say my original dataframe was named my_table: my_table.reset_index().set_index(['Time', 'Letter']) and instead of grouping it so that under every time index, letter there is BOTH Type X and Type Y, they seemed to have been sorted (adding a few more entries to demonstrate a point): Time(i) Letter(i) Type Value 0 A x 10 D x 25 H x 15 G x 33 1 B x 40 G x 10 3 C x 50 0 B y 20 H y 10 1 A y 30 Why does this happen? I expected a result like so: Time Letter Type Value 0 A x 10 y 30 B y 20 H x 15 y 10 D x 25 G x 33 1 B x 40 G x 10 3 C x 50 The same behavior occurs when I make Type one of the indices, it just becomes bold as an index. How do I successfully group columns using Time and Letter to get X and Y to be matched by those columns, so I can successfully use unstack?
You need to set type as the index as well df.set_index(['Time','Letter','Type']).Value.unstack(fill_value='').reset_index() Out[178]: Type Time Letter x y 0 0 A 10 1 0 B 20 2 1 A 30 3 1 B 40 4 3 C 50
populate new column in a pandas dataframe which takes input from other columns
i have a function which should take x , y , z as input and returns r as output. For example : my_func( x , y, z) takes x = 10 , y = 'apple' and z = 2 and returns value in column r. Similarly, function takes x = 20, y = 'orange' and z =4 and populates values in column r. Any suggestions what would be the efficient code for this ? Before : a x y z 5 10 'apple' 2 2 20 'orange' 4 0 4 'apple' 2 5 5 'pear' 6 After: a x y z r 5 10 'apple' 2 x 2 20 'orange' 4 x 10 4 'apple' 2 x 5 5 'pear' 6 x
Depends on how complex your function is. In general you can use pandas.DataFrame.apply: >>> def my_func(x): ... return '{0} - {1} - {2}'.format(x['y'],x['a'],x['x']) ... >>> df['r'] = df.apply(my_func, axis=1) >>> df a x y z r 0 5 10 'apple' 2 'apple' - 5 - 10 1 2 20 'orange' 4 'orange' - 2 - 20 2 0 4 'apple' 2 'apple' - 0 - 4 3 5 5 'pear' 6 'pear' - 5 - 5 axis=1 is to make your function work 'for each row' instead of 'for each column`: Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1) But if it's really simple function, like the one above, you probably can even do it without function, with vectorized operations.