Top N rows vs rows with Top N unique values using pandas

Top N rows vs rows with Top N unique values using pandas - python

I have a pandas dataframe like as shown below
import pandas as pd
data ={'Count':[1,1,2,3,4,2,1,1,2,1,3,1,3,6,1,1,9,3,3,6,1,5,2,2,0,2,2,4,0,1,3,2,5,0,3,3,1,2,2,1,6,2,3,4,1,1,3,3,4,3,1,1,4,2,3,0,2,2,3,1,3,6,1,8,4,5,4,2,1,4,1,1,1,2,3,4,1,1,1,3,2,0,6,2,3,2,9,10,2,1,2,3,1,2,2,3,2,1,8,4,0,3,3,5,12,1,5,13,6,13,7,3,5,2,3,3,1,1,5,15,7,9,1,1,1,2,2,2,4,3,3,2,4,1,2,9,3,1,3,0,0,4,0,1,0,1,0]}
df = pd.DataFrame(data)
I would like to do the below
a) Top 5 rows (this will return only 5 rows)
b) Rows with Top 5 unique values (this can return N > 5 rows if the top 5 values are repeating). See my example screenshot below where we have 8 rows for selecting top 5 unique values
While am able to get Top 5 rows by using the below
df.nlargest(5,['Count'])
However, when I try the below for b), I don't get the expected output
df.nlargest(5,['Count'],keep='all')
I expect my output to be like as below

Are you after top 5 unique values or largest top five values?
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df.index.isin(df.drop_duplicates(keep='first').head(5).index), 'Y','N')))
or did you need
df =(df.assign(top5rows=np.where(df.index.isin(df.head(5).index),'Y','N'),
top5unique=np.where(df['Count'].isin(list(df['Count'].unique()[:5])),'Y','N')))
Count top5rows top5unique
0 1 Y Y
1 1 Y Y
2 2 Y Y
3 3 Y Y
4 4 Y Y
5 2 N Y
6 1 N Y
7 1 N Y
8 2 N Y
9 1 N Y
10 3 N Y
11 1 N Y
12 3 N Y
13 6 N Y
14 1 N Y

Related

group column values with difference of 3(say) digit in python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1

Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000

To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

Counting items by different conditions in rows in Python Pandas

I am trying to create a confusion matrix of an experiment.
So the dataset is like this;
Responses Condition
3 1 R
4 1 R
6 1 R
7 1 R
8 -1 R
9 -1 N
10 -1 N
11 -1 N
12 -1 R
13 1 R
I want to categorize four different conditions; and I want to count each one of the conditions.
1 & N,
-1 & N,
1 & R,
-1 & R
I want to count each one of those situations in the dataframe.
I have tried to use .itertuples , but I don't know how to use it with 2 parameters.

IIUC
df.groupby(['Condition','Responses']).size().unstack(fill_value=0).stack()
Condition Responses
N -1 3
1 0
R -1 2
1 5
dtype: int64
Or
pd.crosstab(df.Condition,df.Responses)
Responses -1 1
Condition
N 3 0
R 2 5

this could work
import pandas as pd
df = pd.DataFrame({"Conditions":["R","R","N","N"], "Responses":[1,-1,-1,-1]})
df.groupby(["Conditions","Responses"]).apply(len).to_frame("occurrences")

Dataframe set_index produces duplicate index values instead of doing hierarchical grouping

I have a dataframe that looks like this (index not shown)
Time Letter Type Value
0 A x 10
0 B y 20
1 A y 30
1 B x 40
3 C x 50
I want to produce a dataframe that looks like this:
Time Letter TypeX TypeY
0 A 10 20
0 B 20
1 A 30
1 B 40
3 C 50
To do that, I decided I would first create a table that have multiple indices, Time, Letter and then unstack the last index Type.
Let's say my original dataframe was named my_table:
my_table.reset_index().set_index(['Time', 'Letter']) and instead of grouping it so that under every time index, letter there is BOTH Type X and Type Y, they seemed to have been sorted (adding a few more entries to demonstrate a point):
Time(i) Letter(i) Type Value
0 A x 10
D x 25
H x 15
G x 33
1 B x 40
G x 10
3 C x 50
0 B y 20
H y 10
1 A y 30
Why does this happen? I expected a result like so:
Time Letter Type Value
0 A x 10
y 30
B y 20
H x 15
y 10
D x 25
G x 33
1 B x 40
G x 10
3 C x 50
The same behavior occurs when I make Type one of the indices, it just becomes bold as an index.
How do I successfully group columns using Time and Letter to get X and Y to be matched by those columns, so I can successfully use unstack?

You need to set type as the index as well
df.set_index(['Time','Letter','Type']).Value.unstack(fill_value='').reset_index()
Out[178]:
Type Time Letter x y
0 0 A 10
1 0 B 20
2 1 A 30
3 1 B 40
4 3 C 50

populate new column in a pandas dataframe which takes input from other columns

i have a function which should take x , y , z as input and returns r as output.
For example : my_func( x , y, z) takes x = 10 , y = 'apple' and z = 2 and returns value in column r. Similarly, function takes x = 20, y = 'orange' and z =4 and populates values in column r. Any suggestions what would be the efficient code for this ?
Before :
a x y z
5 10 'apple' 2
2 20 'orange' 4
0 4 'apple' 2
5 5 'pear' 6
After:
a x y z r
5 10 'apple' 2 x
2 20 'orange' 4 x
10 4 'apple' 2 x
5 5 'pear' 6 x

Depends on how complex your function is. In general you can use pandas.DataFrame.apply:
>>> def my_func(x):
... return '{0} - {1} - {2}'.format(x['y'],x['a'],x['x'])
...
>>> df['r'] = df.apply(my_func, axis=1)
>>> df
a x y z r
0 5 10 'apple' 2 'apple' - 5 - 10
1 2 20 'orange' 4 'orange' - 2 - 20
2 0 4 'apple' 2 'apple' - 0 - 4
3 5 5 'pear' 6 'pear' - 5 - 5
axis=1 is to make your function work 'for each row' instead of 'for each column`:
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1)
But if it's really simple function, like the one above, you probably can even do it without function, with vectorized operations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Top N rows vs rows with Top N unique values using pandas - python

Related

group column values with difference of 3(say) digit in python

pd.Dataframe.update puts the result at the top of the dataframe

Counting items by different conditions in rows in Python Pandas

Dataframe set_index produces duplicate index values instead of doing hierarchical grouping

populate new column in a pandas dataframe which takes input from other columns

Categories

Resources