Grouping by in Pandas dataframe - python

I feel like I am missing something really simple here, can someone tell me what is wrong with this code?
I am trying to group by Sex where the Age > 30 and the Survived value = 1.
'Sex' is a boolean value (1 or 0), if that makes a difference
data_r.groupby('Sex')([data_r.Age >30],[data_r.Survived == 1]).count()
This is throwing:
"'DataFrameGroupBy' object is not callable"
any ideas? thanks

You need filter first and then groupby.
data_r[(data_r.Age>30) & (data_r.Survived==1)].groupby('Sex').count()

You can do you filtering before grouping.
data_r.query('Age > 30 and Survived == 1').groupby('Sex').count()
Output:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare \
Sex
female 83 83 83 83 83 83 83 83 83
male 41 41 41 41 41 41 41 41 41
Cabin Embarked
Sex
female 47 81
male 25 41
IMHO... I'd use size it is safer, count does not include null values(NaN values). Notice those different values in the columns this is due to NaN values.
data_r.query('Age > 30 and Survived == 1').groupby('Sex').size()
Output:
Sex
female 83
male 41
dtype: int64

Related

how to assign a value to one column by checking the condition on different columns using where,select, loc

assign values to a column based on multiple columns in dataframe
I have a following code - where I am trying assign value to a column based age of the person
conditions = [df['age']<=25,df['age']>25,df['age']>=50]
values = ['age below 25','between 25 and 50','50+']
df['age category']=np.select(conditions,values)
output -
gender name age age category
0 male A 45 between 25 and 50
1 female B 22 age below 25
2 other C 54 between 25 and 50
for the age 54 it should assign age category as 50+
so i have tried following code which shows a error
conditions = [df['age']<=25,(df['age']>25 & df['age']<50),df['age']>=50]
values = ['age below 25','between 25 and 50','50+']
df['age category']=np.select(conditions,values)
I think we can use either where, select or loc for this but entirely not sure.. Thanks in advance
I would use cut here:
### user defined threshold ages in order
ages = [25, 50]
### below is programmatic
labels = ([f'age below {ages[0]}']
+[f'between {a} and {b}'
for a,b in zip(ages, ages[1:])]
+[f'{ages[-1]}+']
)
df['age category'] = pd.cut(df['age'], bins=[0]+ages+[np.inf], labels=labels)
Output:
gender name age age category
0 male A 45 between 25 and 50
1 female B 22 age below 25
2 other C 54 50+
You can use default parameter for np.select and since the first condition encountered is selected, you can use:
conditions = [df['age'] < 25, df['age'] < 50]
values = ['age below 25', 'between 25 and 50']
df['age category'] = np.select(conditions, values, default='50+')
print(df)
# Output:
age age category
0 56 50+
1 18 age below 25
2 39 between 25 and 50
3 21 age below 25
4 13 age below 25
5 24 age below 25
6 54 50+
7 47 between 25 and 50
8 43 between 25 and 50
9 60 50+
10 65 50+
11 21 age below 25
12 53 50+
13 66 50+
14 52 50+
15 13 age below 25
16 10 age below 25
17 46 between 25 and 50
18 13 age below 25
19 57 50+

Add number to grouped pandas values in accordance to their occurrence in the group

I would like to add a number to same room occurrences with the same ID. So my dataframe has two columns ('ID' and 'Room'). I want to add a number to each Room according to its occurrence in the column 'room' for a single ID. Underneath is the original df and the desired df.
Example: ID34 has 3 bedrooms so I want the first to be -> bedroom_1, the second -> bedroom_2 and the third -> bedroom_3.
original df:
ID Room
34 Livingroom
34 Bedroom
34 Kitchen
34 Bedroom
34 Bedroom
34 Storage
50 Kitchen
50 Kitchen
89 Livingroom
89 Bedroom
89 Bedroom
98 Livingroom
Desired df:
ID Room
34 Livingroom_1
34 Bedroom_1
34 Kitchen_1
34 Bedroom_2
34 Bedroom_3
34 Storage_1
50 Kitchen_1
50 Kitchen_2
89 Livingroom_1
89 Bedroom_1
89 Bedroom_2
98 Livingroom_1
Tried code:
import pandas as pd
import numpy as np

data = pd.DataFrame({"ID": [34,34,34,34,34,34,50,50,89,89,89,98],
"Room": ['Livingroom','Bedroom','Kitchen','Bedroom','Bedroom','Storage','Kitchen','Kitchen','Livingroom','Bedroom','Bedroom','Livingroom']})
df = pd.DataFrame(columns=['ID'])
for i in range(df['Room'].nunique()):
df_new = (df[df['Room'] == ])
df_new.columns = ['ID', 'Room' + str(i)]
df_result = df_result.merge(df_new, on='ID', how='outer')
Lets try concatenate room with the cumcount of the df grouped by Room and ID as follows
df=df.assign(Room=df.Room+"_"+(df.groupby(['ID','Room']).cumcount()+1).astype(str))
ID Room
0 34 Livingroom_1
1 34 Bedroom_1
2 34 Kitchen_1
3 34 Bedroom_2
4 34 Bedroom_3
5 34 Storage_1
6 50 Kitchen_1
7 50 Kitchen_2
8 89 Livingroom_1
9 89 Bedroom_1
10 89 Bedroom_2
11 98 Livingroom_1
import inflect
p = inflect.engine()
df['Room'] += df.groupby('Room').cumcount().add(1).map(p.ordinal).radd('_')
print(df)
https://stackoverflow.com/a/59951701/3756587 I copied from here.
Here is some code that can do that for you. I'm basically breaking it down into three steps.
Perform a groupby apply to get apply a custom function on a group by operation. This allows you to generate new names for each pair of ID, Room.
Reduce the multiindex to the original index. Because we are grouping on two columns the index is now a hierarchical grouping of the two columns. We are discarding the original because we want to use our new names.
Perform an explode on each entry. This is because for simplicity, we are computing the apply result as an array. A subsequent explode then give each element in the array a unique row.
def f(rooms_col):
arr = np.empty(len(rooms_col), dtype=object)
for i, name in enumerate(rooms_col):
arr[i] = name + f"_{i + 1}"
return arr
# assuming data is the data from above
tmp_df = data.groupby(["ID", "Room"])["Room"].apply(f)
# Drop the old room name
tmp_df.index = tmp_df.index.droplevel(1)
# Explode the results array -> 1 row per entry
df = tmp_df.explode()
print(df)
Here is your output:
ID
34 Bedroom_1
34 Bedroom_2
34 Bedroom_3
34 Kitchen_1
34 Livingroom_1
34 Storage_1
50 Kitchen_1
50 Kitchen_2
89 Bedroom_1
89 Bedroom_2
89 Livingroom_1
98 Livingroom_1
Name: Room, dtype: object

find the maximum value in a column with respect to other column

i have below data frame:-
input-
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
i want the output as:-0
Amy 73
so basically i want to find the highest value in age column and i also want the name of person with highest age.
i tried with pandas using group by as below:-
df2=df.groupby(['first_name'])['age'].max()
But with this i am getting the below output as below :
first_name
Amy 73
Jake 24
Jason 42
Molly 52
Tina 36
Name: age, dtype: int64
where as i only want
Amy 73
How shall i go about it in pandas?
You can get your result with the code below
df.loc[df.age.idxmax(),['first_name','age']]
Here, with df.age.idxmax() we are getting the index of the row which has the maximum age value.
Then with df.loc[df.age.idxmax(),['first_name','age']] we are getting the columns 'first_name' & 'age' at that index.
This line of code should do the work
df[df['age']==df['age'].max()][['first_name','age']]
The [['first_name','age']] has the names of columns you want in the result output.
Change as you want.
As in this case the output will be
first_name Age
Amy 73

Pivoting count of column value using python pandas

I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5

How to merge rows (with strings) based on column value (int) in Pandas dataframe?

I have datasets in the format
df1=
userid movieid tags timestamp
73 130682 b movie 1432523704
73 130682 comedy 1432523704
73 130682 horror 1432523704
77 1199 Trilogy of the Imagination 1163220043
77 2968 Gilliam 1163220138
77 2968 Trilogy of the Imagination 1163220039
77 4467 Trilogy of the Imagination 1163220065
77 4911 Gilliam 1163220167
77 5909 Takashi Miike 1163219591
and I want another dataframe to be in format
df2=
userid tags
73 b movie[1] comedy[1] horror[1]
77 Trilogy of the Imagination[3] Gilliam[1] Takashi Miike[1]
such that I can merge all tags together for word/s count or term frequency.
In sort, I want all tags for one userid together concatenated by " " (one space), such that I can also count number of occurrences of word/s. I am unable to concatenate strings in tags together. I can count words and its occurrences. Any help/advice would be appreciated.
First count and reformat the result of the count per group. Keep it as an intermediate result:
r = df.groupby('userid').apply(lambda g: g.tags.value_counts()).reset_index(level=-1)
r
Out[46]:
level_1 tags
userid
73 b movie 1
73 horror 1
73 comedy 1
77 Trilogy of the Imagination 3
77 Gilliam 2
77 Takashi Miike 1
This simple string manipulation will give you the result per line:
r.level_1+'['+r.tags.astype(str)+']'
Out[49]:
userid
73 b movie[1]
73 horror[1]
73 comedy[1]
77 Trilogy of the Imagination[3]
77 Gilliam[2]
77 Takashi Miike[1]
The neat part of being in Python is to be able to do something like this with it:
(r.level_1+'['+r.tags.astype(str)+']').groupby(level=0).apply(' '.join)
Out[50]:
userid
73 b movie[1] horror[1] comedy[1]
77 Trilogy of the Imagination[3] Gilliam[2] Takas...

Categories

Resources