I am a newbie to datascience and I want to count how many female/male are in each Title.
I tried the following piece of code:
'''
newdf = pd.DataFrame()
newdf[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )
[1].split( '.' )[0].strip() )
newdf['Age'] = full['Age']
newdf['Sex'] = full['Sex']
newdf.dropna(axis = 0,inplace=True)
print(newdf.head())
What I get is :
Title Age Sex
0 Mr 22.0 male
1 Mrs 38.0 female
2 Miss 26.0 female
3 Mrs 35.0 female
4 Mr 35.0 male
Then I am trying this to add #male,#female columns
df = pd.DataFrame()
df = newdf[['Age','Title']].groupby('Title').mean().sort_values(by='Age',ascending=False)
df['#People'] = newdf['Title'].value_counts()
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
Error message that I have:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What I expected is to have four columns: Title,Age(average),#People, #male,#female. So I want to know how many of those #people are male and female
P.S Without these lines :
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
everything works fine,and I get:
Age #People
Title
Capt 70.000000 1
Col 54.000000 4
Sir 49.000000 1
Major 48.500000 2
Lady 48.000000 1
Dr 43.571429 7
....
But without #male,#female.
Use GroupBy.agg for aggregate mean with size and for new columns add crosstab by DataFrame.join:
df1 = (df.groupby('Title')['Age']
.agg([('Age','mean'),('#People','size')])
.sort_values(by='Age',ascending=False))
df2 = pd.crosstab(df['Title'], df['Sex']).add_suffix('_avg')
df = df1.join(df2)
print (df)
Age #People female_avg male_avg
Title
Mrs 36.5 2 2 0
Mr 28.5 2 0 2
Miss 26.0 1 1 0
Related
I'm a beginner python coder, I want to build a python function that calculate a specific indicator,
as example, the data is look like:
ID status Age Gender
01 healthy 16 Male
02 un_healthy 14 Female
03 un_healthy 22 Male
04 healthy 12 Female
05 healthy 33 Female
To build a function that calculate the percentage of healthy people by healthy+un_health
def health_rate(healthy, un_healthy,age){
if (age >= 15):
if (gender == "Male"):
return rateMale= (count(healthy)/count(healthy)+count(un_healthy))
Else
return rateFemale= (count(healthy)/count(healthy)+count(un_healthy))
Else
return print("underage");
and then just use .apply
but the logic isn't right, I still not get my desired output
I want to return Male rate and Female rate
You could use pivot_table (df your dataframe):
df = df[df.Age >= 15].pivot_table(
index="status", columns="Gender", values="ID",
aggfunc="count", margins=True, fill_value=0
)
Result for your sample dataframe:
Gender Female Male All
status
healthy 1 1 2
un_healthy 0 1 1
All 1 2 3
If you want percentages:
df = (df / df.loc["All", :] * 100).drop("All")
Result:
Gender Female Male All
status
healthy 100.0 50.0 66.666667
un_healthy 0.0 50.0 33.333333
df[col_name].value_counts(normalize=True) gives you the proportions for the desired column. Here's how you can parameterize it:
def health_percentages(df, col_name):
return df[col_name].value_counts(normalize=True)*100
Example:
data = [ [1, 'healthy',16,'M'], [2, 'un_healthy',14,'F'], [3, 'un_healthy', 22, 'M'],[4, 'healthy', 12, 'F'],[5, 'healthy', 33, 'F']]
df = pd.DataFrame(data, columns = ['ID','status', 'Age', 'Gender'])
print(df)
print(health_percentages(df, 'status'))
#output:
ID status Age Gender
0 1 healthy 16 M
1 2 un_healthy 14 F
2 3 un_healthy 22 M
3 4 healthy 12 F
4 5 healthy 33 F
healthy 60.0
un_healthy 40.0
I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
I have the following pandas df:
Name
Jack
Alex
Jackie
Susan
i also have the following dict:
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
I would like to add in two colums for Gender and Age so that my df returns:
Name Gender Age
Jack Male 22
Alex Male 26
Jackie Female 28
Susan Female 30
I have tried:
df['Gender'] = df.Name.map(d[0])
df['Age'] = df.Name.map(d[1])
but no such luck. Any ideas or help would be muhc appreciated! Thanks!
df['Gender'] = df.Name.map(lambda x: d[x][0])
df['Age'] = df.Name.map(lambda x: d[x][1])
Take all the values of the dictionary
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
value_list = list(d.values())
df = pd.DataFrame(value_list, columns =['Gender', 'Age'])
print(df)
Use pd.DataFrame constructor with Series.map and use pd.concat to concat with df:
In [2696]: df = pd.concat([df,pd.DataFrame(df.Name.map(d).tolist(), columns=['Gender', 'Age'])], axis=1)
In [2695]: df
Out[2696]:
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Solutions working well also if no match in dictionary like:
d = {'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
print (df)
Name Gender Age
0 Alex Male 26
1 Jack NaN NaN
2 Jackie Female 28
3 Susan Female 30
Use DataFrame.from_dict from your dictionary and add to column Name by DataFrame.join, advantage is if more columns in input data all working same way:
df = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['Gender','Age']), on='Name')
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Your solution should working if create 2 dictionaries:
d1 = {k:v[0] for k,v in d.items()}
d2 = {k:v[1] for k,v in d.items()}
df['Gender'] = df.Name.map(d1)
df['Age'] = df.Name.map(d2)
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
I have a dataframe called passenger_details which is shown below
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male I drive to work car 1 hour
Passenger2 26 Female I take the metro train NaN ...
Passenger3 33 Female NaN NaN 30 mins ...
Passenger4 29 Female I take the metro train NaN ...
...
I want to apply an if function that will turn missing values(NaN values) to 0 and present values to 1, to column headings that have the string 'Commute' in them.
This is basically what I'm trying to achieve
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male 1 1 1
Passenger2 26 Female 1 1 0 ...
Passenger3 33 Female 0 0 1 ...
Passenger4 29 Female 1 1 0 ...
...
However, I'm struggling with how to phrase my code. This is what I have done
passenger_details = passenger_details.filter(regex = 'Location_', axis = 1).apply(lambda value: str(value).replace('value', '1', 'NaN','0'))
But I get a Type Error of
'replace() takes at most 3 arguments (4 given)'
Any help would be appreciated
Seelct columns by Index.contains and test not missing values by DataFrame.notna and last cast to integer for True/False to 1/0 map:
c = df.columns.str.contains('Commute')
df.loc[:, c] = df.loc[:, c].notna().astype(int)
print (df)
Passenger Age Gender Commute_to_work Commute_mode Commute_time
0 Passenger1 32 Male 1 1 1
1 Passenger2 26 Female 1 1 0
2 Passenger3 33 Female 0 0 1
3 Passenger4 29 Female 1 1 0
Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'