How to map to multiple values in a dictionary in pandas - python

I have the following pandas df:
Name
Jack
Alex
Jackie
Susan
i also have the following dict:
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
I would like to add in two colums for Gender and Age so that my df returns:
Name Gender Age
Jack Male 22
Alex Male 26
Jackie Female 28
Susan Female 30
I have tried:
df['Gender'] = df.Name.map(d[0])
df['Age'] = df.Name.map(d[1])
but no such luck. Any ideas or help would be muhc appreciated! Thanks!

df['Gender'] = df.Name.map(lambda x: d[x][0])
df['Age'] = df.Name.map(lambda x: d[x][1])

Take all the values of the dictionary
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
value_list = list(d.values())
df = pd.DataFrame(value_list, columns =['Gender', 'Age'])
print(df)

Use pd.DataFrame constructor with Series.map and use pd.concat to concat with df:
In [2696]: df = pd.concat([df,pd.DataFrame(df.Name.map(d).tolist(), columns=['Gender', 'Age'])], axis=1)
In [2695]: df
Out[2696]:
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30

Solutions working well also if no match in dictionary like:
d = {'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
print (df)
Name Gender Age
0 Alex Male 26
1 Jack NaN NaN
2 Jackie Female 28
3 Susan Female 30
Use DataFrame.from_dict from your dictionary and add to column Name by DataFrame.join, advantage is if more columns in input data all working same way:
df = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['Gender','Age']), on='Name')
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Your solution should working if create 2 dictionaries:
d1 = {k:v[0] for k,v in d.items()}
d2 = {k:v[1] for k,v in d.items()}
df['Gender'] = df.Name.map(d1)
df['Age'] = df.Name.map(d2)
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30

Related

how to structure a python function that take input from data frame to calculate specific indicator

I'm a beginner python coder, I want to build a python function that calculate a specific indicator,
as example, the data is look like:
ID status Age Gender
01 healthy 16 Male
02 un_healthy 14 Female
03 un_healthy 22 Male
04 healthy 12 Female
05 healthy 33 Female
To build a function that calculate the percentage of healthy people by healthy+un_health
def health_rate(healthy, un_healthy,age){
if (age >= 15):
if (gender == "Male"):
return rateMale= (count(healthy)/count(healthy)+count(un_healthy))
Else
return rateFemale= (count(healthy)/count(healthy)+count(un_healthy))
Else
return print("underage");
and then just use .apply
but the logic isn't right, I still not get my desired output
I want to return Male rate and Female rate
You could use pivot_table (df your dataframe):
df = df[df.Age >= 15].pivot_table(
index="status", columns="Gender", values="ID",
aggfunc="count", margins=True, fill_value=0
)
Result for your sample dataframe:
Gender Female Male All
status
healthy 1 1 2
un_healthy 0 1 1
All 1 2 3
If you want percentages:
df = (df / df.loc["All", :] * 100).drop("All")
Result:
Gender Female Male All
status
healthy 100.0 50.0 66.666667
un_healthy 0.0 50.0 33.333333
df[col_name].value_counts(normalize=True) gives you the proportions for the desired column. Here's how you can parameterize it:
def health_percentages(df, col_name):
return df[col_name].value_counts(normalize=True)*100
Example:
data = [ [1, 'healthy',16,'M'], [2, 'un_healthy',14,'F'], [3, 'un_healthy', 22, 'M'],[4, 'healthy', 12, 'F'],[5, 'healthy', 33, 'F']]
df = pd.DataFrame(data, columns = ['ID','status', 'Age', 'Gender'])
print(df)
print(health_percentages(df, 'status'))
#output:
ID status Age Gender
0 1 healthy 16 M
1 2 un_healthy 14 F
2 3 un_healthy 22 M
3 4 healthy 12 F
4 5 healthy 33 F
healthy 60.0
un_healthy 40.0

Python Pandas concatenate every 2nd row to previous row

I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN

Choose higher value based off column value between two dataframes

question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna

How can I count how many male/female are in each title?

I am a newbie to datascience and I want to count how many female/male are in each Title.
I tried the following piece of code:
'''
newdf = pd.DataFrame()
newdf[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )
[1].split( '.' )[0].strip() )
newdf['Age'] = full['Age']
newdf['Sex'] = full['Sex']
newdf.dropna(axis = 0,inplace=True)
print(newdf.head())
What I get is :
Title Age Sex
0 Mr 22.0 male
1 Mrs 38.0 female
2 Miss 26.0 female
3 Mrs 35.0 female
4 Mr 35.0 male
Then I am trying this to add #male,#female columns
df = pd.DataFrame()
df = newdf[['Age','Title']].groupby('Title').mean().sort_values(by='Age',ascending=False)
df['#People'] = newdf['Title'].value_counts()
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
Error message that I have:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What I expected is to have four columns: Title,Age(average),#People, #male,#female. So I want to know how many of those #people are male and female
P.S Without these lines :
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
everything works fine,and I get:
Age #People
Title
Capt 70.000000 1
Col 54.000000 4
Sir 49.000000 1
Major 48.500000 2
Lady 48.000000 1
Dr 43.571429 7
....
But without #male,#female.
Use GroupBy.agg for aggregate mean with size and for new columns add crosstab by DataFrame.join:
df1 = (df.groupby('Title')['Age']
.agg([('Age','mean'),('#People','size')])
.sort_values(by='Age',ascending=False))
df2 = pd.crosstab(df['Title'], df['Sex']).add_suffix('_avg')
df = df1.join(df2)
print (df)
Age #People female_avg male_avg
Title
Mrs 36.5 2 2 0
Mr 28.5 2 0 2
Miss 26.0 1 1 0

add a row at top in pandas dataframe [duplicate]

This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male

Categories

Resources