I have a pandas dataframe which looks like as follows:
df =
key value
1 Level 1
2 Age 35
3 Height 180
4 Gender 0
...
and a dictionary as follows:
my_dict = {
'Level':{0: 'Low', 1:'Medium', 2:'High'},
'Gender': {0: 'Female', 1: 'Male'}
}
I want to map from the dictionary to the dataframe and change the 'value' column with its corresponding value in the dictionary such as the output becomes:
key value
1 Level Medium
2 Age 35
3 Height 180
4 Gender Female
...
Its okay for other values in the column become a string as well. How can I achieve this? Thanks for the help.
Check with replace
out = df.set_index('key').T.replace(my_dict).T.reset_index()
out
Out[27]:
key value
0 Level Medium
1 Age 35
2 Height 180
3 Gender Female
df.at[1, 'value'] = my_dict['Level'][df.at[1, 'value']]
df.at[4, 'value'] = my_dict['Gender'][df.at[4, 'value']]
Related
I am trying to loop over a data frame column which contains several lists and control if the values in the lists are countained by another data frame column.
I am pretty new to python and have this problem now for a longer time. I already tried to find a way to solve this problem with isin and str.contains, but I still got no match.
Here is the code I worked out so far:
data = [['yellow', 10,0], ['red', 15,0], ['blue', 14,0]]
df1 = pd.DataFrame(data, columns = ['Colour', 'Colour_id','Amount'])
df1
Colour Colour_id Amount
yellow 10 0
red 15 0
blue 14 0
data = [['tom',[10,15],200 ], ['adam', [10],50], ['john',[15,14],200]]
df2 = pd.DataFrame(data, columns = ['Colour', 'Colour_id','Amount'])
df2
Name Colour_id Amount
tom [10,15] 200
adam [10] 50
john [15,14] 200
for indices, row in df2.iterrows():
for i in row['Colour_id']:
if i in df1['Colour_id']:
df1['Amount']=df1['Amount']=df2['Amount']
else:
print("No")
The expect result should be that the Amount Column of df1 is filled like:
Colour Colour_id Amount
yellow 10 250
red 15 400
blue 14 200
At the moment I only get the "No" of the else condition.
Idea is create Series by convert list column to DataFrame, reshape by stack and aggregate sum and then Series.map:
df3 = pd.DataFrame(df2['Colour_id'].values.tolist(), index=df2['Amount']).stack().reset_index()
s = df3.groupby(0)['Amount'].sum()
df1['Amount'] = df1['Colour_id'].map(s)
print (df1)
Colour Colour_id Amount
0 yellow 10 250
1 red 15 400
2 blue 14 200
Or use defaultdict with pure python fo dictionary by summing values and map for new column:
from collections import defaultdict
d = defaultdict(int)
for cid, Amount in zip(df2['Colour_id'], df2['Amount']):
for x in cid:
d[x] += Amount
print (d)
defaultdict(<class 'int'>, {10: 250, 15: 400, 14: 200})
df1['Amount'] = df1['Colour_id'].map(s)
print (df1)
Colour Colour_id Amount
0 yellow 10 250
1 red 15 400
2 blue 14 200
I have input data like as shown below. Here 'gender' and 'ethderived' are two columns. I would like to replace their values like 1,2,3 etc with categorical values. Ex - 1 with Male, 2 with Female
The mapping file looks like as shown below - sample 2 columns
Input data looks like as shown below
I expect my output dataframe to look like this
I have tried to do this using the below code. Though the code works fine, I don't see any replace happening. Can you please help me with this?
mapp = pd.read_csv('file2.csv')
data = pd.read_csv('file1.csv')
for col in mapp:
if col in data.columns:
print(col)
s = list(mapp.loc[(mapp[col].str.contains('^\d')==True)].index)
print("s is",s)
for i in s:
print("i is",i)
try:
value = mapp[col][i].split('. ')
print("value 0 is",value[0])
print("value 1 is",value[1])
if value[0] in data[col].values:
data.replace({col:{value[0]:value[1]}})
except:
print("column not present")
else:
print("No")
Please note that I have shown only two columns but in real time there might more than 600 columns. Any elegant approach/suggestions to make it simple is helpful. As I have two separate csv files, any suggestions on merge/join etc will also be helpful but please note that my mapping file contains values as "1. Male", "2. Female". hence I used regex
Also note that, several other column values can also have mapping values that start with 1. ex: 1. Single, 2. Married, 3. Divorced etc
Looking forward to your help
Use DataFrame.replace with nested dictionaries - first key define colum name for replace and another values for replace created by function Series.str.extract:
df = pd.DataFrame({'Gender':['1.Male','2.Female', np.nan],
'Ethnicity':['1.Chinese','2.Indian','3.Malay']})
print (df)
Gender Ethnicity
0 1.Male 1.Chinese
1 2.Female 2.Indian
2 NaN 3.Malay
d={x:df[x].str.extract(r'(\d+)\.(.+)').dropna().set_index(0)[1].to_dict() for x in df.columns}
print (d)
{'Gender': {'1': 'Male', '2': 'Female'},
'Ethnicity': {'1': 'Chinese', '2': 'Indian', '3': 'Malay'}}
df1 = pd.DataFrame({'Gender':[2,1,2,1],
'Ethnicity':[1,2,3,1]})
print (df1)
Gender Ethnicity
0 2 1
1 1 2
2 2 3
3 1 1
#convert to strings before replace
df2 = df1.astype(str).replace(d)
print (df2)
Gender Ethnicity
0 Female Chinese
1 Male Indian
2 Female Malay
3 Male Chinese
If the entries are always in order(1.XXX,2.XXX...), use:
m=df1.apply(lambda x: x.str[2:])
n=df2.sub(1).replace(m)
print(n)
gender ethderived
0 Female Chinese
1 Male Indian
2 Male Malay
3 Female Chinese
4 Male Chinese
5 Female Indian
This question already has answers here:
Pandas merge two dataframes summing values [duplicate]
(2 answers)
how to merge two dataframes and sum the values of columns
(2 answers)
Closed 4 years ago.
I am new to pandas, could you help me with the case belove pls
I have 2 DF:
df1 = pd.DataFrame({'A': ['name', 'color', 'city', 'animal'], 'number': ['1', '32', '22', '13']})
df2 = pd.DataFrame({'A': ['name', 'color', 'city', 'animal'], 'number': ['12', '2', '42', '15']})
df1
A number
0 name 1
1 color 32
2 city 22
3 animal 13
DF1
A number
0 name 12
1 color 2
2 city 42
3 animal 15
I need to get the sum of the colum number e.g.
DF1
A number
0 name 13
1 color 34
2 city 64
3 animal 27
but if I do new = df1 + df2 i get a
NEW
A number
0 namename 13
1 colorcolor 34
2 citycity 64
3 animalanimal 27
I even tried with merge on="A" but nothing.
Can anyone enlight me pls
Thank you
Here are two different ways: one with add, and one with concat and groupby. In either case, you need to make sure that your number columns are numeric first (your example dataframes have strings):
# set `number` to numeric (could be float, I chose int here)
df1['number'] = df1['number'].astype(int)
df2['number'] = df2['number'].astype(int)
# method 1, set the index to `A` in each and add the two frames together:
df1.set_index('A').add(df2.set_index('A')).reset_index()
# method 2, concatenate the two frames, groupby A, and get the sum:
pd.concat((df1,df2)).groupby('A',as_index=False).sum()
Output:
A number
0 animal 28
1 city 64
2 color 34
3 name 13
Merging isn't a bad idea, you just need to remember to convert numeric series to numeric, select columns to merge on, then sum on numeric columns via select_dtypes:
df1['number'] = pd.to_numeric(df1['number'])
df2['number'] = pd.to_numeric(df2['number'])
df = df1.merge(df2, on='A')
df['number'] = df.select_dtypes(include='number').sum(1) # 'number' means numeric columns
df = df[['A', 'number']]
print(df)
A number
0 name 13
1 color 34
2 city 64
3 animal 28
I have a big dataset (2m rows, 70 variables), which has many categorical variables. All categorical variables are coded in numbers (e.g. see df1)
df1:
obs gender job
1 1 1
2 1 2
3 2 2
4 1 1
I have a another data frame with all explanations, looking like this:
df2:
Var: Value: Label:
gender 1 male
gender 2 female
job 1 blue collar
job 2 white collar
Is there a fast way to replace all values of the categorical columns with their label from df2? This would save me the work to always look up the meaning of the value in df2. I found some solutions to replace values by hand, but I look for an automatic way doing this.
Thank you
You could use a dictionary generated from df2. Like this:
Firstly, generating some dummy data:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)
df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']
If you want to replace one variable something like this:
genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])
And if you'd like to replace a bunch of variables:
colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])
For a million rows it takes about 1 second so should be reasonable fast.
Create a mapper dictionary from df2 using groupby
d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()
{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}
Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper
for col in df1.columns:
if col in d.keys():
df1[col] = df1[col].map(d[col])
You get
obs gender job
0 1 male blue collar
1 2 male white collar
2 3 female white collar
3 4 male blue collar
I'm trying to deal with data that is xls written in html(or xml. IDK)
I tried to do this
df = pandas.read_html(r"filename.xls", skiprows=0)
and it was not dataframe but just list. so I did this
df = df[0]
and after this, I could do,
print(df)
the result is as below
0 1 2
0 name age gender
1 john 18 male
2 ryan 20 male
before, I did similar task with other xlsx files that just worked fine but not with this one.
for instance,
for index, row in df.itterrows():
target = str(row['gender'])
if target = 'male':
df.loc[index,'gender'] = 'Y'
else:
df.loc[index,'gender'] = 'N'
in real, the code is 400 lines long....
I want my dataframe looks like below so that I can re-use the code that I wrote already.
name age gender
0 john 18 male
1 ryan 20 male
as the comment, I'm adding this result too.
I tried to skip the row
df = pandas.read_html(r"filename.xls", skiprows=1)
the result is as below
0 1 2
0 john 18 male
1 ryan 20 male
how would I do it?
Use parameter header=0:
df = pandas.read_html(r"filename.xls", header=0)[0]
And then instead loops is posible use np.where:
change:
for index, row in df.itterrows():
target = str(row['gender'])
if target = 'male':
df.loc[index,'gender'] = 'Y'
else:
df.loc[index,'gender'] = 'N'
to:
df['gender'] = np.where(df['gender'] == 'male', 'Y', 'N')