Replace values in a dataframe with values from another dataframe - Regex - python

I have input data like as shown below. Here 'gender' and 'ethderived' are two columns. I would like to replace their values like 1,2,3 etc with categorical values. Ex - 1 with Male, 2 with Female
The mapping file looks like as shown below - sample 2 columns
Input data looks like as shown below
I expect my output dataframe to look like this
I have tried to do this using the below code. Though the code works fine, I don't see any replace happening. Can you please help me with this?
mapp = pd.read_csv('file2.csv')
data = pd.read_csv('file1.csv')
for col in mapp:
if col in data.columns:
print(col)
s = list(mapp.loc[(mapp[col].str.contains('^\d')==True)].index)
print("s is",s)
for i in s:
print("i is",i)
try:
value = mapp[col][i].split('. ')
print("value 0 is",value[0])
print("value 1 is",value[1])
if value[0] in data[col].values:
data.replace({col:{value[0]:value[1]}})
except:
print("column not present")
else:
print("No")
Please note that I have shown only two columns but in real time there might more than 600 columns. Any elegant approach/suggestions to make it simple is helpful. As I have two separate csv files, any suggestions on merge/join etc will also be helpful but please note that my mapping file contains values as "1. Male", "2. Female". hence I used regex
Also note that, several other column values can also have mapping values that start with 1. ex: 1. Single, 2. Married, 3. Divorced etc
Looking forward to your help

Use DataFrame.replace with nested dictionaries - first key define colum name for replace and another values for replace created by function Series.str.extract:
df = pd.DataFrame({'Gender':['1.Male','2.Female', np.nan],
'Ethnicity':['1.Chinese','2.Indian','3.Malay']})
print (df)
Gender Ethnicity
0 1.Male 1.Chinese
1 2.Female 2.Indian
2 NaN 3.Malay
d={x:df[x].str.extract(r'(\d+)\.(.+)').dropna().set_index(0)[1].to_dict() for x in df.columns}
print (d)
{'Gender': {'1': 'Male', '2': 'Female'},
'Ethnicity': {'1': 'Chinese', '2': 'Indian', '3': 'Malay'}}
df1 = pd.DataFrame({'Gender':[2,1,2,1],
'Ethnicity':[1,2,3,1]})
print (df1)
Gender Ethnicity
0 2 1
1 1 2
2 2 3
3 1 1
#convert to strings before replace
df2 = df1.astype(str).replace(d)
print (df2)
Gender Ethnicity
0 Female Chinese
1 Male Indian
2 Female Malay
3 Male Chinese

If the entries are always in order(1.XXX,2.XXX...), use:
m=df1.apply(lambda x: x.str[2:])
n=df2.sub(1).replace(m)
print(n)
gender ethderived
0 Female Chinese
1 Male Indian
2 Male Malay
3 Female Chinese
4 Male Chinese
5 Female Indian

Related

Cut substring on multiple columns when one column contains a particular substring

I want to cut a certain part of a string (applied to multiple columns and differs on each column) when one column contains a particular substring
Example:
Assume the following dataframe
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
print(df)
Out:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike39 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 Holy5 18 Libra AA Gannon h_i_j
For example if one entry of column City ends with 'e', cut the last three letters of column 'City' and the last two letters of column 'name'.
What I tried so far is something like this:
df['City'] = df['City'].apply(lambda x: df['City'].str[:-3] if df.City.str.endswith('e'))
That doesn't work and I also don't really know how to cut letters on other columns while having the same if clause.
I'm happy for any help I get.
Thank you
You can record the rows with City ending with e then use loc update:
mask = df['City'].str[-1] == 'e'
df.loc[mask, 'City'] = df.loc[mask, 'City'].str[:-3]
df.loc[mask, 'name'] = df.loc[mask, 'name'].str[:-2]
Output:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike 20 Leo AB Somervi c_d_e
2 Brend 25 Virgo B Hendersonvi f_g
3 Holy5 18 Libra AA Gannon h_i_j
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
def func(row):
index = row.name
if row['City'][-1] == 'c': #check the last letter of column City for each row, implement your condition here.
df.at[index, 'City'] = df['City'][index][:-3]
df.at[index, 'name'] = df['name'][index][:-1]
df.apply(lambda x: func(x), axis =1 )
print (df)

How to create a filter according to input values in python

I'm trying to create a filter values considering the input values from users.
For example, if the user want to filter data for USA,Canada or anything else he must to write these names and the csv should contain data only about it.
I have tried to creat something with python and pandas libraries.
import pandas as pd
df.columns = ['id_country','country','population','number cities']
filter_data = int(input('select country writing the id_country: '))
filtered=(df.loc[df['id_country'] == filter_data])
indexdata = filtered.set_index('id_country')
indexdata.to_csv('C:\\Users\\Marco\\Desktop\\countries.csv', index = 'false')
this code just works when users write only one id_country, and it doesn't work when user want to write 2 or more.
Use isin():
data = pd.DataFrame({'sample col1': [1,2,3,4,5], 'sample col2': ['a','b','c','d','e'], 'country': ['US', 'Canada','Japan','US','Canada']})
>>> data
sample col1 sample col2 country
0 1 a US
1 2 b Canada
2 3 c Japan
3 4 d US
4 5 e Canada
Ask the user what value at column country:
>>> value_for_filter = input('Enter what country would you like to look data at:\n')
Enter what country would you like to look data at:
Canada
Filter the country column using user input:
>>> df = data[data['country'].isin([value_for_filter])]
>>> df
sample col1 sample col2 country
1 2 b Canada
4 5 e Canada

Python pandas: labeling categorical values based on legend dataframe

I have a big dataset (2m rows, 70 variables), which has many categorical variables. All categorical variables are coded in numbers (e.g. see df1)
df1:
obs gender job
1 1 1
2 1 2
3 2 2
4 1 1
I have a another data frame with all explanations, looking like this:
df2:
Var: Value: Label:
gender 1 male
gender 2 female
job 1 blue collar
job 2 white collar
Is there a fast way to replace all values of the categorical columns with their label from df2? This would save me the work to always look up the meaning of the value in df2. I found some solutions to replace values by hand, but I look for an automatic way doing this.
Thank you
You could use a dictionary generated from df2. Like this:
Firstly, generating some dummy data:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)
df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']
If you want to replace one variable something like this:
genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])
And if you'd like to replace a bunch of variables:
colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])
For a million rows it takes about 1 second so should be reasonable fast.
Create a mapper dictionary from df2 using groupby
d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()
{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}
Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper
for col in df1.columns:
if col in d.keys():
df1[col] = df1[col].map(d[col])
You get
obs gender job
0 1 male blue collar
1 2 male white collar
2 3 female white collar
3 4 male blue collar

python pandas.column index number to real column name

I'm trying to deal with data that is xls written in html(or xml. IDK)
I tried to do this
df = pandas.read_html(r"filename.xls", skiprows=0)
and it was not dataframe but just list. so I did this
df = df[0]
and after this, I could do,
print(df)
the result is as below
0 1 2
0 name age gender
1 john 18 male
2 ryan 20 male
before, I did similar task with other xlsx files that just worked fine but not with this one.
for instance,
for index, row in df.itterrows():
target = str(row['gender'])
if target = 'male':
df.loc[index,'gender'] = 'Y'
else:
df.loc[index,'gender'] = 'N'
in real, the code is 400 lines long....
I want my dataframe looks like below so that I can re-use the code that I wrote already.
name age gender
0 john 18 male
1 ryan 20 male
as the comment, I'm adding this result too.
I tried to skip the row
df = pandas.read_html(r"filename.xls", skiprows=1)
the result is as below
0 1 2
0 john 18 male
1 ryan 20 male
how would I do it?
Use parameter header=0:
df = pandas.read_html(r"filename.xls", header=0)[0]
And then instead loops is posible use np.where:
change:
for index, row in df.itterrows():
target = str(row['gender'])
if target = 'male':
df.loc[index,'gender'] = 'Y'
else:
df.loc[index,'gender'] = 'N'
to:
df['gender'] = np.where(df['gender'] == 'male', 'Y', 'N')

Pandas: assigning values to a dataframe column based on pivot table values using an if statement

I'm using the Titanic Kaggle data as a means to explore Pandas. I'm trying to figure out how to use an if statement inside of .ix[] (or otherwise) I have a pivot table I'm using to get a lookup value into my main dataframe. Here's a chunk of the pivot table (entitled 'data'):
Survived Count % Female Survived % Male Survived \
Sex female male female male
Embarked Pclass
C 1 42 17 43 42 97.67 40.48
2 7 2 7 10 100.00 20.00
3 15 10 23 43 65.22 23.26
Now I would like to go through each line in the main dataframe to assign its looked up value. No problem looking up the value hardcoded like:
df['Chance of Survival'] = data.ix['C']['% Female Survived'].get(1)
97.67
However when trying to insert the dynamic portion to include an if statement, things don't work out so great:
df['Chance of Survival'] = data.ix[df.Embarked][('% Female Survived' if df.Sex == 'female') | ('% Male Survived' if df.Sex=='male')].get(df.Pclass)
So the desired output in my main dataframe would look like this:
PersonId Embarked Sex Pclass Chance of Survival
1 C female 1 97.67
2 C male 2 20.00
3 C male 3 23.26
Thanks in advance! :)
Got it but in case anyone else has a similar problem. Or better yet, if anyone has a nicer way of doing it. :)
def getValue(line):
'''Lookup value in pivot table "data" given the contents of the line passed in from df'''
value = lambda line: '% Male Survived' if line.Sex == 'male' else '% Female Survived'
result = data.ix[line.Embarked][value(line)].get(line.Pclass)
return result
df['Chance of Survival'] = df.apply(getValue, axis=1)
So anyone who wants to assign values in the column of one dataframe based on values from another. I used .ix[] to drill down to the value, then .apply() to apply a function across each row (axis=1) finding the line's values just as you would a dataframe. ('line.element'/line['element'])
As far as I understand your problem, you want to assign values to an existing dataframe and you are currently using DataFrame.ix
The method you probably want is DataFrame.loc which works like this:
df = pd.DataFrame({'foo':[1,2,3,4], 'bar':[1,2,3,4]})
df
bar foo
0 1 1
1 2 2
2 3 3
3 4 4
df.loc[1]['foo'] = 4
df
bar foo
0 1 1
1 2 4
2 3 3
3 4 4
If you want to assign to new columns, you just have to create them first, simply
df['newcolumn'] = np.nan
Then you can assign it with the code above.

Categories

Resources