Python pandas: labeling categorical values based on legend dataframe - python

I have a big dataset (2m rows, 70 variables), which has many categorical variables. All categorical variables are coded in numbers (e.g. see df1)
df1:
obs gender job
1 1 1
2 1 2
3 2 2
4 1 1
I have a another data frame with all explanations, looking like this:
df2:
Var: Value: Label:
gender 1 male
gender 2 female
job 1 blue collar
job 2 white collar
Is there a fast way to replace all values of the categorical columns with their label from df2? This would save me the work to always look up the meaning of the value in df2. I found some solutions to replace values by hand, but I look for an automatic way doing this.
Thank you

You could use a dictionary generated from df2. Like this:
Firstly, generating some dummy data:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)
df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']
If you want to replace one variable something like this:
genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])
And if you'd like to replace a bunch of variables:
colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])
For a million rows it takes about 1 second so should be reasonable fast.

Create a mapper dictionary from df2 using groupby
d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()
{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}
Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper
for col in df1.columns:
if col in d.keys():
df1[col] = df1[col].map(d[col])
You get
obs gender job
0 1 male blue collar
1 2 male white collar
2 3 female white collar
3 4 male blue collar

Related

Compare the values of 2 columns in pandas dataframe to fill a third column

I have a dataframe (figure). Supose that I will add more observations to my dataframe.
For this new observations (9 and 10) I only add the color, food and age columns. For the score column I want to compare with the other observations if the column of "food" and "color" got the same label then the score value will be equal of that observation.
In this case the score value is 5.0 and 6.0 respectively. How can I automatize this process when i add a lot of observations without the score value?
You could try something like below:
import pandas as pd
#Shortened working data list, for demonstration purposes
data = [[1,'Red','Apple',70,5.0],[2,'Yellow','Pizza',90,6.0],[9,'Red','Apple', 2, None],[10,'Yellow','Pizza',2,None]]
#Set up data frame
df = pd.DataFrame(data, columns=['Observations', 'Color', 'Food', 'Age', 'Score'])
# Remove nan values
df_cleaned = df[df['Score'].notna()]
#Generate a dictionary with a key that combines Color and Food, and a value that equals Score
targetValues = dict(zip((df_cleaned.Color + df_cleaned.Food), df_cleaned.Score))
#Replace nan values in our original data frame with the values from our dictionary created above
df['Score']=df['Score'].fillna((df.Color + df.Food).map(targetValues))
print(df)
That will yield an output like below:
Observations Color Food Age Score
0 1 Red Apple 70 5.0
1 2 Yellow Pizza 90 6.0
2 9 Red Apple 2 5.0
3 10 Yellow Pizza 2 6.0
The general idea is to create a dictionary, and use those key-value pairs to replace the NaN values in your data frame

Python map values from dictionary to dataframe

I have a pandas dataframe which looks like as follows:
df =
key value
1 Level 1
2 Age 35
3 Height 180
4 Gender 0
...
and a dictionary as follows:
my_dict = {
'Level':{0: 'Low', 1:'Medium', 2:'High'},
'Gender': {0: 'Female', 1: 'Male'}
}
I want to map from the dictionary to the dataframe and change the 'value' column with its corresponding value in the dictionary such as the output becomes:
key value
1 Level Medium
2 Age 35
3 Height 180
4 Gender Female
...
Its okay for other values in the column become a string as well. How can I achieve this? Thanks for the help.
Check with replace
out = df.set_index('key').T.replace(my_dict).T.reset_index()
out
Out[27]:
key value
0 Level Medium
1 Age 35
2 Height 180
3 Gender Female
df.at[1, 'value'] = my_dict['Level'][df.at[1, 'value']]
df.at[4, 'value'] = my_dict['Gender'][df.at[4, 'value']]

Loop all columns for value in any column

I'm trying to loop through all columns in a dataframe to find where a "Feature" condition is met in order to alter the FeatureValue. So if my dataframe(df) looks like below:
Feature FeatureValue Feature2 Feature2Value
Cat 1 Dog 3
Fish 2 Cat 1
I want to find where Feature=Cat or Feature2=Cat and change FeatureValue and Feature2Value to 20. I tried the below to get started, but am struggling.
for column in df:
if df.loc[df[column] == "Cat"]:
print(column)
The solution would look like:
Feature FeatureValue Feature2 Feature2Value
Cat 20 Dog 3
Fish 2 Cat 20
Here is a way to do it :
# First we construct a dictionary linking each feature to its value column
feature_value = {'Feature' : 'FeatureValue', 'Feature2' : 'Feature2Value'}
# We iterate over each feature column
for feature in feature_value:
df.loc[df[feature]=='Cat', feature_value[feature]] = 20
You currently have a wide data structure. To solve your problem in an elegant way, you should convert to a long data structure. I don't know what you are doing with your data, but the long form is often much easier to deal with.
You can do it like this
import pandas as np
from itertools import chain
# set up your sample data
dta = {'Feature': ['Cat', 'Fish'], 'FeatureValue': [1, 2], 'Feature2': ['Dog', 'Cat'], 'Feature2Value': [3, 1]}
df = pd.DataFrame(data=dta)
# relabel your columns to be able to apply method `wide_to_long`
# this is a little ugly here only because your column labels are not wisely chosen
# if you had [Feature1,FeatureValue1,Feature2,FeatureValue2] as column labels,
# you could get rid of this part
columns = ['Feature', 'FeatureValue'] * int(len(df.columns)/2)
identifier = zip(range(int(len(df.columns)/2)), range(int(len(df.columns)/2)))
identifier = list(chain(*identifier))
columns = ['{}{}'.format(i,j) for i, j in zip(columns, identifier)]
df.columns = columns
# generate result
df['feature'] = df.index
df_long = pd.wide_to_long(df, stubnames=['Feature', 'FeatureValue'], i='feature', j='id')
Now, you converted your data from
Feature FeatureValue Feature2 Feature2Value
0 Cat 1 Dog 3
1 Fish 2 Cat 1
to this
Feature FeatureValue
feature id
0 0 Cat 1
1 0 Fish 2
0 1 Dog 3
1 1 Cat 1
This allows you to answer your problem in a single line, no loops:
df_long.loc[df_long['Feature'] == 'Cat', 'FeatureValue'] = 20
This yields
Feature FeatureValue
feature id
0 0 Cat 20
1 0 Fish 2
0 1 Dog 3
1 1 Cat 20
You can easily go back to your wide format using the same method.

Replace values in a dataframe with values from another dataframe - Regex

I have input data like as shown below. Here 'gender' and 'ethderived' are two columns. I would like to replace their values like 1,2,3 etc with categorical values. Ex - 1 with Male, 2 with Female
The mapping file looks like as shown below - sample 2 columns
Input data looks like as shown below
I expect my output dataframe to look like this
I have tried to do this using the below code. Though the code works fine, I don't see any replace happening. Can you please help me with this?
mapp = pd.read_csv('file2.csv')
data = pd.read_csv('file1.csv')
for col in mapp:
if col in data.columns:
print(col)
s = list(mapp.loc[(mapp[col].str.contains('^\d')==True)].index)
print("s is",s)
for i in s:
print("i is",i)
try:
value = mapp[col][i].split('. ')
print("value 0 is",value[0])
print("value 1 is",value[1])
if value[0] in data[col].values:
data.replace({col:{value[0]:value[1]}})
except:
print("column not present")
else:
print("No")
Please note that I have shown only two columns but in real time there might more than 600 columns. Any elegant approach/suggestions to make it simple is helpful. As I have two separate csv files, any suggestions on merge/join etc will also be helpful but please note that my mapping file contains values as "1. Male", "2. Female". hence I used regex
Also note that, several other column values can also have mapping values that start with 1. ex: 1. Single, 2. Married, 3. Divorced etc
Looking forward to your help
Use DataFrame.replace with nested dictionaries - first key define colum name for replace and another values for replace created by function Series.str.extract:
df = pd.DataFrame({'Gender':['1.Male','2.Female', np.nan],
'Ethnicity':['1.Chinese','2.Indian','3.Malay']})
print (df)
Gender Ethnicity
0 1.Male 1.Chinese
1 2.Female 2.Indian
2 NaN 3.Malay
d={x:df[x].str.extract(r'(\d+)\.(.+)').dropna().set_index(0)[1].to_dict() for x in df.columns}
print (d)
{'Gender': {'1': 'Male', '2': 'Female'},
'Ethnicity': {'1': 'Chinese', '2': 'Indian', '3': 'Malay'}}
df1 = pd.DataFrame({'Gender':[2,1,2,1],
'Ethnicity':[1,2,3,1]})
print (df1)
Gender Ethnicity
0 2 1
1 1 2
2 2 3
3 1 1
#convert to strings before replace
df2 = df1.astype(str).replace(d)
print (df2)
Gender Ethnicity
0 Female Chinese
1 Male Indian
2 Female Malay
3 Male Chinese
If the entries are always in order(1.XXX,2.XXX...), use:
m=df1.apply(lambda x: x.str[2:])
n=df2.sub(1).replace(m)
print(n)
gender ethderived
0 Female Chinese
1 Male Indian
2 Male Malay
3 Female Chinese
4 Male Chinese
5 Female Indian

For Loop to Return Unique Values in DataFrame

I'm working through a beginner's ML code, and in order to count the number of unique samples in a column, the author uses this code:
def unique_vals(rows, col):
"""Find the unique values for a column in a dataset."""
return set([row[col] for row in rows])
I am working with a DataFrame however, and for me, this code returns single letters: 'm', 'l', etc. I tried altering it to:
set(row[row[col] for row in rows)
But then it returns:
KeyError: "None of [Index(['Apple', 'Banana', 'Grape' dtype='object', length=2318)] are in the [columns]"
Thanks for your time!
In general, you don't need to do such things yourself because pandas already does them for you.
In this case, what you want is the unique method, which you can call on a Series directly (the pd.Series is the abstraction that represents, among other things, columns), and which returns a numpy array containing the unique values in that Series.
If you want the unique values for multiple columns, you can do something like this:
which_columns = ... # specify the columns whose unique values you want here
uniques = {col: df[col].unique() for col in which_columns}
If you are working on categorical columns then following code is very useful
It will not only print the unique values but also print the count of each unique value
col = ['col1', 'col2', 'col3'...., 'coln']
#Print frequency of categories
for col in categorical_columns:
print ('\nFrequency of Categories for varible %s'%col)
print (bd1[col].value_counts())
Example:
df
pets location owner
0 cat San_Diego Champ
1 dog New_York Ron
2 cat New_York Brick
3 monkey San_Diego Champ
4 dog San_Diego Veronica
5 dog New_York Ron
categorical_columns = ['pets','owner','location']
#Print frequency of categories
for col in categorical_columns:
print ('\nFrequency of Categories for varible %s'%col)
print (df[col].value_counts())
Output:
# Frequency of Categories for varible pets
# dog 3
# cat 2
# monkey 1
# Name: pets, dtype: int64
# Frequency of Categories for varible owner
# Champ 2
# Ron 2
# Brick 1
# Veronica 1
# Name: owner, dtype: int64
# Frequency of Categories for varible location
# New_York 3
# San_Diego 3
# Name: location, dtype: int64

Categories

Resources