modify a dataframe column with a dictionary - python

I have a column in my dataframe that has a value that is an identifier. And in turn I have a dictionary that contains for each identifier its meaning.
I would like to replace the identifier of my column with its meaning.
I thought it was as simple as doing the following (the column with the identifier is in position 3 of the dataframe),
df.iloc[:,3] = my_dict[df.iloc[:,3]]]
But I get the following error,
TypeError: unhashable type: 'Series'.
The column in question contains integers and the dictionary looks like the following:
my_dict = {1: "One", 2: "Two", 3: "Three"}
How could I make this change to my dataframe?
Thank you very much.

I would recommand using a lambda function applied to your column
def explain_column(x,my_dict):
if x in my_dict.keys():
return my_dict[x]
else:
return x #Assuming that you won't change the value if not in the dict
df['my_column']=df['my_column'].apply(lambda x: explain_column(x,my_dict))

You could use .map()
Example:
import pandas as pd
data = [[1,2,3],
[2,3,4]]
columns = ['a','b', 'c']
df = pd.DataFrame(data, columns=columns)
my_dict = {1: "One", 2: "Two", 3: "Three"}
df['a'] = df['a'].map(my_dict).fillna(df['a'])
Output:
print(df)
a b c
0 1 2 3
1 2 3 4
Is now (note I only applied to columns 'a' and 'c', and if you don't want nan, then need to use the .fillna():
print(df)
a b c
0 One 2 Three
1 Two 3 4

Related

Error when creating a data frame in Python: ValueError [duplicate]

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.
The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3
You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2
You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')
You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')
Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )
Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.
You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78
You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})
I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)
import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3
To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀
I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
​```
the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2
This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!
If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)
Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3
simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )
Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}
Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']
You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})
Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

Get a Dataframe index if it is of a specific dtype

I have been trying to build a preprocessing pipeline, but I am struggling a little to generate a list of the indexes for each column that is an object dtype. I have been able to get the names of each into an array using the following code:
categorical_features = [col for col in input.columns if input[col].dtype == 'object']
Is there an easy way to get the index of these columns, from the original input dataframe into a list, like this one that I built manually?
c = [1,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,28,29,
30,31,38,39,40,41,42,43,44,45,50,51,55,56]
Use df.select_dtypes + df.columns.get_indexer:
categorical_features = df.columns.get_indexer(df.select_dtypes('object').columns)
df.select_dtypes returns a copy of df with only the columns that are of the specified dtype(s) (you can specify multiple, e.g. df.select_dtypes(['object', 'int'])).
df.columns.get_indexer returns the indexes of the specified columns.
I think you need select.dtypes and enumerate
df = pd.DataFrame({'A' : ['A', 'B', 'C'], 'B' : [1,2,3], 'C' : [1, '2', '3']})
print(df)
A B C
0 A 1 1
1 B 2 2
2 C 3 3
idx_cols = [idx for idx, col in enumerate(df.select_dtypes('object').columns) ]
[0, 1]
enumerate can help with that:
categorical_features_indexes = [i for i, col in enumerate(input.columns) if input[col].dtype == 'object']

How to change values in a given column based on a condition and put those new values in a new column? [duplicate]

I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the col1 column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
You can use .replace. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series, i.e. df["col1"].replace(di, inplace=True).
map can be much faster than replace
If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:
df['col1'].map(di).fillna(df['col1'])
as in #jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit, it appears that map is approximately 10x faster than replace.
Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See #jpp answer (linked above) for more extensive benchmarks and discussion.
There is a bit of ambiguity in your question. There are at least three two interpretations:
the keys in di refer to index values
the keys in di refer to df['col1'] values
the keys in di refer to index locations (not the OP's question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di are meant to refer to index values, then you could use the update method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I've modified the values from your original post so it is clearer what update is doing.
Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.
Case 2:
If the keys in di refer to df['col1'] values, then #DanAllan and #DSM show how to achieve this with replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di were changed to match values in df['col1'].
Case 3:
If the keys in di refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.
DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You'll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.
Given map is faster than replace (#JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
Hope it can be useful to someone.
Cheers
Or do apply:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
Demo:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
You can update your mapping dictionary with missing pairs from the dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', np.nan]})
map_ = {'a': 'A', 'b': 'B', 'd': np.nan}
# Get mapping from df
uniques = df['col1'].unique()
map_new = dict(zip(uniques, uniques))
# {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', nan: nan}
# Update mapping
map_new.update(map_)
# {'a': 'A', 'b': 'B', 'c': 'c', 'd': nan, nan: nan}
df['col2'] = df['col1'].map(dct_map_new)
Result:
col1 col2
0 a A
1 b B
2 c c
3 d NaN
4 NaN NaN
A nice complete solution that keeps a map of your class labels:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
This way, you can at any point refer to the original class label from labels_dict.
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map().
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)

Python: Appending Value to a dataframe

So apparently I am trying to declare an empty dataframe, then assign some values in it
df = pd.DataFrame()
df["a"] = 1234
df["b"] = b # Already defined earlier
df["c"] = c # Already defined earlier
df["t"] = df["b"]/df["c"]
I am getting the below output:
Empty DataFrame
Columns: [a, b, c, t]
Index: []
Can anyone explain why I am getting this empty dataframe even when I am assigning the values. Sorry if my question is kind of basic
I think, you have to initialize DataFrame like this.
df = pd.DataFrame(data=[[1234, b, c, b/c]], columns=list("abct"))
When you make DataFrame with no initial data, the DataFrame has no data and no columns.
So you can't append any data I think.
Simply add those values as a list, e.g.:
df["a"] = [123]
You have started by initialising an empty DataFrame:
# Initialising an empty dataframe
df = pd.DataFrame()
# Print the DataFrame
print(df)
Result
Empty DataFrame
Columns: []
Index: []
As next you've created a column inside the empty DataFrame:
df["a"] = 1234
print(df)
Result
Empty DataFrame
Columns: [a]
Index: []
But you never added values to the existing column "a" - f.e. by using a dictionary (key: "a" and value list [1, 2, 3, 4]:
df = pd.DataFrame({"a":[1, 2, 3, 4]})
print(df)
Result:
In case a list of values is added each value will get an index entry.
The problem is that a cell in a table needs both a row index value and a column index value to insert the cell value. So you need to decide if "a", "b", "c" and "t" are columns or row indexes.
If they are column indexes, then you'd need a row index (0 in the example below) along with what you have written above:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[0, "b"] = 2
df.loc[0, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 2.0 3.0
Now that you have data in the dataframe you can perform column operations (i.e., create a new column "t" and for each row assign the value of the corresponding item under "b" divided by the corresponding items under "c"):
df["t"] = df["b"]/df["c"]
Of course, you can also use different indexes for each item as follows:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[1, "b"] = 2
df.loc[2, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
But as you can see the cells where you have not specified the (row, column, value) tuple now are NaN. This means if you try df["b"]/df["c"] you will get NaN values out as you are trying a linear operation with a NaN value.
In : df["b"]/df["c"]
Out:
0 NaN
1 NaN
2 NaN
dtype: float64
The converse is if you wanted to insert the items under one column. You'd now need a column header for this (0 in the below):
df = pd.DataFrame()
df.loc["a", 0] = 1234
df.loc["b", 0] = 2
df.loc["c", 0] = 3
Result:
In : df
Out:
0
a 1234.0
b 2.0
c 3.0
Now in inserting the value for "t" you'd need to specify exactly which cells you are referring to (note that pandas won't perform vectorised row operations in the same way that it performs vectorised columns operations).
df.loc["t", 0] = df.loc["b", 0]/df.loc["c", 0]

Replace string values in pandas to corresponding integer values for dataframe column in Python3 [duplicate]

I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the col1 column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
You can use .replace. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series, i.e. df["col1"].replace(di, inplace=True).
map can be much faster than replace
If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:
df['col1'].map(di).fillna(df['col1'])
as in #jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit, it appears that map is approximately 10x faster than replace.
Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See #jpp answer (linked above) for more extensive benchmarks and discussion.
There is a bit of ambiguity in your question. There are at least three two interpretations:
the keys in di refer to index values
the keys in di refer to df['col1'] values
the keys in di refer to index locations (not the OP's question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di are meant to refer to index values, then you could use the update method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I've modified the values from your original post so it is clearer what update is doing.
Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.
Case 2:
If the keys in di refer to df['col1'] values, then #DanAllan and #DSM show how to achieve this with replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di were changed to match values in df['col1'].
Case 3:
If the keys in di refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.
DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You'll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.
Given map is faster than replace (#JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
Hope it can be useful to someone.
Cheers
Or do apply:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
Demo:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
You can update your mapping dictionary with missing pairs from the dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', np.nan]})
map_ = {'a': 'A', 'b': 'B', 'd': np.nan}
# Get mapping from df
uniques = df['col1'].unique()
map_new = dict(zip(uniques, uniques))
# {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', nan: nan}
# Update mapping
map_new.update(map_)
# {'a': 'A', 'b': 'B', 'c': 'c', 'd': nan, nan: nan}
df['col2'] = df['col1'].map(dct_map_new)
Result:
col1 col2
0 a A
1 b B
2 c c
3 d NaN
4 NaN NaN
A nice complete solution that keeps a map of your class labels:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
This way, you can at any point refer to the original class label from labels_dict.
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map().
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)

Categories

Resources