Mapping dictionary onto dataframe when dictionary key is a list - python

I have a dictionary where the values are lists:
dict = {1: ['a','b'], 2:['c', 'd']}
I want to map the dictionary onto col1 of my dataframe.
col1
a
c
If the value of col1 is IN one of the values of my dictionary, then I want to replace the value of col1 with the value of the dictionary key.
Like this, my dataframe will become:
col1
1
2
thanks in advance

I would convert the dictionary in the right way:
mapping = {}
for key, values in D.items():
for item in values:
mapping[item] = key
and then
df['col1'] = df['col1'].map(mapping)

You can also try using stack + reset_index and set_index with map.
d = pd.DataFrame({1: ['a','b'], 2:['c', 'd']})
mapping = d.stack().reset_index().set_index(0)["level_1"]
s = pd.Series(['a', 'c'], name="col1")
s.map(mapping)
0 1
1 2
Name: col1, dtype: int64
Step by step demo
d.stack()
0 1 a
2 c
1 1 b
2 d
dtype: object
d.stack().reset_index()
level_0 level_1 0
0 0 1 a
1 0 2 c
2 1 1 b
3 1 2 d
d.stack().reset_index().set_index(0)
level_0 level_1
0
a 0 1
c 0 2
b 1 1
d 1 2
Finally, we select the level_1 column as our mapping to pass in map function.

do you mean something like this???
D = {1 : ['a', 'b'], 2 : ['c', 'd']}
for key, value in D.items():
for each in value:
if each in D[key]:
print(each, "is in D[%s]" % key)
o/p:
a is in D[1]
b is in D[1]
c is in D[2]
d is in D[2]

Related

Replace value in column by value in list by index

My column in dataframe contains indices of values in list. Like:
id | idx
A | 0
B | 0
C | 2
D | 1
list = ['a', 'b', 'c', 'd']
I want to replace each value in idx column greater than 0 by value in list of corresponding index, so that:
id | idx
A | 0
B | 0
C | c # list[2]
D | b # list[1]
I tried to do this with loop, but it does nothing...Although if I move ['idx'] it will replace all values on this row
for index in df.idx.values:
if index >=1:
df[df.idx==index]['idx'] = list[index]
Dont use list like variable name, because builtin (python code word).
Then use Series.map with enumerate in Series.mask:
L = ['a', 'b', 'c', 'd']
df['idx'] = df['idx'].mask(df['idx'] >=1, df['idx'].map(dict(enumerate(L))))
print (df)
id idx
0 A 0
1 B 0
2 C c
3 D b
Similar idea is processing only matched rows by mask:
L = ['a', 'b', 'c', 'd']
m = df['idx'] >=1
df.loc[m,'idx'] = df.loc[m,'idx'].map(dict(enumerate(L)))
print (df)
id idx
0 A 0
1 B 0
2 C c
3 D b
Create a dictionary for items where the index is greater than 0, then use the mapping with replace to get your output :
mapping = dict((key,val) for key,val in enumerate(l) if key > 0)
print(mapping)
{1: 'b', 2: 'c', 3: 'd'}
df.replace(mapping)
id idx
0 A 0
1 B 0
2 C c
3 D b
Note : I changed the list variable name to l

Pandas new column from indexing list by row value

I am looking to create a new column in a Pandas data frame with the value of a list filtered by the df row value.
df = pd.DataFrame({'Index': [0,1,3,2], 'OtherColumn': ['a', 'b', 'c', 'd']})
Index OtherColumn
0 a
1 b
3 c
2 d
l = [1000, 1001, 1002, 1003]
Desired output:
Index OtherColumn Value
0 a -
1 b -
3 c 1003
2 d -
My code:
df.loc[df.OtherColumn == 'c', 'Value'] = l[df.Index]
Which returns an error since 'df.Index' is not recognised as a int but as a list (not filter by OtherColumn == 'c').
For R users, I'm looking for:
df[OtherColumn == 'c', Value := l[Index]]
Thanks.
Convert list to numpy array for indexing and then filter by mask in both sides:
m = df.OtherColumn == 'c'
df.loc[m, 'Value'] = np.array(l)[df.Index][m]
print (df)
Index OtherColumn Value
0 0 a NaN
1 1 b NaN
2 3 c 1003.0
3 2 d NaN
Or use numpy.where:
m = df.OtherColumn == 'c'
df['Value'] = np.where(m, np.array(l)[df.Index], '-')
print (df)
Index OtherColumn Value
0 0 a -
1 1 b -
2 3 c 1003
3 2 d -
Or:
df['value'] = np.where(m, df['Index'].map(dict(enumerate(l))), '-')
Use Series.where + Series.map:
df['value']=df['Index'].map(dict(enumerate(l))).where(df['OtherColumn']=='c','-')
print(df)
Index OtherColumn value
0 0 a -
1 1 b -
2 3 c 1003
3 2 d -

Most efficient way to return Column name in a pandas df

I have a pandas df that contains 4 different columns. For every row theres a value thats of importance. I want to return the Column name where that value is displayed. So for the df below I want to return the Column name when the value 2 is labelled.
d = ({
'A' : [2,0,0,2],
'B' : [0,0,2,0],
'C' : [0,2,0,0],
'D' : [0,0,0,0],
})
df = pd.DataFrame(data=d)
Output:
A B C D
0 2 0 0 0
1 0 0 2 0
2 0 2 0 0
3 2 0 0 0
So it would be A,C,B,A
I'm doing this via
m = (df == 2).idxmax(axis=1)[0]
And then changing the row. But this isn't very efficient.
I'm also hoping to produce the output as a Series from pandas df
Use DataFrame.dot:
df.astype(bool).dot(df.columns).str.cat(sep=',')
Or,
','.join(df.astype(bool).dot(df.columns))
'A,C,B,A'
Or, as a list:
df.astype(bool).dot(df.columns).tolist()
['A', 'C', 'B', 'A']
...or a Series:
df.astype(bool).dot(df.columns)
0 A
1 C
2 B
3 A
dtype: object

Map dictionary values in Pandas

I have a data frame (df) with the following:
var1
a 1
a 1
b 2
b 3
c 3
d 5
And a dictionary:
dict_cat = {
'x' = ['a', 'b', 'c'],
'y' = 'd' }
And I want to create a new column called cat in which depending of the var1 value, it takes the dict key value:
var1 cat
a 1 x
a 1 x
b 2 x
b 3 x
c 3 x
d 5 y
I have tried to map the dict to the variable using: df['cat'] = df['var1'].map(dict_cat), but since values are inside a list, Python do not recognize the values and I only get NaN values. There is a way to do this using map, or should I create a function that iterates over rows checking if var1 is in the dictionary lists?
Thanks!
You need swap keys with values to new dict and then use map:
print (df)
var1 var2
0 a 1
1 a 1
2 b 2
3 b 3
4 c 3
5 d 5
dict_cat = {'x' : ['a', 'b', 'c'],'y' : 'd' }
d = {k: oldk for oldk, oldv in dict_cat.items() for k in oldv}
print (d)
{'a': 'x', 'b': 'x', 'c': 'x', 'd': 'y'}
df['cat'] = df['var1'].map(d)
print (df)
var1 var2 cat
0 a 1 x
1 a 1 x
2 b 2 x
3 b 3 x
4 c 3 x
5 d 5 y
If first columns is index is possible use rename or convert index to_series and then use map:
print (df)
var1
a 1
a 1
b 2
b 3
c 3
d 5
dict_cat = {'x' : ['a', 'b', 'c'],'y' : 'd' }
d = {k: oldk for oldk, oldv in dict_cat.items() for k in oldv}
df['cat'] = df.rename(d).index
Or:
df['cat'] = df.index.to_series().map(d)
print (df)
var1 cat
a 1 x
a 1 x
b 2 x
b 3 x
c 3 x
d 5 y

use column name as condition for where on pandas DataFrame

Say I have the following DataFrame:
arrays = [['foo', 'foo', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(4,4), columns = columnValues)
print(df)
foo bar
A B C D
0 0.037362 0.470010 0.315396 0.333798
1 0.339038 0.396307 0.487242 0.064883
2 0.691654 0.793609 0.044490 0.384154
3 0.605801 0.967021 0.156839 0.123816
I want to produce the following output:
foo bar
A B C D
0 0 0 0.315396 0.333798
1 0 0 0.487242 0.064883
2 0 0 0.044490 0.384154
3 0 0 0.156839 0.123816
I think I can use pd.DataFrame.where() for this, however I don't see how to pass the column name bar as a condition.
EDIT: I'm looking for a way to specifically use bar instead of foo to produce the desired outcome, as foo would actually be many columns
EDIT2: Unfortunately list comprehension breaks if the list contains all the column labels. Explicitly writing out the for loop does work though.
So instead of this:
df.loc[:, [col for col in df.columns.levels[0] if col != 'bar']] = 0
I use this:
for col in df.columns.levels[0]:
if not(col in nameList):
df.loc[:,col]=0
Use slicing to set your data. Here, you could access sub-columns (A, B), under foo.
In [12]: df
Out[12]:
foo bar
A B C D
0 0.040251 0.119267 0.170111 0.582362
1 0.978192 0.592043 0.515702 0.630627
2 0.762532 0.667234 0.450505 0.103858
3 0.871375 0.397503 0.966837 0.870184
In [13]: df.loc[:, 'foo'] = 0
In [14]: df
Out[14]:
foo bar
A B C D
0 0 0 0.170111 0.582362
1 0 0 0.515702 0.630627
2 0 0 0.450505 0.103858
3 0 0 0.966837 0.870184
If you want to set all columns except bar, you could do.
In [15]: df.loc[:, [col for col in df.columns.levels[0] if col != 'bar']] = 0
You could use get_level_values, I guess:
>>> df
foo bar
A B C D
0 0.039728 0.065875 0.825380 0.240403
1 0.617857 0.895751 0.484237 0.506315
2 0.332381 0.047287 0.011291 0.346073
3 0.216224 0.024978 0.834353 0.500970
>>> df.loc[:, df.columns.get_level_values(0) != "bar"] = 0
>>> df
foo bar
A B C D
0 0 0 0.825380 0.240403
1 0 0 0.484237 0.506315
2 0 0 0.011291 0.346073
3 0 0 0.834353 0.500970
df.columns.droplevel(1) != "bar" should also work, although I don't like it as much even though it's shorter because it inverts the selection logic.
Easier, without loc
df['foo'] = 0
If you happen not to have this multi index you can use:
df.ix[:,['A','B']] = 0
This replaces automatically the values in your columns 'A' and 'B' by 0.

Categories

Resources