How to replace nan by dictionary in pandas dataframe column - python

I want to replace a NaN in a dataframe column by a dictionary like this: {"value":["100"]}
df[column].apply(type).value_counts()
output:
<class 'dict'> 11565
<class 'float'> 43
df[column].isna().sum()
output => 43
How can I do this?

Use lambda function for replace by dictionary:
df = pd.DataFrame({'column':[np.nan, {'a':[4,5]}]})
d = {"value":["100"]}
df['column'] = df['column'].apply(lambda x: d if pd.isna(x) else x)
print (df)
column
0 {'value': ['100']}
1 {'a': [4, 5]}
Or list comprehension:
df['column'] = [d if pd.isna(x) else x for x in df['column']]

Related

use specific columns to map new column with json

I have a data frame with:
A B C
1 3 6
I want to take the 2 columns and create column D that reads {"A":"1", "C":"6}
new dataframe output would be:
A B C D
1 3 6 {"A":"1", "C":"6}
I have the following code:
df['D'] = n.apply(lambda x: x.to_json(), axis=1)
but this is taking all columns while I only need columns A and C and want to leave B from the JSON that is created.
Any tips on just targeting the two columns would be appreciated.
It's not exactly what you ask but you can convert your 2 columns into a dict then if you want to export your data in JSON format, use df['D'].to_json():
df['D'] = df[['A', 'C']].apply(dict, axis=1)
print(df)
# Output
A B C D
0 1 3 6 {'A': 1, 'C': 6}
For example, export the column D as JSON:
print(df['D'].to_json(orient='records', indent=4))
# Output
[
{
"A":1,
"C":6
}
]
Use subset in lambda function:
df['D'] = df.apply(lambda x: x[['A','C']].to_json(), axis=1)
Or sellect columns before apply:
df['D'] = df[['A','C']].apply(lambda x: x.to_json(), axis=1)
If possible create dictionaries:
df['D'] = df[['A','C']].to_dict(orient='records')
print (df)
A B C D
0 1 3 6 {'A': 1, 'C': 6}

Replace values of empty dictionaries in a dataframe column

Given the following:
data = pd.DataFrame({"a": [{}, 1, 2]})
How best to replace {} with a particular value?
The following works:
rep = 0
data.apply(lambda x: [y if not isinstance(y, dict) else rep for y in x])
but I'm wondering if there's something more idiomatic.
Try with bool empty object will return False
data.loc[~data.a.astype(bool),'a'] = 0
data
Out[103]:
a
0 0
1 1
2 2
You can use pd.to_numeric with errors='coerce':
In [24]: data['a'] = pd.to_numeric(data['a'], errors='coerce').fillna(0).astype(int)
In [25]: data
Out[25]:
a
0 0
1 1
2 2

Pandas how to checkif substring of one column is a substring of another column

I have a input like this, and I want to check where a substring in col_1 exists in col_2 or not. What I wanted to try contains two step: split text from col_1 and then use for loop to do the comparison with col_2. I'm wondering if I can achieve this via df.apply ?
INPUT>
dct = {'col_1': ['X_a', 'Y_b'],
'col_2': ['a_b_c', 'c_d_e',]}
df = pd.DataFrame(dct)
EXPECT RESULT>
col_1 col_2 result
0 X_a a_b_c True
1 Y_b c_d_e False
Do you need something involving set intersections?
df['result'] = (
df['col_1'].str.split('_').map(set) & df['col_2'].str.split('_').map(set))
df
col_1 col_2 result
0 X_a a_b_c True
1 Y_b c_d_e False
You can use df.apply with axis=1. This mean that it will apply function to each row.
>>> import pandas as pd
>>>
>>> dct = {'col_1': ['X_a', 'Y_b'],
... 'col_2': ['a', 'c',]}
>>> df = pd.DataFrame(dct)
>>>
>>> def check_substring(row):
... _, second = row.col_1.split("_")
... return second in row.col_2
...
>>> df["result"] = df.apply(check_substring, axis=1)
>>> print(df)
col_1 col_2 result
0 X_a a True
1 Y_b c False
This is a one liner using apply and an inner loop.
df['result'] = df.apply(lambda x: any(y in x['col_1'] for y in x['col_2'].split('_')), axis=1)
Exmaple
dct = {'col_1': ['X_a', 'Y_b'],
'col_2': ['a_b_c', 'c_d_e',]}
df = pd.DataFrame(dct)
df['result'] = df.apply(lambda x: any(y in x['col_1'] for y in x['col_2'].split('_')), axis=1)
>>> df
col_1 col_2 result
0 X_a a_b_c True
1 Y_b c_d_e False
Try:
df['result'] = np.nan
for i in range(len(df)):
df['result'][i] = df['col_2'][i] in list(df['col_1'][i])

How can I get intersection of two pandas series text column?

I have two pandas series of text column how can I get intersection of those?
print(df)
0 {this, is, good}
1 {this, is, not, good}
print(df1)
0 {this, is}
1 {good, bad}
I'm looking for a output something like below.
print(df2)
0 {this, is}
1 {good}
I've tried this but it returns
df.apply(lambda x: x.intersection(df1))
TypeError: unhashable type: 'set'
Looks like a simple logic:
s1 = pd.Series([{'this', 'is', 'good'}, {'this', 'is', 'not', 'good'}])
s2 = pd.Series([{'this', 'is'}, {'good', 'bad'}])
s1 - (s1 - s2)
#Out[122]:
#0 {this, is}
#1 {good}
#dtype: object
This approach works for me
import pandas as pd
import numpy as np
data = np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}])
data1 = np.array([{'this', 'is'},{'good', 'bad'}])
df = pd.Series(data)
df1 = pd.Series(data1)
df2 = pd.Series([df[i] & df1[i] for i in xrange(df.size)])
print(df2)
I appreciate above answers. Here is a simple example to solve the same if you have DataFrame (As I guess, after looking into your variable names like df & df1, you had asked this for DataFrame .).
This df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1) will do that. Let's see how I reached to the solution.
The answer at https://stackoverflow.com/questions/266582... was helpful for me.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({
... "set": [{"this", "is", "good"}, {"this", "is", "not", "good"}]
... })
>>>
>>> df
set
0 {this, is, good}
1 {not, this, is, good}
>>>
>>> df1 = pd.DataFrame({
... "set": [{"this", "is"}, {"good", "bad"}]
... })
>>>
>>> df1
set
0 {this, is}
1 {bad, good}
>>>
>>> df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1)
0 {this, is}
1 {good}
dtype: object
>>>
How I reached to the above solution?
>>> df.apply(lambda x: print(x.name), axis=1)
0
1
0 None
1 None
dtype: object
>>>
>>> df.loc[0]
set {this, is, good}
Name: 0, dtype: object
>>>
>>> df.apply(lambda row: print(row[0]), axis=1)
{'this', 'is', 'good'}
{'not', 'this', 'is', 'good'}
0 None
1 None
dtype: object
>>>
>>> df.apply(lambda row: print(type(row[0])), axis=1)
<class 'set'>
<class 'set'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), df1.loc[row.name]), axis=1)
<class 'set'> set {this, is}
Name: 0, dtype: object
<class 'set'> set {good}
Name: 1, dtype: object
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name])), axis=1)
<class 'set'> <class 'pandas.core.series.Series'>
<class 'set'> <class 'pandas.core.series.Series'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name][0])), axis=1)
<class 'set'> <class 'set'>
<class 'set'> <class 'set'>
0 None
1 None
dtype: object
>>>
Similar to above except if you want to keep everything in one dataframe
Current df:
df = pd.DataFrame({0: np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}]), 1: np.array([{'this', 'is'},{'good', 'bad'}])})
Intersection of series 0 & 1
df[2] = df.apply(lambda x: x[0] & x[1], axis=1)

Looping dictionary through column using Pandas

I have a data frame with a column called "Input", consisting of various numbers.
I created a dictionary that looks like this
sampleDict = {
"a" : ["123","456"],
"b" : ["789","272"]
}
I am attempting to loop through column "Input" against this dictionary. If any of the values in the dictionary are found (123, 789, etc), I would like to create a new column in my data frame that signifies where it was found.
For example, I would like to create column called "found" where the value is "a" when 456 was found in "Input." the value is "b" when 789 was found in the input.
I tried the following code but my logic seems to be off:
for key in sampleDict:
for p_key in df['Input']:
if code in p_key:
if code in sampleDict[key]:
df = print(code)
print(df)
Use map by flattened lists to dictionary, only is necessary all values in lists are unique:
d = {k: oldk for oldk, oldv in sampleDict.items() for k in oldv}
print (d)
{'123': 'a', '456': 'a', '789': 'b', '272': 'b'}
df = pd.DataFrame({'Input':['789','456','100']})
df['found'] = df['Input'].map(d)
print (df)
Input found
0 789 b
1 456 a
2 100 NaN
If duplicated values in lists is possible use aggregation, e.g. by join in first step and map by Series:
sampleDict = {
"a" : ["123","456", "789"],
"b" : ["789","272"]
}
df1 = pd.DataFrame([(k, oldk) for oldk, oldv in sampleDict.items() for k in oldv],
columns=['a','b'])
s = df1.groupby('a')['b'].apply(', '.join)
print (s)
a
123 a
272 b
456 a
789 a, b
Name: b, dtype: object
df = pd.DataFrame({'Input':['789','456','100']})
df['found'] = df['Input'].map(s)
print (df)
Input found
0 789 a, b
1 456 a
2 100 NaN
You can use collections.defaultdict to construct a mapping of list values to key(s). Data from #jezrael.
from collections import defaultdict
d = defaultdict(list)
for k, v in sampleDict.items():
for w in v:
d[w].append(k)
print(d)
defaultdict(list,
{'123': ['a'], '272': ['b'], '456': ['a'], '789': ['a', 'b']})
Then use pd.Series.map to map inputs to keys in a new series:
df = pd.DataFrame({'Input':['789','456','100']})
df['found'] = df['Input'].map(d)
print(df)
Input found
0 789 [a, b]
1 456 [a]
2 100 NaN
create a mask using a list comprehension then convert the list to an array and mask the true values in the search array
sampleDict = {
"a" : ["123","456"],
"b" : ["789","272"]
}
search=['789','456','100']
#https://www.techbeamers.com/program-python-list-contains-elements/
#https://stackoverflow.com/questions/10274774/python-elegant-and-efficient-ways-to-mask-a-list
for key,item in sampleDict.items():
print(item)
mask=[]
[mask.append(x in search) for x in item]
arr=np.array(item)
print(arr[mask])

Categories

Resources