Pandas fillna to empty dictionary - python

I have a pandas dataframe with a 'metadata' column that should contain a dictionary as value. However, some values are missing and set to NaN. I would like this to be {} instead.
Sometimes, the entire column is missing and initializing it to {} is also problematic.
For adding the column
tspd['metadata'] = {} # fails
tspd['metadata'] = [{} for _ in tspd.index] # works
For filling missing values
tspd['metadata'].replace(np.nan,{}) # does nothing
tspd['metadata'].fillna({}) # likewise does nothing
tspd.loc[tspd['metadata'].isna(), 'metadata'] = {} # error
tspd['metadata'] = tspd['metadata'].where(~tspd['metadata'].isna(), other={}) # this sets the NaN values to <built-in method values of dict object>
So adding the column works, but is a bit ugly. Replacing the values without some (slow) loop seems not possible.

You can use np.nan == np.nan is False, so for replace missing values is possible use:
tspd = pd.DataFrame({'a': [0,1,2], 'metadata':[{'a':'s'}, np.nan, {'d':'e'}]})
tspd['metadata'] = tspd['metadata'].apply(lambda x: {} if x != x else x)
print(tspd)
a metadata
0 0 {'a': 's'}
1 1 {}
2 2 {'d': 'e'}
Or:
tspd['metadata'] = [{} if x != x else x for x in tspd['metadata']]

Do not using [{}] * len(tspd)
tspd['metadata'] = [{}for x in range(len(tspd))]
tspd
Out[326]:
a metadata
0 0 {}
1 1 {}
2 2 {}
Detail
tspd['metadata'] = [{}] * len(tspd)
tspd['metadata'].iloc[0]['lll']=1
tspd # see all duplicated here ,since they are the same copy
Out[324]:
a metadata
0 0 {'lll': 1}
1 1 {'lll': 1}
2 2 {'lll': 1}
Do it one by one , each time create the iid {}
tspd['metadata'] = [{}for x in range(len(tspd))]
tspd
Out[326]:
a metadata
0 0 {}
1 1 {}
2 2 {}
tspd['metadata'].iloc[0]['lll']=1
tspd
Out[328]:
a metadata
0 0 {'lll': 1}
1 1 {}
2 2 {}

Related

In a column of a dataframe, count number of elements on list starting by "a"

I have a dataset like
data = {'ID': ['first_value', 'second_value', 'third_value',
'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'],
['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
And I want to create two columns:
A column that counts the number of elements that start with "a" on 'list_id'
A column that counts the number of elements that DO NOT start with "a" on "list_id"
I was thinking on doing something like:
data['list_id'].apply(lambda x: for entity in x if x.startswith("a")
I thought on counting first the ones starting with “a” and after counting the ones not starting with “a”, so I did this:
sum(1 for w in data["list_id"] if w.startswith('a'))
Moreover this does not really work and I cannot make it work.
Any ideas? :)
Assuming this input:
ID list_id
0 first_value [001, ab0, 44A]
1 second_value []
2 third_value [005, 006]
3 fourth_value [a22]
4 fifth_value [azz]
5 sixth_value [aaa, abd]
you can use:
sum(1 for l in data['list_id'] for x in l if x.startswith('a'))
output: 5
If you rather want a count per row:
df['starts_with_a'] = [sum(x.startswith('a') for x in l) for l in df['list_id']]
df['starts_with_other'] = df['list_id'].str.len()-df['starts_with_a']
NB. using a list comprehension is faster than apply
output:
ID list_id starts_with_a starts_with_other
0 first_value [001, ab0, 44A] 1 2
1 second_value [] 0 0
2 third_value [005, 006] 0 2
3 fourth_value [a22] 1 0
4 fifth_value [azz] 1 0
5 sixth_value [aaa, abd] 2 0
Using pandas something quite similar to your proposal works:
data = {'ID': ['first_value', 'second_value', 'third_value', 'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'], ['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
df["len"] = df.list_id.apply(len)
df["num_a"] = df.list_id.apply(lambda s: sum(map(lambda x: x[0] == "a", s)))
df["num_not_a"] = df["len"] - df["num_a"]

How to put the print information into a current dataframe

I am having brain fart. I wrote some code to get keywords from my data frame. It worked, but how can I put the print information into my current data frame. Thank you for the help in advance.
from scipy.sparse import coo_matrix
def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""
#use only topn items from vector
sorted_items = sorted_items[:topn]
score_vals = []
feature_vals = []
# word index and corresponding tf-idf score
for idx, score in sorted_items:
#keep track of feature name and its corresponding score
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]]=score_vals[idx]
return results
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,5)
#now print the results - NEED TO PUT THIS INFORMATION IN MY CURRENT DATAFRAME
print("\nAbstract:")
print(doc)
print("\nKeywords:")
for k in keywords:
print(k,keywords[k])
First: DataFrame is not Excel so it may not look like you may expect.
You can use append() to add new row with text. It should automatically add NaN if row is shorted. OR it will add columns with NaN if row is longer.
import pandas as pd
data = {
'X': ['A','B','C'],
'Y': ['D','E','F'],
'Z': ['G','H','I']
}
df = pd.DataFrame(data)
print(df)
df = df.append({"X": 'Abstract:'}, ignore_index=True)
df = df.append({"X": 'Keywords:'}, ignore_index=True)
keywords = {"first": 123, "second": 456, "third": 789}
for key, value in keywords.items():
df = df.append({"X": key, "Y": value}, ignore_index=True)
print(df)
Result:
# Before
X Y Z
0 A D G
1 B E H
2 C F I
# After
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract: NaN NaN
4 Keywords: NaN NaN
5 first 123 NaN
6 second 456 NaN
7 third 789 NaN
Eventually later you can replace NaN with something else - ie. empty string:
df = df.fillna('')
Result:
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract:
4 Keywords:
5 first 123
6 second 456
7 third 789

How to get the max from row's elements in python?

I have a data frame that contains a single column Positive Dispatch,
index Positive Dispatch
0 a,c
1 b
2 a,b
Each keyword has its own value:
a,b,c = 12,22,11
I want to create a new column that contains the max of each row, for example in the first row there are a and c and between them a has the biggest value, which is 12 and so on:
Positive Dispatch Max
a,c 12
b 22
a,b 22
My attempt:
import pandas as pd
dic1 = {
'a': [12,0,22],
'b': [0,13,22],
'c': [12,0,0], # there can be N number of columns here for example
} # 'd': [11,22,333]
a,b,c = 12,22,11 # d will have its own value, for example d = 33
df = pd.DataFrame(dic1)
df['Positive Dispatch'] = df.gt(0).dot(df.columns + ',').str[:-1] #Creating the positive dispatch column
print(df['Positive Dispatch'].max(axis=1))
But this gives the error:
ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>
IIUC:
create a dict then calculate max according to the key and value of the dictionary by using split()+max()+map():
d={'a':a,'b':b,'c':c}
df['Max']=df['Positive Dispatch'].str.split(',').map(lambda x:max([d.get(y) for y in x]))
#for more columns use applymap() in place of map() and logic remains same
OR
If you have more columns like 'Dispatch' then use:
d={'a':a,'b':b,'c':c,'d':d}
df[['Max','Min']]=df[['Positive Dispatch','Negative Dispatch']].applymap(lambda x:max([d.get(y) for y in x.split(',')]))
Sample Dataframe used:
dic1 = {
'a': [12,0,22],
'b': [0,13,22],
'c': [12,0,0], # there can be N number of columns here for example
'd': [11,22,333]}
a,b,c,d = 12,22,11,33 # d will have its own value, for example d = 33
df = pd.DataFrame(dic1)
df['Positive Dispatch'] = df.gt(0).dot(df.columns + ',').str[:-1]
df['Negative Dispatch']=[['a,d'],['c,b,a'],['d,c']]
df['Negative Dispatch']=df['Negative Dispatch'].str.join(',')
output:
a b c Positive Dispatch Max
0 12 0 12 a,c 12
1 0 13 0 b 22
2 22 22 0 a,b 22

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[t.id] = t.value
if current_id != t.id:
current_id = t.id
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
prints:
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
diff
a b 2
c 4
b c 1
a 2
c a 1
I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:
# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
which gives me:
id value
2 a 3
4 b 6
5 c 7
but can't find a way of using this to take the diffs in a vectorized manner
Depending on how many ids you have, this works with few thousands:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
diff.stack()
output:
a b 2
c 4
b c 1
dtype: object
Edit for updated data:
For the updated data, approach is similar if we can create the f table:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Output:
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
If you want to go further and drop the index (a,a), well, I'm so lazy :D.
My method
s=df.groupby(df.id.shift().ne(df.id).cumsum()).agg({'id':'first','value':['min','max']})
s.columns=s.columns.droplevel(0)
t=s['min'].values[:,None]-s['max'].values
t=t.astype(float)
Below are all reshape, to match your output
t[np.triu_indices(t.shape[1], 0)] = np.nan
newdf=pd.DataFrame(t,index=s['first'],columns=s['first'])
newdf.values[newdf.index.values[:,None]==newdf.index.values]=np.nan
newdf=newdf.T.stack()
newdf
Out[933]:
first first
a b 2.0
c 4.0
b c 1.0
a 2.0
c a 1.0
dtype: float64

Filter a pandas dataframe using values from a dict

I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.
IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3
Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.
For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]
Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3
To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]
You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)
Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]
I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]

Categories

Resources