I tried using itertools.groupby with a pandas Series. But I got:
TypeError: boolean value of NA is ambiguous
Indeed some of my values are NA.
This is a minimal reproducible example:
import pandas as pd
import itertools
g = itertools.groupby([pd.NA,0])
next(g)
next(g)
Comparing a NA always results in NA, so g.__next__ does while NA and fails.
Is there a way to solve this, so itertools.groupby works with NA values? Or should I just accept it and use a different route to my (whatever) goal?
How about using a key function in itertools.groupby to convert pd.NA to None? Since == doesn't produce the desired output with pd.NA, we can use the is operator to perform identity comparison instead.
import pandas as pd
import itertools
arr = [pd.NA, pd.NA, 0, 1, 1]
keyfunc = lambda x: None if (x is pd.NA) else x
for key, group in itertools.groupby(arr, key=keyfunc):
print(key, list(group))
Output:
None [<NA>, <NA>]
0 [0]
1 [1, 1]
Related
I have a pandas DataFrame with several columns containing dicts. I am trying to identify columns that contain at least 1 dict.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'i': [0, 1, 2, 3],
'd': [np.nan, {'p':1}, {'q':2}, np.nan],
't': [np.nan, {'u':1}, {'v':2}, np.nan]
})
# Iterate over cols to find dicts
cdict = [i for i in df.columns if isinstance(df[i][0],dict)]
cdict
[]
How do I find cols with dicts? Is there a solution to find cols with dicts without iterating over every cell / value of columns?
You can do :
s = df.applymap(lambda x:isinstance(x, dict)).any()
dict_cols = s[s].index.tolist()
print(dict_cols)
['d', 't']
We can apply over the columns although this still is iterating but making use of apply.
df.apply(lambda x: [any(isinstance(y, dict) for y in x)], axis=0)
EDIT: I think using applymap is more direct. However, we can use our boolean result to get the column names
any_dct = df.apply(lambda x: [any(isinstance(y, dict) for y in
x)], axis=0, result_type="expand")
df.iloc[:,any_dct.iloc[0,:].tolist()].columns.values
i want to get values from the dict that looks like
pair_devices_count =
{('tWAAAA.jg', 'ttNggB.jg'): 1,
('tWAAAM.jg', 'ttWVsM.jg'): 2,
('tWAAAN.CV', 'ttNggB.AS'): 1,
('tWAAAN.CV', 'ttNggB.CV'): 2,
('tWAAAN.CV', 'ttNggB.QG'): 1}
(Pairs of domain)
But when i use
train_data[['domain', 'target_domain']].apply(lambda x: pair_devices_count.get((x), 0))
it raises an error, because pandas series are not hashable
How can i get dict values to generate column
train['pair_devices_count']?
you cannot apply on multiple columns. You can try this :
train_data.apply(lambda x: pair_devices_count[(x.domain, x.target_domain)], axis=1)
pandas series are not hashable
Convert pd.Series to tuple before using .get consider following simple example
import pandas as pd
d = {('A','A'):1,('A','B'):2,('A','C'):3}
df = pd.DataFrame({'X':['A','A','A'],'Y':['C','B','A'],'Z':['X','Y','Z']})
df['d'] = df[['X','Y']].apply(lambda x:d.get(tuple(x)),axis=1)
print(df)
output
X Y Z d
0 A C X 3
1 A B Y 2
2 A A Z 1
In one column, I have 4 possible (non-sequential) values: A, 2, +, ? and I want order rows according to a custom sequence 2, ?, A, +, I followed some code I followed online:
order_by_custom = pd.CategoricalDtype(['2', '?', 'A', '+'], ordered=True)
df['column_name'].astype(order_by_custom)
df.sort_values('column_name', ignore_index=True)
But for some reason, although it does sort, it still does so according to alphabetical (or binary value) position rather than the order I've entered them in the order_by_custom object.
Any ideas?
.astype does return Series after conversion, but you did not anything with it. Try assigning it to your df. Consider following example:
import pandas as pd
df = pd.DataFrame({'orderno':[1,2,3],'custom':['X','Y','Z']})
order_by_custom = pd.CategoricalDtype(['Z', 'Y', 'X'], ordered=True)
df['custom'] = df['custom'].astype(order_by_custom)
print(df.sort_values('custom'))
output
orderno custom
2 3 Z
1 2 Y
0 1 X
You can use a customized dictionary to sort it. For example a dictionary will be as:
my_custom_dict = {'2': 0, '?': 1, 'A': 2, '+' : 3}
If your column name is "my_column_name" then,
df.sort_values(by=['my_column_name'], key=lambda x: x.map(my_custom_dict))
I have a list within a dictionary within a dictionary. The data set is very large. How can I most quickly return the list nested in the two dictionaries if I am given a List that is specific to the key, dict pairs?
{"Dict1":{"Dict2": ['UNIOUE LIST'] }}
Is there an alternate data structure to use for this for efficiency?
I do not believe a more efficient data structure exists in Python. Simply retrieving the list using the regular indexing operator should be a very fast operation, even if both levels of dictionaries are very large.
nestedDict = {"Dict1":{"Dict2": ['UNIOUE LIST'] }}
uniqueList = nestedDict["Dict1"]["Dict2"]
My only thought for improving performance was to try flattening the data structure into a single dictionary with tuples for keys. This would take more memory than the nested approach since the keys in the top-level dictionary will be replicated for every entry in the second-level dictionaries, but it will only compute the hash function once for every lookup. But this approach is actually slower than the nested approach in practice:
nestedDict = {i: {j: ['UNIQUE LIST'] for j in range(1000)} for i in range(1000)}
flatDict = {(i, j): ['UNIQUE LIST'] for i in range(1000) for j in range(1000)}
import random
def accessNested():
i = random.randrange(1000)
j = random.randrange(1000)
return nestedDict[i][j]
def accessFlat():
i = random.randrange(1000)
j = random.randrange(1000)
return nestedDict[(i,j)]
import timeit
print(timeit.timeit(accessNested))
print(timeit.timeit(accessFlat))
Output:
2.0440238649971434
2.302736301004188
The fastest way to access the list within the nested dictionary is,
d = {"Dict1":{"Dict2": ['UNIOUE LIST'] }}
print(d["Dict1"]["Dict2"])
Output :
['UNIOUE LIST']
But if you perform iteration on the list that is in nested dictionary. so you can use the following code as example,
d = {"a":{"b": ['1','2','3','4'] }}
for i in d["a"]["b"]:
print(i)
Output :
1
2
3
4
If I understand correctly, you want to access a nested dictionary structure if...
if I am given a List that is specific to the key
So, here you have a sample dictionary and key that you want to access
d = {'a': {'a': 0, 'b': 1},
'b': {'a': {'a': 2}, 'b': 3}}
key = ('b', 'a', 'a')
The lazy approach
This is fast if you know Python dictionaries already, no need to learn other stuff!
>>> value = d
>>> for level in key:
... value = temp[level]
>>> value
2
NestedDict from the ndicts package
If you pip install ndicts then you get the same "lazy approach" implementation in a nicer interface.
>>> from ndicts import NestedDict
>>> nd = NestedDict(d)
>>> nd[key]
2
>>> nd["b", "a", "a"]
2
This option is fast because you can't really write less code than nd[key] to get what you want.
Pandas dataframes
This is the solution that will give you performance. Lookups in dataframes should be quick, especially if you have a sorted index.
In this case we have hierarchical data with multiple levels, so I will create a MultiIndex first. I will use the NestedDict for ease, but anything else to flatten the dictionary will do.
>>> keys = list(nd.keys())
>>> values = list(nd.values())
>>> from pandas import DataFrame, MultiIndex
>>> index = MultiIndex.from_tuples(keys)
>>> df = DataFrame(values, index=index, columns="Data").sort_index()
>>> df
Data
a a NaN 0
b NaN 1
b a a 2
b NaN 3
Use the loc method to get a row.
>>> nd.loc[key]
Data 2
Name: (b, a, a), dtype: int64
I have a list of dictionnaries that all have the same keys.
in_list = [{'index':1, 'value':2.}, {'index':1, 'value':3.}, {'index':2, 'value':4.}]
I'd like to create a new dictionnary with the average on 'value' for each 'index'.
out_dict = {1:2.5, 2:4.}
What would be the most pythonic way to do this ?
The following code does what I want, but I feel like it is clumsy
tmp = {x:[] for x in range(1,3)}
for el in in_list:
tmp[el['index']].append(el['value'])
for key, val in tmp.iteritems():
out_dict[key] = sum(val)/len(val)
Your code is fine, but you can make it a little more compact. As Transhuman's answer shows you can avoid initialising tmp by making it a defaultdict of lists. Another way to do that is to use the dict.setdefault method. And then use a dict comprehension to calculate the averages.
in_list = [
{'index':1, 'value':2.},
{'index':1, 'value':3.},
{'index':2, 'value':4.}
]
out_dict = {}
for d in in_list:
out_dict.setdefault(d['index'], []).append(d['value'])
out_dict = {k: sum(v) / len(v) for k, v in out_dict.items()}
print(out_dict)
output
{1: 2.5, 2: 4.0}
To do it without any packages to install (long one-liner :-) ):
import itertools,statistics
a = dict(zip(sorted(set([i['index'] for i in lod])),[statistics.mean(int(item['value']) for item in group) for key, group in itertools.groupby(lod, key=lambda x: x['index'])]))
Now:
print(a)
Returns:
{1: 2.5, 2: 4}
If python 2:
import itertools
a = dict(zip(sorted(set([i['index'] for i in lod]),key=[i['index'] for i in lod].index),[sum(int(item['value']) for item in group)/len(int(item['value']) for item in group) for key, group in itertools.groupby(lod, key=lambda x: x['index'])]))
Explanation:
get the ordered list of unique elements using set
use itertools.groupby for grouping then iterate by key a group, the get average using statistics or sum and len
the above two notes are all in a zip(dict(zip(...)))
Or to make the code little cleaner:
Python 3:
import itertools,statistics
unique_elements=sorted(set([i['index'] for i in lod]))
groups=statistics.mean(int(item['value']) for item in group) for key, group in itertools.groupby(lod, key=lambda x: x['index'])]
a = dict(zip(unique_elements,groups))
Python 2:
import itertools
unique=sorted(set([i['index'] for i in lod])
groups=[sum(int(item['value']) for item in group)/len(int(item['value']) for item in group) for key, group in itertools.groupby(lod, key=lambda x: x['index'])]
a = dict(unique,groups))
I don't think your code is clumsy, but you could check out pandas.
>>> import pandas as pd
>>> in_list = [{'index':1, 'value':2.}, {'index':1, 'value':3.}, {'index':2, 'value':4.}]
>>>
>>> df = pd.DataFrame(in_list)
>>> df.groupby(by='index').mean()
value
index
1 2.5
2 4.0
You can transform the result to a standard dictionary if you like.
>>> df.groupby(by='index').mean().to_dict()['value']
{1: 2.5, 2: 4.0}
One way you can do is using collections.defaultdict
in_list = [{'index':1, 'value':2.}, {'index':1, 'value':3.}, {'index':2, 'value':4.}]
from collections import defaultdict
d_dict = defaultdict(list)
for k,v in [d.values() for d in in_list]:
d_dict[k].append(v)
{k:sum(v)/len(v) for k,v in d_dict.items()}
#{1: 2.5, 2: 4.0}