Working with Pandas, I have to rewrite queries implemented as a dict:
query = {"height": 175}
The key is the attribute for the query and the value could be a scalar or iterable.
In the first part I check if the value is not NaN and scalar.
If this condition holds I write the query expression with the == symbol, but else if the value is Iterable I would need to write the expression with the in keyword.
This is the actual code that I need to fix in order to work also with Iterables.
import numpy as np
from collections import Iterable
def query_dict_to_expr(query: dict) -> str:
expr = " and ".join(["{} == {}"
.format(k, v) for k, v in query.items()
if (not np.isnan(v)
and np.isscalar(v))
else "{} in #v".format(k) if isinstance(v, Iterable)
]
)
return expr
but I got invalid syntax in correspondence with the else statement.
If I understand correctly, you don't need to check the type:
In [47]: query
Out[47]: {'height': 175, 'lst_col': [1, 2, 3]}
In [48]: ' and '.join(['{} == {}'.format(k,v) for k,v in query.items()])
Out[48]: 'height == 175 and lst_col == [1, 2, 3]'
Demo:
In [53]: df = pd.DataFrame(np.random.randint(5, size=(5,3)), columns=list('abc'))
In [54]: df
Out[54]:
a b c
0 0 0 3
1 4 2 4
2 2 2 3
3 0 1 0
4 0 4 1
In [55]: query = {"a": 0, 'b':[0,4]}
In [56]: q = ' and '.join(['{} == {}'.format(k,v) for k,v in query.items()])
In [57]: q
Out[57]: 'a == 0 and b == [0, 4]'
In [58]: df.query(q)
Out[58]:
a b c
0 0 0 3
4 0 4 1
You misplaces the if/else in the comprehension. If you put the if after the for, like f(x) for x in iterable if g(x), this will filter the elements of the iterable (and can not be combined with an else). Instead, you want to keep all the elements, i.e. use f(x) for x in iterable where f(x) just happens to be a ternary expression, i.e. in the form a(x) if c(x) else b(x).
Instead, try like this (simplified non-numpy example):
>>> query = {"foo": 42, "bar": [1,2,3]}
>>> " and ".join(["{} == {}".format(k, v)
if not isinstance(v, list)
else "{} in {}".format(k, v)
for k, v in query.items()])
'foo == 42 and bar in [1, 2, 3]'
Related
I have a Pandas Series of lists of arbitary length:
s = pd.Series([[1,2,3], [4,6], [7,8,9,10]])
and a list of elements
l = [1,2,3,6,7,8]
I want to return all elements of the series s which has all values contained in l, otherwise None. I want to do something like this but apply it to each element in the series:
s.where(s.isin(l), None)
So the output would be a series:
pd.Series([[1,2,3], None, None])
You can use the magic of python sets:
s.apply(set(l).issuperset)
Output:
0 True
1 False
2 False
dtype: bool
Then use where to modify the non matching rows using the previous output as mask:
s.where(s.apply(set(l).issuperset), None)
Output:
0 [1, 2, 3]
1 None
2 None
dtype: object
you can explode the series, use isin with l and use all with the parameter level=0 (equivalent to groupby.all on the index).
print(s.explode().isin(l).all(level=0))
0 True
1 False
2 False
dtype: bool
use this Boolean mask in where to get your expected result
s1 = s.where(s.explode().isin(l).all(level=0), None)
print(s1)
0 [1, 2, 3]
1 None
2 None
dtype: object
Thanks to a comment of #mozway, the parameter level=0 in all is being deprecated, so the solution would be with groupby.all
s1 = s.where(s.explode().isin(l).groupby(level=0).all(), None)
#TomNash, you can combine all function with listcomprehension:
s = pd.Series([[1,2,3], [4,5,6], [7,8,9]])
l = [1,2,3,6,7,8]
final_list = []
for x in s:
if all(item in l for item in x):
final_list.append(x)
else:
final_list.append(None)
print(final_list)
OUTPUT:
[[1, 2, 3], None, None]
s = pd.Series([[1,2,3], [4,6], [7,8,9,10]])
l = [1,2,3,6,7,8]
new_series = []
for i in range(len(s)):
s_in_l = 0
for j in range(len(s[i])):
if s[i][j] not in l:
s_in_l = s_in_l + 1
if s_in_l == 0:
new_series.append(s[i])
else:
new_series.append(None)
new_series = pd.Series(new_series)
print(new_series)
output:
0 [1, 2, 3]
1 None
2 None
dtype: object
You can check the element of s is subset of l by .issubset function, as folllows:
s.apply(lambda x: x if set(x).issubset(l) else None)
or make use of numpy function setdiff1d, as follows:
s.apply(lambda x: x if (len(np.setdiff1d(x, l)) == 0) else None)
Result:
0 [1, 2, 3]
1 None
2 None
dtype: object
I have 2 questions:
I have a dataset that contains some duplicate IDs, but some of them have different actions so they can't be removed. I want for each ID to do some math and store the final value to work with later. I already have duplicate indices, but in this code, it doesn't work properly and gives NaN.
How can I write nested loop using pandas? Cause it takes too much time to run. I've already used iterrows(), but didn't work.
l_list = []
for i in range(len(idx)):
for j in range(len(idx[i])):
if df.at[j,'action'] == 0:
a = df.rank[idx[i]]*50
b = df.study_list[idx[i]].str.strip('[]').str.split(',').str.len()
l_list.append(a + b)
Based on my understanding of what you've provided, see if this works:
In [15]: df
Out[15]:
ID rank action study_list
0 aaa 24 0 [a, b]
1 bbb 6 1 [1, 2, 3]
2 aaa 14 0 [1, 2, 3, 4]
In [16]: def do_thing(row):
...: if row['ID'] == 'aaa' and row['action'] == 0:
...: return row['rank'] * 50 + len(row['study_list'])
...: else:
...: return 100 * row['rank']
...:
In [17]: df['new_value'] = df.apply(do_thing, axis=1)
In [18]: df
Out[18]:
ID rank action study_list new_value
0 aaa 24 0 [a, b] 1202
1 bbb 6 1 [1, 2, 3] 600
2 aaa 14 0 [1, 2, 3, 4] 704
NOTE:
I have made many simplifications as your post doesn't enable a reproducible case. Read this thread to see how to best ask questions about Pandas.
I also can't guarantee speed as you have not provided the details regarding the size of the dataset.
i dont know what does the variable idx or anything. i think your code is wrong,
you have to try this code
l_list = []
for i in range(len(idx)):
for j in range(len(idx[i])):
if df.at[j,'action'] == 0:
a = df.rank[idx[i]]*50
b = df.study_list[idx[i]].str.strip('[]').str.split(',').str.len()
l_list.append(a + b)
Simple dictionary:
d = {'a': set([1,2,3]), 'b': set([3, 4])}
(the sets may be turned into lists if it matters)
How do I convert it into a long/tidy DataFrame in which each column is a variable and every observation is a row, i.e.:
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
The following works, but it's a bit cumbersome:
id = 0
tidy_d = {}
for l, vs in d.items():
for v in vs:
tidy_d[id] = {'letter': l, 'value': v}
id += 1
pd.DataFrame.from_dict(tidy_d, orient = 'index')
Is there any pandas magic to do this? Something like:
pd.DataFrame([d]).T.reset_index(level=0).unnest()
where unnest obviously doesn't exist and comes from R.
You can use a comprehension with itertools.chain and zip:
from itertools import chain
keys, values = map(chain.from_iterable, zip(*((k*len(v), v) for k, v in d.items())))
df = pd.DataFrame({'letter': list(keys), 'value': list(values)})
print(df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
This can be rewritten in a more readable fashion:
zipper = zip(*((k*len(v), v) for k, v in d.items()))
values = map(list, map(chain.from_iterable, zipper))
df = pd.DataFrame(list(values), columns=['letter', 'value'])
Use numpy.repeat with chain.from_iterable:
from itertools import chain
df = pd.DataFrame({
'letter' : np.repeat(list(d.keys()), [len(v) for k, v in d.items()]),
'value' : list(chain.from_iterable(d.values())),
})
print (df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
A tad more "pandaic", inspired by this post:
pd.DataFrame.from_dict(d, orient = 'index') \
.rename_axis('letter').reset_index() \
.melt(id_vars = ['letter'], value_name = 'value') \
.drop('variable', axis = 1) \
.dropna()
Some timings of melt and slightly modified chain answers:
import random
import timeit
from itertools import chain
import pandas as pd
print(pd.__version__)
dict_size = 1000000
randoms = [random.randint(0, 100) for __ in range(10000)]
max_list_size = 1000
d = {k: random.sample(randoms, random.randint(1, max_list_size)) for k in
range(dict_size)}
def chain_():
keys, values = map(chain.from_iterable,
zip(*(([k] * len(v), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
def melt_():
pd.DataFrame.from_dict(d, orient='index'
).rename_axis('letter').reset_index(
).melt(id_vars=['letter'], value_name='value'
).drop('variable', axis=1).dropna()
setup ="""from __main__ import chain_, melt_"""
repeat = 3
numbers = 10
def timer(statement, _setup=''):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
print('timing')
timer('chain_()')
timer('melt_()')
Seems melt is faster for max_list_size 100:
1.0.3
timing
246.71311019999996
204.33705529999997
and slower for max_list_size 1000:
2675.8446872
4565.838648400002
probably because of assigning the memory for a much bigger df than needed
A variation of chain answer:
def chain_2():
keys, values = map(chain.from_iterable,
zip(*((itertools.repeat(k, len(v)), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
doesn't seem to be any faster
(python 3.7.6)
Just another one,
from collections import defaultdict
e = defaultdict(list)
for key, val in d.items():
e["letter"] += [key] * len(val)
e["value"] += list(val)
df = pd.DataFrame(e)
I can solve my task by writing a for loop, but I wonder, how to do this in a more pandorable way.
So I have this dataframe storing some lists and want to find all the rows that have any common values in these lists,
(This code just to obtaine a df with lists:
>>> df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]})
>>> df
a b
0 A 1
1 A 2
2 B 5
3 B 1
4 B 4
5 C 6
>>> d = df.groupby('a')['b'].apply(list)
)
Here we start:
>>> d
A [1, 2]
B [5, 1, 4]
C [6]
Name: b, dtype: object
I want to select rows with index 'A' and 'B', because their lists overlap by the value 1.
I could write now a for loop or expand the dataframe at these lists (reversing the way I got it above) and have multiple rows copying other values.
What would you do here? Or is there some way, to use df.groupby(by=lambda x, y : return not set(x).isdisjoint(y)), that compares two rows?
But groupby and also boolean masking just look at one element at once...
I tried now to overload the equality operator for lists, and because lists are not hashable, then of tuples and sets (I set hash to 1 to avoid identity comparison). I then used groupby and merge on the frame with itself, but as it seems, that it checks off the indexes, that it has already matched.
import pandas as pd
import numpy as np
from operator import itemgetter
class IndexTuple(set):
def __hash__(self):
#print(hash(str(self)))
return hash(1)
def __eq__(self, other):
#print("eq ")
is_equal = not set(self).isdisjoint(other)
return is_equal
l = IndexTuple((1,7))
l1 = IndexTuple((4, 7))
print (l == l1)
df = pd.DataFrame(np.random.randint(low=0, high=4, size=(10, 2)), columns=['a','b']).reset_index()
d = df.groupby('a')['b'].apply(IndexTuple).to_frame().reset_index()
print (d)
print (d.groupby('b').b.apply(list))
print (d.merge (d, on = 'b', how = 'outer'))
outputs (it works fine for the first element, but at [{3}] there should be [{3},{0,3}] instead:
True
a b
0 0 {1}
1 1 {0, 2}
2 2 {3}
3 3 {0, 3}
b
{1} [{1}]
{0, 2} [{0, 2}, {0, 3}]
{3} [{3}]
Name: b, dtype: object
a_x b a_y
0 0 {1} 0
1 1 {0, 2} 1
2 1 {0, 2} 3
3 3 {0, 3} 1
4 3 {0, 3} 3
5 2 {3} 2
Using a merge on df:
v = df.merge(df, on='b')
common_cols = set(
np.sort(v.iloc[:, [0, -1]].query('a_x != a_y'), axis=1).ravel()
)
common_cols
{'A', 'B'}
Now, pre-filter and call groupby:
df[df.a.isin(common_cols)].groupby('a').b.apply(list)
a
A [1, 2]
B [5, 1, 4]
Name: b, dtype: object
I understand you are asking for a "pandorable" solution, but in my opinion this is not a task ideally suited to pandas.
Below is one solution using collections.Counter and itertools.combinations which provides your result without using a dataframe.
from collections import defaultdict
from itertools import combinations
data = {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]}
d = defaultdict(set)
for i, j in zip(data['a'], data['b']):
d[i].add(j)
res = {frozenset({i, j}) for i, j in combinations(d, 2) if not d[i].isdisjoint(d[j])}
# {frozenset({'A', 'B'})}
Explanation
Group to a set with collections.defaultdict. via an O(n) complexity solution.
Iterate using itertools.combinations to find set values which are not disjoint, using a set comprehension.
Use frozenset (or sorted tuple) for key type as lists are mutable and therefore cannot be used as dictionary keys.
helpful
'[2, 4]'
'[0, 0]'
'[0, 1]'
'[7, 13]'
'[4, 6]'
Column name helpful has a list inside the string. I want to split 2 and 4 into separate columns.
[int(each) for each in df['helpful'][0].strip('[]').split(',')]
This works the first row but if I do
[int(each) for each in df['helpful'].strip('[]').split(',')]
gives me attribute error
AttributeError: 'Series' object has no attribute 'strip'
How can I print out like this in my dataframe??
helpful not_helpful
2 4
0 0
0 1
7 13
4 6
As suggested by #abarnert, the first port of call is find out why your data is coming across as strings and try and rectify that problem.
However, if this is beyond your control, you can use ast.literal_eval as below.
import pandas as pd
from ast import literal_eval
df = pd.DataFrame({'helpful': ['[2, 4]', '[0, 0]', '[0, 1]', '[7, 13]', '[4, 6]']})
res = pd.DataFrame(df['helpful'].map(literal_eval).tolist(),
columns=['helpful', 'not_helpful'])
# helpful not_helpful
# 0 2 4
# 1 0 0
# 2 0 1
# 3 7 13
# 4 4 6
Explanation
From the documentation, ast.literal_eval performs the following function:
Safely evaluate an expression node or a string containing a Python
literal or container display. The string or node provided may only
consist of the following Python literal structures: strings, bytes,
numbers, tuples, lists, dicts, sets, booleans, and None.
Assuming what you've described here accurately mimics your real-world case, how about a regex with .str.extract()?
>>> regex = r'\[(?P<helpful>\d+),\s*(?P<not_helpful>\d+)\]'
>>> df
helpful
0 [2, 4]
1 [0, 0]
2 [0, 1]
>>> df['helpful'].str.extract(regex, expand=True).astype(np.int64)
helpful not_helpful
0 2 4
1 0 0
2 0 1
Each pattern (?P<name>...) is a named capturing group. Here, there are two: helpful/not helpful. This assumes the pattern can be described by: opening bracket, 1 or more digits, comma, 0 or more spaces, 1 or more digits, and closing bracket. The Pandas method (.extract()), as its name implies, "extracts" the result of match.group(i) for each i:
>>> import re
>>> regex = r'\[(?P<helpful>\d+),\s*(?P<not_helpful>\d+)\]'
>>> re.search(regex, '[2, 4]').group('helpful')
'2'
>>> re.search(regex, '[2, 4]').group('not_helpful')
'4'
Just for fun without module.
s = """
helpful
'[2, 4]'
'[0, 0]'
'[0, 1]'
'[7, 13]'
'[4, 6]'
"""
lst = s.strip().splitlines()
d = {'helpful':[], 'not_helpful':[]}
el = [tuple(int(x) for x in e.strip("'[]").split(', ')) for e in lst[1:]]
d['helpful'].extend(x[0] for x in el)
d['not_helpful'].extend(x[1] for x in el)
NUM_WIDTH = 4
COLUMN_WIDTH = max(len(k) for k in d)
print('{:^{num_width}}{:^{column_width}}{:^{column_width}}'.format(
' ', *sorted(d),
num_width=NUM_WIDTH,
column_width=COLUMN_WIDTH
)
)
for (i, v) in enumerate(zip(d['helpful'], d['not_helpful']), 1):
print('{:^{num_width}}{:^{column_width}}{:^{column_width}}'.format(
i, *v,
num_width=NUM_WIDTH,
column_width=COLUMN_WIDTH
)
)