Related
Short description
I want to walk along a numpy 2D array starting from different points in specified directions (either 1 or -1) until a column changes (see below)
Current code
First let's generate a dataset:
# Generate big random dataset
# first column is id and second one is a number
np.random.seed(123)
c1 = np.random.randint(0,100,size = 1000000)
c2 = np.random.randint(0,20,size = 1000000)
c3 = np.random.choice([1,-1],1000000 )
m = np.vstack((c1, c2, c3)).T
m = m[m[:,0].argsort()]
Then I wrote the following code that starts at specific rows in the matrix (start_points) then keeps extending in the specified direction (direction_array) until the metadata changes:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = start_mat[:,1]
direction_array = start_mat[:,2]
walk_array = start_array
while True:
walk_array = np.add(walk_array, direction_array)
try:
walk_mat = mat[walk_array]
walk_metadata = walk_mat[:,1]
if sorted(metadata) != sorted(walk_metadata):
raise IndexError
except IndexError:
return start_mat, mat[walk_array + (direction_array *-1)]
s = time.time()
for i in range(100000):
start_points = np.random.randint(0,1000000,size = 3)
res = walk(m, start_points)
Question
While the above code works fine I think there must be an easier/more elegant way to walk along a numpy 2D array from different start points until the value of another column changes? This for example requires me to slice the input array for every step in the while loop which seems quite inefficient (especially when I have to run walk millions of times).
You don't have to whole input array in while loop. You could just use the column that values you want to check.
I refactored a little bit your code as well so there is no while True statement and so there is no if that raises error for no particular reason.
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
data = mat[:,1]
walk_array = np.add(start_array, direction_array)
try:
while metadata == sorted(data[walk_array]):
walk_array = np.add(walk_array, direction_array)
except IndexError:
pass
return start_mat, mat[walk_array - direction_array]
In this particular reason if len(start_array) is a big number (thousands of elements) you could use collections.Counter instead of sort as it will be much faster.
I was thinking of another approach that could be used and I that there could be a array with desired slices in correct direction.
But this approach seems very dirty. Anyway I will post it maybe you will find it anyhow useful
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
_data = mat[:,1]
walk_slices = zip(*[
data[start_points[i]+direction_array[i]::direction_array[i]]
for i in range(len(start_points))
])
for step, walk_metadata in enumerate(walk_slices):
if metadata != sorted(walk_metadata):
break
return start_mat, mat[start_array + (direction_array * step)]
To perform operation starting from a single row, define the following class:
class Walker:
def __init__(self, tbl, row):
self.tbl = tbl
self.row = row
self.dir = self.tbl[self.row, 2]
# How many rows can I move from "row" in the indicated direction
# while metadata doesn't change
def numEq(self):
# Metadata from "row" in the required direction
md = self.tbl[self.row::self.dir, 1]
return ((md != md[0]).cumsum() == 0).sum() - 1
# Get row "n" positions from "row" in the indicated direction
def getRow(self, n):
return self.tbl[self.row + n * self.dir]
Then, to get the result, run:
def walk_2(m, start_points):
# Create walkers for each starting point
wlk = [ Walker(m, n) for n in start_points ]
# How many rows can I move
dist = min([ w.numEq() for w in wlk ])
# Return rows from changed positions
return np.vstack([ w.getRow(dist) for w in wlk ])
The execution time of my code is roughly the same as yours,
but in my opinion my code is more readable and concise.
I have 2 very large lists of lists whose sizes are dynamic and not known as they come from a different source and each sublist is 2000 entries long.
I need to iterate through each sublist for both the lists of lists and pass it to sql query, do some data processing and then move on to the next sublist.
Using generators is ideal to iterate through such huge lists of list.
For simplification , I am recreating the problem by using 2 list of lists that are 10 entries long and each sublist has 2 entries.
def test():
Send_list= [['2000000000259140093', '1000000000057967562'],
['4000000000008393617', '3000000000006545639'],
['1000000000080880314','1000000000119225203'],
['1000000000096861508', '1000000000254915223'],
['2000000000079125911', '1000000000014797506']]
Pay_list = [['3000000000020597219', '1000000000079442325'],
['1000000000057621671', '3000000000020542928'],
['3000000000020531804', '4000000000010435913'],
['1000000000330634222', '3000000000002353220'],
['1000000000256385361', '2000000000286618770']]
for list1,list2 in itertools.izip_longest(Send_list,Pay_list):
yield [list1,list2]
Now , I can use the next() function to iterate through piece by piece and pass the sublists to the sql queries.
In [124]: c = next(test())
In [125]: c
Out[125]:
[['2000000000259140093', '1000000000057967562'],
['3000000000020597219', '1000000000079442325']]
a = c[0]
b = c[1]
placeholders1 = ','.join('?' for i in range(len(a)))
placeholders2 = ','.join('?' for i in range(len(b)))
sql1 = "select * from Pretty_Txns where Send_Customer in (%s)"% placeholders1
sql2 = "select * from Pretty_Txns where pay_Customer in (%s)"% placeholders2
df_send = pd.read_sql(sql1,cnx,params=a)
df_pay = pd.read_sql(sql2,cnx,params=b)
///data processing and passing the result frame back to sql///
result.to_sql()
///then repeating the same steps for the the next sublists
Now when I tried using a for loop to loop through next():
for list in test():
c = next(test())
a = c[0]
b = c[1]
placeholders1 = ','.join('?' for i in range(len(a)))
placeholders2 = ','.join('?' for i in range(len(b)))
sql1 = "select * from Pretty_Txns where Send_Customer in (%s)"% placeholders1
sql2 = "select * from Pretty_Txns where pay_Customer in (%s)"% placeholders2
df_send = pd.read_sql(sql1,cnx,params=a)
df_pay = pd.read_sql(sql2,cnx,params=b)
////lot of data processing steps and passing the final results back to sql
result.to_sql()
It only iterates through the first two sublists and does the processing for that and stops.
The value of c right now is:
In [145]: c
Out[145]:
[['2000000000259140093', '1000000000057967562'],
['3000000000020597219', '1000000000079442325']]
This is the first sublist in both Send_list and Pay_list
In [149]: Send_list
Out[149]:
[['2000000000259140093', '1000000000057967562'],
['4000000000008393617', '3000000000006545639'],
['1000000000080880314', '1000000000119225203'],
['1000000000096861508', '1000000000254915223'],
['2000000000079125911', '1000000000014797506']]
In [150]: Pay_list
Out[150]:
[['3000000000020597219', '1000000000079442325'],
['1000000000057621671', '3000000000020542928'],
['3000000000020531804', '4000000000010435913'],
['1000000000330634222', '3000000000002353220'],
['1000000000256385361', '2000000000286618770']]
Once the data from the result dataframe is passed to sql, the control should go back to the step c=next(test()) and the whole process should repeat until the original list is exhausted.
I am struggling to accomplish that. Looking forward some pointers and guidance.
Firstly, I don't see why you're mixing a for loop with an explicit call to next.
Secondly, next(test()) calls next on a new generator object at every iteration of the for loop, which means c will always be the first item from the gen. object. You may need to store the same gen. object somewhere and then call next on it repeatedly:
gen = test()
c = next(gen)
...
c = next(gen)
Finally, itertools.izip_longest returns an iterator, so you're probably complicating things by yielding values from it. You can simply return the iterator.
def test():
...
return itertools.izip_longest(Send_list, Pay_list):
Well don't create a new generator all the time and only use its first element. Create one generator and iterate that.
>>> for a, b in test():
print a, b
['2000000000259140093', '1000000000057967562'] ['3000000000020597219', '1000000000079442325']
['4000000000008393617', '3000000000006545639'] ['1000000000057621671', '3000000000020542928']
['1000000000080880314', '1000000000119225203'] ['3000000000020531804', '4000000000010435913']
['1000000000096861508', '1000000000254915223'] ['1000000000330634222', '3000000000002353220']
['2000000000079125911', '1000000000014797506'] ['1000000000256385361', '2000000000286618770']
Working on below problem,
Problem,
Given a m * n grids, and one is allowed to move up or right, find the different paths between two grid points.
I write a recursive version and a dynamic programming version, but they return different results, and any thoughts what is wrong?
Source code,
from collections import defaultdict
def move_up_right(remaining_right, remaining_up, prefix, result):
if remaining_up == 0 and remaining_right == 0:
result.append(''.join(prefix[:]))
return
if remaining_right > 0:
prefix.append('r')
move_up_right(remaining_right-1, remaining_up, prefix, result)
prefix.pop(-1)
if remaining_up > 0:
prefix.append('u')
move_up_right(remaining_right, remaining_up-1, prefix, result)
prefix.pop(-1)
def move_up_right_v2(remaining_right, remaining_up):
# key is a tuple (given remaining_right, given remaining_up),
# value is solutions in terms of list
dp = defaultdict(list)
dp[(0,1)].append('u')
dp[(1,0)].append('r')
for right in range(1, remaining_right+1):
for up in range(1, remaining_up+1):
for s in dp[(right-1,up)]:
dp[(right,up)].append(s+'r')
for s in dp[(right,up-1)]:
dp[(right,up)].append(s+'u')
return dp[(right, up)]
if __name__ == "__main__":
result = []
move_up_right(2,3,[],result)
print result
print '============'
print move_up_right_v2(2,3)
In version 2 you should be starting your for loops at 0 not at 1. By starting at 1 you are missing possible permutations where you traverse the bottom row or leftmost column first.
Change version 2 to:
def move_up_right_v2(remaining_right, remaining_up):
# key is a tuple (given remaining_right, given remaining_up),
# value is solutions in terms of list
dp = defaultdict(list)
dp[(0,1)].append('u')
dp[(1,0)].append('r')
for right in range(0, remaining_right+1):
for up in range(0, remaining_up+1):
for s in dp[(right-1,up)]:
dp[(right,up)].append(s+'r')
for s in dp[(right,up-1)]:
dp[(right,up)].append(s+'u')
return dp[(right, up)]
And then:
result = []
move_up_right(2,3,[],result)
set(move_up_right_v2(2,3)) == set(result)
True
And just for fun... another way to do it:
from itertools import permutations
list(map(''.join, set(permutations('r'*2+'u'*3, 5))))
The problem with the dynamic programming version is that it doesn't take into account the paths that start from more than one move up ('uu...') or more than one move right ('rr...').
Before executing the main loop you need to fill dp[(x,0)] for every x from 1 to remaining_right+1 and dp[(0,y)] for every y from 1 to remaining_up+1.
In other words, replace this:
dp[(0,1)].append('u')
dp[(1,0)].append('r')
with this:
for right in range(1, remaining_right+1):
dp[(right,0)].append('r'*right)
for up in range(1, remaining_up+1):
dp[(0,up)].append('u'*up)
I'm currently faced with having to semi-regularly update (synchronize) a large-ish list of dicts from a canonical changing source while maintaining my own updates to it. A non-standard merge, for which the simplest description is probably:-
A is my own list of dicts (updated by my program to include cached values as additional keys.
b is some regularly sent information from a source (A was originally identical to b). It contains a few keys, but not the cached values I've added to A.
keys = ['key1', 'key2'] is a list of keys which both A and b have (A has more keys than that.
mkey = 'mtime' is a special key which both A and b have which indicates that I should invalidate the cached values of A.
Basically, if a dict in A matches a dict in b, I should keep the dict in A unless b['mtime'] > A['mtime']. If a dict appears in A but not in b I get rid of it, while if it appears in b but not in A I add it to A.
My holy grail objective is to not lose any cached key-value pairs in A at all, but I'm having trouble achieving that. My current solution looks something like this:-
def priority_merge(A, b, keys, mkey):
retval = []
b_index = 0
for elemA in A:
if b_index >= len(b):
break # No more items in b
elemb = b[b_index]
minA = { k: elemA[k] for k in keys }
minb = { k: elemb[k] for k in keys }
if minA == minb: # Found a match
if elemA[mkey] >= elemb[mkey]:
retval.append(elemA)
else: # Check mkey to see if take b instead
retval.append(elemb)
b_index = b_index + 1
else: # No match, check forward by one
if b_index+1 >= len(b):
continue
elembplus = b[b_index+1]
minb = { k: elembplus[k] for k in keys}
if minA == minb:
retval.append(elemb) # This is a new element
if elemA[mkey] >= elembplus[mkey]:
retval.append(elemA)
else:
retval.append(elembplus)
b_index = b_index + 2
if b_index <= len(b):
retval.extend(b[b_index:])
return retval
This works fine as long as I don't get more than one additions and/or deletions (b relative to A) in a row. So if A contains 1, 2, 3, 5 and b contains 1, 2, 3, 4, 5 it's fine, but if A contains 1, 2, 5 and b contains 1, 2, 3, 4, 5 this breaks down.
I could do a check till len(b) under the else case commented as # No match, check forward by one, or first iterate through both A and b to map matching elements, then iterate through again based on that map to create retval. This seems error-prone though (I'm sure its do-able logic wise, but I'm also fairly sure code I write for it would be buggy). Please recommend a suitable algorithm to tackle this problem, whether it be my two ideas or something else.
As I told hash method could help you to ensure comparison, based only on keys list you will able to find the intersection element (element to merged) and difference element.
class HashedDictKey(dict):
def __init__(self, keys_, **kwargs):
super().__init__(**kwargs)
self.keys_ = keys_
def __hash__(self):
return hash(tuple(sorted((k, self.get(k)) for k in self.keys_)))
def __eq__(self, other):
return hash(self) == hash(other)
def merge(A, B):
to_be_added = []
to_be_del = []
to_be_updated = []
def get(obj, it):
for i in it:
if obj == i:
return i
raise ValueError("No %s value" % obj)
for a, b in zip_longest(A, B):
if a in B:
to_be_updated.append(a)
if a not in B:
to_be_del.append(a)
if b not in A:
to_be_added.append(b)
for i in to_be_del:
A.remove(i)
for j in to_be_added:
A.append(j)
for i in to_be_updated:
a = get(i, A)
b = get(i, B)
if b['mtime'] > a['mtime']:
A.remove(a)
here the complete snippet
I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result