Rewrite a double loop in a nicer and maybe shorter way - python

I am wondering if the following code can be written in a somewhat nicer way. Basically, I want to calculate z = f(x, y) for a (x, y) meshgrid.
a = linspace(0, xr, 100)
b = linspace(0, yr, 100)
for i in xrange(100):
for j in xrange(100):
z[i][j] = f(a[i],b[j])

Yeah. Your code as presented in the question is nice.
Do not ever think that few lines is "nice" or "cool". What counts is clarity, readability and maintainability. Other people should be able to understand your code (and you should understand it in 12 months, when you need to find a bug).
Many programmers, especially young ones, think that "clever" solutions are desirable. They are not. And that's what is so nice with the python community. We are much less afflicted by that mistake than others.

you could do something like
z = [[f(item_a, item_b) for item_b in b] for item_a in a]

You could use itertools' product:
[f(i,j) for i,j in product( a, b )]
and if you really want to shrink those 5 lines into 1 then:
[f(i,j) for i,j in product( linspace(0,xr,100), linspace(0,yr,100)]
To take it even further if you want a function of xr and yr where you can also preset the ranges of 0 and 100 to something else:
def ranged_linspace( _start, _end, _function ):
def output_z( xr, yr ):
return [_function( i, j ) for i,j in product( linspace( _start, xr, _end ), linspace( _start, yr, _end ) )]
return output_z

If you set it all at once, you can use a list comprehension;
[[f(a[i], b[j]) for j in range(100)] for i in range(100)]
If you need to use a z that's already there, however, you can't do that and your code is about the neatest you'll get.
Addition: I don't know with what this lingrid does, but if it produces a 100-element list, use aaronasterling's list comprehension; no point in creating an extra iterator if you don't need to.

This shows the general result. a is made into a list 6-long and b is 4-long. The result is a list of 6 lists, and each nested list is 4 elements long.
>>> def f(x,y):
... return x+y
...
>>> a, b = list(range(0, 12, 2)), list(range(0, 12, 3))
>>> print len(a), len(b)
6 4
>>> result = [[f(aa, bb) for bb in b] for aa in a]
>>> print result
[[0, 3, 6, 9], [2, 5, 8, 11], [4, 7, 10, 13], [6, 9, 12, 15], [8, 11, 14, 17], [10, 13, 16, 19]]

I think this is the one line code that you looking for
z = [[a+b for b in linspace(0,yr,100)] for a in linspace(0,xr,100)]

Your linspace actually looks like it could be np.linspace. If it is you could operate on the numpy arrays without having to iterate explicitly:
z = f(x[:, np.newaxis], y)
For example:
>>> import numpy as np
>>> x = np.linspace(0, 9, 10)
>>> y = np.linspace(0, 90, 10)
>>> x[:, np.newaxis] + y # or f(x[:, np.newaxis], y)
array([[ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.],
[ 1., 11., 21., 31., 41., 51., 61., 71., 81., 91.],
[ 2., 12., 22., 32., 42., 52., 62., 72., 82., 92.],
[ 3., 13., 23., 33., 43., 53., 63., 73., 83., 93.],
[ 4., 14., 24., 34., 44., 54., 64., 74., 84., 94.],
[ 5., 15., 25., 35., 45., 55., 65., 75., 85., 95.],
[ 6., 16., 26., 36., 46., 56., 66., 76., 86., 96.],
[ 7., 17., 27., 37., 47., 57., 67., 77., 87., 97.],
[ 8., 18., 28., 38., 48., 58., 68., 78., 88., 98.],
[ 9., 19., 29., 39., 49., 59., 69., 79., 89., 99.]])
But you could also use np.ogrid instead of two linspace:
import numpy as np
>>> x, y = np.ogrid[0:10, 0:100:10]
>>> x + y # or f(x, y)
array([[ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
[ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
[ 2, 12, 22, 32, 42, 52, 62, 72, 82, 92],
[ 3, 13, 23, 33, 43, 53, 63, 73, 83, 93],
[ 4, 14, 24, 34, 44, 54, 64, 74, 84, 94],
[ 5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
[ 6, 16, 26, 36, 46, 56, 66, 76, 86, 96],
[ 7, 17, 27, 37, 47, 57, 67, 77, 87, 97],
[ 8, 18, 28, 38, 48, 58, 68, 78, 88, 98],
[ 9, 19, 29, 39, 49, 59, 69, 79, 89, 99]])
It somewhat depends on what you're f is. If it contains functions like math.sin you need to replace them by numpy.sin.
If it's not about numpy then you should stick either with your option or optionally using enumerate when looping:
for idx1, ai in enumerate(a):
for idx2, bj in enumerate(b):
z[idx1][idx2] = f(ai, bj)
This has the advantage that you don't need to hardcode your range (or xrange) or use the len(a) as input. But in general if there is not a huge performance difference 1 then use the method you and others using your code will understand easily.
1 If your a and b are numpy.arrays then there would be a significant performance difference because numpy can process the arrays much faster if no list<->numpy.array conversions are required.

Related

Merge multidimensional NumPy arrays based on first row

I have to work with sensor data (from ros, specifically, but it should not be relevant). To this end, I have several 2-D numpy arrays with one row storing the timestamps and the following others the corresponding sensors data. Problem is, such arrays do not have the same dimensions (different sampling times). I need to merge all of these arrays into a single big one. How can I do so based on the timestamp and, say, replace the missing numbers with 0 or NaN?
Example of my situation:
import numpy as np
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
print(a)
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
print(b)
Which returns output
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]]
[[ 1 3 5 7 9]
[ 35 188 114 153 36]]
What I am looking for is
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]
[ 35 0 188 0 114 0 153 0 36]]
Is there any way to achieve this in an efficient way? This is an example but I am working with thousands of samples. Thanks!
For simple case of one b-matrix
With first row of a storing all possible timestamps and both of those first rows in a and b being sorted, we can use np.searchsorted -
idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))
For several b-matrices
Approach #1
To extend to multiple b-matrices, we can follow a similar method with np.searchsorted within a loop, like so -
def merge_arrays(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
for b_i in B:
idx = np.searchsorted(a[0],b_i[0])
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
Sample run -
In [175]: a
Out[175]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
In [176]: b0
Out[176]:
array([[16, 22, 34, 56, 67, 91],
[20, 80, 69, 79, 47, 64],
[82, 88, 49, 29, 19, 19]])
In [177]: b1
Out[177]:
array([[ 4, 16, 34, 99],
[28, 34, 0, 0],
[36, 53, 5, 38],
[17, 79, 4, 42]])
In [178]: merge_arrays(a, [b0,b1])
Out[178]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 0, 0, 20, 80, 69, 79, 47, 0, 64, 0],
[ 0, 0, 82, 88, 49, 29, 19, 0, 19, 0],
[28, 0, 34, 0, 0, 0, 0, 0, 0, 0],
[36, 0, 53, 0, 5, 0, 0, 0, 0, 38],
[17, 0, 79, 0, 4, 0, 0, 0, 0, 42]])
Approach #2
If looping with np.searchsorted seems to be the bottleneck, we can vectorize that part -
def merge_arrays_v2(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
r0 = [i[0] for i in B]
r0s = np.concatenate((r0))
idxs = np.searchsorted(a[0],r0s)
cols = np.array([i.shape[1] for i in B])
sp = np.r_[0,cols.cumsum()]
start,stop = sp[:-1],sp[1:]
for (b_i,s0,s1) in zip(B,start,stop):
idx = idxs[s0:s1]
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
Here's an approach using np.searchsorted:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [118, 105, 86, 94, 69, 17, 142, 46, 54]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [70, 15, 4, 97, 57]])
out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[118., 105., 86., 94., 69., 17., 142., 46., 54.],
[ 70., 0., 15., 0., 4., 0., 97., 0., 57.]])
Update - Merging many matrices
Here's a almost fully vectorised approach for a scenario with multiple b matrices. This approach does not require a priori knowledge of which is the largest list:
def merge_timestamps(*x):
# infer which is the list with maximum length
# as well as individual lengths
concat = np.concatenate(*x, axis=1)[0]
lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
# define the output matrix
A = x[0][max_len_list]
out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
# Update the output matrix with the values of the smaller
# arrays according to their index. This is of course assuming
# all values are contained in the largest
for ix, i in enumerate(others):
out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
return out
Lets check with the following example:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [107, 13, 123, 119, 137, 135, 65, 157, 83]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [ 81, 49, 83, 32, 179]])
time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[ 1, 3],
# [185, 117]])
merge_timestamps([a,b,c])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
As mentioned this approach does not require a priori knowledge of which is the largest list, i.e. it would also work with:
merge_timestamps([b, c, a])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
Applicable only if sensor is capturing data at fixed interval.
First we will need to create a dataframe with fixed interval (15 min interval in this case), then use concat function to this dataframe with sensor's data.
Code to generate dataframe with 15 min interval (copied)
l = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='15T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
l = pd.DataFrame(l)
Assuming below data comes from sensor
m = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='30T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)
merge above two dataframes
df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0
Output
0 SensorData
0 2016-09-02T17:30:00Z 0.0
1 2016-09-02T17:45:00Z 0.0
2 2016-09-02T18:00:00Z 1.0
3 2016-09-02T18:15:00Z 0.0
4 2016-09-02T18:30:00Z 2.0
5 2016-09-02T18:45:00Z 0.0
6 2016-09-02T19:00:00Z 3.0
7 2016-09-02T19:15:00Z 0.0
8 2016-09-02T19:30:00Z 4.0
9 2016-09-02T19:45:00Z 0.0
10 2016-09-02T20:00:00Z 5.0
11 2016-09-02T20:15:00Z 0.0
12 2016-09-02T20:30:00Z 6.0
13 2016-09-02T20:45:00Z 0.0
14 2016-09-02T21:00:00Z 7.0

How to generate many interaction terms in Pandas?

I would like to estimate an IV regression model using many interactions with year, demographic, and etc. dummies. I can't find an explicit method to do this in Pandas and am curious if anyone has tips.
I'm thinking of trying scikit-learn and this function:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
I was now faced with a similar problem, where I needed a flexible way to create specific interactions and looked through StackOverflow. I followed the tip in the comment above of #user333700 and thanks to him found patsy (http://patsy.readthedocs.io/en/latest/overview.html) and after a Google search this scikit-learn integration patsylearn (https://github.com/amueller/patsylearn).
So going through the example of #motam79, this is possible:
import numpy as np
import pandas as pd
from patsylearn import PatsyModel, PatsyTransformer
x = np.array([[ 3, 20, 11],
[ 6, 2, 7],
[18, 2, 17],
[11, 12, 19],
[ 7, 20, 6]])
df = pd.DataFrame(x, columns=["a", "b", "c"])
x_t = PatsyTransformer("a:b + a:c + b:c", return_type="dataframe").fit_transform(df)
This returns the following:
a:b a:c b:c
0 60.0 33.0 220.0
1 12.0 42.0 14.0
2 36.0 306.0 34.0
3 132.0 209.0 228.0
4 140.0 42.0 120.0
I answered to a similar question here, where I provide another example with categorical variables:
How can an interaction design matrix be created from categorical variables?
You can use sklearn's PolynomialFeatures function. Here is an example:
Let's assume, this is your design (i.e. feature) matrix:
x = array([[ 3, 20, 11],
[ 6, 2, 7],
[18, 2, 17],
[11, 12, 19],
[ 7, 20, 6]])
x_t = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(x)
Here is the result:
array([[ 3., 20., 11., 60., 33., 220.],
[ 6., 2., 7., 12., 42., 14.],
[ 18., 2., 17., 36., 306., 34.],
[ 11., 12., 19., 132., 209., 228.],
[ 7., 20., 6., 140., 42., 120.]])
The first 3 features are the original features, and the next three are interactions of the original features.

Using interpolate function over 2-D array

I have a 1-D function that takes so much time to compute over a big 2-D array of 'x' values, so it is much easy to create an interpolate function using SciPy facility and then compute y using it, which will be much faster. However, I cannot use the interpolation function on arrays that have more than 1-D.
Example:
# First, I create the interpolation function in the domain I want to work
x = np.arange(1, 100, 0.1)
f = exp(x) # a complicated function
f_int = sp.interpolate.InterpolatedUnivariateSpline(x, f, k=2)
# Now, in the code I do that
x = [[13, ..., 1], [99, ..., 45], [33, ..., 98] ..., [15, ..., 65]]
y = f_int(x)
# Which I want that it returns y = [[f_int(13), ..., f_int(1)], ..., [f_int(15), ..., f_int(65)]]
But returns:
ValueError: object too deep for desired array
I know I could loop over all x members, but I don't know if it is a better option...
Thanks!
EDIT:
A function like that also would do the job:
def vector_op(function, values):
orig_shape = values.shape
values = np.reshape(values, values.size)
return np.reshape(function(values), orig_shape)
I've tried the np.vectorize but it is too slow...
If f_int wants single dimensional data, you should flatten your input, feed it to the interpolator, then reconstruct your original shape:
>>> x = np.arange(1, 100, 0.1)
>>> f = 2 * x # a simple function to see the results are good
>>> f_int = scipy.interpolate.InterpolatedUnivariateSpline(x, f, k=2)
>>> x = np.arange(25).reshape(5, 5) + 1
>>> x
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
>>> x_int = f_int(x.reshape(-1)).reshape(x.shape)
>>> x_int
array([[ 2., 4., 6., 8., 10.],
[ 12., 14., 16., 18., 20.],
[ 22., 24., 26., 28., 30.],
[ 32., 34., 36., 38., 40.],
[ 42., 44., 46., 48., 50.]])
x.reshape(-1) does the flattening, and the .reshape(x.shape) returns it to its original form.
I think you want to do a vectorized function in numpy:
#create some random test data
test = numpy.random.random((100,100))
#a normal python function that you want to apply
def myFunc(i):
return np.exp(i)
#now vectorize the function so that it will work on numpy arrays
myVecFunc = np.vectorize(myFunc)
result = myVecFunc(test)
I would use a combination of a list comprehension and map (there might be a way to use two nested maps that I'm missing)
In [24]: x
Out[24]: [[1, 2, 3], [1, 2, 3], [1, 2, 3]]
In [25]: [map(lambda a: a*0.1, x_val) for x_val in x]
Out[25]:
[[0.1, 0.2, 0.30000000000000004],
[0.1, 0.2, 0.30000000000000004],
[0.1, 0.2, 0.30000000000000004]]
this is just for illustration purposes.... replace lambda a: a*0.1 with your function, f_int

variable assignment: keep shape

...better to directly show the code. Here it is:
import numpy as np
a = np.zeros([3, 3])
a
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
b = np.random.random_integers(0, 100, size = (1, 3))
b
array([[ 10, 3, 8]])
c = np.random.random_integers(0, 100, size = (4, 3))
c
array([[ 22, 21, 14],
[ 55, 64, 12],
[ 33, 85, 98],
[ 37, 44, 45]])
a = b will change dimensions of a
a = c will change dimensions of a
for a = b, I want:
array([[ 10., 3., 8.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
and for a = c, I want:
array([[ 22, 21, 14],
[ 55, 64, 12],
[ 33, 85, 98]])
So I want to lock the shape of 'a' so that values being assigned to it get "cropped" if necessary. Of course without if statements.
The problem is that the equal operator is making a shallow copy of the array, and what you want is a deep copy of part of the array.
So for this, if you know that b only has one outer array, then you can do:
a[0] = b
And if know that a is a 3x3, then you could also do:
a = c[0:3]
Furthermore, if you want them to be actual deep copies, you'll want:
a[0] = b.copy()
and
a = c[0:3].copy()
To make them independent.
If you don't already know the lengths of the matrices, you can use the len() function to find out at runtime.
You can do this easily by using Numpy slice notation. Here is a SO question with good answers explaining it clearly. Essentially, you need to ensure that the shape of the left hand array and the right had array match, and you can achieve this by slicing the corresponding arrays appropriately.
import numpy as np
a = np.zeros([3, 3])
b = np.array([[ 10, 3, 8]])
c = np.array([[ 22, 21, 14],
[ 55, 64, 12],
[ 33, 85, 98],
[ 37, 44, 45]])
a[0] = b
print a
a = c[0:3]
print a
Output:
[[ 10. 3. 8.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
[[22 21 14]
[55 64 12]
[33 85 98]]
It seems you want to replace elements in the top left of a 2D array with elements from a second 2D array without worrying about the sizes of the arrays. Here is a method:
def replacer(orig, repl):
new = np.copy(orig)
w2, h1 = new.shape
w1, h2 = repl.shape
new[0:min(w1,w2), 0:min(h1,h2)] = repl[0:min(w1,w2), 0:min(h1,h2)]
return new
print replacer(a,b)
print replacer(a,c)

Moving large SQL query to NumPy

I have a very large MySQL query in my web app that looks like this:
query =
SELECT video_tag.video_id, (sum(user_rating.rating) * video.rating_norm) as score
FROM video_tag
JOIN user_rating ON user_rating.item_id = video_tag.tag_id
JOIN video ON video.id = video_tag.video_id
WHERE item_type = 3 AND user_id = 1 AND rating != 0 AND video.website_id = 2
AND rating_norm > 0 AND video_id NOT IN (1,2,3) GROUP BY video_id
ORDER BY score DESC LIMIT 20"
This query joins three tables (video, video_tag, and user_rating), groups the results, and does some basic math to compute a score for each video. This takes about 2s to run as the tables are large.
Instead of making SQL do all this work, I suspect it would be faster to do this computation using NumPy arrays. The data in 'video' and 'video_tag' is constant - so I could just load those table into memory once and not have to ping SQL each time.
However, while I can load these three tables into three separate arrays, I'm having a heck of a time replicating the above query (specifically the JOIN and GROUP BY parts). Has anyone any experience with replicating SQL queries using NumPy arrays?
Thanks!
What makes this exercise awkward is the single-data-type constraint for NumPy arrays. For instance, the GROUP BY operation implicitly requires (at least) one field/column of continuous values (to aggregate/sum) and one field/column to partition or group by.
Of course, NumPy recarrays can represent a 2D array (or SQL Table) using a different data type for each column (aka 'Field'), but I find these composite arrays cumbersome to work with. So in the code snippets below, i just used the conventional ndarray class to replicate the two SQL operations highlighted in the OP's Question.
to mimic SQL JOIN in NumPy:
first, create two NumPy arrays (A & B) each to represent an SQL Table. The primary keys for A are in 1st column; foreign key for B also in 1st column.
import numpy as NP
A = NP.random.randint(10, 100, 40).reshape(8, 5)
a = NP.random.randint(1, 3, 8).reshape(8, -1) # add column of primary keys
A = NP.column_stack((a, A))
B = NP.random.randint(0, 10, 4).reshape(2, 2)
b = NP.array([1, 2])
B = NP.column_stack((b, B))
Now (attempt to) replicate JOIN using NumPy array objects:
# prepare the array that will hold the 'result set':
AB = NP.column_stack((A, NP.zeros((A.shape[0], B.shape[1]-1))))
def join(A, B) :
'''
returns None, side effect is population of 'results set' NumPy array, 'AB';
pass in A, B, two NumPy 2D arrays, representing the two SQL Tables to join
'''
k, v = B[:,0], B[:,1:]
dx = dict(zip(k, v))
for i in range(A.shape[0]) :
AB[i:,-2:] = dx[A[i,0]]
to mimic SQL GROUP BY in NumPy:
def group_by(AB, col_id) :
'''
returns 2D NumPy array aggregated on the unique values in column specified by col_id;
pass in a 2D NumPy array and the col_id (integer) which holds the unique values to group by
'''
uv = NP.unique(AB[:,col_id])
temp = []
for v in uv :
ndx = AB[:,0] == v
temp.append(NP.sum(AB[:,1:][ndx,], axis=0))
temp = NP. row_stack(temp)
uv = uv.reshape(-1, 1)
return NP.column_stack((uv, temp))
for a test case, they return the correct result:
>>> A
array([[ 1, 92, 50, 67, 51, 75],
[ 2, 64, 35, 38, 69, 11],
[ 1, 83, 62, 73, 24, 55],
[ 2, 54, 71, 38, 15, 73],
[ 2, 39, 28, 49, 47, 28],
[ 1, 68, 52, 28, 46, 69],
[ 2, 82, 98, 24, 97, 98],
[ 1, 98, 37, 32, 53, 29]])
>>> B
array([[1, 5, 4],
[2, 3, 7]])
>>> join(A, B)
array([[ 1., 92., 50., 67., 51., 75., 5., 4.],
[ 2., 64., 35., 38., 69., 11., 3., 7.],
[ 1., 83., 62., 73., 24., 55., 5., 4.],
[ 2., 54., 71., 38., 15., 73., 3., 7.],
[ 2., 39., 28., 49., 47., 28., 3., 7.],
[ 1., 68., 52., 28., 46., 69., 5., 4.],
[ 2., 82., 98., 24., 97., 98., 3., 7.],
[ 1., 98., 37., 32., 53., 29., 5., 4.]])
>>> group_by(AB, 0)
array([[ 1., 341., 201., 200., 174., 228., 20., 16.],
[ 2., 239., 232., 149., 228., 210., 12., 28.]])

Categories

Resources