Moving large SQL query to NumPy

Moving large SQL query to NumPy - python

I have a very large MySQL query in my web app that looks like this:
query =
SELECT video_tag.video_id, (sum(user_rating.rating) * video.rating_norm) as score
FROM video_tag
JOIN user_rating ON user_rating.item_id = video_tag.tag_id
JOIN video ON video.id = video_tag.video_id
WHERE item_type = 3 AND user_id = 1 AND rating != 0 AND video.website_id = 2
AND rating_norm > 0 AND video_id NOT IN (1,2,3) GROUP BY video_id
ORDER BY score DESC LIMIT 20"
This query joins three tables (video, video_tag, and user_rating), groups the results, and does some basic math to compute a score for each video. This takes about 2s to run as the tables are large.
Instead of making SQL do all this work, I suspect it would be faster to do this computation using NumPy arrays. The data in 'video' and 'video_tag' is constant - so I could just load those table into memory once and not have to ping SQL each time.
However, while I can load these three tables into three separate arrays, I'm having a heck of a time replicating the above query (specifically the JOIN and GROUP BY parts). Has anyone any experience with replicating SQL queries using NumPy arrays?
Thanks!

What makes this exercise awkward is the single-data-type constraint for NumPy arrays. For instance, the GROUP BY operation implicitly requires (at least) one field/column of continuous values (to aggregate/sum) and one field/column to partition or group by.
Of course, NumPy recarrays can represent a 2D array (or SQL Table) using a different data type for each column (aka 'Field'), but I find these composite arrays cumbersome to work with. So in the code snippets below, i just used the conventional ndarray class to replicate the two SQL operations highlighted in the OP's Question.
to mimic SQL JOIN in NumPy:
first, create two NumPy arrays (A & B) each to represent an SQL Table. The primary keys for A are in 1st column; foreign key for B also in 1st column.
import numpy as NP
A = NP.random.randint(10, 100, 40).reshape(8, 5)
a = NP.random.randint(1, 3, 8).reshape(8, -1) # add column of primary keys
A = NP.column_stack((a, A))
B = NP.random.randint(0, 10, 4).reshape(2, 2)
b = NP.array([1, 2])
B = NP.column_stack((b, B))
Now (attempt to) replicate JOIN using NumPy array objects:
# prepare the array that will hold the 'result set':
AB = NP.column_stack((A, NP.zeros((A.shape[0], B.shape[1]-1))))
def join(A, B) :
'''
returns None, side effect is population of 'results set' NumPy array, 'AB';
pass in A, B, two NumPy 2D arrays, representing the two SQL Tables to join
'''
k, v = B[:,0], B[:,1:]
dx = dict(zip(k, v))
for i in range(A.shape[0]) :
AB[i:,-2:] = dx[A[i,0]]
to mimic SQL GROUP BY in NumPy:
def group_by(AB, col_id) :
'''
returns 2D NumPy array aggregated on the unique values in column specified by col_id;
pass in a 2D NumPy array and the col_id (integer) which holds the unique values to group by
'''
uv = NP.unique(AB[:,col_id])
temp = []
for v in uv :
ndx = AB[:,0] == v
temp.append(NP.sum(AB[:,1:][ndx,], axis=0))
temp = NP. row_stack(temp)
uv = uv.reshape(-1, 1)
return NP.column_stack((uv, temp))
for a test case, they return the correct result:
>>> A
array([[ 1, 92, 50, 67, 51, 75],
[ 2, 64, 35, 38, 69, 11],
[ 1, 83, 62, 73, 24, 55],
[ 2, 54, 71, 38, 15, 73],
[ 2, 39, 28, 49, 47, 28],
[ 1, 68, 52, 28, 46, 69],
[ 2, 82, 98, 24, 97, 98],
[ 1, 98, 37, 32, 53, 29]])
>>> B
array([[1, 5, 4],
[2, 3, 7]])
>>> join(A, B)
array([[ 1., 92., 50., 67., 51., 75., 5., 4.],
[ 2., 64., 35., 38., 69., 11., 3., 7.],
[ 1., 83., 62., 73., 24., 55., 5., 4.],
[ 2., 54., 71., 38., 15., 73., 3., 7.],
[ 2., 39., 28., 49., 47., 28., 3., 7.],
[ 1., 68., 52., 28., 46., 69., 5., 4.],
[ 2., 82., 98., 24., 97., 98., 3., 7.],
[ 1., 98., 37., 32., 53., 29., 5., 4.]])
>>> group_by(AB, 0)
array([[ 1., 341., 201., 200., 174., 228., 20., 16.],
[ 2., 239., 232., 149., 228., 210., 12., 28.]])

Related

Python -Taking dot product of long list of arrays

So I'm trying to to take the dot product of two arrays using numpy's dot product function.
import numpy as np
MWFrPos_Hydro1 = subPos1[submaskFirst1]
x = MWFrPos_Hydro1
MWFrVel_Hydro1 = subVel1[submaskFirst1]
y = MWFrVel_Hydro1
MWFrPosMag_Hydro1 = [np.linalg.norm(i) for i in MWFrPos_Hydro1]
np.dot(x, y)
returns
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-135-9ef41eb4235d> in <module>()
6
7
----> 8 np.dot(x, y)
ValueError: shapes (1220,3) and (1220,3) not aligned: 3 (dim 1) != 1220 (dim 0)
And I using this function improperly?
The arrays look like this
print x
[[ 51.61872482 106.19775391 69.64765167]
[ 33.86419296 11.75729942 11.84990311]
[ 12.75009823 58.95491028 38.06708527]
...,
[ 99.00266266 96.0210495 18.79844856]
[ 27.18083954 74.35041809 78.07577515]
[ 19.29788399 82.16114044 1.20453501]]
print y
[[ 40.0402298 -162.62153625 -163.00158691]
[-359.41983032 -115.39328766 14.8419466 ]
[ 95.92044067 -359.26425171 234.57330322]
...,
[ 130.17840576 -7.00977898 42.09699249]
[ 37.37852478 -52.66002655 -318.15155029]
[ 126.1726532 121.3104248 -416.20855713]]
Would for looping np.vdot be more optimal in this circumstance?

You can't take the dot product of two n * m matrices unless m == n -- when multiplying two matrices, A and B, B needs to have as many columns as A has rows. (So you can multiply an n * m matrix with an m * n matrix.)
See this article on multiplying matrices.

Some possible products for (n,3) arrays (here I'll just one)
In [434]: x=np.arange(12.).reshape(4,3)
In [435]: x
Out[435]:
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.],
[ 9., 10., 11.]])
element by element product, summed across the columns; n values. This is a magnitude like number.
In [436]: (x*x).sum(axis=1)
Out[436]: array([ 5., 50., 149., 302.])
Same thing with einsum, which gives more control over which axes are multiplied, and which are summed.
In [437]: np.einsum('ij,ij->i',x,x)
Out[437]: array([ 5., 50., 149., 302.])
dot requires last of the 1st and 2nd last of 2nd to have the same size, so I have to use x.T (transpose). The diagonal matches the above.
In [438]: np.dot(x,x.T)
Out[438]:
array([[ 5., 14., 23., 32.],
[ 14., 50., 86., 122.],
[ 23., 86., 149., 212.],
[ 32., 122., 212., 302.]])
np.einsum('ij,kj',x,x) does the same thing.
There is a new matmul product, but with 2d arrays like this it is just dot. I have to turn them into 3d arrays to get the 4 values; and even with that I have to squeeze out excess dimensions:
In [450]: x[:,None,:]#x[:,:,None]
Out[450]:
array([[[ 5.]],
[[ 50.]],
[[ 149.]],
[[ 302.]]])
In [451]: np.squeeze(_)
Out[451]: array([ 5., 50., 149., 302.])

Select all occurrences of top K values along each column in a NumPy array

Lets say I have a NumPy array as follows: My original array is 50K X8.5K size. This is sample
array([[ 1. , 2. , 3. ],
[ 1. , 0.5, 2. ],
[ 2. , 3. , 1. ]])
Now what I want is that for each column, only keep top K values (lets take K as 2 here) and re-code others to zero.
So output I am expecting is something like this:
array([[ 1., 2., 3.],
[ 1., 0., 2.],
[ 2., 3., 0.]])
So basically if we see, we kind of sort each column values in descending and then check if each value of that column is not amongst the k- largest values of that column then re-code that value to zero
I tried something like this but it is giving an error
for x in range(e.shape[1]):
e[:,x]=map(np.where(lambda x: x in e[:,x][::-1][:2], x, 0), e[:,x])
2
3 for x in range(e.shape[1]):
----> 4 e[:,x]=map(np.where(lambda x: x in e[:,x][::-1][:2], x, 0), e[:,x])
5
TypeError: 'numpy.ndarray' object is not callable
Currently I am also iterating for each column. Any solution which works fast since I have like 50K rows and 8K columns so iterating for each column and then for each column doing map of each value in that column would be time consuming I guess.
Please advise.

With focus on performance for such large arrays, here's a vectorized approach to solve it -
K = 2 # Select top K values along each column
# Sort A, store the argsort for later usage
sidx = np.argsort(A,axis=0)
sA = A[sidx,np.arange(A.shape[1])]
# Perform differentiation along rows and look for non-zero differentiations
df = np.diff(sA,axis=0)!=0
# Perform cumulative summation along rows from bottom upwards.
# Thus, summations < K should give us a mask of valid ones that are to
# be kept per column. Use this mask to set rest as zeros in sorted array.
mask = (df[::-1].cumsum(0)<K)[::-1]
sA[:-1] *=mask
# Finally revert back to unsorted order by using sorted indices sidx
out = sA[sidx.argsort(0),np.arange(sA.shape[1])]
Please note that for more performance boost, np.argsort could be replaced by np.argpartition.
Sample input, ouput -
In [343]: A
Out[343]:
array([[106, 106, 102],
[105, 101, 104],
[106, 107, 101],
[107, 103, 106],
[106, 105, 108],
[106, 104, 105],
[107, 101, 101],
[105, 103, 102],
[104, 102, 106],
[104, 106, 101]])
In [344]: out
Out[344]:
array([[106, 106, 0],
[ 0, 0, 0],
[106, 107, 0],
[107, 0, 106],
[106, 0, 108],
[106, 0, 0],
[107, 0, 0],
[ 0, 0, 0],
[ 0, 0, 106],
[ 0, 106, 0]])

This should get you there:
def rwhere(a, b, p, k):
if p >= len(b) or p >= k:
return 0
else:
return np.where(a == b[p], b[p], rwhere(a, b, p + 1, k))
def codek(a, k):
b = a.copy()
b.sort(0)
b = b[::-1]
return rwhere(a, b, 0, k)
codek(a, 2)
array([[ 1., 2., 3.],
[ 1., 0., 2.],
[ 2., 3., 0.]])

Ok. I just figured out what was the problem in my code. The where clause should be in return condition of lambda function. The below works fine.
array([[ 1. , 2. , 3. ],
[ 1. , 0.5, 2. ],
[ 2. , 3. , 1. ]])
e=copy.deepcopy(a)
for y in range(e.shape[1]):
e[:,y]=map(lambda x: np.where(x in np.sort(a[:,y])[::-1][:2],x, 0), e[:,y])
array([[ 1., 2., 3.],
[ 1., 0., 2.],
[ 2., 3., 0.]])
In [297]:
instead of 2 I can keep it as K and should work fine for that too.

How to generate many interaction terms in Pandas?

I would like to estimate an IV regression model using many interactions with year, demographic, and etc. dummies. I can't find an explicit method to do this in Pandas and am curious if anyone has tips.
I'm thinking of trying scikit-learn and this function:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

I was now faced with a similar problem, where I needed a flexible way to create specific interactions and looked through StackOverflow. I followed the tip in the comment above of #user333700 and thanks to him found patsy (http://patsy.readthedocs.io/en/latest/overview.html) and after a Google search this scikit-learn integration patsylearn (https://github.com/amueller/patsylearn).
So going through the example of #motam79, this is possible:
import numpy as np
import pandas as pd
from patsylearn import PatsyModel, PatsyTransformer
x = np.array([[ 3, 20, 11],
[ 6, 2, 7],
[18, 2, 17],
[11, 12, 19],
[ 7, 20, 6]])
df = pd.DataFrame(x, columns=["a", "b", "c"])
x_t = PatsyTransformer("a:b + a:c + b:c", return_type="dataframe").fit_transform(df)
This returns the following:
a:b a:c b:c
0 60.0 33.0 220.0
1 12.0 42.0 14.0
2 36.0 306.0 34.0
3 132.0 209.0 228.0
4 140.0 42.0 120.0
I answered to a similar question here, where I provide another example with categorical variables:
How can an interaction design matrix be created from categorical variables?

You can use sklearn's PolynomialFeatures function. Here is an example:
Let's assume, this is your design (i.e. feature) matrix:
x = array([[ 3, 20, 11],
[ 6, 2, 7],
[18, 2, 17],
[11, 12, 19],
[ 7, 20, 6]])
x_t = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(x)
Here is the result:
array([[ 3., 20., 11., 60., 33., 220.],
[ 6., 2., 7., 12., 42., 14.],
[ 18., 2., 17., 36., 306., 34.],
[ 11., 12., 19., 132., 209., 228.],
[ 7., 20., 6., 140., 42., 120.]])
The first 3 features are the original features, and the next three are interactions of the original features.

"valid" and "full" convolution using fft2 in Python

This is an incomplete Python snippet of convolution with FFT.
I want to modify it to make it support, 1) valid convolution
2) and full convolution
import numpy as np
from numpy.fft import fft2, ifft2
image = np.array([[3,2,5,6,7,8],
[5,4,2,10,8,1]])
kernel = np.array([[4,5],
[1,2]])
fft_size = # what size should I put here for,
# 1) valid convolution
# 2) full convolution
convolution = ifft2(fft2(image, fft_size) * fft2(kernel, fft_size))
Thank you in advance.

In the case of 1-dimensional arrays x and y with lengths L and M, resp., you need to pad the FFT to size L + M - 1 for mode="full". For the 2-d case, apply that rule to each axis.
Using numpy, you can compute the size in the 2-d case with
np.array(x.shape) + np.array(y.shape) - 1
To implement the "valid" mode, you'll have to compute the "full" result and then slice out the valid part. For 1-d, assuming L > M, the valid data is the L - M + 1 elements in the center of the full data. Again, apply the same rule to each axis in the 2-d case.
For example,
import numpy as np
from numpy.fft import fft2, ifft2
def fftconvolve2d(x, y, mode="full"):
"""
x and y must be real 2-d numpy arrays.
mode must be "full" or "valid".
"""
x_shape = np.array(x.shape)
y_shape = np.array(y.shape)
z_shape = x_shape + y_shape - 1
z = ifft2(fft2(x, z_shape) * fft2(y, z_shape)).real
if mode == "valid":
# To compute a valid shape, either np.all(x_shape >= y_shape) or
# np.all(y_shape >= x_shape).
valid_shape = x_shape - y_shape + 1
if np.any(valid_shape < 1):
valid_shape = y_shape - x_shape + 1
if np.any(valid_shape < 1):
raise ValueError("empty result for valid shape")
start = (z_shape - valid_shape) // 2
end = start + valid_shape
z = z[start[0]:end[0], start[1]:end[1]]
return z
Here's the function applied to your example data:
In [146]: image
Out[146]:
array([[ 3, 2, 5, 6, 7, 8],
[ 5, 4, 2, 10, 8, 1]])
In [147]: kernel
Out[147]:
array([[4, 5],
[1, 2]])
In [148]: fftconvolve2d(image, kernel, mode="full")
Out[148]:
array([[ 12., 23., 30., 49., 58., 67., 40.],
[ 23., 49., 37., 66., 101., 66., 21.],
[ 5., 14., 10., 14., 28., 17., 2.]])
In [149]: fftconvolve2d(image, kernel, mode="valid")
Out[149]: array([[ 49., 37., 66., 101., 66.]])
More error checking could be added, and it could be modified to handle complex arrays and n-dimensional arrays. And it would be nice if additional padding was chosen to make the FFT calculation more efficient. If you made all those enhancements, you might end up with something like scipy.signal.fftconvolve (https://github.com/scipy/scipy/blob/master/scipy/signal/signaltools.py#L210):
In [152]: from scipy.signal import fftconvolve
In [153]: fftconvolve(image, kernel, mode="full")
Out[153]:
array([[ 12., 23., 30., 49., 58., 67., 40.],
[ 23., 49., 37., 66., 101., 66., 21.],
[ 5., 14., 10., 14., 28., 17., 2.]])
In [154]: fftconvolve(image, kernel, mode="valid")
Out[154]: array([[ 49., 37., 66., 101., 66.]])

Rewrite a double loop in a nicer and maybe shorter way

I am wondering if the following code can be written in a somewhat nicer way. Basically, I want to calculate z = f(x, y) for a (x, y) meshgrid.
a = linspace(0, xr, 100)
b = linspace(0, yr, 100)
for i in xrange(100):
for j in xrange(100):
z[i][j] = f(a[i],b[j])

Yeah. Your code as presented in the question is nice.
Do not ever think that few lines is "nice" or "cool". What counts is clarity, readability and maintainability. Other people should be able to understand your code (and you should understand it in 12 months, when you need to find a bug).
Many programmers, especially young ones, think that "clever" solutions are desirable. They are not. And that's what is so nice with the python community. We are much less afflicted by that mistake than others.

you could do something like
z = [[f(item_a, item_b) for item_b in b] for item_a in a]

You could use itertools' product:
[f(i,j) for i,j in product( a, b )]
and if you really want to shrink those 5 lines into 1 then:
[f(i,j) for i,j in product( linspace(0,xr,100), linspace(0,yr,100)]
To take it even further if you want a function of xr and yr where you can also preset the ranges of 0 and 100 to something else:
def ranged_linspace( _start, _end, _function ):
def output_z( xr, yr ):
return [_function( i, j ) for i,j in product( linspace( _start, xr, _end ), linspace( _start, yr, _end ) )]
return output_z

If you set it all at once, you can use a list comprehension;
[[f(a[i], b[j]) for j in range(100)] for i in range(100)]
If you need to use a z that's already there, however, you can't do that and your code is about the neatest you'll get.
Addition: I don't know with what this lingrid does, but if it produces a 100-element list, use aaronasterling's list comprehension; no point in creating an extra iterator if you don't need to.

This shows the general result. a is made into a list 6-long and b is 4-long. The result is a list of 6 lists, and each nested list is 4 elements long.
>>> def f(x,y):
... return x+y
...
>>> a, b = list(range(0, 12, 2)), list(range(0, 12, 3))
>>> print len(a), len(b)
6 4
>>> result = [[f(aa, bb) for bb in b] for aa in a]
>>> print result
[[0, 3, 6, 9], [2, 5, 8, 11], [4, 7, 10, 13], [6, 9, 12, 15], [8, 11, 14, 17], [10, 13, 16, 19]]

I think this is the one line code that you looking for
z = [[a+b for b in linspace(0,yr,100)] for a in linspace(0,xr,100)]

Your linspace actually looks like it could be np.linspace. If it is you could operate on the numpy arrays without having to iterate explicitly:
z = f(x[:, np.newaxis], y)
For example:
>>> import numpy as np
>>> x = np.linspace(0, 9, 10)
>>> y = np.linspace(0, 90, 10)
>>> x[:, np.newaxis] + y # or f(x[:, np.newaxis], y)
array([[ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.],
[ 1., 11., 21., 31., 41., 51., 61., 71., 81., 91.],
[ 2., 12., 22., 32., 42., 52., 62., 72., 82., 92.],
[ 3., 13., 23., 33., 43., 53., 63., 73., 83., 93.],
[ 4., 14., 24., 34., 44., 54., 64., 74., 84., 94.],
[ 5., 15., 25., 35., 45., 55., 65., 75., 85., 95.],
[ 6., 16., 26., 36., 46., 56., 66., 76., 86., 96.],
[ 7., 17., 27., 37., 47., 57., 67., 77., 87., 97.],
[ 8., 18., 28., 38., 48., 58., 68., 78., 88., 98.],
[ 9., 19., 29., 39., 49., 59., 69., 79., 89., 99.]])
But you could also use np.ogrid instead of two linspace:
import numpy as np
>>> x, y = np.ogrid[0:10, 0:100:10]
>>> x + y # or f(x, y)
array([[ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
[ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
[ 2, 12, 22, 32, 42, 52, 62, 72, 82, 92],
[ 3, 13, 23, 33, 43, 53, 63, 73, 83, 93],
[ 4, 14, 24, 34, 44, 54, 64, 74, 84, 94],
[ 5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
[ 6, 16, 26, 36, 46, 56, 66, 76, 86, 96],
[ 7, 17, 27, 37, 47, 57, 67, 77, 87, 97],
[ 8, 18, 28, 38, 48, 58, 68, 78, 88, 98],
[ 9, 19, 29, 39, 49, 59, 69, 79, 89, 99]])
It somewhat depends on what you're f is. If it contains functions like math.sin you need to replace them by numpy.sin.
If it's not about numpy then you should stick either with your option or optionally using enumerate when looping:
for idx1, ai in enumerate(a):
for idx2, bj in enumerate(b):
z[idx1][idx2] = f(ai, bj)
This has the advantage that you don't need to hardcode your range (or xrange) or use the len(a) as input. But in general if there is not a huge performance difference 1 then use the method you and others using your code will understand easily.
1 If your a and b are numpy.arrays then there would be a significant performance difference because numpy can process the arrays much faster if no list<->numpy.array conversions are required.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Moving large SQL query to NumPy - python

Related

Python -Taking dot product of long list of arrays

Select all occurrences of top K values along each column in a NumPy array

How to generate many interaction terms in Pandas?

"valid" and "full" convolution using fft2 in Python

Rewrite a double loop in a nicer and maybe shorter way

Categories

Resources