How to generate many interaction terms in Pandas?

How to generate many interaction terms in Pandas? - python

I would like to estimate an IV regression model using many interactions with year, demographic, and etc. dummies. I can't find an explicit method to do this in Pandas and am curious if anyone has tips.
I'm thinking of trying scikit-learn and this function:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

I was now faced with a similar problem, where I needed a flexible way to create specific interactions and looked through StackOverflow. I followed the tip in the comment above of #user333700 and thanks to him found patsy (http://patsy.readthedocs.io/en/latest/overview.html) and after a Google search this scikit-learn integration patsylearn (https://github.com/amueller/patsylearn).
So going through the example of #motam79, this is possible:
import numpy as np
import pandas as pd
from patsylearn import PatsyModel, PatsyTransformer
x = np.array([[ 3, 20, 11],
[ 6, 2, 7],
[18, 2, 17],
[11, 12, 19],
[ 7, 20, 6]])
df = pd.DataFrame(x, columns=["a", "b", "c"])
x_t = PatsyTransformer("a:b + a:c + b:c", return_type="dataframe").fit_transform(df)
This returns the following:
a:b a:c b:c
0 60.0 33.0 220.0
1 12.0 42.0 14.0
2 36.0 306.0 34.0
3 132.0 209.0 228.0
4 140.0 42.0 120.0
I answered to a similar question here, where I provide another example with categorical variables:
How can an interaction design matrix be created from categorical variables?

You can use sklearn's PolynomialFeatures function. Here is an example:
Let's assume, this is your design (i.e. feature) matrix:
x = array([[ 3, 20, 11],
[ 6, 2, 7],
[18, 2, 17],
[11, 12, 19],
[ 7, 20, 6]])
x_t = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(x)
Here is the result:
array([[ 3., 20., 11., 60., 33., 220.],
[ 6., 2., 7., 12., 42., 14.],
[ 18., 2., 17., 36., 306., 34.],
[ 11., 12., 19., 132., 209., 228.],
[ 7., 20., 6., 140., 42., 120.]])
The first 3 features are the original features, and the next three are interactions of the original features.

Related

Merge multidimensional NumPy arrays based on first row

I have to work with sensor data (from ros, specifically, but it should not be relevant). To this end, I have several 2-D numpy arrays with one row storing the timestamps and the following others the corresponding sensors data. Problem is, such arrays do not have the same dimensions (different sampling times). I need to merge all of these arrays into a single big one. How can I do so based on the timestamp and, say, replace the missing numbers with 0 or NaN?
Example of my situation:
import numpy as np
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
print(a)
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
print(b)
Which returns output
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]]
[[ 1 3 5 7 9]
[ 35 188 114 153 36]]
What I am looking for is
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]
[ 35 0 188 0 114 0 153 0 36]]
Is there any way to achieve this in an efficient way? This is an example but I am working with thousands of samples. Thanks!

For simple case of one b-matrix
With first row of a storing all possible timestamps and both of those first rows in a and b being sorted, we can use np.searchsorted -
idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))
For several b-matrices
Approach #1
To extend to multiple b-matrices, we can follow a similar method with np.searchsorted within a loop, like so -
def merge_arrays(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
for b_i in B:
idx = np.searchsorted(a[0],b_i[0])
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
Sample run -
In [175]: a
Out[175]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
In [176]: b0
Out[176]:
array([[16, 22, 34, 56, 67, 91],
[20, 80, 69, 79, 47, 64],
[82, 88, 49, 29, 19, 19]])
In [177]: b1
Out[177]:
array([[ 4, 16, 34, 99],
[28, 34, 0, 0],
[36, 53, 5, 38],
[17, 79, 4, 42]])
In [178]: merge_arrays(a, [b0,b1])
Out[178]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 0, 0, 20, 80, 69, 79, 47, 0, 64, 0],
[ 0, 0, 82, 88, 49, 29, 19, 0, 19, 0],
[28, 0, 34, 0, 0, 0, 0, 0, 0, 0],
[36, 0, 53, 0, 5, 0, 0, 0, 0, 38],
[17, 0, 79, 0, 4, 0, 0, 0, 0, 42]])
Approach #2
If looping with np.searchsorted seems to be the bottleneck, we can vectorize that part -
def merge_arrays_v2(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
r0 = [i[0] for i in B]
r0s = np.concatenate((r0))
idxs = np.searchsorted(a[0],r0s)
cols = np.array([i.shape[1] for i in B])
sp = np.r_[0,cols.cumsum()]
start,stop = sp[:-1],sp[1:]
for (b_i,s0,s1) in zip(B,start,stop):
idx = idxs[s0:s1]
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out

Here's an approach using np.searchsorted:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [118, 105, 86, 94, 69, 17, 142, 46, 54]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [70, 15, 4, 97, 57]])
out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[118., 105., 86., 94., 69., 17., 142., 46., 54.],
[ 70., 0., 15., 0., 4., 0., 97., 0., 57.]])
Update - Merging many matrices
Here's a almost fully vectorised approach for a scenario with multiple b matrices. This approach does not require a priori knowledge of which is the largest list:
def merge_timestamps(*x):
# infer which is the list with maximum length
# as well as individual lengths
concat = np.concatenate(*x, axis=1)[0]
lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
# define the output matrix
A = x[0][max_len_list]
out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
# Update the output matrix with the values of the smaller
# arrays according to their index. This is of course assuming
# all values are contained in the largest
for ix, i in enumerate(others):
out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
return out
Lets check with the following example:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [107, 13, 123, 119, 137, 135, 65, 157, 83]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [ 81, 49, 83, 32, 179]])
time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[ 1, 3],
# [185, 117]])
merge_timestamps([a,b,c])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
As mentioned this approach does not require a priori knowledge of which is the largest list, i.e. it would also work with:
merge_timestamps([b, c, a])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])

Applicable only if sensor is capturing data at fixed interval.
First we will need to create a dataframe with fixed interval (15 min interval in this case), then use concat function to this dataframe with sensor's data.
Code to generate dataframe with 15 min interval (copied)
l = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='15T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
l = pd.DataFrame(l)
Assuming below data comes from sensor
m = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='30T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)
merge above two dataframes
df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0
Output
0 SensorData
0 2016-09-02T17:30:00Z 0.0
1 2016-09-02T17:45:00Z 0.0
2 2016-09-02T18:00:00Z 1.0
3 2016-09-02T18:15:00Z 0.0
4 2016-09-02T18:30:00Z 2.0
5 2016-09-02T18:45:00Z 0.0
6 2016-09-02T19:00:00Z 3.0
7 2016-09-02T19:15:00Z 0.0
8 2016-09-02T19:30:00Z 4.0
9 2016-09-02T19:45:00Z 0.0
10 2016-09-02T20:00:00Z 5.0
11 2016-09-02T20:15:00Z 0.0
12 2016-09-02T20:30:00Z 6.0
13 2016-09-02T20:45:00Z 0.0
14 2016-09-02T21:00:00Z 7.0

Replace values based on multiple conditions of two array?

Assume that I have two arrays
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(5, 4))
>>> a
array([[1, 6, 7, 4],
[2, 7, 4, 2],
[9, 3, 6, 4],
[9, 6, 8, 2],
[7, 2, 9, 5]])
>>> b = np.random.randint(0, 10, size=(5, 4))
>>> b
array([[ 5., 8., 6., 5.],
[ 1., 8., 4., 8.],
[ 1., 4., 6., 3.],
[ 4., 8., 6., 4.],
[ 8., 7., 7., 5.]], dtype=float32)
Now I have a situation where I need to compare elements of each arrays and replace with known values. For example my conditions are
if a == 0 then replace with 0 (or) if b == 0 then replace with 0
if a > 4 and < 11 then replace with 1 (or) if b > 1 and < 3 then replace with 1
if a > 10 and < 18 then replace with 2 (or) if b > 2 and < 5 then replace with 2
.
.
.
and finally
if a > 40 replace with 9 (or) if b > 9 then replace with 9.
Those replaced values can be stored in a new arrary which I need to use it for other function.
The simplest form of element wise comparison like a[ a > 2 ] = 1 works. But I am not aware of multiple comparison (multiple times) with same method.
I am sure that there is a easy way exist in numpy which I am unable to find. Any help is appreciated.
if

np.digitize should do what you want. The first arguments are the values you want to replace and the second are the thresholds.
a_replace = np.digitize(a, [0, 4, 10, ..., 40], right=True)
b_replace = np.digitize(b, [0, 1, 2, ..., 9], right=True)

Output in scipy.stats.binned_statistic_dd()

I am trying to use scipy.stats.binned_statistic_dd and I can't for the life of me figure out the outputs. Does anyone have any advice here?
Look at this simple sample program:
import scipy
scipy.__version__
# '0.14.0'
import numpy as np
print scipy.stats.binned_statistic_dd([np.ones(10), np.ones(10)], np.arange(10), 'count', bins=3)
#(array([[ 0., 0., 0.],
# [ 0., 10., 0.],
# [ 0., 0., 0.]]),
# [array([ 0.5 , 0.83333333, 1.16666667, 1.5 ]),
# array([ 0.5 , 0.83333333, 1.16666667, 1.5 ])],
# array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]))
So the documentation claims the outputs are:
statistic : ndarray, shape(nx1, nx2, nx3,...) The values of the
selected statistic in each two-dimensional bin
edges : list of
ndarrays A list of D arrays describing the (nxi + 1) bin edges for
each dimension
binnumber : 1-D ndarray of ints This assigns to each
observation an integer that represents the bin in which this
observation falls. Array has the same length as values.
In the example the statistic makes good sence, I asked for the 'count' and got 10, there are 10 elements all in that same bin. Edges makes good sense too, the data to be over was a dimension 2 and I wanted 3 bins so I gotout 4 edges that are reasonable.
Then the question the binnumber makes no sense to me at all, array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12]), there are indeed 10 numbers the same length and the data inputted, np.arange(10), but number 12 makes no sense at all. What am I missing. 12 is not an unravel index over the bins turned into a multi D array, since there are 3 bins in each dimension I could see numbers up to 9. What is 12 telling me?

The values in binnumbers are an unraveled index of bins that include an extra
set of "out of range" bins.
In this example,
In [40]: hst, edges, bincounts = binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3)
In [41]: hst
Out[41]:
array([[ 0., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 0.]])
the bins are numbered as follows:
0 | 1 | 2 | 3 | 4
-----+-----+-----+-----+-----
5 | 6 | 7 | 8 | 9
-----+-----+-----+-----+-----
10 | 11 | 12 | 13 | 14
-----+-----+-----+-----+-----
15 | 16 | 17 | 18 | 19
-----+-----+-----+-----+-----
20 | 21 | 22 | 23 | 24
The "out of range" bins are not included in hst; the data in hst corresponds to bin numbers
6, 7, 8, 11, 12, 13, 16, 17 and 18. That's why all the values in bincounts are 12:
In [42]: bincounts
Out[42]: array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12])
You can use the range argument to force the counts into the outer bins. For example,
by setting the ranges of the coordinates to be [2, 3] and [0, 0.5], so all the values in the
first coordinate are left of their range and all the values in the second coordinate are
to the right of their range, all the points end up in the upper right outer bin, which is
bin index 4:
In [51]: binned_statistic_dd([np.ones(10), np.ones(10)], None, 'count', bins=3, range=[[2,3],[0,0.5]])
Out[51]:
(array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]]),
[array([ 2. , 2.33333333, 2.66666667, 3. ]),
array([ 0. , 0.16666667, 0.33333333, 0.5 ])],
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]))

Using interpolate function over 2-D array

I have a 1-D function that takes so much time to compute over a big 2-D array of 'x' values, so it is much easy to create an interpolate function using SciPy facility and then compute y using it, which will be much faster. However, I cannot use the interpolation function on arrays that have more than 1-D.
Example:
# First, I create the interpolation function in the domain I want to work
x = np.arange(1, 100, 0.1)
f = exp(x) # a complicated function
f_int = sp.interpolate.InterpolatedUnivariateSpline(x, f, k=2)
# Now, in the code I do that
x = [[13, ..., 1], [99, ..., 45], [33, ..., 98] ..., [15, ..., 65]]
y = f_int(x)
# Which I want that it returns y = [[f_int(13), ..., f_int(1)], ..., [f_int(15), ..., f_int(65)]]
But returns:
ValueError: object too deep for desired array
I know I could loop over all x members, but I don't know if it is a better option...
Thanks!
EDIT:
A function like that also would do the job:
def vector_op(function, values):
orig_shape = values.shape
values = np.reshape(values, values.size)
return np.reshape(function(values), orig_shape)
I've tried the np.vectorize but it is too slow...

If f_int wants single dimensional data, you should flatten your input, feed it to the interpolator, then reconstruct your original shape:
>>> x = np.arange(1, 100, 0.1)
>>> f = 2 * x # a simple function to see the results are good
>>> f_int = scipy.interpolate.InterpolatedUnivariateSpline(x, f, k=2)
>>> x = np.arange(25).reshape(5, 5) + 1
>>> x
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
>>> x_int = f_int(x.reshape(-1)).reshape(x.shape)
>>> x_int
array([[ 2., 4., 6., 8., 10.],
[ 12., 14., 16., 18., 20.],
[ 22., 24., 26., 28., 30.],
[ 32., 34., 36., 38., 40.],
[ 42., 44., 46., 48., 50.]])
x.reshape(-1) does the flattening, and the .reshape(x.shape) returns it to its original form.

I think you want to do a vectorized function in numpy:
#create some random test data
test = numpy.random.random((100,100))
#a normal python function that you want to apply
def myFunc(i):
return np.exp(i)
#now vectorize the function so that it will work on numpy arrays
myVecFunc = np.vectorize(myFunc)
result = myVecFunc(test)

I would use a combination of a list comprehension and map (there might be a way to use two nested maps that I'm missing)
In [24]: x
Out[24]: [[1, 2, 3], [1, 2, 3], [1, 2, 3]]
In [25]: [map(lambda a: a*0.1, x_val) for x_val in x]
Out[25]:
[[0.1, 0.2, 0.30000000000000004],
[0.1, 0.2, 0.30000000000000004],
[0.1, 0.2, 0.30000000000000004]]
this is just for illustration purposes.... replace lambda a: a*0.1 with your function, f_int

Moving large SQL query to NumPy

I have a very large MySQL query in my web app that looks like this:
query =
SELECT video_tag.video_id, (sum(user_rating.rating) * video.rating_norm) as score
FROM video_tag
JOIN user_rating ON user_rating.item_id = video_tag.tag_id
JOIN video ON video.id = video_tag.video_id
WHERE item_type = 3 AND user_id = 1 AND rating != 0 AND video.website_id = 2
AND rating_norm > 0 AND video_id NOT IN (1,2,3) GROUP BY video_id
ORDER BY score DESC LIMIT 20"
This query joins three tables (video, video_tag, and user_rating), groups the results, and does some basic math to compute a score for each video. This takes about 2s to run as the tables are large.
Instead of making SQL do all this work, I suspect it would be faster to do this computation using NumPy arrays. The data in 'video' and 'video_tag' is constant - so I could just load those table into memory once and not have to ping SQL each time.
However, while I can load these three tables into three separate arrays, I'm having a heck of a time replicating the above query (specifically the JOIN and GROUP BY parts). Has anyone any experience with replicating SQL queries using NumPy arrays?
Thanks!

What makes this exercise awkward is the single-data-type constraint for NumPy arrays. For instance, the GROUP BY operation implicitly requires (at least) one field/column of continuous values (to aggregate/sum) and one field/column to partition or group by.
Of course, NumPy recarrays can represent a 2D array (or SQL Table) using a different data type for each column (aka 'Field'), but I find these composite arrays cumbersome to work with. So in the code snippets below, i just used the conventional ndarray class to replicate the two SQL operations highlighted in the OP's Question.
to mimic SQL JOIN in NumPy:
first, create two NumPy arrays (A & B) each to represent an SQL Table. The primary keys for A are in 1st column; foreign key for B also in 1st column.
import numpy as NP
A = NP.random.randint(10, 100, 40).reshape(8, 5)
a = NP.random.randint(1, 3, 8).reshape(8, -1) # add column of primary keys
A = NP.column_stack((a, A))
B = NP.random.randint(0, 10, 4).reshape(2, 2)
b = NP.array([1, 2])
B = NP.column_stack((b, B))
Now (attempt to) replicate JOIN using NumPy array objects:
# prepare the array that will hold the 'result set':
AB = NP.column_stack((A, NP.zeros((A.shape[0], B.shape[1]-1))))
def join(A, B) :
'''
returns None, side effect is population of 'results set' NumPy array, 'AB';
pass in A, B, two NumPy 2D arrays, representing the two SQL Tables to join
'''
k, v = B[:,0], B[:,1:]
dx = dict(zip(k, v))
for i in range(A.shape[0]) :
AB[i:,-2:] = dx[A[i,0]]
to mimic SQL GROUP BY in NumPy:
def group_by(AB, col_id) :
'''
returns 2D NumPy array aggregated on the unique values in column specified by col_id;
pass in a 2D NumPy array and the col_id (integer) which holds the unique values to group by
'''
uv = NP.unique(AB[:,col_id])
temp = []
for v in uv :
ndx = AB[:,0] == v
temp.append(NP.sum(AB[:,1:][ndx,], axis=0))
temp = NP. row_stack(temp)
uv = uv.reshape(-1, 1)
return NP.column_stack((uv, temp))
for a test case, they return the correct result:
>>> A
array([[ 1, 92, 50, 67, 51, 75],
[ 2, 64, 35, 38, 69, 11],
[ 1, 83, 62, 73, 24, 55],
[ 2, 54, 71, 38, 15, 73],
[ 2, 39, 28, 49, 47, 28],
[ 1, 68, 52, 28, 46, 69],
[ 2, 82, 98, 24, 97, 98],
[ 1, 98, 37, 32, 53, 29]])
>>> B
array([[1, 5, 4],
[2, 3, 7]])
>>> join(A, B)
array([[ 1., 92., 50., 67., 51., 75., 5., 4.],
[ 2., 64., 35., 38., 69., 11., 3., 7.],
[ 1., 83., 62., 73., 24., 55., 5., 4.],
[ 2., 54., 71., 38., 15., 73., 3., 7.],
[ 2., 39., 28., 49., 47., 28., 3., 7.],
[ 1., 68., 52., 28., 46., 69., 5., 4.],
[ 2., 82., 98., 24., 97., 98., 3., 7.],
[ 1., 98., 37., 32., 53., 29., 5., 4.]])
>>> group_by(AB, 0)
array([[ 1., 341., 201., 200., 174., 228., 20., 16.],
[ 2., 239., 232., 149., 228., 210., 12., 28.]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generate many interaction terms in Pandas? - python

Related

Merge multidimensional NumPy arrays based on first row

Replace values based on multiple conditions of two array?

Output in scipy.stats.binned_statistic_dd()

Using interpolate function over 2-D array

Moving large SQL query to NumPy

Categories

Resources