Related
Suppose you have a 2D array filled with integers in a continuous manner, going from left to right and top to bottom. Hence it would look like
[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]
Suppose now you have a 1D array of some of the integers shown in the array above. Lets say this array is [6,7,11]. I want to extract the block/chunk of the 2D array that contains the elements of the list. With these two inputs the result should be
[[ 6., 7.],
[11., nan]]
(I am padding with np.nan is it cannot be reshaped)
This is what I have written. Is there a simpler way please?
import numpy as np
def my_fun(my_list):
ids_down = 4
ids_across = 5
layout = np.arange(ids_down * ids_across).reshape((ids_down, ids_across))
ids = np.where((layout >= min(my_list)) & (layout <= max(my_list)), layout, np.nan)
r,c = np.unravel_index(my_list, ids.shape)
out = np.nan*np.ones(ids.shape)
for i, t in enumerate(zip(r,c)):
out[t] = my_list[i]
ax1_mask = np.any(~np.isnan(out), axis=1)
ax0_mask = np.any(~np.isnan(out), axis=0)
out = out[ax1_mask, :]
out = out[:, ax0_mask]
return out
Then trying my_fun([6,7,11]) returns
[[ 6., 7.],
[11., nan]]
This 100% NumPy solution works for both contiguous and non-contiguous arrays of wanted numbers.
a = np.array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
n = np.array([6, 7, 11])
Identify the locations of the wanted numbers:
mask = np.isin(a, n)
Select the rows and columns that have the wanted numbers:
np.where(mask, a, np.nan)\
[mask.any(axis=1)][:, mask.any(axis=0)]
#array([[ 6., 7.],
# [11., nan]])
One approach is to look for the bounding boxes by checking which elements in the array are contained in the second list. We can use scipy.ndimage:
from scipy import ndimage
m = np.isin(a, b)
a_components, _ = ndimage.measurements.label(m, np.ones((3, 3)))
bbox = ndimage.measurements.find_objects(a_components)
out = a[bbox[0]]
np.where(np.isin(out, b), out, np.nan)
array([[ 6., 7.],
[11., nan]])
Setup -
a = np.array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
b = np.array([6,7,11])
Or for b = np.array([10,12,16]) we'd get:
m = np.isin(a, b)
a_components, _ = ndimage.measurements.label(m, np.ones((3, 3)))
bbox = ndimage.measurements.find_objects(a_components)
out = a[bbox[0]]
np.where(np.isin(out, b), out, np.nan)
array([[10., nan, 12.],
[nan, 16., nan]])
We could also adapt the above for multiple bounding boxes by doing:
b = np.array([5, 11, 8, 14])
m = np.isin(a, b)
a_components, _ = ndimage.measurements.label(m, np.ones((3, 3)))
bbox = ndimage.measurements.find_objects(a_components)
l = []
for box in bbox:
out = a[box]
l.append(np.where(np.isin(out, b), out, np.nan))
print(l)
[array([[ 5., nan],
[nan, 11.]]),
array([[ 8., nan],
[nan, 14.]])]
Taking advantage of the specific form of template array A we can directly transform the test values to coordinates:
A = np.arange(20).reshape(4,5)
test = [6,7,11]
y,x = np.unravel_index(test,A.shape)
yl,yr = y.min(),y.max()
xl,xr = x.min(),x.max()
out = np.full((yr-yl+1,xr-xl+1),np.nan)
out[y-yl,x-xl]=test
out
# array([[ 6., 7.],
# [11., nan]])
I have to work with sensor data (from ros, specifically, but it should not be relevant). To this end, I have several 2-D numpy arrays with one row storing the timestamps and the following others the corresponding sensors data. Problem is, such arrays do not have the same dimensions (different sampling times). I need to merge all of these arrays into a single big one. How can I do so based on the timestamp and, say, replace the missing numbers with 0 or NaN?
Example of my situation:
import numpy as np
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
print(a)
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
print(b)
Which returns output
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]]
[[ 1 3 5 7 9]
[ 35 188 114 153 36]]
What I am looking for is
[[ 1 2 3 4 5 6 7 8 9]
[ 51 9 117 174 164 60 95 197 30]
[ 35 0 188 0 114 0 153 0 36]]
Is there any way to achieve this in an efficient way? This is an example but I am working with thousands of samples. Thanks!
For simple case of one b-matrix
With first row of a storing all possible timestamps and both of those first rows in a and b being sorted, we can use np.searchsorted -
idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))
For several b-matrices
Approach #1
To extend to multiple b-matrices, we can follow a similar method with np.searchsorted within a loop, like so -
def merge_arrays(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
for b_i in B:
idx = np.searchsorted(a[0],b_i[0])
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
Sample run -
In [175]: a
Out[175]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
In [176]: b0
Out[176]:
array([[16, 22, 34, 56, 67, 91],
[20, 80, 69, 79, 47, 64],
[82, 88, 49, 29, 19, 19]])
In [177]: b1
Out[177]:
array([[ 4, 16, 34, 99],
[28, 34, 0, 0],
[36, 53, 5, 38],
[17, 79, 4, 42]])
In [178]: merge_arrays(a, [b0,b1])
Out[178]:
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 0, 0, 20, 80, 69, 79, 47, 0, 64, 0],
[ 0, 0, 82, 88, 49, 29, 19, 0, 19, 0],
[28, 0, 34, 0, 0, 0, 0, 0, 0, 0],
[36, 0, 53, 0, 5, 0, 0, 0, 0, 38],
[17, 0, 79, 0, 4, 0, 0, 0, 0, 42]])
Approach #2
If looping with np.searchsorted seems to be the bottleneck, we can vectorize that part -
def merge_arrays_v2(a, B):
# a : Array with first row holding all possible timestamps
# B : list or tuple of all b-matrices
lens = np.array([len(i) for i in B])
L = (lens-1).sum() + len(a)
out_dtype = np.result_type(*[i.dtype for i in B])
out = np.zeros((L, a.shape[1]), dtype=out_dtype)
out[:len(a)] = a
s = len(a)
r0 = [i[0] for i in B]
r0s = np.concatenate((r0))
idxs = np.searchsorted(a[0],r0s)
cols = np.array([i.shape[1] for i in B])
sp = np.r_[0,cols.cumsum()]
start,stop = sp[:-1],sp[1:]
for (b_i,s0,s1) in zip(B,start,stop):
idx = idxs[s0:s1]
out[s:s+len(b_i)-1,idx] = b_i[1:]
s += len(b_i)-1
return out
Here's an approach using np.searchsorted:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [118, 105, 86, 94, 69, 17, 142, 46, 54]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [70, 15, 4, 97, 57]])
out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[118., 105., 86., 94., 69., 17., 142., 46., 54.],
[ 70., 0., 15., 0., 4., 0., 97., 0., 57.]])
Update - Merging many matrices
Here's a almost fully vectorised approach for a scenario with multiple b matrices. This approach does not require a priori knowledge of which is the largest list:
def merge_timestamps(*x):
# infer which is the list with maximum length
# as well as individual lengths
concat = np.concatenate(*x, axis=1)[0]
lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
# define the output matrix
A = x[0][max_len_list]
out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
# Update the output matrix with the values of the smaller
# arrays according to their index. This is of course assuming
# all values are contained in the largest
for ix, i in enumerate(others):
out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
return out
Lets check with the following example:
time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)
a=np.array((time1,data1))
# array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
# [107, 13, 123, 119, 137, 135, 65, 157, 83]])
time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[ 1, 3, 5, 7, 9],
# [ 81, 49, 83, 32, 179]])
time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[ 1, 3],
# [185, 117]])
merge_timestamps([a,b,c])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
As mentioned this approach does not require a priori knowledge of which is the largest list, i.e. it would also work with:
merge_timestamps([b, c, a])
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[107., 13., 123., 119., 137., 135., 65., 157., 83.],
[185., 0., 117., 0., 0., 0., 0., 0., 0.],
[ 81., 0., 49., 0., 83., 0., 32., 0., 179.]])
Applicable only if sensor is capturing data at fixed interval.
First we will need to create a dataframe with fixed interval (15 min interval in this case), then use concat function to this dataframe with sensor's data.
Code to generate dataframe with 15 min interval (copied)
l = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='15T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
l = pd.DataFrame(l)
Assuming below data comes from sensor
m = (pd.DataFrame(columns=['NULL'],
index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
freq='30T'))
.between_time('07:00','21:00')
.index.strftime('%Y-%m-%dT%H:%M:%SZ')
.tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)
merge above two dataframes
df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0
Output
0 SensorData
0 2016-09-02T17:30:00Z 0.0
1 2016-09-02T17:45:00Z 0.0
2 2016-09-02T18:00:00Z 1.0
3 2016-09-02T18:15:00Z 0.0
4 2016-09-02T18:30:00Z 2.0
5 2016-09-02T18:45:00Z 0.0
6 2016-09-02T19:00:00Z 3.0
7 2016-09-02T19:15:00Z 0.0
8 2016-09-02T19:30:00Z 4.0
9 2016-09-02T19:45:00Z 0.0
10 2016-09-02T20:00:00Z 5.0
11 2016-09-02T20:15:00Z 0.0
12 2016-09-02T20:30:00Z 6.0
13 2016-09-02T20:45:00Z 0.0
14 2016-09-02T21:00:00Z 7.0
I'm looking for a solution to sum per column in a 2D array ("a" in the example below) and starting from a cell position as defined in a different 1D array ("ref" in the example below).
I have tried the following:
import numpy as np
a = np.arange(20).reshape(5, 4)
print(a) # representing an original large 2D array
ref = np.array([0, 2, 4, 1]) # reference array for defining start of sum
s = a.sum(axis=0)
print(s) # Works: sums all elements per column
s = a[2:].sum(axis=0)
print(s) # Works as well: sum from the third element till end per column
# This is what I look for: sum per column starting at element defined by ref[]
s = np.zeros(4).astype(int) # makes an empty 1D array
for i in np.arange(4): # for each column
for j in np.arange(ref[i], 5):
s[i] += a[j, i] # sums all elements from ref till end (i.e. 5)
print(s) # This is the desired outcome
for i in np.arange(4):
s = a[ref[i]:].sum(axis=0)
print(s) # No good; same as a[ref[4]:].sum(axis=0) and here ref[4] = 1
s = np.zeros(4).astype(int) # makes an empty 1D array
for i in np.arange(4):
s[i] = np.sum(a[ref[i]:, i])
print(s) # Yes; this is also the desired outcome
Is it possible to realize this without using a for loop?
Does numpy have functions for doing this in a single step?
s = a[ref:].sum(axis=0)
This would be nice, but is not working.
Thank you for your time!
A basic solution based on np.cumsum:
In [1]: a = np.arange(15).reshape(5, 3)
In [2]: res = np.array([0, 2, 3])
In [3]: b = np.cumsum(a, axis=0)
In [4]: b
Out[4]:
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15],
[18, 22, 26],
[30, 35, 40]])
In [5]: a
Out[5]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [6]: b[res, np.arange(a.shape[1])]
Out[6]: array([ 0, 12, 26])
In [7]: b[-1, :] - b[res, np.arange(a.shape[1])]
Out[7]: array([30, 23, 14])
so it does not give us the result we want: we need to add a first line of zeros to b:
In [13]: b = np.vstack([np.zeros((1, a.shape[1])), b])
In [14]: b
Out[14]:
array([[ 0., 0., 0.],
[ 0., 1., 2.],
[ 3., 5., 7.],
[ 9., 12., 15.],
[ 18., 22., 26.],
[ 30., 35., 40.]])
In [17]: b[-1, :] - b[res, np.arange(a.shape[1])]
Out[17]: array([ 30., 30., 25.])
which is, I believe, the desired output.
Lets say I have a NumPy array as follows: My original array is 50K X8.5K size. This is sample
array([[ 1. , 2. , 3. ],
[ 1. , 0.5, 2. ],
[ 2. , 3. , 1. ]])
Now what I want is that for each column, only keep top K values (lets take K as 2 here) and re-code others to zero.
So output I am expecting is something like this:
array([[ 1., 2., 3.],
[ 1., 0., 2.],
[ 2., 3., 0.]])
So basically if we see, we kind of sort each column values in descending and then check if each value of that column is not amongst the k- largest values of that column then re-code that value to zero
I tried something like this but it is giving an error
for x in range(e.shape[1]):
e[:,x]=map(np.where(lambda x: x in e[:,x][::-1][:2], x, 0), e[:,x])
2
3 for x in range(e.shape[1]):
----> 4 e[:,x]=map(np.where(lambda x: x in e[:,x][::-1][:2], x, 0), e[:,x])
5
TypeError: 'numpy.ndarray' object is not callable
Currently I am also iterating for each column. Any solution which works fast since I have like 50K rows and 8K columns so iterating for each column and then for each column doing map of each value in that column would be time consuming I guess.
Please advise.
With focus on performance for such large arrays, here's a vectorized approach to solve it -
K = 2 # Select top K values along each column
# Sort A, store the argsort for later usage
sidx = np.argsort(A,axis=0)
sA = A[sidx,np.arange(A.shape[1])]
# Perform differentiation along rows and look for non-zero differentiations
df = np.diff(sA,axis=0)!=0
# Perform cumulative summation along rows from bottom upwards.
# Thus, summations < K should give us a mask of valid ones that are to
# be kept per column. Use this mask to set rest as zeros in sorted array.
mask = (df[::-1].cumsum(0)<K)[::-1]
sA[:-1] *=mask
# Finally revert back to unsorted order by using sorted indices sidx
out = sA[sidx.argsort(0),np.arange(sA.shape[1])]
Please note that for more performance boost, np.argsort could be replaced by np.argpartition.
Sample input, ouput -
In [343]: A
Out[343]:
array([[106, 106, 102],
[105, 101, 104],
[106, 107, 101],
[107, 103, 106],
[106, 105, 108],
[106, 104, 105],
[107, 101, 101],
[105, 103, 102],
[104, 102, 106],
[104, 106, 101]])
In [344]: out
Out[344]:
array([[106, 106, 0],
[ 0, 0, 0],
[106, 107, 0],
[107, 0, 106],
[106, 0, 108],
[106, 0, 0],
[107, 0, 0],
[ 0, 0, 0],
[ 0, 0, 106],
[ 0, 106, 0]])
This should get you there:
def rwhere(a, b, p, k):
if p >= len(b) or p >= k:
return 0
else:
return np.where(a == b[p], b[p], rwhere(a, b, p + 1, k))
def codek(a, k):
b = a.copy()
b.sort(0)
b = b[::-1]
return rwhere(a, b, 0, k)
codek(a, 2)
array([[ 1., 2., 3.],
[ 1., 0., 2.],
[ 2., 3., 0.]])
Ok. I just figured out what was the problem in my code. The where clause should be in return condition of lambda function. The below works fine.
array([[ 1. , 2. , 3. ],
[ 1. , 0.5, 2. ],
[ 2. , 3. , 1. ]])
e=copy.deepcopy(a)
for y in range(e.shape[1]):
e[:,y]=map(lambda x: np.where(x in np.sort(a[:,y])[::-1][:2],x, 0), e[:,y])
array([[ 1., 2., 3.],
[ 1., 0., 2.],
[ 2., 3., 0.]])
In [297]:
instead of 2 I can keep it as K and should work fine for that too.
I have a 1-D function that takes so much time to compute over a big 2-D array of 'x' values, so it is much easy to create an interpolate function using SciPy facility and then compute y using it, which will be much faster. However, I cannot use the interpolation function on arrays that have more than 1-D.
Example:
# First, I create the interpolation function in the domain I want to work
x = np.arange(1, 100, 0.1)
f = exp(x) # a complicated function
f_int = sp.interpolate.InterpolatedUnivariateSpline(x, f, k=2)
# Now, in the code I do that
x = [[13, ..., 1], [99, ..., 45], [33, ..., 98] ..., [15, ..., 65]]
y = f_int(x)
# Which I want that it returns y = [[f_int(13), ..., f_int(1)], ..., [f_int(15), ..., f_int(65)]]
But returns:
ValueError: object too deep for desired array
I know I could loop over all x members, but I don't know if it is a better option...
Thanks!
EDIT:
A function like that also would do the job:
def vector_op(function, values):
orig_shape = values.shape
values = np.reshape(values, values.size)
return np.reshape(function(values), orig_shape)
I've tried the np.vectorize but it is too slow...
If f_int wants single dimensional data, you should flatten your input, feed it to the interpolator, then reconstruct your original shape:
>>> x = np.arange(1, 100, 0.1)
>>> f = 2 * x # a simple function to see the results are good
>>> f_int = scipy.interpolate.InterpolatedUnivariateSpline(x, f, k=2)
>>> x = np.arange(25).reshape(5, 5) + 1
>>> x
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
>>> x_int = f_int(x.reshape(-1)).reshape(x.shape)
>>> x_int
array([[ 2., 4., 6., 8., 10.],
[ 12., 14., 16., 18., 20.],
[ 22., 24., 26., 28., 30.],
[ 32., 34., 36., 38., 40.],
[ 42., 44., 46., 48., 50.]])
x.reshape(-1) does the flattening, and the .reshape(x.shape) returns it to its original form.
I think you want to do a vectorized function in numpy:
#create some random test data
test = numpy.random.random((100,100))
#a normal python function that you want to apply
def myFunc(i):
return np.exp(i)
#now vectorize the function so that it will work on numpy arrays
myVecFunc = np.vectorize(myFunc)
result = myVecFunc(test)
I would use a combination of a list comprehension and map (there might be a way to use two nested maps that I'm missing)
In [24]: x
Out[24]: [[1, 2, 3], [1, 2, 3], [1, 2, 3]]
In [25]: [map(lambda a: a*0.1, x_val) for x_val in x]
Out[25]:
[[0.1, 0.2, 0.30000000000000004],
[0.1, 0.2, 0.30000000000000004],
[0.1, 0.2, 0.30000000000000004]]
this is just for illustration purposes.... replace lambda a: a*0.1 with your function, f_int