I have a 1D numpy array, and some offset/length values. I would like to extract from this array all entries which fall within offset, offset+length, which are then used to build up a new 'reduced' array from the original one, that only consists of those values picked by the offset/length pairs.
For a single offset/length pair this is trivial with standard array slicing [offset:offset+length]. But how can I do this efficiently (i.e. without any loops) for many offset/length values?
Thanks,
Mark
>>> import numpy as np
>>> a = np.arange(100)
>>> ind = np.concatenate((np.arange(5),np.arange(10,15),np.arange(20,30,2),np.array([8])))
>>> a[[ind]]
array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 20, 22, 24, 26, 28, 8])
There is the naive method; just doing the slices:
>>> import numpy as np
>>> a = np.arange(100)
>>>
>>> offset_length = [(3,10),(50,3),(60,20),(95,1)]
>>>
>>> np.concatenate([a[offset:offset+length] for offset,length in offset_length])
array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 50, 51, 52, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 95])
The following might be faster, but you would have to test/benchmark.
It works by constructing a list of the desired indices, which is valid method of indexing a numpy array.
>>> indices = [offset + i for offset,length in offset_length for i in xrange(length)]
>>> a[indices]
array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 50, 51, 52, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 95])
It's not clear if this would actually be faster than the naive method but it might be if you have a lot of very short intervals. But I don't know.
(This last method is basically the same as #fraxel's solution, just using a different method of making the index list.)
Performance testing
I've tested a few different cases: a few short intervals, a few long intervals, lots of short intervals. I used the following script:
import timeit
setup = 'import numpy as np; a = np.arange(1000); offset_length = %s'
for title, ol in [('few short', '[(3,10),(50,3),(60,10),(95,1)]'),
('few long', '[(3,100),(200,200),(600,300)]'),
('many short', '[(2*x,1) for x in range(400)]')]:
print '**',title,'**'
print 'dbaupp 1st:', timeit.timeit('np.concatenate([a[offset:offset+length] for offset,length in offset_length])', setup % ol, number=10000)
print 'dbaupp 2nd:', timeit.timeit('a[[offset + i for offset,length in offset_length for i in xrange(length)]]', setup % ol, number=10000)
print ' fraxel:', timeit.timeit('a[np.concatenate([np.arange(offset,offset+length) for offset,length in offset_length])]', setup % ol, number=10000)
This outputs:
** few short **
dbaupp 1st: 0.0474979877472
dbaupp 2nd: 0.190793991089
fraxel: 0.128381967545
** few long **
dbaupp 1st: 0.0416231155396
dbaupp 2nd: 1.58000087738
fraxel: 0.228138923645
** many short **
dbaupp 1st: 3.97210478783
dbaupp 2nd: 2.73584890366
fraxel: 7.34302687645
This suggests that my first method is the fastest when you have a few intervals (and it is significantly faster), and my second is the fastest when you have lots of intervals.
Related
Playing around with numpy:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
nl = np.array(l.append(3))
>> array(None, dtype=object)
Now, if I call on l, I'll get the list: [39, 54, 72, 46, 89, 53, 96, 64, 2, 75, 3]
My question is, why doesn't numpy create that list as an array?
If I do something like this:
nl = np.array(l.extend([45])) I get the same thing.
But, if I try to concatenate without a method: nl = np.array(l+[45]) it works.
What is causing this behaviour?
The append function will always return None. You must do this in two different lines of code:
import numpy as np
l = [39, 54, 72, 46, 89, 53, 96, 64, 2, 75]
l.append(3)
nl = np.array(l)
append and extend are in-place methods and return None.
print(l.append(3)) # None
print(l.extend([3])) # None
Problem:
List of Multiples
Create a Python 3 function that takes two numbers (value, length) as arguments and returns a list of multiples of value until the size of the list reaches length.
Examples
list_of_multiples(value=7, length=5) ➞ [7, 14, 21, 28, 35]
list_of_multiples(value=12, length=10) ➞ [12, 24, 36, 48, 60, 72, 84, 96, 108, 120]
list_of_multiples(value=17, length=6) ➞ [17, 34, 51, 68, 85, 102]
def multiples (value,length):
"""
Value is number to be multiply
length is maximum number of iteration up to
which multiple required.
"""
for i in range(length):
out=i
return i
Most Pythonic Way
def multiples(value, length):
return [*range(value, length*value+1, value)]
print(multiples(7, 5))
# [7, 14, 21, 28, 35]
print(multiples(12, 10))
# [12, 24, 36, 48, 60, 72, 84, 96, 108, 120]
print(multiples(17, 6))
# [17, 34, 51, 68, 85, 102]
Pythonic way:
def multiples(value, length):
return [value * i for i in range(1, length + 1)]
print(multiples(7, 5))
# [7, 14, 21, 28, 35]
print(multiples(12, 10))
# [12, 24, 36, 48, 60, 72, 84, 96, 108, 120]
print(multiples(17, 6))
# [17, 34, 51, 68, 85, 102]
def multiples(value, length):
list_multiples = []
i = 0
while i < length:
list_multiples.append(value*(i+1))
i+=1
return list_multiples
The easy / not in-line way would be :
def multiples(value, length):
l = []
for i in range(1, length+1):
l.append(value*i)
return l
The best answer for small values of length (< 100) is given by Nite Block.
However, in case length becomes bigger, using numpy is significantly faster than python loops:
numpy.arange(1, length+1) * value
With a length of 1000, python loops take almost 4 times longer than numpy. See code below:
import timeit
testcode_numpy = '''
import numpy
def multiples_numpy(value, length):
return numpy.arange(1, length+1) * value
multiples_numpy(5, 1000)
'''
testcode = '''
def multiples(value, length):
return [*range(value, length*value+1, value)]
multiples(5, 1000)
'''
print(timeit.timeit(testcode_numpy))
print(timeit.timeit(testcode))
# Result:
# without numpy: 9.7 s
# with numpy: 2.4 s
I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()
You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')
create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]
There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])
I've a 2-Dim array containing the residual sum of squares of a given fit (unimportant here).
RSS[i,j] = np.sum((spectrum_theo - sp_exp_int) ** 2)
I would like to find the matrix element with the minimum value AND its position (i,j) in the matrix. Find the minimum element is OK:
RSS_min = RSS[RSS != 0].min()
but for the index, I've tried:
ij_min = np.where(RSS == RSS_min)
which gives me:
ij_min = (array([3]), array([20]))
I would like to obtain instead:
ij_min = (3,20)
If I try :
ij_min = RSS.argmin()
I obtain:
ij_min = 0,
which is a wrong result.
Does it exist a function, in Scipy or elsewhere, that can do it? I've searched on the web, but I've found answers leading only with 1-Dim arrays, not 2- or N-Dim.
Thanks!
The easiest fix based on what you have right now would just be to extract the elements from the array as a final step:
# ij_min = (array([3]), array([20]))
ij_min = np.where(RSS == RSS_min)
ij_min = tuple([i.item() for i in ij_min])
Does this work for you
import numpy as np
array = np.random.rand((1000)).reshape(10,10,10)
print np.array(np.where(array == array.min())).flatten()
in the case of multiple minimums you could try something like
import numpy as np
array = np.array([[1,1,2,3],[1,1,4,5]])
print zip(*np.where(array == array.min()))
You can combine argmin with unravel_index.
For example, here's an array RSS:
In [123]: np.random.seed(123456)
In [124]: RSS = np.random.randint(0, 99, size=(5, 8))
In [125]: RSS
Out[125]:
array([[65, 49, 56, 43, 43, 91, 32, 87],
[36, 8, 74, 10, 12, 75, 20, 47],
[50, 86, 34, 14, 70, 42, 66, 47],
[68, 94, 45, 87, 84, 84, 45, 69],
[87, 36, 75, 35, 93, 39, 16, 60]])
Use argmin (which returns an integer that is the index in the flattened array), and then pass that to unravel_index along with the shape of RSS to convert the index of the flattened array into the indices of the 2D array:
In [126]: ij_min = np.unravel_index(RSS.argmin(), RSS.shape)
In [127]: ij_min
Out[127]: (1, 1)
ij_min itself can be used as an index into RSS to get the minimum value:
In [128]: RSS_min = RSS[ij_min]
In [129]: RSS_min
Out[129]: 8
Lately I've been doing a lot of processing on 8x8 blocks of image-data.
Standard approach has been to use nested for-loops to extract the blocks, e.g.
for y in xrange(0,height,8):
for x in xrange(0,width,8):
d = image_data[y:y+8,x:x+8]
# further processing on the 8x8-block
I can't help to wonder if there is a way to vectorize this operation or another approach using numpy/scipy that I can use instead? An iterator of some kind?
A MWE1:
#!/usr/bin/env python
import sys
import numpy as np
from scipy.fftpack import dct, idct
import scipy.misc
import matplotlib.pyplot as plt
def dctdemo(coeffs=1):
unzig = np.array([
0, 1, 8, 16, 9, 2, 3, 10,
17, 24, 32, 25, 18, 11, 4, 5,
12, 19, 26, 33, 40, 48, 41, 34,
27, 20, 13, 6, 7, 14, 21, 28,
35, 42, 49, 56, 57, 50, 43, 36,
29, 22, 15, 23, 30, 37, 44, 51,
58, 59, 52, 45, 38, 31, 39, 46,
53, 60, 61, 54, 47, 55, 62, 63])
lena = scipy.misc.lena()
width, height = lena.shape
# reconstructed
rec = np.zeros(lena.shape, dtype=np.int64)
# Can this part be vectorized?
for y in xrange(0,height,8):
for x in xrange(0,width,8):
d = lena[y:y+8,x:x+8].astype(np.float)
D = dct(dct(d.T, norm='ortho').T, norm='ortho').reshape(64)
Q = np.zeros(64, dtype=np.float)
Q[unzig[:coeffs]] = D[unzig[:coeffs]]
Q = Q.reshape([8,8])
q = np.round(idct(idct(Q.T, norm='ortho').T, norm='ortho'))
rec[y:y+8,x:x+8] = q.astype(np.int64)
plt.imshow(rec, cmap='gray')
plt.show()
if __name__ == '__main__':
try:
c = int(sys.argv[1])
except ValueError:
sys.exit()
else:
if 1 <= int(sys.argv[1]) <= 64:
dctdemo(int(sys.argv[1]))
Footnotes:
Actual application: https://github.com/figgis/dctdemo
There's a function view_as_windows for this in Scikit Image
http://scikit-image.org/docs/dev/api/skimage.util.html#view-as-windows
Unfortunately I will have to finish this answer another time, but you can grab the windows in a form that you can pass to dct with:
from skimage.util import view_as_windows
# your code...
d = view_as_windows(lena.astype(np.float), (8, 8)).reshape(-1, 8, 8)
dct(d, axis=0)
There is a function called extract_patches in the scikit-learn feature extraction routines. You need to specify a patch_size and an extraction_step. The result will be a view on your image as patches, which may overlap. The resulting array is 4D, the first 2 index the patch, and the last two index the pixels of the patch. Try this
from sklearn.feature_extraction.image import extract_patches
patches = extract_patches(image_data, patch_size=(8, 8), extraction_step=(4, 4))
This gives (8, 8) size patches that overlap by half.
Note that up until now this uses no extra memory, because it is implemented using stride tricks. You can force a copy by reshaping
patches = patches.reshape(-1, 8, 8)
which will basically yield a list of patches.