I have a script that accumulates (counts) bytes contained in two files. Bytes are C-like unsigned char integer values between 0 and 255.
The goal of this accumulator script is to count the joint counts or frequencies of bytes in those two files. Possibly to extend this out to multiple files/dimensions.
The two files are the same size, but they are very large, on the order of 6 TB or so.
I am using numpy.uint64 values, as I am getting overflow problems using Python's int type.
I have a 1D accumulator array that is 255**2 in length, in order to store joint counts.
I calculate the offset from a row-by-column -to- array offset calculation, so as to increment the joint frequency at the right index. I walk through both files in chunks of bytes (n_bytes), unpack them, and increment the frequency counter.
Here's a rough sketch of the code:
import numpy
import ctypes
import struct
buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
total_buckets = buckets_per_signal_type**2
buckets = numpy.zeros((total_buckets,), dtype=numpy.uint64)
# open file handles to two files (omitted for brevity...)
# buffer size that is known ahead of time to be a divisible
# unit of the original files
# (for example, here, reading in 2.4e6 bytes per loop iteration)
n_bytes = 2400000
total_bytes = 0L
# format used to unpack bytes
struct_format = "=%dB" % (n_bytes)
while True:
# read in n_bytes chunk from each file
first_file_bytes = first_file_handle.read(n_bytes)
second_file_bytes = second_file_handle.read(n_bytes)
# break if both file handles have nothing left to read
if len(first_file_bytes) == 0 and len(second_file_bytes) == 0:
break
# unpack actual bytes
first_bytes_unpacked = struct.unpack(struct_format, first_file_bytes)
second_bytes_unpacked = struct.unpack(struct_format, second_file_bytes)
for index in range(0, n_bytes):
first_byte = first_bytes_unpacked[index]
second_byte = second_bytes_unpacked[index]
offset = first_byte * buckets_per_signal_type + second_byte
buckets[offset] += 1
total_bytes += n_bytes
# repeat until both file handles are both EOF
# print out joint frequency (omitted)
Compared with the version where I used int, this is incredibly slow, on an order of magnitude slower. The original job completed (incorrectly, due to overflow) in about 8 hours, and this numpy-based version had to be quit early as it will appear to take about 12-14 days to finish.
Either numpy is incredibly slow at this basic task, or I'm not doing an accumulator with numpy in a way that is Python-like. I suspect the latter, which is why I'm asking SO for help.
I read about numpy.add.at, but the unpacked byte arrays I would add to the buckets array do not have offset values that translate naturally to the "shape" of the buckets array.
Is there a way to store and increment an array of (long) integers, that does not overflow, and which is reasonably performant?
I could rewrite this in C, I guess, but am hoping there is something in numpy I am overlooking that will solve this quickly. Thanks for your advice.
Update
I had older versions of numpy and scipy that did not support numpy.add.at. So that was another issue to look into.
I'll try the following and see how that goes:
first_byte_arr = np.array(first_bytes_unpacked)
second_byte_arr = np.array(second_bytes_unpacked)
offsets = first_byte_arr * buckets_per_signal_type + second_byte_arr
np.add.at(buckets, offsets, 1L)
Hopefully it runs a little faster!
Update II
Using np.add.at and np.array, this job will take roughly 12 days to complete. I'm going to give up on numpy for now and go back to reading raw bytes with C, where the runtimes are a bit more reasonable. Thanks all for your advice!
Without trying to follow all the file read and struct code, it looks like you are adding 1 to an assortment of slots in the buckets array. That part shouldn't be taking days.
But to get an idea of how the dtype of buckets affects that step, I'll test adding 1 to a random assortment of indices.
In [57]: idx = np.random.randint(0,255**2,10000)
In [58]: %%timeit buckets = np.zeros(255**2, dtype=np.int64)
...: for i in idx:
...: buckets[i] += 1
...:
9.38 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [59]: %%timeit buckets = np.zeros(255**2, dtype=np.uint64)
...: for i in idx:
...: buckets[i] += 1
...:
71.7 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
uint64 is about 8x slower.
If there weren't duplicates, we could just do buckets[idx] += 1. But allowing for duplicates we have to use add.at:
In [60]: %%timeit buckets = np.zeros(255**2, dtype=np.int64)
...: np.add.at(buckets, idx, 1)
...:
1.6 ms ± 348 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [61]: %%timeit buckets = np.zeros(255**2, dtype=np.uint64)
...: np.add.at(buckets, idx, 1)
...:
1.62 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Interesting that dtype uint64 does not affect the timing in this case.
You mention in comments that you tried a list accumulator. I assume like this:
In [62]: %%timeit buckets = [0]*(255**2)
...: for i in idx:
...: buckets[i] += 1
...:
3.59 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's faster than the iterative version of the array. In general iteration on arrays is slower than on lists. It's the 'whole-array' operations that are faster, such as add.at.
To verify that add.at is the correct substitute for iteration, compare
In [63]: buckets0 = np.zeros(255**2, dtype=np.int64)
In [64]: for i in idx: buckets0[i] += 1
In [66]: buckets01 = np.zeros(255**2, dtype=np.int64)
In [67]: np.add.at(buckets01, idx, 1)
In [68]: np.allclose(buckets0, buckets01)
Out[68]: True
In [69]: buckets02 = np.zeros(255**2, dtype=np.int64)
In [70]: buckets02[idx] += 1
In [71]: np.allclose(buckets0, buckets02)
Out[71]: False
In [75]: bucketslist = [0]*(255**2)
In [76]: for i in idx: bucketslist[i] += 1
In [77]: np.allclose(buckets0, bucketslist)
Out[77]: True
numpy has its own file I/O method in fromfile which you'd probably be better off using if you want the output in a numpy array. (See this question)
Probably better to use the array structure given by numpy to make your buckets a 2d array:
buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
buckets = numpy.zeros((buckets_per_signal_type, buckets_per_signal_type), dtype=numpy.uint64)
And then just use np.add.at to increment the bins
# define record_type to match your data
while True
data_1 = np.fromfile(first_file_handle, dtype=record_dtype, count=nbytes)
data_2 = np.fromfile(second_file_handle, dtype=record_dtype, count=nbytes)
s = np.minimum(data_1.size, data_2.size)
if s == 0:
break
np.add.at(buckets, [data_1[:s], data_2[:s]], 1)
Related
I have an array, X, which I want to make monotonic. Specifically, I want to do
y = x.copy()
for i in range(1, len(x)):
y[i] = np.max(x[:i])
This is extremely slow for large arrays, but it feels like there should be a more efficient way of doing this. How can this operation be sped up?
The OP implementation is very inefficient because it does not use the information acquired on the previous iteration, resulting in O(n²) complexity.
def max_acc_OP(arr):
result = np.empty_like(arr)
for i in range(len(arr)):
result[i] = np.max(arr[:i + 1])
return result
Note that I fixed the OP code (which was otherwise throwing a ValueError: zero-size array to reduction operation maximum which has no identity) by allowing to get the largest value among those up to position i included.
It is easy to adapt that so that values at position i are excluded, but it leaves the first value of the result undefined, and it would never use the last value of the input. The first value of the result can be taken to be equal to the first value of the input, e.g.:
def max_acc2_OP(arr):
result = np.empty_like(arr)
result[0] = arr[0] # uses first value of input
for i in range(1, len(arr) + 1):
result[i] = np.max(arr[:i])
return result
It is equally easy to have similar adaptations for the code below, and I do not think it is particularly relevant to cover both cases of the value at position i included and excluded. Henceforth, only the "included" case is covered.
Back to the efficiency of the solotion, if you keep track of the current maximum and use that to fill your output array instead of re-computing the maximum for all value up to i at each iteration, you can easily get to O(n) complexity:
def max_acc(arr):
result = np.empty_like(arr)
curr_max = arr[0]
for i, x in enumerate(arr):
if x > curr_max:
curr_max = x
result[i] = curr_max
return result
However, this is still relatively slow because of the explicit looping.
Luckily, one can either rewrite this in vectorized form combining np.fmax() (or np.maximum() -- depending on how you need NaNs to be handled) and np.ufunc.accumulate():
np.fmax.accumulate()
# or
np.maximum.accumulate()
or, accelerating the solution above with Numba:
max_acc_nb = nb.njit(max_acc)
Some timings on relatively large inputs are provided below:
n = 10000
arr = np.random.randint(0, n, n)
%timeit -n 4 -r 4 max_acc_OP(arr)
# 97.5 ms ± 14.2 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 np.fmax.accumulate(arr)
# 112 µs ± 134 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 np.maximum.accumulate(arr)
# 88.4 µs ± 107 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 max_acc(arr)
# 2.32 ms ± 146 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
%timeit -n 4 -r 4 max_acc_nb(arr)
# 9.11 µs ± 3.01 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
indicating that max_acc() is already much faster than max_acc_OP(), but np.maximum.accumulate() / np.fmax.accumulate() is even faster, and max_acc_nb() comes out as the fastest. As always, it is important to take these kind of numbers with a grain of salt.
I think it will work faster to just keep track of the maximum rather than calculating it each time for each sub-array
y = x.copy()
_max = y[0]
for i in range(1, len(x)):
y[i] = _max
_max = max(x[i], _max)
you can use list comprehension for it. but you need to start your loop from 1 not from 0. either you can use like that if you want loop from 0.
y=[np.max(x[:i+1]) for i in range(len(x))]
or like that
y=[np.max(x[:i]) for i in range(1,len(x)+1)]
I am trying to order the zeroes and ones in arrangement of the order. The expected output is what I am trying to get to. Without using a list comprehension preferably.
import numpy as np
order = np.array([0,1,0,1,0])
zeroes= np.array([10,55, 30])
ones = np.array([3,8])
Expected Output
[10, 3, 55, 8, 30]
How about this (no Python loops: 750x faster than a list comprehension, when tested on 200k elements):
# note: updated version: faster and more robust to faulty input
def altcat(zeroes, ones, order):
i0 = np.nonzero(order == 0)[0][:len(zeroes)]
i1 = np.nonzero(order == 1)[0][:len(ones)]
z = np.zeros_like(order, dtype=zeroes.dtype)
z[i0] = zeroes[:len(i0)]
z[i1] = ones[:len(i1)]
return z
On your example:
>>> altcat(zeroes=np.array([10,55, 30]), ones=np.array([3,8]),
... order=np.array([0,1,0,1,0]))
array([10, 3, 55, 8, 30])
Speed
# set up
n = 200_000
np.random.seed(0)
order = np.random.randint(0, 2, size=n)
n1 = order.sum()
n0 = n - n1
ones = np.random.randint(100, size=n1)
zeroes = np.random.randint(100, size=n0)
# for comparison, a method proposed elsewhere, based on lists
def altcat_list(zeroes, ones, order):
zeroes = list(zeroes)
ones = list(ones)
return [zeroes.pop(0) if i == 0 else ones.pop(0) for i in order]
Test:
a = %timeit -o altcat(zeroes, ones, order)
# 2.38 ms ± 573 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
b = %timeit -o altcat_list(zeroes, ones, order)
# 1.84 s ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
b.average / a.average
# 773.59
Note: I initially tried with n = 1_000_000, but while altcat does that in 12.4ms, the list-based version would take forever and I had to stop it.
It seems that the list-based method is worse than O(n) (100K: 0.4s; 200K: 1.84s; 400K: 10.4s).
Addendum
If you really want to do it with a list comprehension and not in pure numpy, then at least consider this:
def altcat_list_mod(zeroes, ones, order):
it = [iter(zeroes), iter(ones)]
return [next(it[i]) for i in order]
That's faster than altcat_list(), but still almost 25x slower than altcat():
# on 200k elements
c = %timeit -o altcat_list_mod(zeroes, ones, order)
# 60 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
c.average / a.average
# 24.93
My problem is that i have a ndarray of shape (N,M,3) and i am trying to check each element in the array using a low level approach currently i am doing something like:
for i in range(N):
for j in range(M):
if ndarr[i][j][2] == 3:
ndarr[i][j][0] == var1
and most of the time the ndarray i need to process is very large usually around 1000x1000 .
the same idea i managed to run on cpp withing a couple of milisecondes in python it take around 30 seconds at best.
i would really appreciate if someone can explain to me or point me towards reading material on how to efficiently iterate trough ndarray
There is no way of doing that efficiently.
NumPy is a small Python wrapper around C code/datatypes. So an ndarray is actually a multidimensional C array. That means the memory address of the array is the address of the first element of the array. All other elements are stored consecutively in memory.
What your Python for loop does, is grabbing each element of the array and temporarily saving it somewhere else (as a Python datastructure) before stuffing it back in the C array. As I have said, there is no way of doing that efficiently with a Python loop.
What you could do is using Numba #jit to speed up the for loop or look after a NumPy routine, that can iterate over an array.
you can use logical indexing to do this more efficiently, it might be interesting to see how it compares with your c implementation.
import numpy as np
a = np.random.randn(2, 4, 3)
print(a)
idx = a[:, :, 2] > 0
a[idx, 0] = 9
print(a)
In Numpy you have to use vectorized-commands (usually calling a C or Cython-function) to achieve good performance. As an alternative you can use Numba or Cython.
Two possible Implementations
import numba as nb
import numpy as np
def calc_np(ndarr,var1):
ndarr[ndarr[:,:,0]==3]=var1
return ndarr
#nb.njit(parallel=True,cache=True)
def calc_nb(ndarr,var1):
for i in nb.prange(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i,j,2] == 3:
ndarr[i,j,0] == var1
return ndarr
Timings
ndarr=np.random.randint(low=0,high=3,size=(1000,1000,3))
%timeit calc_np(ndarr,2)
#780 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#first call takes longer due to compilation overhead
res=calc_nb(ndarr,2)
%timeit calc(ndarr,2)
#55.2 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Edit
You also use a wrong indexing method. ndarr[i] gives a 2d view on the original 3d-array, the next indexing operation [j] gives the next view on the previous view. This also has quite an impact on performance.
def calc_1(ndarr,var1):
for i in range(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i][j][2] == 3:
ndarr[i][j][0] == var1
return ndarr
def calc_2(ndarr,var1):
for i in range(ndarr.shape[0]):
for j in range(ndarr.shape[1]):
if ndarr[i,j,2] == 3:
ndarr[i,j,0] == var1
return ndarr
%timeit calc_1(ndarr,2)
#549 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(ndarr,2)
#321 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a problem with Numba typing - I read the manual, but eventually hit a brick wall.
The function in question is a part of a bigger project - though it needs to run fast - Python lists are out of the question, hence I've decided on trying Numba. Sadly, the function fails in nopython=True mode, despite the fact that - according to my understanding - all types are being provided.
The code is as follows:
from Numba import jit, njit, uint8, int64, typeof
#jit(uint8[:,:,:](int64))
def findWhite(cropped):
h1 = int64(0)
for i in cropped:
for j in i:
if np.sum(j) == 765:
h1 = h1 + int64(1)
else:
pass
return h1
also, separately:
print(typeof(cropped))
array(uint8, 3d, C)
print(typeof(h1))
int64
In this case 'cropped' is a large uint8 3D C matrix (RGB tiff file comprehension - PIL.Image). Could someone please explain to a Numba newbie what am I doing wrong?
Have you considered using Numpy? That's often a good intermediate between Python lists and Numba, something like:
h1 = (cropped.sum(axis=-1) == 765).sum()
or
h1 = (cropped == 255).all(axis=-1).sum()
The example code you provide is not valid Numba. Your signature is also incorrect, since the input is a 3D array and the output an integer, it should probably be:
#njit(int64(uint8[:,:,:]))
Looping over the array like you do is not valid code. A close translation of your code would be something like this:
#njit(int64(uint8[:,:,:]))
def findWhite(cropped):
h1 = int64(0)
ys, xs, n_bands = cropped.shape
for i in range(ys):
for j in range(xs):
if cropped[i, j, :].sum() == 765:
h1 += 1
return h1
But that isn't very fast and doesn't beat Numpy on my machine. With Numba it's fine to explicitly loop over every element in an array, this is already a lot faster:
#njit(int64(uint8[:,:,:]))
def findWhite_numba(cropped):
h1 = int64(0)
ys, xs, zs = cropped.shape
for i in range(ys):
for j in range(xs):
incr = 1
for k in range(zs):
if cropped[i, j, k] != 255:
incr = 0
break
h1 += incr
return h1
For a 5000x5000x3 array these are the result for me:
Numpy (h1 = (cropped == 255).all(axis=-1).sum()):
427 ms ± 6.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
findWhite:
612 ms ± 6.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
findWhite_numba:
31 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A benefit of the Numpy method is that it generalizes to any amount of dimensions.
def nonzero(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
The above code is much slower compared to
(row,col) = np.nonzero(edges_canny)
It would be great if I can get any direction how to increase the speed and why numpy functions are much faster?
There are 2 reasons why NumPy functions can outperform Pythons types:
The values inside the array are native types, not Python types. This means NumPy doesn't need to go through the abstraction layer that Python has.
NumPy functions are (mostly) written in C. That actually only matters in some cases because a lot of Python functions are also written in C, for example sum.
In your case you also do something really inefficient: You append to an array. That's one really expensive operation in the middle of a double loop. That's an obvious (and unnecessary) bottleneck right there. You would get amazing speedups just by using lists as nonzero_row and nonzero_col and only convert them to array just before you return:
def nonzero_list_based(a):
row,colum = a.shape
a = a.tolist()
nonzero_row = []
nonzero_col = []
for i in range(0,row):
for j in range(0,colum):
if a[i][j] != 0:
nonzero_row.append(i)
nonzero_col.append(j)
return (np.array(nonzero_row), np.array(nonzero_col))
The timings:
import numpy as np
def nonzero_original(a):
row,colum = a.shape
nonzero_row = np.array([],dtype=int)
nonzero_col = np.array([],dtype=int)
for i in range(0,row):
for j in range(0,colum):
if a[i,j] != 0:
nonzero_row = np.append(nonzero_row,i)
nonzero_col = np.append(nonzero_col,j)
return (nonzero_row,nonzero_col)
arr = np.random.randint(0, 10, (100, 100))
%timeit np.nonzero(arr)
# 315 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit nonzero_original(arr)
# 759 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nonzero_list_based(arr)
# 13.1 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even though it's 40 times slower than the NumPy operation it's still more than 60 times faster than your approach. There's an important lesson here: Avoid np.append whenever possible!
One additional point why NumPy outperforms alternative approaches is because they (mostly) use state-of-the art approaches (or they "import" them, i.e. BLAS/LAPACK/ATLAS/MKL) to solve the problems. These algorithms have been optimized for correctness and speed over years (if not decades). You shouldn't expect to find a faster or even comparable solution.