My problem is, I need to read around 50M lines from a file in format
x1 "\t" x2 "\t" .. x10 "\t" count
and then to compute the matrix A with components A[j][i] = Sum (over all lines) count * x_i * x_j.
I tried 2 approaches, both reading the file line per line:
1) keep A a Python matrix and update in for loop:
for j in range(size):
for i in range(size):
A[j][i] += x[j] * x[i] * count
2) make A a numpy array, and update using numpy.add:
numpy.add(A, count * numpy.outer(x, x))
What surprised me is that the 2nd approach has been around 30% slower than the first one. And both are really slow - around 10 minutes for the whole file...
Is there some way to speed up the calculation of the matrix? Maybe there is some function that would read the data entirely from the file (or in large chunks) and not line per line? Any suggestions?
Some thoughts:
Use pandas.read_csv with the C engine to read the file. It is a lot faster than np.genfromtxt because the engine is c/Cython optimized.
You can read the whole file in memory and then do the calculations. this is the easiest way but from an efficiency perspective your CPU will be mostly idle waiting for input. This time could be better used calculating stuff.
You can try to read and process line by line (ex: with the cvs module). While io will still be the bottleneck by the end you will have processed your file. The problem here is that you still will have some efficiency loss due to the Python overhead.
Probably the best combination would be to read by chunks using pandas.read_csv with the iterator and chunk_size parameters set and process chunks at a time. I bet there is an optimal chunk size that will beat the other methods.
Your matrix is symmetric, compute just the upper half using your first approach (55 computations per row instead of 100).
The second approach is slower. I don't know why but, if you're instantiating 50M small ndarrays, it is possible that's the bottleneck and possibly using a single ndarray and copying each row data
x = np.zeros((11,))
for l in data.readlines():
x[:] = l.split()
A+=np.outer(x[:-1],x[:-1])*x[-1]
may result in a speedup.
Depending on how much memory you have available on you machine, you try using a regular expression to parse the values and numpy reshaping and slicing to apply the calculations. If you run out of memory, consider a similar approach but read the file in, say, 1M line chunks.
txt = open("C:/temp/input.dat").read()
values = re.split("[\t|\n]", txt.strip())
thefloats = [ float(x) for x in values]
mat = np.reshape(thefloats, (num_cols, num_rows))
for i in range(len(counts)):
mat[:-1,i] *= counts[-1,i]
Related
I need a much faster code to remove values of an 1D array (array length ~ 10-15) that are common with another 1D array (array length ~ 1e5-5e5 --> rarely up to 7e5), which are index arrays contain integers. There is no duplicate in the arrays, and they are not sorted and the order of the values must be kept in the main array after modification. I know that can be achieved using such np.setdiff1d or np.in1d (which both are not supported for numba jitted in no-python mode), and other similar posts (e.g. this) have not much more efficient way to do so, but performance is important here because all the values in the main index array will be gradually be removed in loops.
import numpy as np
import numba as nb
n = 500000
r = 10
arr1 = np.random.permutation(n)
arr2 = np.random.randint(0, n, r)
# #nb.jit
def setdif1d_np(a, b):
return np.setdiff1d(a, b, assume_unique=True)
# #nb.jit
def setdif1d_in1d_np(a, b):
return a[~np.in1d(a, b)]
There is another related post that proposed by norok2 for 2D arrays, that is ~15 times faster solution (hashing-like way using numba) than usual methods described there. This solution may be the best if it could be prepared for 1D arrays:
#nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
#nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
I tried to prepare that for 1D arrays, but I have some problems/question with that.
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
Why mul_xor_hash will not work without nb.njit:
File "C:/Users/Ali/Desktop/test - Copy - Copy.py", line 21, in mul_xor_hash
result = (result * k) ^ x
TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
IDK how to implement mul_xor_hash on 1D arrays (if it could), which I guess may make it faster more than for 2Ds, so I broadcast the input arrays to 2D by [None, :], which get the following error just for arr2:
print(mul_xor_hash(arr2[0]))
ValueError: new type not compatible with array
and what does delta do
I am searching the most efficient way in this regard. In the absence of better method than norok2 solution, how to prepare this solution for 1D arrays?
Understanding the hash-based solution
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
mul_xor_hash is a custom hash function. Functions mixing xor and multiply (possibly with shifts) are known to be relatively fast to compute the hash of a raw data buffer. The multiplication tends to shuffle bits and the xor is used to somehow combine/accumulate the result in a fixed size small value (ie. the final hash). There are many different hashing functions. Some are faster than others, some cause more collisions than other in a given context. A fast hashing function causing too many collisions can be useless in practice as it would result in a pathological situation where all conflicting values needs to be compared. This is why fast hash functions are hard to implement.
init and k are parameter certainly causing the hash to be pretty balance. This is pretty common in such a hash function. k needs to be sufficiently big for the multiplication to shuffle bits and it should typically also be a prime number (values like power of two tends to increase collisions due to modular arithmetic behaviours). init plays a significant role only for very small arrays (eg. with 1 item): it helps to reduce collisions by xoring the final hash by a non-trivial constant. Indeed, if arr.size = 1, then result = (init * k) ^ arr[0] where init * k is a constant. Having an identity hash function equal to arr[0] is known to be bad since it tends to result in many collisions (this is a complex topic, but put it shortly, arr[0] can be divided by the number of buckets in the hash table for example). Thus, init should be a relatively big number and init * k should also be a big non-trivial value (a prime number is a good target value).
Why mul_xor_hash will not work without nb.njit
It depends of the input. The input needs to be a 1D array and have a raw size in byte divisible by 8 (eg. 64-bit items, 2n x 32-bit ones, 4n x 16-bit one or 8n 8-bit ones). Here is some examples:
mul_xor_hash(np.random.rand(10))
mul_xor_hash(np.arange(10)) # Do not work with 9
and what does delta do
It is a set containing the hash of the arr2 row so to find matching lines faster than comparing them without hashes.
how to prepare this solution for 1D arrays?
AFAIK, hashes are only use to avoid comparisons of rows but this is because the input is the 2D array. In 1D, there is no such a problem.
There is big catch with this method: it only works if there is no hash collisions. Otherwise, the implementation wrongly assumes that values are equal even if they are not! #norok explicitly mentioned it in the comments though:
Note that the collision handling for the hashings should also be implemented
Faster implementation
Using the 2D solution of #norok2 for 1D is not a good idea since hashes will not make it faster the way they are used. In fact, a set already use a hash function internally anyway. Not to mention collisions needs to be properly implemented (which is done by a set).
Using a set is a relatively good idea since it causes the complexity to be O(n + m) where n = len(arr1) and m = len(arr2). That being said, if arr1 is converted to a set, then it will be too big to fit in L1 cache (due to the size of arr1 in your case) resulting in slow cache misses. Additionally, the growing size of the set will cause values to be re-hashed which is not efficient. If arr2 is converted to a set, then the many hash table fetches will not be very efficient since arr2 is very small in your case. This is why this solution is sub-optimal.
One solution is to split arr1 in chunks and then build a set based on the target chunk. You can then check if a value is in the set or not efficiently. Building the set is still not very efficient due to the growing size. This problem is due to Python itself which do not provide a way to reserve some space for the data structure like other languages do (eg. C++). One solution to avoid this issue is simply to reimplement an hash-table which is not trivial and cumbersome. Actually, Bloom filters can be used to speed up this process since they can quickly find if there is no collision between the two sets arr1 and arr2 in average (though they are not trivial to implement).
Another optimization is to use multiple threads to compute the chunks in parallel since they are independent. That being said, the appending to the final array is not easy to do efficiently in parallel, especially since you do not want the order to be modified. One solution is to move away the copy from the parallel loop and do it serially but this is slow and AFAIK there is no simple way to do that in Numba currently (since the parallelism layer is very limited). Consider using native languages like C/C++ for an efficient parallel implementation.
In the end, hashing can be pretty complex and the speed up can be quite small compared to a naive implementation with two nested loops since arr2 only have few items and modern processors can compare values quickly using SIMD instructions (while hash-based method can hardly benefit from them on mainstream processors). Unrolling can help to write a pretty simple and fast implementation. Again, unfortunately, Numba use LLVM-Jit internally which appear to fail to vectorize such a simple code (certainly due to missing optimizations in either LLVM-Jit or even LLVM itself). As a result, the non vectorized code is finally a bit slower (rather than 4~10 times faster on a modern mainstream processor). One solution is to use a C/C++ code instead to do that (or possibly Cython).
Here is a serial implementation using basic Bloom filters:
#nb.njit('uint32(int32)')
def hash_32bit_4k(value):
return (np.uint32(value) * np.uint32(27_644_437)) & np.uint32(0x0FFF)
#nb.njit(['int32[:](int32[:], int32[:])', 'int32[:](int32[::1], int32[::1])'])
def setdiff1d_nb_faster(arr1, arr2):
out = np.empty_like(arr1)
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
cur = 0
for i in range(arr1.size):
# If the bloom-filter value is true, we know arr1[i] is not in arr2.
# Otherwise, there is maybe a false positive (conflict) and we need to check to be sure.
if bloomFilter[hash_32bit_4k(arr1[i])] and arr1[i] in arr2:
continue
out[cur] = arr1[i]
cur += 1
return out[:cur]
Here is an untested variant that should work for 64-bit integers (floating point numbers need memory views and possibly a prime constant too):
#nb.njit('uint64(int64)')
def hash_32bit_4k(value):
return (np.uint64(value) * np.uint64(67_280_421_310_721)) & np.uint64(0x0FFF)
Note that if all the values in the small array are contained in the main array in each loop, then we can speed up the arr1[i] in arr2 part by removing values from arr2 when we find them. That being said, collisions and findings should be very rare so I do not expect this to be significantly faster (not to mention it adds some overhead and complexity). If items are computed in chunks, then the last chunks can be directly copied without any check but the benefit should still be relatively small. Note that this strategy can be effective for the naive (C/C++) SIMD implementation previously mentioned though (it can be about 2x faster).
Generalization and parallel implementation
This section focus on the algorithm to use regarding the input size. It particularly details an SIMD-based implementation and discuss about the use of multiple threads.
First of all, regarding the value r, the best algorithm to use can be different. More specifically:
when r is 0, the best thing to do is to return the input array arr1 unmodified (possibly a copy to avoid issue with in-place algorithms);
when r is 1, we can use one basic loop iterating over the array, but the best implementation is likely to use np.where of Numpy which is highly optimized for that
when r is small like <10, then using a SIMD-based implementation should be particularly efficient, especially if the iteration range of the arr2-based loop is known at compile-time and is unrolled
for bigger r values that are still relatively small (eg. r < 1000 and r << n), the provided hash-based solution should be one of the best;
for larger r values with r << n, the hash-based solution can be optimized by packing boolean values as bits in bloomFilter and by using multiple hash-functions instead of one so to better handle collisions while being more cache-friendly (in fact, this is what actual bloom filters does); note that multi-threading can be used so speed up the lookups when r is huge and r << n;
when r is big and not much smaller than n, then the problem is pretty hard to solve efficiently and the best solution is certainly to sort both arrays (typically with a radix sort) and use a merge-based method to remove the duplicates, possibly with multiple threads when both r and n are huge (hard to implement).
Let's start with the SIMD-based solution. Here is an implementation:
#nb.njit('int32[:](int32[::1], int32[::1])')
def setdiff1d_nb_simd(arr1, arr2):
out = np.empty_like(arr1)
limit = arr1.size // 4 * 4
limit2 = arr2.size // 2 * 2
cur = 0
z32 = np.int32(0)
# Tile (x4) based computation
for i in range(0, limit, 4):
f0, f1, f2, f3 = z32, z32, z32, z32
v0, v1, v2, v3 = arr1[i], arr1[i+1], arr1[i+2], arr1[i+3]
# Unrolled (x2) loop searching for a match in `arr2`
for j in range(0, limit2, 2):
val1 = arr2[j]
val2 = arr2[j+1]
f0 += (v0 == val1) + (v0 == val2)
f1 += (v1 == val1) + (v1 == val2)
f2 += (v2 == val1) + (v2 == val2)
f3 += (v3 == val1) + (v3 == val2)
# Remainder of the previous loop
if limit2 != arr2.size:
val = arr2[arr2.size-1]
f0 += v0 == val
f1 += v1 == val
f2 += v2 == val
f3 += v3 == val
if f0 == 0: out[cur] = arr1[i+0]; cur += 1
if f1 == 0: out[cur] = arr1[i+1]; cur += 1
if f2 == 0: out[cur] = arr1[i+2]; cur += 1
if f3 == 0: out[cur] = arr1[i+3]; cur += 1
# Remainder
for i in range(limit, arr1.size):
if arr1[i] not in arr2:
out[cur] = arr1[i]
cur += 1
return out[:cur]
It turns out this implementation is always slower than the hash-based one on my machine since Numba clearly generate an inefficient for the inner arr2-based loop and this appears to come from broken optimizations related to the ==: Numba simply fail use SIMD instructions for this operation (for no apparent reasons). This prevent many alternative SIMD-related codes to be fast as long as they are using Numba.
Another issue with Numba is that np.where is slow since it use a naive implementation while the one of Numpy has been heavily optimized. The optimization done in Numpy can hardly be applied to the Numba implementation due to the previous issue. This prevent any speed up using np.where in a Numba code.
In practice, the hash-based implementation is pretty fast and the copy takes a significant time on my machine already. The computing part can be speed up using multiple thread. This is not easy since the parallelism model of Numba is very limited. The copy cannot be easily optimized with Numba (one can use non-temporal store but this is not yet supported by Numba) unless the computation is possibly done in-place.
To use multiple threads, one strategy is to first split the range in chunk and then:
build a boolean array determining, for each item of arr1, whether the item is found in arr2 or not (fully parallel)
count the number of item found by chunk (fully parallel)
compute the offset of the destination chunk (hard to parallelize, especially with Numba, but fast thanks to chunks)
copy the chunk to the target location without copying found items (fully parallel)
Here is an efficient parallel hash-based implementation:
#nb.njit('int32[:](int32[:], int32[:])', parallel=True)
def setdiff1d_nb_faster_par(arr1, arr2):
# Pre-computation of the bloom-filter
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
chunkSize = 1024 # To tune regarding the kind of input
chunkCount = (arr1.size + chunkSize - 1) // chunkSize
# Find for each item of `arr1` if the value is in `arr2` (parallel)
# and count the number of item found for each chunk on the fly.
# Note: thanks to page fault, big parts of `found` are not even written in memory if `arr2` is small
found = np.zeros(arr1.size, dtype=nb.bool_)
foundCountByChunk = np.empty(chunkCount, dtype=nb.uint16)
for i in nb.prange(chunkCount):
start, end = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
foundCountInChunk = 0
for j in range(start, end):
val = arr1[j]
if bloomFilter[hash_32bit_4k(val)] and val in arr2:
found[j] = True
foundCountInChunk += 1
foundCountByChunk[i] = foundCountInChunk
# Compute the location of the destination chunks (sequential)
outChunkOffsets = np.empty(chunkCount, dtype=nb.uint32)
foundCount = 0
for i in range(chunkCount):
outChunkOffsets[i] = i * chunkSize - foundCount
foundCount += foundCountByChunk[i]
# Parallel chunk-based copy
out = np.empty(arr1.size-foundCount, dtype=arr1.dtype)
for i in nb.prange(chunkCount):
srcStart, srcEnd = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
cur = outChunkOffsets[i]
# Optimization: we can copy the whole chunk if there is nothing found in it
if foundCountByChunk[i] == 0:
out[cur:cur+(srcEnd-srcStart)] = arr1[srcStart:srcEnd]
else:
for j in range(srcStart, srcEnd):
if not found[j]:
out[cur] = arr1[j]
cur += 1
return out
This implementation is the fastest for the target input on my machine. It is generally fast when n is quite big and the overhead to create threads is relatively small on the target platform (eg. on PCs but typically not computing servers with many cores). The overhead of the parallel implementation is significant so the number of core on the target machine needs to be at least 4 so the implementation can be significantly faster than the sequential implementation.
It may be useful to tune the chunkSize variable for the target inputs. If r << n, it is better to use a pretty big chunkSize. That being said, the number of chunk needs to be sufficiently big for multiple thread to operate on many chunks. Thus, chunkSize should be significantly smaller than n / numberOfThreads.
On my machine most of the time (65-70%) is spent in the final copy which is mostly memory-bound and can hardly be optimized further with Numba.
Results
Here are results on my i5-9600KF-based machine (with 6 cores):
setdif1d_np: 2.65 ms
setdif1d_in1d_np: 2.61 ms
setdiff1d_nb: 2.33 ms
setdiff1d_nb_simd: 1.85 ms
setdiff1d_nb_faster: 0.73 ms
setdiff1d_nb_faster_par: 0.49 ms
The best provided implementation is about 4~5 time faster than the other ones.
What I found is that hashing does not help,. It is just trick for 2D case, to convert 1d arrays to single numbers and put them as such in a set.
Below is method of norok2 I converted to 1d arrays (and added annotations for faster compilation).
Note that this is only slightly (20-30%) faster than the methods you already have. And of course after second function call, on first due to compilation it is slightly slower.
#nb.njit('int32[:](int32[:], int32[:])')
def setdiff1d_nb(arr1, arr2):
delta = set(arr2)
# : build the result
result = np.empty(len(arr1), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if arr1[i] not in delta:
result[j] = arr1[i]
j += 1
return result[:j]
I have some python code for reading data from RAM of an FPGA and writing it to disk on my computer. The code's runtime is 2.56sec. I need to bring it down to 2sec.
mem = device.getNode("udaq.readout_mem").readBlock(16384)
device.dispatch()
ram.append(mem)
ram.reverse()
memory = ram.pop()
for j in range(16384):
if 0 < j < 4096:
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
if 8192 < j < 12288:
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
Your loop is very unefficient. You're literally iterating for nothing when values aren't in range. And you're spending a lot of time testing the indices.
Don't do one loop & 2 tests. Just create 2 loops without index tests (note that first index is skipped if we respect your tests:
for j in range(1,4096):
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
for j in range(8193,12288):
f.write('0x%05x\t0x%08x\n' %(j, memory[j]))
maybe more pythonic & more concise (& not using memory[j] so it has a chance to be faster):
import itertools
for start,end in ((1,4096),(8193,12288)):
sl = itertools.islice(memory,start,end)
for j,m in enumerate(sl,start):
f.write('0x%05x\t0x%08x\n' %(j, m))
the outer loop saves the 2 loops (so if there are more offsets, just add them in the tuple list). The islice object creates a slice of the memory but no copies are made. It iterates without checking the indices each time for array out of bounds, so it can be faster. It has yet to be benched, but the writing to disk is probably taking a lot of time as well.
Jean-François Fabre's observations on the loops are very good, but we can go further. The code is performing around 8000 write operations, of constant size, and with nearly the same content. We can prepare a buffer to do that in one operation.
# Prepare buffer with static portions
addresses = list(range(1,4096)) + list(range(8193,12288))
dataoffset = 2+5+1+2
linelength = dataoffset+8+1
buf = bytearray(b"".join(b'0x%05x\t0x%08x\n'%(j,0)
for j in addresses))
# Later on, fill in data
for line,address in enumerate(addresses):
offset = linelength*line+dataoffset
buf[offset:offset+8] = b"%08x"%memory[address]
f.write(buf)
This means far fewer system calls. It's likely we can go even further by e.g. reading the memory as a buffer and using b2a_hex or similar rather than a string formatting per word. It might also make sense to precalculate the offsets rather than using enumerate.
I have very long arrays and tables of time-value pairs in pytables. I need to be able to perform linear interpolation and zero order hold interpolation on this data.
Currently, I'm turning the columns into numpy arrays using pytables' column-wise slice notation and then feeding the numpy arrays to scipy.interpolate.interp1d to create the interpolation functions.
Is there a better way to do this?
The reason I ask is that it is my understanding that turning the columns into numpy arrays basically copies them into memory. Which means that when I start running my code full throttle I'm going to be in trouble since I will be working with data sets large enough to drown my desktop. Please correct me if I'm mistaken on this point.
Also, due to the large amounts of data I'll be working with, I suspect that writing a function that iterates over the pytables arrays/tables in order to do the interpolation myself will be incredibly slow since I need to call the interpolation function many, many times (about as many times as there are records in the data I'm trying to interpolate).
Your question is difficult to answer because there is always a trade off between memory and computation time and you are essentially asking to not have to sacrifice either of them, which is impossible. scipy.interpolate.interp1d() requires that the arrays be in memory and writing an out-of-core interpolator requires that you query the disk linearly with the number of times that you call it.
That said, there are a couple of things that you can do, none of which are perfect.
The first thing that you can try is down sampling the data. This will cut down the data that you need to have in memory by the factor that you down sample. The disadvantage is that your interpolation is that much coarser. Luckily this is pretty easy to do. Just provide a step size to the columns that you access. For down sampling factor of 4 you would do:
with tb.open_file('myfile.h5', 'r') as f:
x = f.root.mytable.cols.x[::4]
y = f.root.mytable.cols.y[::4]
f = scipy.interpolate.interp1d(x, y)
ynew = f(xnew)
You could make this step size adjustable based on the memory available if you wanted to as well.
Alternatively, if the data set that you are interpolating values for - xnew - exists only on a subset of the original domain, you can get away with reading in only portions of the original table that are in the new neighborhood. Given a fudge factor of 10%, you would do something like the following:
query = "{0} <= x & x <= {1}".format(xnew.min()*0.9, xnew.max()*1.1)
with tb.open_file('myfile.h5', 'r') as f:
data = f.root.mytable.read_where(query)
f = scipy.interpolate.interp1d(data['x'], data['y'])
ynew = f(xnew)
Extending this idea, if we have the case where xnew sorted (monotonically increasing) but does extend over the entire original domain, then you can read in from the table on disk in a chunked fashion. Say we want to have 10 chunks:
newlen = len(xnew)
chunks = 10
chunklen = newlen/ chunks
ynew = np.empty(newlen, dtype=float)
for i in range(chunks):
xnew_chunk = xnew[i*chunklen:(i+1)*chunklen]
query = "{0} <= x & x <= {1}".format(xnew_chunklen.min()*0.9,
xnew_chunklen.max()*1.1)
with tb.open_file('myfile.h5', 'r') as f:
data = f.root.mytable.read_where(query)
f = scipy.interpolate.interp1d(data['x'], data['y'])
ynew[i*chunklen:(i+1)*chunklen] = f(xnew_chunk)
Striking the balance between memory and I/O speed is always a challenge. There are probably things that you can do to speed these strategies up depending on how regular your data is. Still, this should be enough to get you started.
When I load an array using numpy.loadtxt, it seems to take too much memory. E.g.
a = numpy.zeros(int(1e6))
causes an increase of about 8MB in memory (using htop, or just 8bytes*1million \approx 8MB). On the other hand, if I save and then load this array
numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')
my memory usage increases by about 100MB! Again I observed this with htop. This was observed while in the iPython shell, and also while stepping through code using Pdb++.
Any idea what's going on here?
After reading jozzas's answer, I realized that if I know ahead of time the array size, there is a much more memory efficient way to do things if say 'a' was an mxn array:
b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
reader = csv.reader(f)
for i, row in enumerate(reader):
b[i,:] = numpy.array(row)
Saving this array of floats to a text file creates a 24M text file. When you re-load this, numpy goes through the file line-by-line, parsing the text and recreating the objects.
I would expect memory usage to spike during this time, as numpy doesn't know how big the resultant array needs to be until it gets to the end of the file, so I'd expect there to be at least 24M + 8M + other temporary memory used.
Here's the relevant bit of the numpy code, from /lib/npyio.py:
# Parse each line, including the first
for i, line in enumerate(itertools.chain([first_line], fh)):
vals = split_line(line)
if len(vals) == 0:
continue
if usecols:
vals = [vals[i] for i in usecols]
# Convert each value according to its column and store
items = [conv(val) for (conv, val) in zip(converters, vals)]
# Then pack it according to the dtype's nesting
items = pack_items(items, packing)
X.append(items)
#...A bit further on
X = np.array(X, dtype)
This additional memory usage shouldn't be a concern, as this is just the way python works - while your python process appears to be using 100M of memory, internally it maintains knowledge of which items are no longer used, and will re-use that memory. For example, if you were to re-run this save-load procedure in the one program (save, load, save, load), your memory usage will not increase to 200M.
Here is what I ended up doing to solve this problem. It works even if you don't know the shape ahead of time. This performs the conversion to float first, and then combines the arrays (as opposed to #JohnLyon's answer, which combines the arrays of string then converts to float). This used an order of magnitude less memory for me, although perhaps was a bit slower. However, I literally did not have the requisite memory to use np.loadtxt, so if you don't have sufficient memory, then this will be better:
def numpy_loadtxt_memory_friendly(the_file, max_bytes = 1000000, **loadtxt_kwargs):
numpy_arrs = []
with open(the_file, 'rb') as f:
i = 0
while True:
print(i)
some_lines = f.readlines(max_bytes)
if len(some_lines) == 0:
break
vec = np.loadtxt(some_lines, **loadtxt_kwargs)
if len(vec.shape) < 2:
vec = vec.reshape(1,-1)
numpy_arrs.append(vec)
i+=len(some_lines)
return np.concatenate(numpy_arrs, axis=0)
I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.
Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.
Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.
As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:
import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
d = np.random.randn(1000,1)
htemp, jnk = np.histogram(d, mybins)
myhist += htemp
I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. #doug's suggestion of a generator seems like a good way to address that problem.
Here's a way to bin your values directly:
import numpy as NP
column_of_values = NP.random.randint(10, 99, 10)
# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])
binned_values = NP.digitize(column_of_values, bins)
'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.
'bincount' will give you (obviously) the bin counts:
NP.bincount(binned_values)
Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:
data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
for i in range(0, data_array.shape[1]) :
yield dx[:,i]
Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)
I'm posting a second answer to the same question since this approach is very different, and addresses different issues.
What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.
For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.
For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".
I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.
I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)
More on Fenwick Trees:
http://en.wikipedia.org/wiki/Fenwick_tree
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees
Are interval, segment, fenwick trees the same?
Binning with Generators (large dataset; fixed-width bins; float data)
If you know the width of your desired bins ahead of time -- even if there are hundreds or thousands of buckets -- then I think rolling your own solution would be fast (both to write, and to run). Here's some Python that assumes you have a iterator that gives you the next value from the file:
from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
binname = int(floor(val/binwidth)*binwidth)
if binname not in counts:
counts[binname] = 0
counts[binname] += 1
print counts
The values can be floats, but this is assuming you use an integer binwidth; you may need to tweak this a bit if you want to use a binwidth of some float value.
As for next_value_from_file(), as mentioned earlier, you'll probably want to write a custom generator or object with an iter() method do do this efficiently. The pseudocode for such a generator would be this:
def next_value_from_file(filename):
f = open(filename)
for line in f:
# parse out from the line the value or values you need
val = parse_the_value_from_the_line(line)
yield val
If a given line has multiple values, then make parse_the_value_from_the_line() either return a list or itself be a generator, and use this pseudocode:
def next_value_from_file(filename):
f = open(filename)
for line in f:
for val in parse_the_values_from_the_line(line):
yield val