i tryed to use multilist to hold scraped data from html
but after 50.000 list append i got memory error
So i decided to change lists to numpy array
SapList= []
ListAll = np.array([])
def eachshop(): #filling each list for each shop data
global ListAll
SapList.append(RowNum)
SapList.extend([sap]) # here can be from one to 10 values in one list["sap1","sap2","sap3",...,"sap10"]
SapList.extend([[strLink,ProdName],ProdCode,ProdH,NewPrice, OldPrice,[FileName+'#Komp!A1',KompPrice],[FileName+'#Sav!A1','Sav']])
SapList.extend([ss]) # here can be from null to 80 sublist with 3 values [["id1", "link", "address"],["id80", "link", "address"]]
ListAll = np.append(np.array(SapList))
So then i do print(ListAll) i got exception C:\Python36\scrap.py, LINE 307 "ListAll = np.append(np.array(SapList))"): setting an array element with a sequence
now for speed up i using pool.map
def makePool(cP, func, iters):
try:
pool = ThreadPool(cP)
#perebiraem Url
pool.map_async(func,enumerate(iters, start=2)).get(99999)
pool.close()
pool.join()
except:
print('Pool Error')
raise
finally:
pool.terminate()
So how to use numpy array in my example and reduce memory usage and speedup I\O operation using Numpy?
It looks like you are trying to make an array from a list that contains a number and lists. Something like:
In [6]: np.array([1, [1,2],[3,4]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-812a9ccb6ca0> in <module>()
----> 1 np.array([1, [1,2],[3,4]])
ValueError: setting an array element with a sequence.
It does work if all elements of lists
In [7]: np.array([[1], [1,2],[3,4,5]])
Out[7]: array([list([1]), list([1, 2]), list([3, 4, 5])], dtype=object)
But if they vary in length the result is an object array, not a 2d numeric array. Such an object dtype array is very much like a list of lists, containing pointers to lists elsewhere in memory.
A multidimensional numeric array can use less memory than a list of lists, but it isn't going to help if you need to make the lists first. And it does not help at all if the sublists vary in size.
Oh, and stay away from np.append. It's evil. Plus you misused it!
As hpaulj pointed out already, numpy arrays will not help here, since you don't have consistent data sizes.
As Spinor8 suggested, dump out data in between instead:
AllList = []
limit = 10000
counter = 0
while not finished:
if counter >= limit:
print AllList
AllList = []
item = CreateYourList(...)
AllList.append(item)
counter += 1
Edit: Since your question is specifically asking about numpy and you even opened a bounty: numpy is not going to help you here, and here is why:
For using numpy efficiently, you have to know the array size at the time of array creation. numpy.array.append() doesn't actually append anything, but creates a new array, which is a huge overhead with large arrays.
Numpy arrays work best if all items have the same number of elements. Specifically, you can think of a numpy array like a matrix: all rows have the same number of columns.
You could create a numpy array based on the largest element in your data stream, but this would mean you allocate memory that you don't need (array elements that will never be filled). This will clearly not solve your memory problem.
So IMHO, your only way to solve this is to break your stream into chunks that your memory can handle, and stitch it together afterwards. Maybe write it to a (temporary) file and append to it?
Related
I have a numpy.array called combinations (2D array of 336^3=37,933,056 by 3 np.float32), and also an err_np (1D array of 336^3=37,933,056 np.float16), storing the %error of each 3 value combination.
I need to sort combinations by their absolute error, so i do:
from sys import getsizeof as size
import numpy as np
# Some stuff gets done here
print(size(combinations)) # 120 - **EITHER THIS IS TOO LOW**
print(len(combinations)) # 37,933,056
print(combinations) # array OK
combinations = combinations[abs(err_np).argsort()]
print(size(combinations)) # 455,196,792 - **OR THIS IS TOO HIGH**
print(len(combinations)) # 37,933,056
print(combinations) # array sorted OK
I assumed the size before sorting was wrong, but i remind you size() is sys.getsizeof(), and the whole array is in there...
Is there a way to sort an np.array using less memory? It should also be as time-efficient as possible, since the size of the array. I currently del err_np after sorting combinations so i don't care about it.
So combinations is (37,933,056, 3) shape array. len() only displays the first dimension.
Then it's data takes up:
In [182]: 37933056*3*4
Out[182]: 455196672
bytes - that's the shape times 4 bytes per element.
From the first size I'd say combinations is a view of some other array. The 120 only records the base array size, not the shared data memory.
After the sort, combinations is a 'copy', with its own databuffer, and size reports the full size.
getsizeof is only useful if you understand what it's reporting. It's ok if the array is not a view, but that size can just as easily be calculated from shape and dtype. getsizeof is even less useful when looking at things like lists.
I have 1-dimensional numpy array and want to store sparse updates of it.
Say I have array of length 500000 and want to do 100 updates of 100 elements. Updates are either adds or just changing the values (I do not think it matters).
What is the best way to do it using numpy?
I wanted to just store two arrays: indices, values_to_add and therefore have two objects: one stores dense matrix and other just keeps indices and values to add, and I can just do something like this with the dense matrix:
dense_matrix[indices] += values_to_add
And if I have multiple updates, I just concat them.
But this numpy syntax doesn't work fine with repeated elements: they are just ignored.
Updating pair when we have an update that repeats index is O(n). I thought of using dict instead of array to store updates, which looks fine from the point of view of complexity, but it doesn't look good numpy style.
What is the most expressive way to achieve this? I know about scipy sparse objects, but (1) I want pure numpy because (2) I want to understand the most efficient way to implement it.
If you have repeated indices you could use at, from the documentation:
Performs unbuffered in place operation on operand ‘a’ for elements
specified by ‘indices’. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once.
Code
a = np.arange(10)
indices = [0, 2, 2]
np.add.at(a, indices, [-44, -55, -55])
print(a)
Output
[ -44 1 -108 3 4 5 6 7 8 9]
Let's say I have a function (called numpyarrayfunction) that outputs an array every time I run it. I would like to run the function multiple times and store the resulting arrays. Obviously, the current method that I am using to do this -
numpyarray = np.zeros((5))
for i in range(5):
numpyarray[i] = numpyarrayfunction
generates an error message since I am trying to store an array within an array.
Eventually, what I would like to do is to take the average of the numbers that are in the arrays, and then take the average of these averages. But for the moment, it would be useful to just know how to store the arrays!
Thank you for your help!
As comments and other answers have already laid out, a good way to do this is to store the arrays being returned by numpyarrayfunction in a normal Python list.
If you want everything to be in a single numpy array (for, say, memory efficiency or computation speed), and the arrays returned by numpyarrayfunction are of a fixed length n, you could make numpyarray multidimensional:
numpyarray = np.empty((5, n))
for i in range(5):
numpyarray[i, :] = numpyarrayfunction
Then you could do np.average(numpyarray, axis = 1) to average over the second axis, which would give you back a one-dimensional array with the average of each array you got from numpyarrayfunction. np.average(numpyarray) would be the average over all the elements, or np.average(np.average(numpyarray, axis = 1)) if you really want the average value of the averages.
More on numpy array indexing.
I initially misread what was going on inside the for loop there. The reason you're getting an error is because numpy arrays will only store numeric types by default, and numpyarrayfunction is returning a non-numeric value (from the name, probably another numpy array). If that function already returns a full numpy array, then you can do something more like this:
arrays = []
for i in range(5):
arrays.append(numpyarrayfunction(args))
Then, you can take the average like so:
avgarray = np.zeros((len(arrays[0])))
for array in arrays:
avgarray += array
avgarray = avgarray/len(arrays)
If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? Should I convert a list of lists, vector by vector instead by popping?
# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)
Edit:
Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists object entirely? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values?
I'm just puttign theis here as it's a bit long for a comment.
Have you read the numpy documentation for array?
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements (dtype,
order, etc.).
...
"""
When you say you don't want to copy the data of the original array when creating the numpy array, what data structure are you hoping to end up with?
A lot of the speed up you get from using numpy is because the C arrays that are created are contiguous in memory. An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy, as it's not written in python.
If you want to just have the numpy array reference the python arrays in your 2D array, then you'll lose the performance gains.
if you do np.array(my_2D_python_array, copy=False) i don't know what it will actually produce, but you could easily test it yourself. Look at the shape of the array, and see what kind of objects it houses.
If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough).
Sorry that was pretty rambling, just a comment. How big are the actual arrays you're looking at?
Here's a plot of the cpu usage and memory usage of a small sample program:
from __future__ import division
#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)
#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()
os.system("python moniter.py -p " + str(pid) + " &")
print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]
import numpy
from datetime import datetime
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)
1, 2, and 3 are the points where each of the matrices finish being created. Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. For the numpy array this is not the case, so it is considerably smaller.
Also note that using the copy on the python object has no effect - new data is always created. You could get around this by creating a numpy array of python objects (using dtype=object), but i wouldn't advise it.
The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.
Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.
Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.