How can I build a numpy array out of a generator object?
Let me illustrate the problem:
>>> import numpy
>>> def gimme():
... for x in xrange(10):
... yield x
...
>>> gimme()
<generator object at 0x28a1758>
>>> list(gimme())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> numpy.array(xrange(10))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.array(gimme())
array(<generator object at 0x28a1758>, dtype=object)
>>> numpy.array(list(gimme()))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In this instance, gimme() is the generator whose output I'd like to turn into an array. However, the array constructor does not iterate over the generator, it simply stores the generator itself. The behaviour I desire is that from numpy.array(list(gimme())), but I don't want to pay the memory overhead of having the intermediate list and the final array in memory at the same time. Is there a more space-efficient way?
One google behind this stackoverflow result, I found that there is a numpy.fromiter(data, dtype, count). The default count=-1 takes all elements from the iterable. It requires a dtype to be set explicitly. In my case, this worked:
numpy.fromiter(something.generate(from_this_input), float)
Numpy arrays require their length to be set explicitly at creation time, unlike python lists. This is necessary so that space for each item can be consecutively allocated in memory. Consecutive allocation is the key feature of numpy arrays: this combined with native code implementation let operations on them execute much quicker than regular lists.
Keeping this in mind, it is technically impossible to take a generator object and turn it into an array unless you either:
can predict how many elements it will yield when run:
my_array = numpy.empty(predict_length())
for i, el in enumerate(gimme()): my_array[i] = el
are willing to store its elements in an intermediate list :
my_array = numpy.array(list(gimme()))
can make two identical generators, run through the first one to find the total length, initialize the array, and then run through the generator again to find each element:
length = sum(1 for el in gimme())
my_array = numpy.empty(length)
for i, el in enumerate(gimme()): my_array[i] = el
1 is probably what you're looking for. 2 is space inefficient, and 3 is time inefficient (you have to go through the generator twice).
While you can create a 1D array from a generator with numpy.fromiter(), you can create an N-D array from a generator with numpy.stack:
>>> mygen = (np.ones((5, 3)) for _ in range(10))
>>> x = numpy.stack(mygen)
>>> x.shape
(10, 5, 3)
It also works for 1D arrays:
>>> numpy.stack(2*i for i in range(10))
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
Note that numpy.stack is internally consuming the generator and creating an intermediate list with arrays = [asanyarray(arr) for arr in arrays]. The implementation can be found here.
[WARNING]
As pointed out by #Joseh Seedy, Numpy 1.16 raises a warning that defeats usage of such function with generators.
Somewhat tangential, but if your generator is a list comprehension, you can use numpy.where to more effectively get your result (I discovered this in my own code after seeing this post)
The vstack, hstack, and dstack functions can take as input generators that yield multi-dimensional arrays.
Related
I have an array:
a = [1, 3, 5, 7, 29 ... 5030, 6000]
This array gets created from a previous process, and the length of the array could be different (it is depending on user input).
I also have an array:
b = [3, 15, 67, 78, 138]
(Which could also be completely different)
I want to use the array b to slice the array a into multiple arrays.
More specifically, I want the result arrays to be:
array1 = a[:3]
array2 = a[3:15]
...
arrayn = a[138:]
Where n = len(b).
My first thought was to create a 2D array slices with dimension (len(b), something). However we don't know this something beforehand so I assigned it the value len(a) as that is the maximum amount of numbers that it could contain.
I have this code:
slices = np.zeros((len(b), len(a)))
for i in range(1, len(b)):
slices[i] = a[b[i-1]:b[i]]
But I get this error:
ValueError: could not broadcast input array from shape (518) into shape (2253412)
You can use numpy.split:
np.split(a, b)
Example:
np.split(np.arange(10), [3,5])
# [array([0, 1, 2]), array([3, 4]), array([5, 6, 7, 8, 9])]
b.insert(0,0)
result = []
for i in range(1,len(b)):
sub_list = a[b[i-1]:b[i]]
result.append(sub_list)
result.append(a[b[-1]:])
You are getting the error because you are attempting to create a ragged array. This is not allowed in numpy.
An improvement on #Bohdan's answer:
from itertools import zip_longest
result = [a[start:end] for start, end in zip_longest(np.r_[0, b], b)]
The trick here is that zip_longest makes the final slice go from b[-1] to None, which is equivalent to a[b[-1]:], removing the need for special processing of the last element.
Please do not select this. This is just a thing I added for fun. The "correct" answer is #Psidom's answer.
What is the most efficient way to remove the last element from a numpy 1 dimensional array? (like pop for list)
NumPy arrays have a fixed size, so you cannot remove an element in-place. For example using del doesn't work:
>>> import numpy as np
>>> arr = np.arange(5)
>>> del arr[-1]
ValueError: cannot delete array elements
Note that the index -1 represents the last element. That's because negative indices in Python (and NumPy) are counted from the end, so -1 is the last, -2 is the one before last and -len is actually the first element. That's just for your information in case you didn't know.
Python lists are variable sized so it's easy to add or remove elements.
So if you want to remove an element you need to create a new array or view.
Creating a new view
You can create a new view containing all elements except the last one using the slice notation:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> arr[:-1] # all but the last element
array([0, 1, 2, 3])
>>> arr[:-2] # all but the last two elements
array([0, 1, 2])
>>> arr[1:] # all but the first element
array([1, 2, 3, 4])
>>> arr[1:-1] # all but the first and last element
array([1, 2, 3])
However a view shares the data with the original array, so if one is modified so is the other:
>>> sub = arr[:-1]
>>> sub
array([0, 1, 2, 3])
>>> sub[0] = 100
>>> sub
array([100, 1, 2, 3])
>>> arr
array([100, 1, 2, 3, 4])
Creating a new array
1. Copy the view
If you don't like this memory sharing you have to create a new array, in this case it's probably simplest to create a view and then copy (for example using the copy() method of arrays) it:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> sub_arr = arr[:-1].copy()
>>> sub_arr
array([0, 1, 2, 3])
>>> sub_arr[0] = 100
>>> sub_arr
array([100, 1, 2, 3])
>>> arr
array([0, 1, 2, 3, 4])
2. Using integer array indexing [docs]
However, you can also use integer array indexing to remove the last element and get a new array. This integer array indexing will always (not 100% sure there) create a copy and not a view:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> indices_to_keep = [0, 1, 2, 3]
>>> sub_arr = arr[indices_to_keep]
>>> sub_arr
array([0, 1, 2, 3])
>>> sub_arr[0] = 100
>>> sub_arr
array([100, 1, 2, 3])
>>> arr
array([0, 1, 2, 3, 4])
This integer array indexing can be useful to remove arbitrary elements from an array (which can be tricky or impossible when you want a view):
>>> arr = np.arange(5, 10)
>>> arr
array([5, 6, 7, 8, 9])
>>> arr[[0, 1, 3, 4]] # keep first, second, fourth and fifth element
array([5, 6, 8, 9])
If you want a generalized function that removes the last element using integer array indexing:
def remove_last_element(arr):
return arr[np.arange(arr.size - 1)]
3. Using boolean array indexing [docs]
There is also boolean indexing that could be used, for example:
>>> arr = np.arange(5, 10)
>>> arr
array([5, 6, 7, 8, 9])
>>> keep = [True, True, True, True, False]
>>> arr[keep]
array([5, 6, 7, 8])
This also creates a copy! And a generalized approach could look like this:
def remove_last_element(arr):
if not arr.size:
raise IndexError('cannot remove last element of empty array')
keep = np.ones(arr.shape, dtype=bool)
keep[-1] = False
return arr[keep]
If you would like more information on NumPys indexing the documentation on "Indexing" is quite good and covers a lot of cases.
4. Using np.delete()
Normally I wouldn't recommend the NumPy functions that "seem" like they are modifying the array in-place (like np.append and np.insert) but do return copies because these are generally needlessly slow and misleading. You should avoid them whenever possible, that's why it's the last point in my answer. However in this case it's actually a perfect fit so I have to mention it:
>>> arr = np.arange(10, 20)
>>> arr
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.delete(arr, -1)
array([10, 11, 12, 13, 14, 15, 16, 17, 18])
5.) Using np.resize()
NumPy has another method that sounds like it does an in-place operation but it really returns a new array:
>>> arr = np.arange(5)
>>> arr
array([0, 1, 2, 3, 4])
>>> np.resize(arr, arr.size - 1)
array([0, 1, 2, 3])
To remove the last element I simply provided a new shape that is 1 smaller than before, which effectively removes the last element.
Modifying the array inplace
Yes, I've written previously that you cannot modify an array in place. But I said that because in most cases it's not possible or only by disabling some (completely useful) safety checks. I'm not sure about the internals but depending on the old size and the new size it could be possible that this includes an (internal-only) copy operation so it might be slower than creating a view.
Using np.ndarray.resize()
If the array doesn't share its memory with any other array, then it's possible to resize the array in place:
>>> arr = np.arange(5, 10)
>>> arr.resize(4)
>>> arr
array([5, 6, 7, 8])
However that will throw ValueErrors in case it's actually referenced by another array as well:
>>> arr = np.arange(5)
>>> view = arr[1:]
>>> arr.resize(4)
ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function
You can disable that safety-check by setting refcheck=False but that shouldn't be done lightly because you make yourself vulnerable for segmentation faults and memory corruption in case the other reference tries to access the removed elements! This refcheck argument should be treated as an expert-only option!
Summary
Creating a view is really fast and doesn't take much additional memory, so whenever possible you should try to work as much with views as possible. However depending on the use-cases it's not so easy to remove arbitrary elements using basic slicing. While it's easy to remove the first n elements and/or last n elements or remove every x element (the step argument for slicing) this is all you can do with it.
But in your case of removing the last element of a one-dimensional array I would recommend:
arr[:-1] # if you want a view
arr[:-1].copy() # if you want a new array
because these most clearly express the intent and everyone with Python/NumPy experience will recognize that.
Timings
Based on the timing framework from this answer:
# Setup
import numpy as np
def view(arr):
return arr[:-1]
def array_copy_view(arr):
return arr[:-1].copy()
def array_int_index(arr):
return arr[np.arange(arr.size - 1)]
def array_bool_index(arr):
if not arr.size:
raise IndexError('cannot remove last element of empty array')
keep = np.ones(arr.shape, dtype=bool)
keep[-1] = False
return arr[keep]
def array_delete(arr):
return np.delete(arr, -1)
def array_resize(arr):
return np.resize(arr, arr.size - 1)
# Timing setup
timings = {view: [],
array_copy_view: [], array_int_index: [], array_bool_index: [],
array_delete: [], array_resize: []}
sizes = [2**i for i in range(1, 20, 2)]
# Timing
for size in sizes:
print(size)
func_input = np.random.random(size=size)
for func in timings:
print(func.__name__.ljust(20), ' ', end='')
res = %timeit -o func(func_input) # if you use IPython, otherwise use the "timeit" module
timings[func].append(res)
# Plotting
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=func.__name__)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
I get the following timings as log-log plot to cover all the details, lower time still means faster, but the range between two ticks represents one order of magnitude instead of a fixed amount. In case you're interested in the specific values, I copied them into this gist:
According to these timings those two approaches are also the fastest. (Python 3.6 and NumPy 1.14.0)
If you want to quickly get array without last element (not removing explicit), use slicing:
array[:-1]
To delete the last element from a 1-dimensional NumPy array, use the numpy.delete method, like so:
import numpy as np
# Create a 1-dimensional NumPy array that holds 5 values
values = np.array([1, 2, 3, 4, 5])
# Remove the last element of the array using the numpy.delete method
values = np.delete(values, -1)
print(values)
Output:
[1 2 3 4]
The last value of the NumPy array, which was 5, is now removed.
Is there a more pythonic way to tell the list which parts of it has to stay in it an which parts has to be removed?
li = [1,2,3,4,5,6,7]
Wanted list:
[1,2,3,6,7]
I can do that this way:
wl = li[:-4]+li[-2:]
I'm looking for something like li[:-4,-2:] (in one statement/command)
Of course I can do remove but it can be used in many situations like:
Wanted list:
[3,4,5,6,7]
I can do del li[0:2]
But it's more common to do:
li[2:]
Other than regular python lists, NumPy arrays can be indexed by other sequence-like objects (other than tuples) e.g. by regular python lists or by another NumPy array.
import numpy as np
li = np.arange(1, 8)
# array([1, 2, 3, 4, 5, 6, 7])
wl = li[[0,1,2,5,6]]
# array([1, 2, 3, 6, 7])
Of course, this leaves you now with the problem of creating the index sequence (the regular python list [0,1,2,5,6] in this example), which puts you back on square one. (Unless you need to access several NumPy arrays at the same indices, so you create this index list once and then re-use it.)
You should probably only consider this if you have additional reasons to use NumPy in general or specifically NumPy arrays.
If you want the output list to follow a certain logic, you can use the filter function.
filter(lambda x: x > 2, li)
or maybe
filter(lambda x: x < 4 or x > 5, li)
What is the best way to improve this code:
def my_func(x, y):
... do smth ...
return cmp(x',y')
my_list = range(0, N)
my_list.sort(cmp=my_func)
A python's list takes a lot of memory in comparison with numpy array (6800MB vs 700MB),
but nympy.array doesn't have the sort function with cmp argument.
Are there other ways to improve memory usage or sort numpy's array with my cmp function?
Update: my current solution is a C function (shared with SWIG) that sorts a huge array of integers and returns it to python after sorting.
But I hope that there is some way to implement memory efficient sorting of huge datasets with Python. Any ideas?
If you can write a ufunc that convert your array, you can do fast sort by argsort:
b = convert(a)
idx = np.argsort(b)
sort_a = a[idx]
As an alternative you can use the builtin sorted with an numpy array:
>>> a = np.arange(10, 1, -1)
>>> sorted(a, cmp=lambda a,b: cmp(a,b))
[2, 3, 4, 5, 6, 7, 8, 9, 10]
It is not in-place, so you have 1400 MB compared to 6800 MB.
Say I have an array with a couple hundred elements. I need to iterate of the array and replace one or more items in the array with some other item. Which strategy is more efficient in python in terms of speed (I'm not worried about memory)?
For example: I have an array
my_array = [1,2,3,4,5,6]
I want to replace the first 3 elements with one element with the value 123.
Option 1 (inline):
my_array = [1,2,3,4,5,6]
my_array.remove(0,3)
my_array.insert(0,123)
Option2 (new array creation):
my_array = [1,2,3,4,5,6]
my_array = my_array[3:]
my_array.insert(0,123)
Both of the above will options will give a result of:
>>> [123,4,5,6]
Any comments would be appreciated. Especially if there is options I have missed.
If you want to replace an item or a set of items in a list, you should never use your first option. Removing and adding to a list in the middle is slow (reference). Your second option is also fairly inefficient, since you're doing two operations for a single replacement.
Instead, just do slice assignment, as eiben's answer instructs. This will be significantly faster and more efficient than either of your methods:
>>> my_array = [1,2,3,4,5,6]
>>> my_array[:3] = [123]
>>> my_array
[123, 4, 5, 6]
arr[0] = x
replaces the 0th element with x. You can also replace whole slices.
>>> arr = [1, 2, 3, 4, 5, 6]
>>> arr[0:3] = [8, 9, 99]
>>> arr
[8, 9, 99, 4, 5, 6]
>>>
And generally it's unclear what you're trying to achieve. Please provide more information or an example.
OK, as for your update. The remove method doesn't work (remove needs one argument). But the slicing I presented works for your case too:
>>> arr
[8, 9, 99, 4, 5, 6]
>>> arr[0:3] = [4]
>>> arr
[4, 4, 5, 6]
I would guess it's the fastest method, but do try it with timeit. According to my tests it's twice as fast as your "new array" approach.
If you're looking speed efficience and manipulate series of integers, You should use the standard array module instead:
>>> import array
>>> my_array = array.array('i', [1,2,3,4,5,6])
>>> my_array = my_array[3:]
>>> my_array.insert(0,123)
>>> my_array
array('i', [123, 4, 5, 6])
The key thing is to avoid moving large numbers of list items more than absolutely have to. Slice assignment, as far as i'm aware, still involves moving the items around the slice, which is bad news.
How do you recognise when you have a sequence of items which need to be replaced? I'll assume you have a function like:
def replacement(objects, startIndex):
"returns a pair (numberOfObjectsToReplace, replacementObject), or None if the should be no replacement"
I'd then do:
def replaceAll(objects):
src = 0
dst = 0
while (src < len(objects)):
replacementInfo = replacement(objects, src)
if (replacementInfo != None):
numberOfObjectsToReplace, replacementObject = replacementInfo
else:
numberOfObjectsToReplace = 1
replacementObject = objects[src]
objects[dst] = replacementObject
src = src + numberOfObjectsToReplace
dst = dst + 1
del objects[dst:]
This code still does a few more loads and stores than it absolutely has to, but not many.