What is difference b/w Python Range() vs Numpy.arange() function? - python

I learned on my web search that numpy.arange take less space than python range function. but i tried
using below it gives me different result.
import sys
x = range(1,10000)
print(sys.getsizeof(x)) # --> Output is 48
a = np.arange(1,10000,1,dtype=np.int8)
print(sys.getsizeof(a)) # --> OutPut is 10095
Could anyone please explain?

In PY3, range is an object that can generate a sequence of numbers; it is not the actual sequence. You may need to brush up on some basic Python reading, paying attention to things like lists and generators, and their differences.
In [359]: x = range(3)
In [360]: x
Out[360]: range(0, 3)
We have use something like list or a list comprehension to actually create those numbers:
In [361]: list(x)
Out[361]: [0, 1, 2]
In [362]: [i for i in x]
Out[362]: [0, 1, 2]
A range is often used in a for i in range(3): print(i) kind of loop.
arange is a numpy function that produces a numpy array:
In [363]: arr = np.arange(3)
In [364]: arr
Out[364]: array([0, 1, 2])
We can iterate on such an array, but it is slower than [362]:
In [365]: [i for i in arr]
Out[365]: [0, 1, 2]
But for doing things math, the array is much better:
In [366]: arr * 10
Out[366]: array([ 0, 10, 20])
The array can also be created from the list [361] (and for compatibility with earlier Py2 usage from the range itself):
In [376]: np.array(list(x)) # np.array(x)
Out[376]: array([0, 1, 2])
But this is slower than using arange directly (that's an implementation detail).
Despite the similarity in names, these shouldn't be seen as simple alternatives. Use range in basic Python constructs such as for loop and comprehension. Use arange when you need an array.
An important innovation in Python (compared to earlier languages) is that we could iterate directly on a list. We didn't have to step through indices. And if we needed indices along with with values we could use enumerate:
In [378]: alist = ['a','b','c']
In [379]: for i in range(3): print(alist[i]) # index iteration
a
b
c
In [380]: for v in alist: print(v) # iterate on list directly
a
b
c
In [381]: for i,v in enumerate(alist): print(i,v) # index and values
0 a
1 b
2 c
Thus you might not see range used that much in basic Python code.

the range type constructor creates range objects, which represent sequences of integers with a start, stop, and step in a space efficient manner, calculating the values on the fly.
np.arange function returns a numpy.ndarray object, which is essentially a wrapper around a primitive array. This is a fast and relatively compact representation, compared to if you created a python list, so list(range(N)), but range objects are more space efficient, and indeed, take constant space, so for all practical purposes, range(a) is the same size as range(b) for any integers a, b
As an aside, you should take care interpreting the results of sys.getsizeof, you must understand what it is doing. So do not naively compare the size of Python lists and numpy.ndarray, for example.
Perhaps whatever you read was referring to Python 2, where range returned a list. List objects do require more space than numpy.ndarray objects, generally.

arange store each individual value of the array while range store only 3 values (start, stop and step). That's the reason arange is taking more space compared to range.
As the question is about the size, this will be the answer.
But there are many advantages of using numpy array and arange than python lists for speed, space and efficiency perspective.

Related

Increment all entries in an array by 'n' without a for loop

I have an array:
arr = [5,5,5,5,5,5]
I want to increment a particular range in the arr by 'n'. So if n=2 and the range is [2,5].
The array should look like this:
arr = [5,5,7,7,7,5]
Needed to do this without a for loop, for a problem im trying to solve.
Tried:
arr[2:5] = [n]*3
but that obviously replaces the entries and becomes:
arr = [5,5,3,3,3,5]
Any suggestions would be highly appriciated.
n = 2
arr_range = slice(2, 5)
arr = [5,5,7,7,7,5]
arr[arr_range] = map(lambda x: x+n, arr[arr_range])
# arr
# [5, 5, 9, 9, 9, 5]
But I would recommend using numpy...
import numpy as np
n = 2
arr_range = slice(2, 5)
arr = np.array([5,5,7,7,7,5])
arr[arr_range] += n
You actually have a list, not an array. If you convert it to a Numpy array it is simple.
>>> n=3
>>> arr = np.array([5,5,5,5,5,5])
>>> arr[2:5] += n
>>> arr
array([5, 5, 8, 8, 8, 5])
You have basically two options (for code see below):
Use slice assignment via a list comprehension (a[:] = [x+1 for x in a]),
Use a for-loop (even though you exclude this in your question, I don't see a legitimate reason for doing so).
They come with pros and cons. Let's assume you are going to replace some fraction of the list items (as opposed to a fixed number of items). The for-loop runs in Python and hence might be slower but it has O(1) memory usage. The list comprehension and slice assignment both operate in C (assuming you are using CPython) but it has O(N) memory usage due to the temporary list.
Using a generator doesn't buy anything since it is converted to a list anyway before the assignment happens (this is necessary because if the generator had fewer or more items than the slice, the list would need to be resized accordingly; see the source code).
Using a map adds even more overhead since it needs to call the mapped function on every item.
The following is a performance comparison of the different methods. The for-loop is fastest for very small lists since it has minimal overhead (just the range object). For more than about a dozen items, the list comprehension clearly outperforms the other methods and especially for larger lists (len(a) > 3e5) the difference to the generator becomes noticeable (the generator cannot provide information about its size, so the generated list needs to be resized as more items are fetched). For very large lists the difference between for-loop and list comprehension seems to shrink again since the memory overhead tends to outweigh the loop cost, but reaching that point would require unusually large lists (where you'd be better off using something like Numpy anyway).
This is the code using the perfplot package:
import numpy
import perfplot
def use_generator(a):
i = slice(0, len(a)//2)
a[i] = (x+1 for x in a[i])
def use_map(a):
i = slice(0, len(a)//2)
a[i] = map(lambda x: x+1, a[i])
def use_list(a):
i = slice(0, len(a)//2)
a[i] = [x+1 for x in a[i]]
def use_loop(a):
for i in range(len(a)//2):
a[i] += 1
perfplot.show(
setup=lambda n: [0]*n,
kernels=[use_generator, use_map, use_list, use_loop],
n_range=[2**k for k in range(1, 26)],
xlabel="len(a)",
equality_check=None,
)

Iterate over numpy with index (numpy equivalent of python enumerate)

I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.

Assembly numpy vector

How to add a numpy array A to elements of a numpy array B with indices given by an index array C?
Ideally, I can write:
A=np.zeros(4,float)
B=np.array([1,2,3,4])
C=np.array([1,2,1,3])
A[C] +=B
print A
output:
[0, 4, 2, 4]
but it doesn't work since (according to documentation) A[C] is a copy.
(I only wonder why it in fact works if indexes in C appears only once.)
I need to do it fast (for big arrays).
It looks like your example was supposed to be
A = np.zeros(4, dtype=float)
B=np.array([1,2,3,4])
C=np.array([1,2,1,3])
A[C] += B
print A
If so, then instead of +=, you want numpy.add.at. add.at does what += does, but with repeated indices handled the way you want. Similar constructs work for other operators, e.g. subtract.at for -=.
numpy.add.at(A, C, B)

Sorting based on one of the list among Nested list in python

I have a list as [[4,5,6],[2,3,1]]. Now I want to sort the list based on list[1] i.e. output should be [[6,4,5],[1,2,3]]. So basically I am sorting 2,3,1 and maintaining the order of list[0].
While searching I got a function which sorts based on first element of every list but not for this. Also I do not want to recreate list as [[4,2],[5,3],[6,1]] and then use the function.
Since [4, 5, 6] and [2, 3, 1] serves two different purposes I will make a function taking two arguments: the list to be reordered, and the list whose sorting will decide the order. I'll only return the reordered list.
This answer has timings of three different solutions for creating a permutation list for a sort. Using the fastest option gives this solution:
def pyargsort(seq):
return sorted(range(len(seq)), key=seq.__getitem__)
def using_pyargsort(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return [a[i] for i in pyargsort(b)]
print using_pyargsort([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
The pyargsort method is inspired by the numpy argsort method, which does the same thing much faster. Numpy also has advanced indexing operations whereby an array can be used as an index, making possible very quick reordering of an array.
So if your need for speed is great, one would assume that this numpy solution would be faster:
import numpy as np
def using_numpy(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return np.array(a)[np.argsort(b)].tolist()
print using_numpy([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
However, for short lists (length < 1000), this solution is in fact slower than the first. This is because we're first converting the a and b lists to array and then converting the result back to list before returning. If we instead assume you're using numpy arrays throughout your application so that we do not need to convert back and forth, we get this solution:
def all_numpy(a, b):
"Reorder array a the same way as array b would be reordered by a normal sort"
return a[np.argsort(b)]
print all_numpy(np.array([4, 5, 6]), np.array([2, 3, 1])) # array([6, 4, 5])
The all_numpy function executes up to 10 times faster than the using_pyargsort function.
The following logaritmic graph compares these three solutions with the two alternative solutions from the other answers. The arguments are two randomly shuffled ranges of equal length, and the functions all receive identically ordered lists. I'm timing only the time the function takes to execute. For illustrative purposes I've added in an extra graph line for each numpy solution where the 60 ms overhead for loading numpy is added to the time.
As we can see, the all-numpy solution beats the others by an order of magnitude. Converting from python list and back slows the using_numpy solution down considerably in comparison, but it still beats pure python for large lists.
For a list length of about 1'000'000, using_pyargsort takes 2.0 seconds, using_nympy + overhead is only 1.3 seconds, while all_numpy + overhead is 0.3 seconds.
The sorting you describe is not very easy to accomplish. The only way that I can think of to do it is to use zip to create the list you say you don't want to create:
lst = [[4,5,6],[2,3,1]]
# key = operator.itemgetter(1) works too, and may be slightly faster ...
transpose_sort = sorted(zip(*lst),key = lambda x: x[1])
lst = zip(*transpose_sort)
Is there a reason for this constraint?
(Also note that you could do this all in one line if you really want to:
lst = zip(*sorted(zip(*lst),key = lambda x: x[1]))
This also results in a list of tuples. If you really want a list of lists, you can map the result:
lst = map(list, lst)
Or a list comprehension would work as well:
lst = [ list(x) for x in lst ]
If the second list doesn't contain duplicates, you could just do this:
l = [[4,5,6],[2,3,1]] #the list
l1 = l[1][:] #a copy of the to-be-sorted sublist
l[1].sort() #sort the sublist
l[0] = [l[0][l1.index(x)] for x in l[1]] #order the first sublist accordingly
(As this saves the sublist l[1] it might be a bad idea if your input list is huge)
How about this one:
a = [[4,5,6],[2,3,1]]
[a[0][i] for i in sorted(range(len(a[1])), key=lambda x: a[1][x])]
It uses the principal way numpy does it without having to use numpy and without the zip stuff.
Neither using numpy nor the zipping around seems to be the cheapest way for giant structures. Unfortunately the .sort() method is built into the list type and uses hard-wired access to the elements in the list (overriding __getitem__() or similar does not have any effect here).
So you can implement your own sort() which sorts two or more lists according to the values in one; this is basically what numpy does.
Or you can create a list of values to sort, sort that, and recreate the sorted original list out of it.

Pythonic way to get the first AND the last element of the sequence

What is the easiest and cleanest way to get the first AND the last elements of a sequence? E.g., I have a sequence [1, 2, 3, 4, 5], and I'd like to get [1, 5] via some kind of slicing magic. What I have come up with so far is:
l = len(s)
result = s[0:l:l-1]
I actually need this for a bit more complex task. I have a 3D numpy array, which is cubic (i.e. is of size NxNxN, where N may vary). I'd like an easy and fast way to get a 2x2x2 array containing the values from the vertices of the source array. The example above is an oversimplified, 1D version of my task.
Use this:
result = [s[0], s[-1]]
Since you're using a numpy array, you may want to use fancy indexing:
a = np.arange(27)
indices = [0, -1]
b = a[indices] # array([0, 26])
For the 3d case:
vertices = [(0,0,0),(0,0,-1),(0,-1,0),(0,-1,-1),(-1,-1,-1),(-1,-1,0),(-1,0,0),(-1,0,-1)]
indices = list(zip(*vertices)) #Can store this for later use.
a = np.arange(27).reshape((3,3,3)) #dummy array for testing. Can be any shape size :)
vertex_values = a[indices].reshape((2,2,2))
I first write down all the vertices (although I am willing to bet there is a clever way to do it using itertools which would let you scale this up to N dimensions ...). The order you specify the vertices is the order they will be in the output array. Then I "transpose" the list of vertices (using zip) so that all the x indices are together and all the y indices are together, etc. (that's how numpy likes it). At this point, you can save that index array and use it to index your array whenever you want the corners of your box. You can easily reshape the result into a 2x2x2 array (although the order I have it is probably not the order you want).
This would give you a list of the first and last element in your sequence:
result = [s[0], s[-1]]
Alternatively, this would give you a tuple
result = s[0], s[-1]
With the particular case of a (N,N,N) ndarray X that you mention, would the following work for you?
s = slice(0,N,N-1)
X[s,s,s]
Example
>>> N = 3
>>> X = np.arange(N*N*N).reshape(N,N,N)
>>> s = slice(0,N,N-1)
>>> print X[s,s,s]
[[[ 0 2]
[ 6 8]]
[[18 20]
[24 26]]]
>>> from operator import itemgetter
>>> first_and_last = itemgetter(0, -1)
>>> first_and_last([1, 2, 3, 4, 5])
(1, 5)
Why do you want to use a slice? Getting each element with
result = [s[0], s[-1]]
is better and more readable.
If you really need to use the slice, then your solution is the simplest working one that I can think of.
This also works for the 3D case you've mentioned.

Categories

Resources