I build a class with some iteration over coming data. The data are in an array form without use of numpy objects. On my code I often use .append to create another array. At some point I changed one of the big array 1000x2000 to numpy.array. Now I have an error after error. I started to convert all of the arrays into ndarray but comments like .append does not work any more. I start to have a problems with pointing to rows, columns or cells. and have to rebuild all code.
I try to google an answer to the question: "what is and advantage of using ndarray over normal array" I can't find a sensible answer. Can you write when should I start to use ndarrays and if in your practice do you use both of them or stick to one only.
Sorry if the question is a novice level, but I am new to python, just try to move from Matlab and want to understand what are pros and cons. Thanks
NumPy and Python arrays share the property of being efficiently stored in memory.
NumPy arrays can be added together, multiplied by a number, you can calculate, say, the sine of all their values in one function call, etc. As HYRY pointed out, they can also have more than one dimension. You cannot do this with Python arrays.
On the other hand, Python arrays can indeed be appended to. Note that NumPy arrays can however be concatenated together (hstack(), vstack(),…). That said, NumPy arrays are mostly meant to have a fixed number of elements.
It is common to first build a list (or a Python array) of values iteratively and then convert it to a NumPy array (with numpy.array(), or, more efficiently, with numpy.frombuffer(), as HYRY mentioned): this allows mathematical operations on arrays (or matrices) to be performed very conveniently (simple syntax for complex operations). Alternatively, numpy.fromiter() might be used to construct the array from an iterator. Or loadtxt() to construct it from a text file.
There are at least two main reasons for using NumPy arrays:
NumPy arrays require less space than Python lists. So you can deal with more data in a NumPy array (in-memory) than you can with Python lists.
NumPy arrays have a vast library of functions and methods unavailable
to Python lists or Python arrays.
Yes, you can not simply convert lists to NumPy arrays and expect your code to continue to work. The methods are different, the bool semantics are different. For the best performance, even the algorithm may need to change.
However, if you are looking for a Python replacement for Matlab, you will definitely find uses for NumPy. It is worth learning.
array.array can change size dynamically. If you are collecting data from some source, it's better to use array.array. But array.array is only one dimension, and there is no calculation functions to do with it. So, when you want to do some calculation with your data, convert it to numpy.ndarray, and use functions in numpy.
numpy.frombuffer can create a numpy.ndarray that shares the same data buffer with array.array objects, it's fast because it don't need to copy the data.
Here is a demo:
import numpy as np
import array
import random
a = array.array("d")
# a for loop that collects 10 channels data
for x in range(100):
a.extend([random.random() for _ in xrange(10)])
# create a ndarray that share the same data buffer with array a, and reshape it to 2D
na = np.frombuffer(a, dtype=float).reshape(-1, 10)
# call some numpy function to do the calculation
np.sum(na, axis=0)
Another great advantage of using NumPy arrays over built-in lists is the fact that NumPy has a C API that allows native C and C++ code to access NumPy arrays directly. Hence, many Python libraries written in low-level languages like C are expecting you to work with NumPy arrays instead of Python lists.
Reference: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Related
I'm trying to understand the differences between what people call matrices and what people call lists within lists.
Are they the same in that, once created, you can do identical things to them (reference elements the same way within them, etc).
Examples:
Making lists within a list:
ListsInLists = [[1,2],[3,4],[5,6]]
Making a multidimensional array:
np.random.rand(3,2)
Stacking arrays to make a matrix:
Array1 = [1,2,3,4]
Array2 = [5,6,7,8]
CompleteArray = vstack((Array1,Array2))
A list of list is very different from a two-dimensional Numpy array.
A list has dynamic size and can hold any type of object, whereas an array has a fixed size and entries of uniform type.
In a list of lists, each sublist can have different sizes. An array has fixed dimensions along each axis.
An array is stored in a contiguous block of memory, whereas the objects in a list can be stored anywhere on the heap.
Numpy arrays are more restrictive, but offer greater performance and memory efficiency. They also provide convenient functions for vectorised mathematical operations.
Internally, a list is represented as an array of pointers that point to arbitrary Python objects. The array uses exponential over-allocation to achieve linear performance when appending repeatedly at the end of the list. A Numpy array on the other hand is typically represented as a C array of numbers.
(This answer does not cover the special case of Numpy object arrays, which can hold any kind of Python object as well. They are rarely used, since they have the restrictions of Numpy arrays, but don't have the performance advantages.)
They are not the same. Arrays are more memory efficient in python than lists, and there are additional functions that can be performed on arrays thanks the to numpy module that you cannot perform on lists.
For calculations, working with arrays in numpy tends to be a lot faster than using built in list functions.
You can read a bit more into it if you want in the answers to this question.
array.array is a built-in type, looks like it will be much more efficient than list for some numerical tasks.
In numpy I could create a 2-d array easily, for example:
a = numpy.asarray([[1,2][3,4]], dtype='int')
But I couldn't find how to create a 2-d array using array.array, can I?
No, though you can do index math to emulate one (which is how they work behind the scenes anyway). array.array is useful for its storage and conversion capabilities but actual calculation on array contents is likely far more efficient using numpy. Of course you can also keep a list of arrays if you wish.
Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?
numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.
#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.
What are the advantages of NumPy over regular Python lists?
I have approximately 100 financial markets series, and I am going to create a cube array of 100x100x100 = 1 million cells. I will be regressing (3-variable) each x with each y and z, to fill the array with standard errors.
I have heard that for "large matrices" I should use NumPy as opposed to Python lists, for performance and scalability reasons. Thing is, I know Python lists and they seem to work for me.
What will the benefits be if I move to NumPy?
What if I had 1000 series (that is, 1 billion floating point cells in the cube)?
NumPy's arrays are more compact than Python lists -- a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.
Maybe you don't care that much for just a million cells, but you definitely would for a billion cells -- neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) -- a much costlier piece of hardware!
The difference is mostly due to "indirectness" -- a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value -- and the memory allocators rounds up to 16). A NumPy array is an array of uniform values -- single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!
NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
For example, you could read your cube directly from a file into an array:
x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))
Sum along the second dimension:
s = x.sum(axis=1)
Find which cells are above a threshold:
(x > 0.5).nonzero()
Remove every even-indexed slice along the third dimension:
x[:, :, ::2]
Also, many useful libraries work with NumPy arrays. For example, statistical analysis and visualization libraries.
Even if you don't have performance problems, learning NumPy is worth the effort.
Alex mentioned memory efficiency, and Roberto mentions convenience, and these are both good points. For a few more ideas, I'll mention speed and functionality.
Functionality: You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. And really, who can live without FFTs?
Speed: Here's a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster (in this test -- mileage may vary).
from numpy import arange
from timeit import Timer
Nelements = 10000
Ntimeits = 10000
x = arange(Nelements)
y = range(Nelements)
t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))
which on my systems (while I'm running a backup) gives:
numpy: 3.004e-05
list: 5.363e-04
Here's a nice answer from the FAQ on the scipy.org website:
What advantages do NumPy arrays offer over (nested) Python lists?
Python’s lists are efficient general-purpose containers. They support
(fairly) efficient insertion, deletion, appending, and concatenation,
and Python’s list comprehensions make them easy to construct and
manipulate. However, they have certain limitations: they don’t support
“vectorized” operations like elementwise addition and multiplication,
and the fact that they can contain objects of differing types mean
that Python must store type information for every element, and must
execute type dispatching code when operating on each element. This
also means that very few list operations can be carried out by
efficient C loops – each iteration would require type checks and other
Python API bookkeeping.
All have highlighted almost all major differences between numpy array and python list, I will just brief them out here:
Numpy arrays have a fixed size at creation, unlike python lists (which can grow dynamically). Changing the size of ndarray will create a new array and delete the original.
The elements in a Numpy array are all required to be of the same data type (we can have the heterogeneous type as well but that will not gonna permit you mathematical operations) and thus will be the same size in memory
Numpy arrays are facilitated advances mathematical and other types of operations on large numbers of data. Typically such operations are executed more efficiently and with less code than is possible using pythons build in sequences
The standard mutable multielement container in Python is the list. Because of Python's dynamic typing, we can even create heterogeneous list. To allow these flexible types, each item in the list must contain its own type info, reference count, and other information. That is, each item is a complete Python object.
In the special case that all variables are of the same type, much of this information is redundant; it can be much more efficient to store data in a fixed-type array (NumPy-style).
Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.
NumPy is not another programming language but a Python extension module. It provides fast and efficient operations on arrays of homogeneous data.
Numpy has fixed size of creation.
In Python :lists are written with square brackets.
These lists can be homogeneous or heterogeneous
The main advantages of using Numpy Arrays Over Python Lists:
It consumes less memory.
Fast as compared to the python List.
Convenient to use.
What are the advantages of NumPy over regular Python lists?
I have approximately 100 financial markets series, and I am going to create a cube array of 100x100x100 = 1 million cells. I will be regressing (3-variable) each x with each y and z, to fill the array with standard errors.
I have heard that for "large matrices" I should use NumPy as opposed to Python lists, for performance and scalability reasons. Thing is, I know Python lists and they seem to work for me.
What will the benefits be if I move to NumPy?
What if I had 1000 series (that is, 1 billion floating point cells in the cube)?
NumPy's arrays are more compact than Python lists -- a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.
Maybe you don't care that much for just a million cells, but you definitely would for a billion cells -- neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) -- a much costlier piece of hardware!
The difference is mostly due to "indirectness" -- a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value -- and the memory allocators rounds up to 16). A NumPy array is an array of uniform values -- single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!
NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
For example, you could read your cube directly from a file into an array:
x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))
Sum along the second dimension:
s = x.sum(axis=1)
Find which cells are above a threshold:
(x > 0.5).nonzero()
Remove every even-indexed slice along the third dimension:
x[:, :, ::2]
Also, many useful libraries work with NumPy arrays. For example, statistical analysis and visualization libraries.
Even if you don't have performance problems, learning NumPy is worth the effort.
Alex mentioned memory efficiency, and Roberto mentions convenience, and these are both good points. For a few more ideas, I'll mention speed and functionality.
Functionality: You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. And really, who can live without FFTs?
Speed: Here's a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster (in this test -- mileage may vary).
from numpy import arange
from timeit import Timer
Nelements = 10000
Ntimeits = 10000
x = arange(Nelements)
y = range(Nelements)
t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))
which on my systems (while I'm running a backup) gives:
numpy: 3.004e-05
list: 5.363e-04
Here's a nice answer from the FAQ on the scipy.org website:
What advantages do NumPy arrays offer over (nested) Python lists?
Python’s lists are efficient general-purpose containers. They support
(fairly) efficient insertion, deletion, appending, and concatenation,
and Python’s list comprehensions make them easy to construct and
manipulate. However, they have certain limitations: they don’t support
“vectorized” operations like elementwise addition and multiplication,
and the fact that they can contain objects of differing types mean
that Python must store type information for every element, and must
execute type dispatching code when operating on each element. This
also means that very few list operations can be carried out by
efficient C loops – each iteration would require type checks and other
Python API bookkeeping.
All have highlighted almost all major differences between numpy array and python list, I will just brief them out here:
Numpy arrays have a fixed size at creation, unlike python lists (which can grow dynamically). Changing the size of ndarray will create a new array and delete the original.
The elements in a Numpy array are all required to be of the same data type (we can have the heterogeneous type as well but that will not gonna permit you mathematical operations) and thus will be the same size in memory
Numpy arrays are facilitated advances mathematical and other types of operations on large numbers of data. Typically such operations are executed more efficiently and with less code than is possible using pythons build in sequences
The standard mutable multielement container in Python is the list. Because of Python's dynamic typing, we can even create heterogeneous list. To allow these flexible types, each item in the list must contain its own type info, reference count, and other information. That is, each item is a complete Python object.
In the special case that all variables are of the same type, much of this information is redundant; it can be much more efficient to store data in a fixed-type array (NumPy-style).
Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.
NumPy is not another programming language but a Python extension module. It provides fast and efficient operations on arrays of homogeneous data.
Numpy has fixed size of creation.
In Python :lists are written with square brackets.
These lists can be homogeneous or heterogeneous
The main advantages of using Numpy Arrays Over Python Lists:
It consumes less memory.
Fast as compared to the python List.
Convenient to use.