What is the main advantage of using the array module instead of lists?
The arrays will take less space.
I've never used the array module, numpy provides the same benefits plus many many more.
Arrays are very similar to lists "except that the type of objects stored in them is constrained. The type is specified at object creation time by using a type code, which is a single character."
http://docs.python.org/library/array.html
So if you know your array will only contain objects of a certain type then go ahead (if performance is crucial), if not just use a list.
Arrays can be very useful if you want to strictly type the data in your collection. Notwithstanding performance it can be quite convenient to be sure of the data types contained within your data structure. However, arrays don't feel very 'pythonic' to me (though I must stress this is only personal preference). Unless you are really concerned with the type of data within the collection I would stick with lists as they afford you a great deal of flexibility. The very slight memory optimisations gained between lists vs arrays are insignificant unless you have an extremely large amount of data.
Related
Probably obvious, but just wanted to confirm:
What would be the best suited data type for a numerical ID in pandas?
Let's say that I have a sequential numerical ID type user_id, which would be better:
an int64 type (that would seem to be the most obvious choice given the numerical representation of the field)
a category type (which might make more sense, given that the ID is not to be used for actual numerical operations, but rather as a unique identifier)
Same question for characters-based IDs, would it be better to use an object or a category type?
I would be tempted to use the category data type (thinking there might be performance benefits, as I would imagine these categories are somehow optimised/hashed/indexed for performance), but I was wondering whether this data type is more suited for a more limited subset of distinct values than the possible 100's of thousands unique user_ids I might have in my dataset.
Thanks!
Operating on object-typed dataframes/arrays is slow because Pandas needs to operate on each item using the inefficient CPython interpreter. This causes a high overhead due to reference counting, internal pointer indirections, type checks, internal function calls, etc. Pandas often uses Numpy internally which can be much faster when the types are native like int64, int32, float64, etc. In that case, Numpy can execute a optimized native code that is not slowed down by the CPython overheads and that can even benefit from hardware SIMD units (regarding the target function used). While Numpy supports bounded strings, Pandas does not use this but slow CPython string objects instead. Strings are inherently slow, even in native codes, because of their generally variable size that is often predictable (this strongly impacts the processor that need to predict branches so to be fast, see this post about branch prediction). In practice, unicode characters make strings even slower (it makes the use of SIMD instructions very difficult and branch even harder to predict). Categorial are basically integers associated with a mapping table (of unique values). Categorial columns can theoretically be faster for some computation because the table is already computed. However, the initial computation of the table can be expensive. Additionally, the table is not always used efficient where it could resulting sometimes to a surprisingly slower execution compared to integers. Not to mentions the table can be big when all the values are different. Integers are the less expensive type. Smaller integer can often be faster. Indeed, SIMD vectors have a fixed size (eg. the AVX-2 SIMD instruction set of 86-64 processors can compute 32 int8 value in a row compared to only 4 int64). Furthermore, smaller items cause the whole columns to take less memory reducing the memory throughput so it improve the performance of memory-bound codes (starting from dataframe copies that are pretty frequent in Pandas). However, this is not always faster because smaller types can sometime cause type-conversion adding an additional overhead (though this overhead can be mitigated using lower-level optimizations). Thus, if you are working on huge dataframe, please consider using small integer types. Otherwise, int64 is certainly a very good option.
I am counting various patterns in graphs, and I store the relevant information in a defaultdict of lists of numpy arrays of size N, where the index values are integers.
I want to efficiently know if I am creating a duplicate array. Not removing duplicates can exponentially grow the amount of duplicates to the point where what I am doing becomes infeasible. But there are potentially hundreds of thousands of arrays, stored in different lists, under different keys. As far as I know, I can't hash an array.
If I simply needed to check for duplicate nonzero indices, I would store the nonzero indices as a bit sequence of ones and then hash that value. But, I don't only need to check the indices - I need to also check their integer values. Is there any way to do this short of coming up with a completely knew design that uses different structures?
Thanks.
The basic idea is “How can I use my own hash (and perhaps ==) to store things differently in a set/dict?” (where “differently” includes “without raising TypeError for being non-hashable).
The first part of the answer is defining your hash function, for example following myrtlecat’s comment. However, beware the standard non-answer based on it: store the custom hash of each object in a set (or map it to, say, the original object with a dict). That you don’t have to provide an equality implementation is a hint that this is wrong: hash values aren’t always unique! (Exception: if you want to “hash by identity”, and know all your keys will outlive the map, id does provide unique “hashes”.)
The rest of the answer is to wrap your desired keys in objects that expose your hash/equality functions as __hash__ and __eq__. Note that overriding the non-hashability of mutable types comes with an obligation to not alter the (underlying) keys! (C programmers would often call doing so undefined behavior.)
For code, see an old answer by xperroni (which includes the option to increase safety by basing the comparisons on private copies that are less likely to be altered by some other code), though I’d add __slots__ to combat the memory overhead.
Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?
numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.
#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.
What are the advantages of NumPy over regular Python lists?
I have approximately 100 financial markets series, and I am going to create a cube array of 100x100x100 = 1 million cells. I will be regressing (3-variable) each x with each y and z, to fill the array with standard errors.
I have heard that for "large matrices" I should use NumPy as opposed to Python lists, for performance and scalability reasons. Thing is, I know Python lists and they seem to work for me.
What will the benefits be if I move to NumPy?
What if I had 1000 series (that is, 1 billion floating point cells in the cube)?
NumPy's arrays are more compact than Python lists -- a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.
Maybe you don't care that much for just a million cells, but you definitely would for a billion cells -- neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) -- a much costlier piece of hardware!
The difference is mostly due to "indirectness" -- a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value -- and the memory allocators rounds up to 16). A NumPy array is an array of uniform values -- single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!
NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
For example, you could read your cube directly from a file into an array:
x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))
Sum along the second dimension:
s = x.sum(axis=1)
Find which cells are above a threshold:
(x > 0.5).nonzero()
Remove every even-indexed slice along the third dimension:
x[:, :, ::2]
Also, many useful libraries work with NumPy arrays. For example, statistical analysis and visualization libraries.
Even if you don't have performance problems, learning NumPy is worth the effort.
Alex mentioned memory efficiency, and Roberto mentions convenience, and these are both good points. For a few more ideas, I'll mention speed and functionality.
Functionality: You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. And really, who can live without FFTs?
Speed: Here's a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster (in this test -- mileage may vary).
from numpy import arange
from timeit import Timer
Nelements = 10000
Ntimeits = 10000
x = arange(Nelements)
y = range(Nelements)
t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))
which on my systems (while I'm running a backup) gives:
numpy: 3.004e-05
list: 5.363e-04
Here's a nice answer from the FAQ on the scipy.org website:
What advantages do NumPy arrays offer over (nested) Python lists?
Python’s lists are efficient general-purpose containers. They support
(fairly) efficient insertion, deletion, appending, and concatenation,
and Python’s list comprehensions make them easy to construct and
manipulate. However, they have certain limitations: they don’t support
“vectorized” operations like elementwise addition and multiplication,
and the fact that they can contain objects of differing types mean
that Python must store type information for every element, and must
execute type dispatching code when operating on each element. This
also means that very few list operations can be carried out by
efficient C loops – each iteration would require type checks and other
Python API bookkeeping.
All have highlighted almost all major differences between numpy array and python list, I will just brief them out here:
Numpy arrays have a fixed size at creation, unlike python lists (which can grow dynamically). Changing the size of ndarray will create a new array and delete the original.
The elements in a Numpy array are all required to be of the same data type (we can have the heterogeneous type as well but that will not gonna permit you mathematical operations) and thus will be the same size in memory
Numpy arrays are facilitated advances mathematical and other types of operations on large numbers of data. Typically such operations are executed more efficiently and with less code than is possible using pythons build in sequences
The standard mutable multielement container in Python is the list. Because of Python's dynamic typing, we can even create heterogeneous list. To allow these flexible types, each item in the list must contain its own type info, reference count, and other information. That is, each item is a complete Python object.
In the special case that all variables are of the same type, much of this information is redundant; it can be much more efficient to store data in a fixed-type array (NumPy-style).
Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.
NumPy is not another programming language but a Python extension module. It provides fast and efficient operations on arrays of homogeneous data.
Numpy has fixed size of creation.
In Python :lists are written with square brackets.
These lists can be homogeneous or heterogeneous
The main advantages of using Numpy Arrays Over Python Lists:
It consumes less memory.
Fast as compared to the python List.
Convenient to use.
I'm porting an C++ scientific application to python, and as I'm new to python, some problems come to my mind:
1) I'm defining a class that will contain the coordinates (x,y). These values will be accessed several times, but they only will be read after the class instantiation. Is it better to use an tuple or an numpy array, both in memory and access time wise?
2) In some cases, these coordinates will be used to build a complex number, evaluated on a complex function, and the real part of this function will be used. Assuming that there is no way to separate real and complex parts of this function, and the real part will have to be used on the end, maybe is better to use directly complex numbers to store (x,y)? How bad is the overhead with the transformation from complex to real in python? The code in c++ does a lot of these transformations, and this is a big slowdown in that code.
3) Also some coordinates transformations will have to be performed, and for the coordinates the x and y values will be accessed in separate, the transformation be done, and the result returned. The coordinate transformations are defined in the complex plane, so is still faster to use the components x and y directly than relying on the complex variables?
Thank you
In terms of memory consumption, numpy arrays are more compact than Python tuples.
A numpy array uses a single contiguous block of memory. All elements of the numpy array must be of a declared type (e.g. 32-bit or 64-bit float.) A Python tuple does not necessarily use a contiguous block of memory, and the elements of the tuple can be arbitrary Python objects, which generally consume more memory than numpy numeric types.
So this issue is a hands-down win for numpy, (assuming the elements of the array can be stored as a numpy numeric type).
On the issue of speed, I think the choice boils down to the question, "Can you vectorize your code?"
That is, can you express your calculations as operations done on entire arrays element-wise.
If the code can be vectorized, then numpy will most likely be faster than Python tuples. (The only case I could imagine where it might not be, is if you had many very small tuples. In this case the overhead of forming the numpy arrays and one-time cost of importing numpy might drown-out the benefit of vectorization.)
An example of code that could not be vectorized would be if your calculation involved looking at, say, the first complex number in an array z, doing a calculation which produces an integer index idx, then retrieving z[idx], doing a calculation on that number, which produces the next index idx2, then retrieving z[idx2], etc. This type of calculation might not be vectorizable. In this case, you might as well use Python tuples, since you won't be able to leverage numpy's strength.
I wouldn't worry about the speed of accessing the real/imaginary parts of a complex number. My guess is the issue of vectorization will most likely determine which method is faster. (Though, by the way, numpy can transform an array of complex numbers to their real parts simply by striding over the complex array, skipping every other float, and viewing the result as floats. Moreover, the syntax is dead simple: If z is a complex numpy array, then z.real is the real parts as a float numpy array. This should be far faster than the pure Python approach of using a list comprehension of attribute lookups: [z.real for z in zlist].)
Just out of curiosity, what is your reason for porting the C++ code to Python?
A numpy array with an extra dimension is tighter in memory use, and at least as fast!, as a numpy array of tuples; complex numbers are at least as good or even better, including for your third question. BTW, you may have noticed that -- while questions asked later than yours were getting answers aplenty -- your was laying fallow: part of the reason is no doubt that asking three questions within a question turns responders off. Why not just ask one question per question? It's not as if you get charged for questions or anything, you know...!-)