Couldn't find how Numpy and PyTorch store arrays (tensors), so I'm asking here.
If you need to store multidimensional arrays like those two, should you store all values in a single-contionuous array and store the shape of the array to determine how you parse it (which is also beneficial when reshaping the array), like the following int arr[];, or should you store the multidimensional array as is, so for example storing a 2d integer array like so in C for example: int arr[][];.
Which should be the way for maximum performance and memory efficiency?
Also, any reference on how to go on developing such a thing is appreciated.
P.S That's for a school exam, so no libraries.
Related
I'm currently working on a helper class to transfer data from a Java ND-Array to a Python numpy nd-array. The Java array uses ND4J and I'm able to ascertain shape, stride, and row/column ordering from the ND4J INDArray.
Py4j allows me to natively transmit a bytearray back from the JVM. However, I'm not too familiar with numpy and I don't quite know whether it has preference for row or column ordering and how I can provide shape information if I give it a bytearray representing a 1D array of data.
The closest question I could find was this: Quickest way to convert 1D byte array to 2D numpy array
However, it doesn't tell me much about providing explicit shape information - it only applies to RGB image data.
So my question is, how can I do something like np.array(bytearray, shape) and how can I know numpy's preferred ordering so I can prepare the incoming data?
Edit
Half-answered my question. Looks like numpy does indeed allow for specific ordering via an extra parameter on many of its array creation methods: https://docs.scipy.org/doc/numpy/reference/routines.array-creation.html
Edit 2
Learning more, I need to make sure that the bytearray (converted from byte[]) is the right datatype. It's almost always going to be double, so should I pass a float type or a numpy.float64?
What you can do is
np.array(bytearray).reshape(shape)
where the output of np.array() is a 1D array, which you then reshape to be in the shape you want it to be. Note that reshaping does not change the order in memory, only how your data is viewed.
When linearly iterating through a default C-style NumPy array, the last dimension of your array will iterate the fastest, this means
a[0,0,0]
a[0,0,1]
are next to each other in memory, while
a[0,0,0]
a[0,1,0]
are not. Knowing this you should be able to figure out the shape argument.
Thirdly, dtype=float and dtype=np.float64 are interchangeable, which you can confirm by comparing
print np.arange(1, dtype=float).dtype
print np.arange(1, dtype=np.float64).dtype
I am working with a large matrix of size m * n for m,n>100000. Since my data is huge I want to store the matrix in memory and work with HDF5, and PyTables.
However, the elements of my matrix are small matrices of real values of dimension 5*5.
I have already looked at the following post, but I would like to know if there is any other way of storing this type of data in tables?
(Create a larger matrix from smaller matrices in numpy)
Thank you in advance
In numpy there are two relevant structures.
One is a 4dimensional array, e.g. np.zeros((100,100,5,5),int). The other is an 2 dimensional array of objects. np.zeros((100,100),dtype=object). With object array, the elements can be anythings - strings, numbers, lists, your 5x5 arrays, other 7x3 array, None, etc).
It is easiest to do math on the 4d array, for example taking the mean across all the 5x5 subarrays, or finding the [:,:,0,0] corner of all.
If your subarrays are all 5x5, it can be tricky to create and fill that object array. np.array(...) tries to create that 4dim array if possible.
With h5py you can chunk the file, and access portions of the larger array. But you still have to have a workable numpy representation to do anything with them.
I'm trying to understand the differences between what people call matrices and what people call lists within lists.
Are they the same in that, once created, you can do identical things to them (reference elements the same way within them, etc).
Examples:
Making lists within a list:
ListsInLists = [[1,2],[3,4],[5,6]]
Making a multidimensional array:
np.random.rand(3,2)
Stacking arrays to make a matrix:
Array1 = [1,2,3,4]
Array2 = [5,6,7,8]
CompleteArray = vstack((Array1,Array2))
A list of list is very different from a two-dimensional Numpy array.
A list has dynamic size and can hold any type of object, whereas an array has a fixed size and entries of uniform type.
In a list of lists, each sublist can have different sizes. An array has fixed dimensions along each axis.
An array is stored in a contiguous block of memory, whereas the objects in a list can be stored anywhere on the heap.
Numpy arrays are more restrictive, but offer greater performance and memory efficiency. They also provide convenient functions for vectorised mathematical operations.
Internally, a list is represented as an array of pointers that point to arbitrary Python objects. The array uses exponential over-allocation to achieve linear performance when appending repeatedly at the end of the list. A Numpy array on the other hand is typically represented as a C array of numbers.
(This answer does not cover the special case of Numpy object arrays, which can hold any kind of Python object as well. They are rarely used, since they have the restrictions of Numpy arrays, but don't have the performance advantages.)
They are not the same. Arrays are more memory efficient in python than lists, and there are additional functions that can be performed on arrays thanks the to numpy module that you cannot perform on lists.
For calculations, working with arrays in numpy tends to be a lot faster than using built in list functions.
You can read a bit more into it if you want in the answers to this question.
array.array is a built-in type, looks like it will be much more efficient than list for some numerical tasks.
In numpy I could create a 2-d array easily, for example:
a = numpy.asarray([[1,2][3,4]], dtype='int')
But I couldn't find how to create a 2-d array using array.array, can I?
No, though you can do index math to emulate one (which is how they work behind the scenes anyway). array.array is useful for its storage and conversion capabilities but actual calculation on array contents is likely far more efficient using numpy. Of course you can also keep a list of arrays if you wish.
I build a class with some iteration over coming data. The data are in an array form without use of numpy objects. On my code I often use .append to create another array. At some point I changed one of the big array 1000x2000 to numpy.array. Now I have an error after error. I started to convert all of the arrays into ndarray but comments like .append does not work any more. I start to have a problems with pointing to rows, columns or cells. and have to rebuild all code.
I try to google an answer to the question: "what is and advantage of using ndarray over normal array" I can't find a sensible answer. Can you write when should I start to use ndarrays and if in your practice do you use both of them or stick to one only.
Sorry if the question is a novice level, but I am new to python, just try to move from Matlab and want to understand what are pros and cons. Thanks
NumPy and Python arrays share the property of being efficiently stored in memory.
NumPy arrays can be added together, multiplied by a number, you can calculate, say, the sine of all their values in one function call, etc. As HYRY pointed out, they can also have more than one dimension. You cannot do this with Python arrays.
On the other hand, Python arrays can indeed be appended to. Note that NumPy arrays can however be concatenated together (hstack(), vstack(),…). That said, NumPy arrays are mostly meant to have a fixed number of elements.
It is common to first build a list (or a Python array) of values iteratively and then convert it to a NumPy array (with numpy.array(), or, more efficiently, with numpy.frombuffer(), as HYRY mentioned): this allows mathematical operations on arrays (or matrices) to be performed very conveniently (simple syntax for complex operations). Alternatively, numpy.fromiter() might be used to construct the array from an iterator. Or loadtxt() to construct it from a text file.
There are at least two main reasons for using NumPy arrays:
NumPy arrays require less space than Python lists. So you can deal with more data in a NumPy array (in-memory) than you can with Python lists.
NumPy arrays have a vast library of functions and methods unavailable
to Python lists or Python arrays.
Yes, you can not simply convert lists to NumPy arrays and expect your code to continue to work. The methods are different, the bool semantics are different. For the best performance, even the algorithm may need to change.
However, if you are looking for a Python replacement for Matlab, you will definitely find uses for NumPy. It is worth learning.
array.array can change size dynamically. If you are collecting data from some source, it's better to use array.array. But array.array is only one dimension, and there is no calculation functions to do with it. So, when you want to do some calculation with your data, convert it to numpy.ndarray, and use functions in numpy.
numpy.frombuffer can create a numpy.ndarray that shares the same data buffer with array.array objects, it's fast because it don't need to copy the data.
Here is a demo:
import numpy as np
import array
import random
a = array.array("d")
# a for loop that collects 10 channels data
for x in range(100):
a.extend([random.random() for _ in xrange(10)])
# create a ndarray that share the same data buffer with array a, and reshape it to 2D
na = np.frombuffer(a, dtype=float).reshape(-1, 10)
# call some numpy function to do the calculation
np.sum(na, axis=0)
Another great advantage of using NumPy arrays over built-in lists is the fact that NumPy has a C API that allows native C and C++ code to access NumPy arrays directly. Hence, many Python libraries written in low-level languages like C are expecting you to work with NumPy arrays instead of Python lists.
Reference: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython