How to deal with large integers in NumPy? - python

I'm doing a data analysis project where I'm working with really large numbers. I originally did everything in pure python but I'm now trying to do it with numpy and pandas. However it seems like I've hit a roadblock, since it is not possible to handle integers larger than 64 bits in numpy (if I use python ints in numpy they max out at 9223372036854775807). Do I just throw away numpy and pandas completely or is there a way to use them with python-style arbitrary large integers? I'm okay with a performance hit.

by default numpy keeps elements as number datatype.
But you can force typing to object, like below
import numpy as np
x = np.array([10,20,30,40], dtype=object)
x_exp2 = 1000**x
print(x_exp2)
the output is
[1000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
The drawback is that the execution is much slower.
Later Edit to show that np.sum() works. There could be some limitations of course.
import numpy as np
x = np.array([10,20,30,40], dtype=object)
x_exp2 = 1000**x
print(x_exp2)
print(np.sum(x_exp2))
print(np.prod(x_exp2))
and the output is:
[1000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
1000000000000000000000000000001000000000000000000000000000001000000000000000000000000000001000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Related

How can I make my Python program use 4 bytes for an int instead of 24 bytes?

To save memory, I want to use less bytes (4) for each int I have instead of 24.
I looked at structs, but I don't really understand how to use them.
https://docs.python.org/3/library/struct.html
When I do the following:
myInt = struct.pack('I', anInt)
sys.getsizeof(myInt) doesn't return 4 like I expected.
Is there something that I am doing wrong? Is there another way for Python to save memory for each variable?
ADDED: I have 750,000,000 integers in an array that I wish to be able to use given an index.
If you want to hold many integers in an array, use a numpy ndarray. Numpy is a very popular third-party package that handles arrays more compactly than Python alone does. Numpy is not in the standard library so that it could be updated more frequently than Python itself is updated--it was considered to be added to the standard library. Numpy is one of the reasons Python has become so popular for Data Science and for other scientific uses.
Numpy's np.int32 type uses four bytes for an integer. Declare your array full of zeros with
import numpy as np
myarray = np.zeros((750000000,), dtype=np.int32)
Or if you just want the array and do not want to spend any time initializing the values,
myarray = np.empty((750000000,), dtype=np.int32)
You then fill and use the array as you like. There is some Python overhead for the complete array, so the array's size will be slightly larger than 4 * 750000000, but the size will be close.

Python ctypes combine two c_double arrays

I have a Python program that needs to pass an array to a .dll that is expecting an array of c doubles. This is currently done by the following code, which is the fastest method of conversion I could find:
from array import array
from ctypes import *
import numpy as np
python_array = np.array(some_python_array)
temp = array('d', python_array.astype('float'))
c_double_array = (c_double * len(temp)).from_buffer(temp)
...where 'np.array' is just there to show that in my case the python_array is a numpy array. Let's say I now have two c_double arrays: c_double_array_a and c_double_array_b, the issue I'm having is I would like to append c_double_array_b to c_double_array_a without reconverting to/from whatever python typically uses for arrays. Is there a way to do this with the ctypes library?
I've been reading through the docs here but nothing seems to detail combining two c_type arrays after creation. It is very important in my program that they can be combined after creation, of course it would be trivial to just append python_array_b to python_array_a and then convert but that won't work in my case.
Thanks!
P.S. if anyone knows a way to speed up the conversion code that would also be greatly appreciated, it takes on the order of 150ms / million elements in the array and my program typically handles 1-10 million elements at a time.
Leaving aside the construction of the ctypes arrays (for which Mark's comment is surely relevant), the issue is that C arrays are not resizable: you can't append or extend them. (There do exist wrappers that provide these features, which may be useful references in constructing this.) What you can do is make a new array of the size of the two existing arrays together and then ctypes.memmove them into it. It might be possible to improve the performance by using realloc, but you'd have to go even lower than normal ctypes memory management to use it.

Does numpy methods work correctly on numbers too big to fit numpy dtypes?

I would like to know if numbers bigger than what int64 or float128 can be correctly processed by numpy functions
EDIT: numpy functions applied to numbers/python objects outside of any numpy array. Like using a np function in a list comprehension that applies to the content of a list of int128?
I can't find anything about that in their docs, but I really don't know what to think and expect. From tests, it should work but I want to be sure, and a few trivial test won't help for that. So I come here for knowledge:
If np framework is not handling such big numbers, are its functions able to deal with these anyway?
EDIT: sorry, I wasn't clear. Please see the edit above
Thanks by advance.
See the Extended Precision heading in the Numpy documentation here. For very large numbers, you can also create an array with dtype set to 'object', which will allow you essentially to use the Numpy framework on the large numbers but with lower performance than using native types. As has been pointed out, though, this will break when you try to call a function not supported by the particular object saved in the array.
import numpy as np
arr = np.array([10**105, 10**106], dtype='object')
But the short answer is that you you can and will get unexpected behavior when using these large numbers unless you take special care to account for them.
When storing a number into a numpy array with a dtype not sufficient to store it, you will get truncation or an error
arr = np.empty(1, dtype=np.int64)
arr[0] = 2**65
arr
Gives OverflowError: Python int too large to convert to C long.
arr = np.empty(1, dtype=float16)
arr[0] = 2**64
arr
Gives inf (and no error)
arr[0] = 2**15 + 2
arr
Gives [ 32768.] (i.e., 2**15), so truncation occurred. It would be harder for this to happen with float128...
You can have numpy arrays of python objects, which could be a python integer which is too big to fit in np.int64. Some of numpy's functionality will work, but many functions call underlying c code which will not work. Here is an example:
import numpy as np
a = np.array([123456789012345678901234567890]) # a has dtype object now
print((a*2)[0]) # Works and gives the right result
print(np.exp(a)) # Does not work, because "'int' object has no attribute 'exp'"
Generally, most functionality will probably be lost for your extremely large numbers. Also, as it has been pointed out, when you have an array with a dtype of np.int64 or similar, you will have overflow problems, when you increase the size of your array elements over that types limit. With numpy, you have to be careful about what your array's dtype is!

Toeplitz matrix using numpy/scipy

In Octave or Matlab there is a neat, compact way to create large Toeplitz matrices, for example:
T = toeplitz([1,-0.25,zeros(1,20)])
That saves a lot of time that would otherwise be spent to fill the matrix with dozens or hundreds of zeros by using extra lines of code.
Yet I don't seem to be able to do the same with neither scipy or numpy, although those two libraries have both toeplitz() and zeros() functions.
Is there a similar way to do that or do I have to put together a routine of my own to do that (not a huge problem, but still a nuisance)?
Thanks,
F.
Currently I think the best you can do is:
from numpy import concatenate, zeros
from scipy.linalg import toeplitz
toeplitz(concatenate([[1., -.25], zeros(20)]))
As of python 3.5 however we'll have:
toeplitz([1., -.25, *zeros(20)])
So that's something to look forward to.
Another option is using r_:
import numpy as np
from scipy.linalg import toeplitz
toeplitz(np.r_[1, -0.25, np.zeros(20)])
You can also work with lists:
toeplitz([1, -0.25] + [0]*20)

Is it possible to reproduce randn() of MATLAB with NumPy?

I wonder if it is possible to exactly reproduce the whole sequence of randn() of MATLAB with NumPy. I coded my own routine with Python/Numpy, and it is giving me a little bit different results from the MATLAB code somebody else did, and I am having hard time finding out where it is coming from because of different random draws.
I have found the numpy.random.seed value which produces the same number for the first draw, but from the second draw and on, it is completely different. I'm making multivariate normal draws for about 20,000 times so I don't want to just save the matlab draws and read it in Python.
The user asked if it was possible to reproduce the output of randn() of Matlab, not rand. I have not been able to set the algorithm or seed to reproduce the exact number for randn(), but the solution below works for me.
In Matlab: Generate your normal distributed random numbers as follows:
rng(1);
norminv(rand(1,5),0,1)
ans =
-0.2095 0.5838 -3.6849 -0.5177 -1.0504
In Python: Generate your normal distributed random numbers as follows:
import numpy as np
from scipy.stats import norm
np.random.seed(1)
norm.ppf(np.random.rand(1,5))
array([[-0.2095, 0.5838, -3.6849, -0.5177,-1.0504]])
It is quite convenient to have functions, which can reproduce equal random numbers, when moving from Matlab to Python or vice versa.
If you set the random number generator to the same seed, it will theoretically create the same numbers, ie in matlab. I am not quite sure how to best do it, but this seems to work, in matlab do:
rand('twister', 5489)
and corresponding in numy:
np.random.seed(5489)
To (re)initalize your random number generators. This gives for me the same numbers for rand() and np.random.random(), however not for randn, I am not sure if there is an easy method for that.
With newer matlab versions you can probably set up a RandStream with the same properties as numpy, for older you can reproduce numpy's randn in matlab (or vice versa). Numpy uses the polar form to create the uniform numbers from np.random.random() (the second algorithm given here: http://www.taygeta.com/random/gaussian.html). You could just write that algorithm in matlab to create the same randn numbers as numpy does from the rand function in matlab.
If you don't need a huge amount of random numbers, just save them in a .mat and read them from scipy.io though...
Just wanted to further clarify on using the twister/seeding method: MATLAB and numpy generate the same sequence using this seeding but will fill them out in matrices differently.
MATLAB fills out a matrix down columns, while python goes down rows. So in order to get the same matrices in both, you have to transpose:
MATLAB:
rand('twister', 1337);
A = rand(3,5)
A =
Columns 1 through 2
0.262024675015582 0.459316887214567
0.158683972154466 0.321000540520167
0.278126519494360 0.518392820597537
Columns 3 through 4
0.261942925565145 0.115274226683149
0.976085284877434 0.386275068634359
0.732814552690482 0.628501179539712
Column 5
0.125057926335599
0.983548605143641
0.443224868645128
python:
import numpy as np
np.random.seed(1337)
A = np.random.random((5,3))
A.T
array([[ 0.26202468, 0.45931689, 0.26194293, 0.11527423, 0.12505793],
[ 0.15868397, 0.32100054, 0.97608528, 0.38627507, 0.98354861],
[ 0.27812652, 0.51839282, 0.73281455, 0.62850118, 0.44322487]])
Note: I also placed this answer on this similar question: Comparing Matlab and Numpy code that uses random number generation

Categories

Resources