I want to put many large long integers into memory without any space between them. How to do that with python 2.7 code in linux?
The large long integers all use the same number of bits. There is totally about 4 gb of data. Leaving spaces of a few bits to make each long integer uses multiples of 8 bits in memory is ok. I want to do bitwise operation on them later.
So far, I am using a python list. But I am not sure if that leaves no space in memory between the integers. Can ctypes help?
Thank you.
The old code uses bitarray (https://pypi.python.org/pypi/bitarray/0.8.1)
import bitarray
data = bitarray.bitarray()
with open('data.bin', 'rb') as f:
data.fromfile(f)
result = data[:750000] & data[750000:750000*2]
This works and the bitarray doesn't have gap in memory. But, the bitarray bitwise and is slower than the native python's bitwise operation of long integer by about 6 times on the computer. Slicing the bitarray in the old code and accessing an element in the list in the newer code use roughly the same amount of time.
Newer code:
import cPickle as pickle
with open('data.pickle', 'rb') as f:
data = pickle.load(f)
# data is a list of python's (long) integers
result = data[0] & data[1]
Numpy:
In the above code. result = data[0] & data[1] creates a new long integer.
Numpy has the out option for numpy.bitwise_and. That would avoid creating a new numpy array. Yet, numpy's bool array seems to use one byte per bool instead of one bit per bool. While, converting the bool array into a numpy.uint8 array avoids this problem, counting the number of set bit is too slow.
The python's native array can't handle the large long integers:
import array
xstr = ''
for i in xrange(750000):
xstr += '1'
x = int(xstr, 2)
ar = array.array('l',[x,x,x])
# OverflowError: Python int too large to convert to C long
You can use the array module, for example:
import array
ar = array('l', [25L, 26L, 27L])
ar[1] # 26L
The primary cause for space inefficiency is the internal structure of a Python long. Assuming a 64-bit platform, Python only uses 30 out of 32 bits to store a value. The gmpy2 library provides access to the GMP (GNU Multiple Precision Arithmetic Library). The internal structure of the gmpy2.mpz type uses all available bits. Here is the size difference for storing a 750000-bit value.
>>> import gmpy2
>>> import sys
>>> a=long('1'*750000, 2)
>>> sys.getsizeof(a)
100024
>>> sys.getsizeof(gmpy2.mpz(a))
93792
The & operation with ``gmpy2.mpz` is also significantly faster.
$ python -m timeit -s "a=long('A'*93750,16);b=long('7'*93750)" "c=a & b"
100000 loops, best of 3: 7.78 usec per loop
$ python -m timeit -s "import gmpy2;a=gmpy2.mpz('A'*93750,16);b=gmpy2.mpz('7'*93750)" "c=a & b"
100000 loops, best of 3: 4.44 usec per loop
If all your operations are in-place, the gmpy2.xmpz type allows changing the internal value of an instance without creating a new instance. It is faster as long as all the operations are immediate.
$ python -m timeit -s "import gmpy2;a=gmpy2.xmpz('A'*93750,16);b=gmpy2.xmpz('7'*93750)" "a &= b"
100000 loops, best of 3: 3.31 usec per loop
Disclaimer: I maintain the gmpy2 library.
Related
Let's say I have a few numeric variables that will be used throughout the code like int and str objects. So I'm wondering, which type should I use to store them?
Here's an example:
class Nums:
a = 1 # stored as int
b = '2' # stored as str
print(int(Nums.b) + 5) # 7
print(str(Nums.a) + 'Hello') # 1Hello
As you can see, there are the two choices of storage and example uses. But the usage won't be so defined like it is here. For example, I will iterate over the class variables and concatenate them all in a string and also compute their product. So should I store them as int and cast to str when I need concatenation or as str and cast them to int before multiplication?
Some brief research using the IPython's %timeit has shown that the performance is about the same, int to str is a tiny bit faster:
In [3]: %timeit int('56')
The slowest run took 12.92 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 360 ns per loop
In [4]: %timeit str(56)
The slowest run took 8.36 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 353 ns per loop
And I would expect this result, as an str call really just calls the object's __str__ method, while reading a string as a number could be more complicated. Is there any real difference and what's the reason behind it?
I'm talking about Python 3.5 particularly
If your program uses numeric values, you should certainly store them as type int. Numeric types are almost always faster than string types, and even if your use case means the string representation is faster, the other benefits are worth it. To name a few:
You get sanity checks for free. If someone tries to shove "five" into your program, if you're storing as integers you'll get an error early. If you're storing as strings, your error may be much harder to reason about.
Anyone picking up your program will expect numeric values to be stored as numeric types
And again, storing things as numeric types is almost always faster
I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):
Standard list
import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]
Numpy array
import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 2594 MB
Just like I expected.
By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.
Because I know my data is pretty spare, I did the same test with sparse data.
Standard list
import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]
Numpy array
from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 529 MB
Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?
One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.
Notes
I started a fresh new python shell for every test.
Memory measured with htop.
Run on 32bit Debian.
A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation
In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.
FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.
I need to use a complicated kind of dict and change the values of some keys dynamically.
So I tried it the following way but encountered with MemoryError with about 32GB RAM. The sys.getsizeof(d) returns 393356 and the sys.getsizeof(d.items()) is 50336.
Did I use the python dict in a wrong way ? can anyone help me !?
d=nltk.defaultdict(lambda:nltk.defaultdict(float))
for myarticlewords in mywords:
for i in myarticlewords:
for j in myarticlewords:
d[i][j]+=1.0
Traceback stoped at "d[i][j]+=1.0 "
When I tried :
dd=dict( (i,d[i].items() ) for i in d.keys() )
Traceback (most recent call last):
File "<pyshell#34>", line 1, in <module>
dd=dict( (i,d[i].items() ) for i in d.keys() )
File "<pyshell#34>", line 1, in <genexpr>
dd=dict( (i,d[i].items() ) for i in d.keys() )
MemoryError
Thanks!
You seem to be using a 32-bit version of python. If you're running windows, you've probably hit the windows memory limit for 32-bit programs, which is 2GB.
That lines up with the numbers I've calculated based on some educated guesses. First, a few important facts: getsizeof only returns the size of the dict itself, not of the things stored in it. This is true of all "container" types. Also, dictionaries increase their size in a staggered way, after every so many items are added.
Now, when I store a dictionary with between about 5500 and 21000 items, getsizeof returns 786712 -- i.e. 393356 * 2. My version of Python is 64-bit, so this strongly suggests to me that you're storing between 5500 and 21000 items using a 32-bit version of Python. You're using nltk, which suggests that you're storing word digrams here. So that means you have a minimum of about 5500 words. You're storing a second dictionary for each of those words, which is also a 5500-item dictionary. So what you really have here is 393356 + 393356 * 5500 bytes, plus a minimum of 5500 * 20 bytes for word storage. Summing it all up:
>>> (393356 + 393356 * 5500 + 5500 * 20) / 1000000000.0
2.163961356
You're trying to store at least 2GB of data. So in short, if you want to make use of those 32 gigabytes of memory, you should upgrade to a 64-bit version of Python.
I'll add that if you're concerned about performance, you may just want to use pickle (or cPickle) rather than shelve to store the dictionary. shelve will probably be slower, even if you set writeback=True.
>>> shelve_d = shelve.open('data', writeback=True)
>>> normal_d = {}
>>> def fill(d):
... for i in xrange(100000):
... d[str(i)] = i
...
>>> %timeit fill(shelve_d)
1 loops, best of 3: 2.6 s per loop
>>> %timeit fill(normal_d)
10 loops, best of 3: 35.4 ms per loop
Saving the dictionary with pickle will take some time too, naturally, but at least it won't slow down the computation itself.
I need to create a large bytearry of a specific size but the size is not known prior to run time. The bytes need to be fairly random. The bytearray size may be as small as a few KBs but as large as a several MB. I do not want to iterate byte-by-byte. This is too slow -- I need performance similar to numpy.random. However, I do not have the numpy module available for this project. Is there something part of a standard python install that will do this? Or do i need to compile my own using C?
for those asking for timings:
>>> timeit.timeit('[random.randint(0,128) for i in xrange(1,100000)]',setup='import random', number=100)
35.73110193696641
>>> timeit.timeit('numpy.random.random_integers(0,128,100000)',setup='import numpy', number=100)
0.5785652013481126
>>>
The os module provides urandom, even on Windows:
bytearray(os.urandom(1000000))
This seems to perform as quickly as you need, in fact, I get better timings than your numpy (though our machines could be wildly different):
timeit.timeit(lambda:bytearray(os.urandom(1000000)), number=10)
0.0554857286941
There are several possibilities, some faster than os.urandom. Also consider whether the data has to be generated deterministically from a random seed. This is invaluable for unit tests where failures have to be reproducible.
short and pithy:
lambda n:bytearray(map(random.getrandbits,(8,)*n))
I've use the above for unit tests and it was fast enough but can it be done faster?
using itertools:
lambda n:bytearray(itertools.imap(random.getrandbits,itertools.repeat(8,n))))
itertools and struct producing 8 bytes per iteration
lambda n:(b''.join(map(struct.Struct("!Q").pack,itertools.imap(
random.getrandbits,itertools.repeat(64,(n+7)//8)))))[:n]
Anything based on b''.join will fill 3-7x the memory consumed by the final bytearray with temporary objects since it queues up all the sub-strings before joining them together and python objects have lots of storage overhead.
Producing large chunks with a specialized function gives better performance and avoids filling memory.
import random,itertools,struct,operator
def randbytes(n,_struct8k=struct.Struct("!1000Q").pack_into):
if n<8000:
longs=(n+7)//8
return struct.pack("!%iQ"%longs,*map(
random.getrandbits,itertools.repeat(64,longs)))[:n]
data=bytearray(n);
for offset in xrange(0,n-7999,8000):
_struct8k(data,offset,
*map(random.getrandbits,itertools.repeat(64,1000)))
offset+=8000
data[offset:]=randbytes(n-offset)
return data
Performance
.84 MB/s :original solution with randint:
4.8 MB/s :bytearray(getrandbits(8) for _ in xrange(n)): (solution by other poster)
6.4MB/s :bytearray(map(getrandbits,(8,)*n))
7.2 MB/s :itertools and getrandbits
10 MB/s :os.urandom
23 MB/s :itertools and struct
35 MB/s :optimised function (holds for len = 100MB ... 1KB)
Note:all tests used 10KB as the string size. Results were consistent up till intermediate results filled memory.
Note:os.urandom is meant to provide secure random seeds. Applications expand that seed with their own fast PRNG. Here's an example, using AES in counter mode as a PRNG:
import os
seed=os.urandom(32)
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.backends import default_backend
backend = default_backend()
cipher = Cipher(algorithms.AES(seed), modes.CTR(b'\0'*16), backend=backend)
encryptor = cipher.encryptor()
nulls=b'\0'*(10**5) #100k
from timeit import timeit
t=timeit(lambda:encryptor.update(nulls),number=10**5) #1GB, (100K*10k)
print("%.1f MB/s"%(1000/t))
This produces pseudorandom data at 180 MB/s. (no hardware AES acceleration, single core) That's only ~5x the speed of the pure python code above.
Addendum
There's a pure python crypto library waiting to be written. Putting the above techniques together with hashlib and stream cipher techniques looks promising. Here's a teaser, a fast string xor (42MB/s).
def xor(a,b):
s="!%iQ%iB"%divmod(len(a),8)
return struct.pack(s,*itertools.imap(operator.xor,
struct.unpack(s,a),
struct.unpack(s,b)))
What's wrong with just including numpy? Anyhow, this creates a random N-bit integer:
import random
N = 100000
bits = random.getrandbits(N)
So if you needed to see if the value of the j-th bit is set or not, you can do bits & (2**j)==(2**j)
EDIT: He asked for byte array not bit array. Ned's answer is better: your_byte_array= bytearray((random.getrandbits(8) for i in xrange(N))
import random
def randbytes(n):
for _ in xrange(n):
yield random.getrandbits(8)
my_random_bytes = bytearray(randbytes(1000000))
There's probably something in itertools that could help here, there always is...
My timings indicate that this goes about five times faster than [random.randint(0,128) for i in xrange(1,100000)]
What is an efficient way to initialize and access elements of a large array in Python?
I want to create an array in Python with 100 million entries, unsigned 4-byte integers, initialized to zero. I want fast array access, preferably with contiguous memory.
Strangely, NumPy arrays seem to be performing very slow. Are there alternatives I can try?
There is the array.array module, but I don't see a method to efficiently allocate a block of 100 million entries.
Responses to comments:
I cannot use a sparse array. It will be too slow for this algorithm because the array becomes dense very quickly.
I know Python is interpreted, but surely there is a way to do fast array operations?
I did some profiling, and I get about 160K array accesses (looking up or updating an element by index) per second with NumPy. This seems very slow.
I have done some profiling, and the results are completely counterintuitive.
For simple array access operations, numpy and array.array are 10x slower than native Python arrays.
Note that for array access, I am doing operations of the form:
a[i] += 1
Profiles:
[0] * 20000000
Access: 2.3M / sec
Initialization: 0.8s
numpy.zeros(shape=(20000000,), dtype=numpy.int32)
Access: 160K/sec
Initialization: 0.2s
array.array('L', [0] * 20000000)
Access: 175K/sec
Initialization: 2.0s
array.array('L', (0 for i in range(20000000)))
Access: 175K/sec, presumably, based upon the profile for the other array.array
Initialization: 6.7s
Just a reminder how Python's integers work: if you allocate a list by saying
a = [0] * K
you need the memory for the list (sizeof(PyListObject) + K * sizeof(PyObject*)) and the memory for the single integer object 0. As long as the numbers in the list stay below the magic number V that Python uses for caching, you are fine because those are shared, i.e. any name that points to a number n < V points to the exact same object. You can find this value by using the following snippet:
>>> i = 0
>>> j = 0
>>> while i is j:
... i += 1
... j += 1
>>> i # on my system!
257
This means that as soon as the counts go above this number, the memory you need is sizeof(PyListObject) + K * sizeof(PyObject*) + d * sizeof(PyIntObject), where d < K is the number of integers above V (== 256). On a 64 bit system, sizeof(PyIntObject) == 24 and sizeof(PyObject*) == 8, i.e. the worst case memory consumption is 3,200,000,000 bytes.
With numpy.ndarray or array.array, memory consumption is constant after initialization, but you pay for the wrapper objects that are created transparently, as Thomas Wouters said. Probably, you should think about converting the update code (which accesses and increases the positions in the array) to C code, either by using Cython or scipy.weave.
Try this:
x = [0] * 100000000
It takes just a few seconds to execute on my machine, and access is close to instant.
If you are are not able to vectorize your calculuations, Python/Numpy will be slow. Numpy is fast because vectorized calculations occur at a lower level than Python. The core numpy functions are all written in C or Fortran. Hence sum(a) is not a python loop with many accesses, it's a single low level C call.
Numpy's Performance Python demo page has a good example with different options. You can easily get 100x increase by using a lower level compiled language, Cython, or using vectorized functions if feasible. This blog post that shows a 43 fold increase using Cython for a numpy usecase.
It's unlikely you'll find anything faster than numpy's array. The implementation of the array itself is as efficient as it would be in, say, C (and basically the same as array.array, just with more usefulness.)
If you want to speed up your code, you'll have to do it by doing just that. Even though the array is implemented efficiently, accessing it from Python code has certain overhead; for example, indexing the array produces integer objects, which have to be created on the fly. numpy offers a number of operations implemented efficiently in C, but without seeing the actual code that isn't performing as well as you want it's hard to make any specific suggestions.
For fast creation, use the array module.
Using the array module is ~5 times faster for creation, but about twice as slow for accessing elements compared to a normal list:
# Create array
python -m timeit -s "from array import array" "a = array('I', '\x00'
* 100000000)"
10 loops, best of 3: 204 msec per loop
# Access array
python -m timeit -s "from array import array; a = array('I', '\x00'
* 100000000)" "a[4975563]"
10000000 loops, best of 3: 0.0902 usec per loop
# Create list
python -m timeit "a = [0] * 100000000"
10 loops, best of 3: 949 msec per loop
# Access list
python -m timeit -s "a = [0] * 100000000" "a[4975563]"
10000000 loops, best of 3: 0.0417 usec per loop
In addition to the other excellent solutions, another way is to use a dict instead of an array (elements which exist are non-zero, otherwise they're zero). Lookup time is O(1).
You might also check if your application is resident in RAM, rather than swapping out. It's only 381 MB, but the system may not be giving you it all for whatever reason.
However there are also some really fast sparse matrices (SciPy and ndsparse). They are done in low-level C, and might also be good.
If
access speed of array.array is acceptable for your application
compact storage is most important
you want to use standard modules (no NumPy dependency)
you are on platforms that have /dev/zero
the following may be of interest to you. It initialises array.array about 27 times faster than array.array('L', [0]*size):
myarray = array.array('L')
f = open('/dev/zero', 'rb')
myarray.fromfile(f, size)
f.close()
On How to initialise an integer array.array object with zeros in Python I'm looking for an even better way.
I would simply create your own data type that doesn't initialize ANY values.
If you want to read an index position that has NOT been initialized, you return zeroes. Still, do not initialize any storage.
If you want to read an index position that HAS been initialized, simply return the value.
If you want to write to an index position that has NOT been initialized, initialize it, and store the input.
NumPy is the appropriate tool for a large, fixed-size, homogeneous array. Accessing individual elements of anything in Python isn't going to be all that fast, though whole-array operations can often be conducted at speeds similar to C or Fortran. If you need to do operations on millions and millions of elements individually quickly, there is only so much you can get out of Python.
What sort of algorithm are you implementing? How do you know that using sparse arrays is too slow if you haven't tried it? What do you mean by "efficient"? You want quick initialization? That is the bottleneck of your code?