An efficient way of making a large random bytearray - python

I need to create a large bytearry of a specific size but the size is not known prior to run time. The bytes need to be fairly random. The bytearray size may be as small as a few KBs but as large as a several MB. I do not want to iterate byte-by-byte. This is too slow -- I need performance similar to numpy.random. However, I do not have the numpy module available for this project. Is there something part of a standard python install that will do this? Or do i need to compile my own using C?
for those asking for timings:
>>> timeit.timeit('[random.randint(0,128) for i in xrange(1,100000)]',setup='import random', number=100)
35.73110193696641
>>> timeit.timeit('numpy.random.random_integers(0,128,100000)',setup='import numpy', number=100)
0.5785652013481126
>>>

The os module provides urandom, even on Windows:
bytearray(os.urandom(1000000))
This seems to perform as quickly as you need, in fact, I get better timings than your numpy (though our machines could be wildly different):
timeit.timeit(lambda:bytearray(os.urandom(1000000)), number=10)
0.0554857286941

There are several possibilities, some faster than os.urandom. Also consider whether the data has to be generated deterministically from a random seed. This is invaluable for unit tests where failures have to be reproducible.
short and pithy:
lambda n:bytearray(map(random.getrandbits,(8,)*n))
I've use the above for unit tests and it was fast enough but can it be done faster?
using itertools:
lambda n:bytearray(itertools.imap(random.getrandbits,itertools.repeat(8,n))))
itertools and struct producing 8 bytes per iteration
lambda n:(b''.join(map(struct.Struct("!Q").pack,itertools.imap(
random.getrandbits,itertools.repeat(64,(n+7)//8)))))[:n]
Anything based on b''.join will fill 3-7x the memory consumed by the final bytearray with temporary objects since it queues up all the sub-strings before joining them together and python objects have lots of storage overhead.
Producing large chunks with a specialized function gives better performance and avoids filling memory.
import random,itertools,struct,operator
def randbytes(n,_struct8k=struct.Struct("!1000Q").pack_into):
if n<8000:
longs=(n+7)//8
return struct.pack("!%iQ"%longs,*map(
random.getrandbits,itertools.repeat(64,longs)))[:n]
data=bytearray(n);
for offset in xrange(0,n-7999,8000):
_struct8k(data,offset,
*map(random.getrandbits,itertools.repeat(64,1000)))
offset+=8000
data[offset:]=randbytes(n-offset)
return data
Performance
.84 MB/s :original solution with randint:
4.8 MB/s :bytearray(getrandbits(8) for _ in xrange(n)): (solution by other poster)
6.4MB/s :bytearray(map(getrandbits,(8,)*n))
7.2 MB/s :itertools and getrandbits
10 MB/s :os.urandom
23 MB/s :itertools and struct
35 MB/s :optimised function (holds for len = 100MB ... 1KB)
Note:all tests used 10KB as the string size. Results were consistent up till intermediate results filled memory.
Note:os.urandom is meant to provide secure random seeds. Applications expand that seed with their own fast PRNG. Here's an example, using AES in counter mode as a PRNG:
import os
seed=os.urandom(32)
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.backends import default_backend
backend = default_backend()
cipher = Cipher(algorithms.AES(seed), modes.CTR(b'\0'*16), backend=backend)
encryptor = cipher.encryptor()
nulls=b'\0'*(10**5) #100k
from timeit import timeit
t=timeit(lambda:encryptor.update(nulls),number=10**5) #1GB, (100K*10k)
print("%.1f MB/s"%(1000/t))
This produces pseudorandom data at 180 MB/s. (no hardware AES acceleration, single core) That's only ~5x the speed of the pure python code above.
Addendum
There's a pure python crypto library waiting to be written. Putting the above techniques together with hashlib and stream cipher techniques looks promising. Here's a teaser, a fast string xor (42MB/s).
def xor(a,b):
s="!%iQ%iB"%divmod(len(a),8)
return struct.pack(s,*itertools.imap(operator.xor,
struct.unpack(s,a),
struct.unpack(s,b)))

What's wrong with just including numpy? Anyhow, this creates a random N-bit integer:
import random
N = 100000
bits = random.getrandbits(N)
So if you needed to see if the value of the j-th bit is set or not, you can do bits & (2**j)==(2**j)
EDIT: He asked for byte array not bit array. Ned's answer is better: your_byte_array= bytearray((random.getrandbits(8) for i in xrange(N))

import random
def randbytes(n):
for _ in xrange(n):
yield random.getrandbits(8)
my_random_bytes = bytearray(randbytes(1000000))
There's probably something in itertools that could help here, there always is...
My timings indicate that this goes about five times faster than [random.randint(0,128) for i in xrange(1,100000)]

Related

Why does my python 2.7 process use an incredbly high amount of memory?

I am trying to understand why this python code results in a process that requires 236 MB of memory, considering that the list is only 76 MB long.
import sys
import psutil
initial = psutil.virtual_memory().available / 1024 / 1024
available_memory = psutil.virtual_memory().available
vector_memory = sys.getsizeof([])
vector_position_memory = sys.getsizeof([1]) - vector_memory
positions = 10000000
print "vector with %d positions should use %d MB of memory " % (positions, (vector_memory + positions * vector_position_memory) / 1024 / 1024)
print "it used %d MB of memory " % (sys.getsizeof(range(0, positions)) / 1024 / 1024)
final = psutil.virtual_memory().available / 1024 / 1024
print "however, this process used in total %d MB" % (initial - final)
The output is:
vector with 10000000 positions should use 76 MB of memory
it used 76 MB of memory
however, this process used in total 236 MB
Adding x10 more positions (i.e. positions = 100000000) results in x10 more memory.
vector with 100000000 positions should use 762 MB of memory
it used 762 MB of memory
however, this process used in total 2330 MB
My ultimate goal is to suck as much memory as I can to create a very long list. To do this, I created this code to understand/predict how big my list could be based on available memory. To my surprise, python needs a ton of memory to manage my list, I guess.
Why does python use so much memory?! What is it doing with it? Any idea on how I can predict python's memory requirements to effectively create a list to use pretty much all the available memory while preventing the OS from doing swap?
The getsizeof function only includes the space used by the list itself.
But the list is effectively just an array of pointers to int objects, and you created 10000000 of those, and each one of those takes memory as well—typically 24 bytes.
The first few numbers (usually up to 255) are pre-created and cached by the interpreter, so they're effectively free, but the rest are not. So, you want to add something like this:
int_memory = sys.getsizeof(10000)
print "%d int objects should use another %d MB of memory " % (positions - 256, (positions - 256) * int_memory / 1024 / 1024)
And then the results will make more sense.
But notice that if you aren't creating a range with 10M unique ints, but instead, say, 10M random ints from 0-10000, or 10M copies of 0, that calculation will no longer be correct. So if want to handle those cases, you need to do something like stash the id of every object you've seen so far and skip any additional references to the same id.
The Python 2.x docs used to have a link to an old recursive getsizeof function that does that, and more… but that link went dead, so it was removed.
The 3.x docs have a link to a newer one, which may or may not work in Python 2.7. (I notice from a quick glance that it uses a __future__ statement for print, and falls back from reprlib.repr to repr, so it probably does.)
If you're wondering why every int is 24 bytes long (in 64-bit CPython; it's different for different platforms and implementations, of course):
CPython represents every builtin type as a C struct that contains, at least, space for a refcount and a pointer to the type. Any actual value the object needs to represent is in addition to that.1 So, the smallest non-singleton type is going to take 24 bytes per instance.
If you're wondering how you can avoid using up 24 bytes per integer, the answer is to use NumPy's ndarray—or, if for some reason you can't, the stdlib's array.array.
Either one lets you specify a "native type", like np.int32 for NumPy or i for array.array, and create an array that holds 100M of those native-type values directly. That will take exactly 4 bytes per value, plus a few dozen constant bytes of header overhead, which is a lot smaller than a list's 8 bytes of pointer, plus a bit of slack at the end that scales with the length, plus an int object wrapping up each value.
Using array.array, you're sacrificing speed for space,2 because every time you want to access one of those values, Python has to pull it out and "box" it as an int object.
Using NumPy, you're gaining both speed and space, because NumPy will let you perform vectorized operations over the whole array in a tightly-optimized C loop.
1. What about non-builtin types, that you create in Python with class? They have a pointer to a dict—which you can see from Python-land as __dict__—that holds all the attributes you add. So they're 24 bytes according to getsizeof, but of course you have to also add the size of that dict.
2. Unless you aren't. Preventing your system from going into swap hell is likely to speed things up a lot more than the boxing and unboxing slows things down. And, even if you aren't avoiding that massive cliff, you may still be avoiding smaller cliffs involving VM paging or cache locality.

List with sparse data consumes less memory then the same data as numpy array

I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):
Standard list
import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]
Numpy array
import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 2594 MB
Just like I expected.
By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.
Because I know my data is pretty spare, I did the same test with sparse data.
Standard list
import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]
Numpy array
from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 529 MB
Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?
One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.
Notes
I started a fresh new python shell for every test.
Memory measured with htop.
Run on 32bit Debian.
A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation
In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.
FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.

File Checksums in Python

I am creating an application related to files. And I was looking for ways to compute checksums for files. I want to know what's the best hashing method to calculate checksums of files md5 or SHA-1 or something else based on this criterias
The checksum should be unique. I know its theoretical but still I want the probablity of collisions to be very very small.
Can compare two files to be equal if there checksums are equal or not.
Speed(not very important, but still)
Please feel free to as elaborative as possible.
It depends on your use case.
If you're only worried about accidental collisions, both MD5 and SHA-1 are fine, and MD5 is generally faster. In fact, MD4 is also sufficient for most use cases, and usually even faster… but it isn't as widely-implemented. (In particular, it isn't in hashlib.algorithms_guaranteed… although it should be in hashlib_algorithms_available on most stock Mac, Windows, and Linux builds.)
On the other hand, if you're worried about intentional attacks—i.e., someone intentionally crafting a bogus file that matches your hash—you have to consider the value of what you're protecting. MD4 is almost definitely not sufficient, MD5 is probably not sufficient, but SHA-1 is borderline. At present, Keccak (which will soon by SHA-3) is believed to be the best bet, but you'll want to stay on top of this, because things change every year.
The Wikipedia page on Cryptographic hash function has a table that's usually updated pretty frequently. To understand the table:
To generate a collision against an MD4 requires only 3 rounds, while MD5 requires about 2 million, and SHA-1 requires 15 trillion. That's enough that it would cost a few million dollars (at today's prices) to generate a collision. That may or may not be good enough for you, but it's not good enough for NIST.
Also, remember that "generally faster" isn't nearly as important as "tested faster on my data and platform". With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1MB random bytes object, then did this:
In [173]: md4 = hashlib.new('md4')
In [174]: md5 = hashlib.new('md5')
In [175]: sha1 = hashlib.new('sha1')
In [180]: %timeit md4.update(data)
1000 loops, best of 3: 1.54 ms per loop
In [181]: %timeit md5.update(data)
100 loops, best of 3: 2.52 ms per loop
In [182]: %timeit sha1.update(data)
100 loops, best of 3: 2.94 ms per loop
As you can see, md4 is significantly faster than the others.
Tests using hashlib.md5() instead of hashlib.new('md5'), and using bytes with less entropy (runs of 1-8 string.ascii_letters separated by spaces) didn't show any significant differences.
And, for the hash algorithms that came with my installation, as tested below, nothing beat md4.
for x in hashlib.algorithms_available:
h = hashlib.new(x)
print(x, timeit.timeit(lambda: h.update(data), number=100))
If speed is really important, there's a nice trick you can use to improve on this: Use a bad, but very fast, hash function, like zlib.adler32, and only apply it to the first 256KB of each file. (For some file types, the last 256KB, or the 256KB nearest the middle without going over, etc. might be better than the first.) Then, if you find a collision, generate MD4/SHA-1/Keccak/whatever hashes on the whole file for each file.
Finally, since someone asked in a comment how to hash a file without reading the whole thing into memory:
def hash_file(path, algorithm='md5', bufsize=8192):
h = hashlib.new(algorithm)
with open(path, 'rb') as f:
block = f.read(bufsize)
if not block:
break
h.update(block)
return h.digest()
If squeezing out every bit of performance is important, you'll want to experiment with different values for bufsize on your platform (powers of two from 4KB to 8MB). You also might want to experiment with using raw file handles (os.open and os.read), which may sometimes be faster on some platforms.
The collision possibilities with hash size of sufficient bits are , theoretically, quite small:
Assuming random hash values with a uniform distribution, a collection
of n different data blocks and a hash function that generates b bits,
the probability p that there will be one or more collisions is bounded
by the number of pairs of blocks multiplied by the probability that a
given pair will collide, i.e
And, so far, SHA-1 collisions with 160 bits have been unobserved. Assuming one exabyte (10^18) of data, in 8KB blocks, the theoretical chance of a collision is 10^-20 -- a very very small chance.
A useful shortcut is to eliminate files known to be different from each other through short-circuiting.
For example, in outline:
Read the first X blocks of all files of interest;
Sort the one that have the same hash for the first X blocks as potentially the same file data;
For each file with the first X blocks that are unique, you can assume the entire file is unique vs all other tested files -- you do not need to read the rest of that file;
With the remaining files, read more blocks until you prove the signatures are the same or different.
With X blocks of sufficient size, 95%+ of the files will be correctly discriminated into unique files in the first pass. This is much faster than blindly reading the entire file and calculating the full hash for each and every file.
md5 tends to work great for checksums ... same with SHA-1 ... both have very small probability of collisions although I think SHA-1 has slightly smaller collision probability since it uses more bits
if you are really worried about it, you could use both checksums (one md5 and one sha1) the chance that both match and the files differ is infinitesimally small (still not 100% impossible but very very very unlikely) ... (this seems like bad form and by far the slowest solution)
typically (read: in every instance I have ever encountered) an MD5 OR an SHA1 match is sufficient to assume uniqueness
there is no way to 100% guarantee uniqueness short of byte by byte comparisson
i created a small duplicate file remover script few days back, which reads the content of the file and create a hash for it and then compare with the next file, in which even if the name is different the checksum is going to be the same..
import hashlib
import os
hash_table = {}
dups = []
path = "C:\\images"
for img in os.path.listdir(path):
img_path = os.path.join(path, img)
_file = open(img_path, "rb")
content = _file.read()
_file.close()
md5 = hashlib.md5(content)
_hash = md5.hexdigest()
if _hash in hash_table.keys():
dups.append(img)
else:
hash_table[_hash] = img

Efficient way to generate and use millions of random numbers in Python

I'm in the process of working on programming project that involves some pretty extensive Monte Carlo simulation in Python, and as such the generation of a tremendous number of random numbers. Very nearly all of them, if not all of them, will be able to be generated by Python's built in random module.
I'm something of a coding newbie, and unfamiliar with efficient and inefficient ways to do things. Is it faster to generate say, all the random numbers as a list, and then iterate through that list, or generate a new random number each time a function is called, which will be in a very large loop?
Or some other, undoubtedly more clever method?
Generate a random number each time. Since the inner workings of the loop only care about a single random number, generate and use it inside the loop.
Example:
# do this:
import random
for x in xrange(SOMEVERYLARGENUMBER):
n = random.randint(1,1000) # whatever your range of random numbers is
# Do stuff with n
# don't do this:
import random
# This list comprehension generates random numbers in a list
numbers = [random.randint(1,1000) for x in xrange(SOMEVERYLARGENUMBER)]
for n in numbers:
# Do stuff with n
Obviously, in practical terms it really doesn't matter, unless you're dealing with billions and billions of iterations, but why bother generating all those numbers if you're only going to be using one at a time?
import random
for x in (random.randint(0,80) for x in xrange(1000*1000)):
print x
The code between parentheses will only generate one item at a time, so it's memory safe.
Python builtin random module, e.g. random.random(), random.randint(), (some distributions also available, you probably want gaussian) does about 300K samples/s.
Since you are doing numerical computation, you probably use numpy anyway, that offers better performance if you cook random number one array at a time instead of one number at a time and wider choice of distributions. 60K/s * 1024 (array length), that's ~60M samples/s.
You can also read /dev/urandom on Linux and OSX. my hw/sw (osx laptop) manages ~10MB/s.
Surely there must be faster ways to generate random numbers en masse, e.g.:
from Crypto.Cipher import AES
from Crypto.Util import Counter
import secrets
aes = AES.new(secrets.token_bytes(16), AES.MODE_CTR, secrets.token_bytes(16), counter=Counter.new(128))
data = "0" * 2 ** 20
with open("filler.bin", "wb") as f:
while True:
f.write(aes.encrypt(data))
This generates 200MB/s on a single core of i5-4670K
Common ciphers like aes and blowfish manage 112MB/s and 70MB/s on my stack. Furthermore modern processors make aes even faster up to some 700MB/s see this link to test runs on few hardware combinations. (edit: link broken). You could use weaker ECB mode, provided you feed distinct inputs into it, and achieve up to 3GB/s.
Stream cipher are better suited for the task, e.g. RC4 tops out at 300MB/s on my hardware, you may get best results from most popular ciphers as more effort was spent optimising those both and software.
Code to generate 10M random numbers efficiently and faster:
import random
l=10000000
listrandom=[]
for i in range (l):
value=random.randint(0,l)
listrandom.append(value)
print listrandom
Time taken included the I/O time lagged in printing on screen:
real 0m27.116s
user 0m24.391s
sys 0m0.819s
Using Numpy -
import numpy as np
np.random.randint(low="put the range like 1 to 100, so put '1' in
low",high="100",size="1000000")

Best seed for parallel process

I need to run a MonteCarlo simulations in parallel on different machines. The code is in c++, but the program is set up and launched with a python script that set a lot of things, in particular the random seed. The function setseed thake a 4 bytes unsigned integer
Using a simple
import time
setseed(int(time.time()))
is not very good because I submit the jobs to a queue on a cluster, they remain pending for some minutes then they starts, but the start time is impredicible, it can be that two jobs start at the same time (seconds), so I switch to:
setseet(int(time.time()*100))
but I'm not happy. What is the best solution? Maybe I can combine information from: time, machine id, process id. Or maybe the best solution is to read from /dev/random (linux machines)?
How to read 4 bytes from /dev/random?
f = open("/dev/random","rb")
f.read(4)
give me a string, I want an integer!
Reading from /dev/random is a good idea. Just convert the 4 byte string into an Integer:
f = open("/dev/random","rb")
rnd_str = f.read(4)
Either using struct:
import struct
rand_int = struct.unpack('I', rnd_string)[0]
Update Uppercase I is needed.
Or multiply and add:
rand_int = 0
for c in rnd_str:
rand_int <<= 8
rand_int += ord(c)
You could simply copy over the four bytes into an integer, that should be the least of your worries.
But parallel pseudo-random number generation is a rather complex topic and very often not done well. Usually you generate seeds on one machine and distribute them to the others.
Take a look at SPRNG, which handles exactly your problem.
If this is Linux or a similar OS, you want /dev/urandom -- it always produces data immediately.
/dev/random may stall waiting for the system to gather randomness. It does produce cryptographic-grade random numbers, but that is overkill for your problem.
You can use a random number as the seed, which has the advantage of being operating-system agnostic (no /dev/random needed), with no conversion from string to int:
Why not simply use
random.randrange(-2**31, 2**31)
as the seed of each process? Slightly different starting times give wildly different seeds, this way…
You could also alternatively use the random.jumpahead method, if you know roughly how many random numbers each process is going to use (the documentation of random.WichmannHill.jumpahead is useful).

Categories

Resources