Is it possible to share gmpy2 multiprecision integers (https://pypi.python.org/pypi/gmpy2) between processes (created by multiprocessing) without creating copies in memory?
Each integer has about 750,000 bits. The integers are not modified by the processes.
Thank you.
Update: Tested code is below.
I would try the following untested approach:
Create a memory mapped file using Python's mmap library.
Use gmpy2.to_binary() to convert a gmpy2.mpz instance into binary string.
Write both the length of the binary string and binary string itself into the memory mapped file. To allow for random access, you should begin every write at a multiple of a fixed value, say 94000 in your case.
Populate the memory mapped file with all your values.
Then in each process, use gmpy2.from_binary() to read the data from the memory mapped file.
You need to read both the length of the binary string and binary string itself. You should be able to pass a slice from the memory mapped file directly to gmpy2.from_binary().
I may be simpler to create a list of (start, end) values for the position of each byte string in the memory mapped file and then pass that list to each process.
Update: Here is some sample code that has been tested on Linux with Python 3.4.
import mmap
import struct
import multiprocessing as mp
import gmpy2
# Number of mpz integers to place in the memory buffer.
z_count = 40000
# Maximum number of bits in each integer.
z_bits = 750000
# Total number of bytes used to store each integer.
# Size is rounded up to a multiple of 4.
z_size = 4 + (((z_bits + 31) // 32) * 4)
def f(instance):
global mm
s = 0
for i in range(z_count):
mm.seek(i * z_size)
t = struct.unpack('i', mm.read(4))[0]
z = gmpy2.from_binary(mm.read(t))
s += z
print(instance, z % 123456789)
def main():
global mm
mm = mmap.mmap(-1, z_count * z_size)
rs = gmpy2.random_state(42)
for i in range(z_count):
z = gmpy2.mpz_urandomb(rs, z_bits)
b = gmpy2.to_binary(z)
mm.seek(i * z_size)
mm.write(struct.pack('i', len(b)))
mm.write(b)
ctx = mp.get_context('fork')
pool = ctx.Pool(4)
pool.map_async(f, range(4))
pool.close()
pool.join()
if __name__ == '__main__':
main()
Related
I am trying to remove top 24 lines of a raw file, so I opened the original raw file(let's call it raw1.raw) and converted it to nparray, then I initialized a new array and remove the top24 lines, but after writing new array to the new binary file(raw2.raw), I found raw2 is 15.2mb only while the original file raw1.raw is like 30.6mb, my code:
import numpy as np
import imageio
import rawpy
import cv2
def ave():
fd = open('raw1.raw', 'rb')
rows = 3000 #around 3000, not the real rows
cols = 5100 #around 5100, not the real cols
f = np.fromfile(fd, dtype=np.uint8,count=rows*cols)
I_array = f.reshape((rows, cols)) #notice row, column format
#print(I_array)
fd.close()
im = np.zeros((rows - 24 , cols))
for i in range (len(I_array) - 24):
for j in range(len(I_array[i])):
im[i][j] = I_array[i + 24][j]
#print(im)
newFile = open("raw2.raw", "wb")
im.astype('uint8').tofile(newFile)
newFile.close()
if __name__ == "__main__":
ave()
I tried to use im.astype('uint16') when write in the binary file, but the value would be wrong if I use uint16.
There must clearly be more data in your 'raw1.raw' file that you are not using. Are you sure that file wasn't created using 'uint16' data and you are just pulling out the first half as 'uint8' data? I just checked the writing of random data.
import os, numpy as np
x = np.random.randint(0,256,size=(3000,5100),dtype='uint8')
x.tofile(open('testfile.raw','w'))
print(os.stat('testfile.raw').st_size) #I get 15.3MB.
So, 'uint8' for a 3000 by 5100 clearly takes up 15.3MB. I don't know how you got 30+.
############################ EDIT #########
Just to add more clarification. Do you realize that dtype does nothing more than change the "view" of your data? It doesn't effect the actual data that is saved in memory. This also goes for data that you read from a file. Take for example:
import numpy as np
#The way to understand x, is that x is taking 12 bytes in memory and using
#that information to hold 3 values. The first 4 bytes are the first value,
#the second 4 bytes are the second, etc.
x = np.array([1,2,3],dtype='uint32')
#Change x to display those 12 bytes at 6 different values. Doing this does
#NOT change the data that the array is holding. You are only changing the
#'view' of the data.
x.dtype = 'uint16'
print(x)
In general (there are few special cases), changing the dtype doesn't change the underlying data. However, the conversion function .astype() does change the underlying data. If you have any array of 12 bytes viewed as 'int32' then running .astype('uint8') will take each entry (4 bytes) and covert it (known as casting) to a uint8 entry (1 byte). The new array will only have 3 bytes for the 3 entries. You can see this litterally:
x = np.array([1,2,3],dtype='uint32')
print(x.tobytes())
y = x.astype('uint8')
print(y.tobytes())
So, when we say that a file is 30mb, we mean that the file has (minus some header information) is 30,000,000 bytes which are exactly uint8s. 1 uint8 is 1 byte. If any array has 6000by5100 uint8s (bytes), then the array has 30,600,000 bytes of information in memory.
Likewise, if you read a file (DOES NOT MATTER THE FILE) and write np.fromfile(,dtype=np.uint8,count=15_300_000) then you told python to read EXACTLY 15_300_000 bytes (again 1 byte is 1 uint8) of information (15mb). If your file is 100mb, 40mb, or even 30mb, it would be completely irrelevant because you told python to only read the first 15mb of data.
I want to "unpack" OR de-serialize the formatted data that is outputed from python's struct.pack() function. The data is sent over the network to another platform that uses Java only.
The Python function that sends data over the network, uses this formater:
def formatOutputMsg_Array(self, mac, arr):
mac_bin = mac.encode("ascii");
mac_len = len(mac_bin);
arr_bin = array.array('d', arr).tobytes();
arr_len = len(arr_bin);
m = struct.pack('qqd%ss%ss' % (mac_len, arr_len), mac_len, arr_len, time.time(), mac_bin, arr_bin);
return m
Here are the docs for python's struct (refer to section 7.3.2.2. Format Characters):
https://docs.python.org/2/library/struct.html
1) The issue is what does 'qqd%ss%ss' mean ???
Does it mean -> long,long,double,char,char,[],char[],char,char[],char[]
2) why is modulo "%" used here with a tuple 'qqd%ss%ss' % (mac_len, arr_len) ?
The first argument to pack is the result of the expression 'qqd%ss%ss' % (mac_len, arr_len), where the two %s are replaced by the values of the given variables. Assuming mac_len == 8 and arr_len == 4, for example, the result is qqd8s4s. s preceded by a number simply means to copy the given bytes for that format into the result.
I am getting struct.error: bad char in struct format when packing bytes back in the struct even without making any changes to them.
I am trying to do bitwise operations on each byte in RGBTRIPLE of a 24-bit BMP image. For the sake of simplicity, I am posting the code with just one sample bytes sequence representing a pixel in a Bitmap; I don't make any bitwise operations on it, just try to pack it back.
from struct import *
from collections import namedtuple
def main():
RGBTRIPLE = namedtuple('RGBTRIPLE', 'rgbtRed rgbtGreen rgbtBlue')
rgbt_fmt = '=BBB'
rgbt_size = calcsize(rgbt_fmt)
rgbt_buffer = b'\x1c\x1e\x1f'
rgbt = RGBTRIPLE._make(unpack(rgbt_fmt, rgbt_buffer))
rgbtRed = rgbt.rgbtRed
rgbtGreen = rgbt.rgbtGreen
rgbtBlue = rgbt.rgbtBlue
rgbt_buffer = pack('rgbt_fmt', rgbtRed, rgbtGreen, rgbtBlue)
if __name__ == "__main__":
main()
From what I understand, the problem is that when I am unpacking bytes, I am getting ints with size > 1 byte. What is the best way to fix the size of those ints at 1 byte, so I can pack them back using the same =BBB struct format?
I am coding a basic frequency analisys of WAVE audio files, but I have trouble when it comes to convertion from WAVE frames to integer.
Here is the relevant part of my code:
import wave
track = wave.open('/some_path/my_audio.wav', 'r')
byt_depth = track.getsampwidth() #Byte depth of the file in BYTES
frame_rate = track.getframerate()
buf_size = 512
def byt_sum (word):
#convert a string of n bytes into an int in [0;8**n-1]
return sum( (256**k)*word[k] for k in range(len(word)) )
raw_buf = track.readframes(buf_size)
'''
One frame is a string of n bytes, where n = byt_depth.
For instance, with a 24bits-encoded file, track.readframe(1) could be:
b'\xff\xfe\xfe'.
raw_buf[n] returns an int in [0;255]
'''
sample_buf = [byt_sum(raw_buf[byt_depth*k:byt_depth*(k+1)])
- 2**(8*byt_depth-1) for k in range(buf_size)]
Problem is: when I plot sample_buf for a single sine signal, I get
an alternative, wrecked sine signal.
I can't figure out why the signal overlaps udpside-down.
Any idea?
P.S.: Since I'm French, my English is quite hesitating. Feel free to edit if there are ugly mistakes.
It might be because you need to use an unsigned value for representing the 16bit samples. See https://en.wikipedia.org/wiki/Pulse-code_modulation
Try to add 32767 to each sample.
Also you should use the python struct module to decode the buffer.
import struct
buff_size = 512
# 'H' is for unsigned 16 bit integer, try 'h' also
sample_buff = struct.unpack('H'*buf_size, raw_buf)
The easiest way is to use a library that does the decoding for you. There are several Python libraries available, my favorite is the soundfile module:
import soundfile as sf
signal, samplerate = sf.read('/some_path/my_audio.wav')
I'm trying to use python to create a random binary file. This is what I've got already:
f = open(filename,'wb')
for i in xrange(size_kb):
for ii in xrange(1024/4):
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
f.close()
But it's terribly slow (0.82 seconds for size_kb=1024 on my 3.9GHz SSD disk machine). A big bottleneck seems to be the random int generation (replacing the randint() with a 0 reduces running time from 0.82s to 0.14s).
Now I know there are more efficient ways of creating random data files (namely dd if=/dev/urandom) but I'm trying to figure this out for sake of curiosity... is there an obvious way to improve this?
IMHO - the following is completely redundant:
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
There's absolutely no need to use struct.pack, just do something like:
import os
fileSizeInBytes = 1024
with open('output_filename', 'wb') as fout:
fout.write(os.urandom(fileSizeInBytes)) # replace 1024 with a size in kilobytes if it is not unreasonably large
Then, if you need to re-use the file for reading integers, then struct.unpack then.
(my use case is generating a file for a unit test so I just need a
file that isn't identical with other generated files).
Another option is to just write a UUID4 to the file, but since I don't know the exact use case, I'm not sure that's viable.
The python code you should write completely depends on the way you intend to use the random binary file. If you just need a "rather good" randomness for multiple purposes, then the code of Jon Clements is probably the best.
However, on Linux OS at least, os.urandom relies on /dev/urandom, which is described in the Linux Kernel (drivers/char/random.c) as follows:
The /dev/urandom device [...] will return as many bytes as are
requested. As more and more random bytes are requested without giving
time for the entropy pool to recharge, this will result in random
numbers that are merely cryptographically strong. For many
applications, however, this is acceptable.
So the question is, is this acceptable for your application ? If you prefer a more secure RNG, you could read bytes on /dev/random instead. The main inconvenient of this device: it can block indefinitely if the Linux kernel is not able to gather enough entropy. There are also other cryptographically secure RNGs like EGD.
Alternatively, if your main concern is execution speed and if you just need some "light-randomness" for a Monte-Carlo method (i.e unpredictability doesn't matter, uniform distribution does), you could consider generate your random binary file once and use it many times, at least for development.
Here's a complete script based on accepted answer that creates random files.
import sys, os
def help(error: str = None) -> None:
if error and error != "help":
print("***",error,"\n\n",file=sys.stderr,sep=' ',end='');
sys.exit(1)
print("""\tCreates binary files with random content""", end='\n')
print("""Usage:""",)
print(os.path.split(__file__)[1], """ "name1" "1TB" "name2" "5kb"
Accepted units: MB, GB, KB, TB, B""")
sys.exit(2)
# https://stackoverflow.com/a/51253225/1077444
def convert_size_to_bytes(size_str):
"""Convert human filesizes to bytes.
ex: 1 tb, 1 kb, 1 mb, 1 pb, 1 eb, 1 zb, 3 yb
To reverse this, see hurry.filesize or the Django filesizeformat template
filter.
:param size_str: A human-readable string representing a file size, e.g.,
"22 megabytes".
:return: The number of bytes represented by the string.
"""
multipliers = {
'kilobyte': 1024,
'megabyte': 1024 ** 2,
'gigabyte': 1024 ** 3,
'terabyte': 1024 ** 4,
'petabyte': 1024 ** 5,
'exabyte': 1024 ** 6,
'zetabyte': 1024 ** 7,
'yottabyte': 1024 ** 8,
'kb': 1024,
'mb': 1024**2,
'gb': 1024**3,
'tb': 1024**4,
'pb': 1024**5,
'eb': 1024**6,
'zb': 1024**7,
'yb': 1024**8,
}
for suffix in multipliers:
size_str = size_str.lower().strip().strip('s')
if size_str.lower().endswith(suffix):
return int(float(size_str[0:-len(suffix)]) * multipliers[suffix])
else:
if size_str.endswith('b'):
size_str = size_str[0:-1]
elif size_str.endswith('byte'):
size_str = size_str[0:-4]
return int(size_str)
if __name__ == "__main__":
input = {} #{ file: byte_size }
if (len(sys.argv)-1) % 2 != 0:
print("-- Provide even number of arguments --")
print(f'--\tGot: {len(sys.argv)-1}: "' + r'" "'.join(sys.argv[1:]) +'"')
sys.exit(2)
elif len(sys.argv) == 1:
help()
try:
for file, size_str in zip(sys.argv[1::2], sys.argv[2::2]):
input[file] = convert_size_to_bytes(size_str)
except ValueError as ex:
print(f'Invalid size: "{size_str}"', file=sys.stderr)
sys.exit(1)
for file, size_bytes in input.items():
print(f"Writing: {file}")
#https://stackoverflow.com/a/14276423/1077444
with open(file, 'wb') as fout:
while( size_bytes > 0 ):
wrote = min(size_bytes, 1024) #chunk
fout.write(os.urandom(wrote))
size_bytes -= wrote