I have a dataframe containing millions of floats. I want to turn them into bytes and join them in a single line. Iterating over each of them is kinda slow. Is there a way to speed this up?
import struct
import numpy as np
# list of floats [197496.84375, 177091.28125, 140972.3125, 120965.9140625, ...]
# 5M - 20M floats in total
data = df.to_numpy().flatten().tolist()
# too slow
dataline = b''.join([struct.pack('>f', event) for event in data])
I tried another approach, but apart from being slow, it also produces a different result
import struct
import numpy as np
def myfunc(event):
return struct.pack('>f', event)
data = df.to_numpy().flatten()
myfunc_vec = np.vectorize(myfunc)
result = myfunc_vec(data)
dataline = b''.join(result)
UPD: found an example here Fastest way to pack a list of floats into bytes in python, but it doesn't allow me to specify endianess. Putting this '%s>f' instead of '%sf' results in an error:
error: bad char in struct format
import random
import struct
floatlist = [random.random() for _ in range(10**5)]
buf = struct.pack('%sf' % len(floatlist), *floatlist)
Related
I am trying to combine the solutions provided in both of these SO answers - Using threading to slice an array into chunks and perform calculation on each chunk and reassemble the returned arrays into one array and Pass multiple parameters to concurrent.futures.Executor.map?. I have a numpy array that I chunk into segments and I want each chunk to be sent to a separate thread and an additional argument to be sent along with the chunk of the original array. This additional argument is a constant and will not change. The performCalc is a function that will take two arguments -one the chunk of the original numpy array and a constant.
First solution I tried
import psutil
import numpy as np
import sys
from concurrent.futures import ThreadPoolExecutor
from functools import partial
def main():
testThread()
def testThread():
minLat = -65.76892
maxLat = 66.23587
minLon = -178.81404
maxLon = 176.2949
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLon,maxLon,0.05)
gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
n_jobs = psutil.cpu_count(logical=False)
chunk = np.array_split(grid_points,n_jobs,axis=0)
x = ThreadPoolExecutor(max_workers=n_jobs)
maxDistance = 4.3
func = partial(performCalc,chunk)
args = [chunk,maxDistance]
# This prints 4.3 twice although there are four cores in the system
results = x.map(func,args)
# This prints 4.3 four times correctly
results1 = x.map(performTest,chunk)
def performCalc(chunk,maxDistance):
print(maxDistance)
return chunk
def performTest(chunk):
print("test")
main()
So performCalc() prints 4.3 twice even though the number of cores in the system is 4. While performTest() prints test four times correctly. I am not able to figure out the reason for this error.
Also I am sure the way I set up the for itertools.partial call is incorrect.
1) There are four chunks of the original numpy array.
2) Each chunk is to be paired with maxDistance and sent to performCalc()
3) There will be four threads that will print maxDistance and will return parts of the total result which will be returned in one array
Where am I going wrong ?
UPDATE
I tried using the lambda approach as well
results = x.map(lambda p:performCalc(*p),args)
but this prints nothing.
Using the solution provided by user mkorvas as shown here - How to pass a function with more than one argument to python concurrent.futures.ProcessPoolExecutor.map()? I was able to solve my problem as shown in the solution here -
import psutil
import numpy as np
import sys
from concurrent.futures import ThreadPoolExecutor
from functools import partial
def main():
testThread()
def testThread():
minLat = -65.76892
maxLat = 66.23587
minLon = -178.81404
maxLon = 176.2949
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLon,maxLon,0.05)
print(latGrid.shape,lonGrid.shape)
gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
print(grid_points.shape)
n_jobs = psutil.cpu_count(logical=False)
chunk = np.array_split(grid_points,n_jobs,axis=0)
x = ThreadPoolExecutor(max_workers=n_jobs)
maxDistance = 4.3
func = partial(performCalc,maxDistance)
results = x.map(func,chunk)
def performCalc(maxDistance,chunk):
print(maxDistance)
return chunk
main()
What apparently one needs to do(and I do not know why and maybe somebody can clarify in another answer) is you need to switch the order of input to the function performCalc()
as shown here -
def performCalc(maxDistance,chunk):
print(maxDistance)
return chunk
Why does it appear that concatenation in Python 3 is slower in some cases than in Python 2?
The most impacted method of concatenation appears to be successive concatenation of bytes objects, which has gone from an O(n) to O(n²) operation.
The bulk of my profiling code is here:
#!/usr/bin/env python
from operator import concat
from sys import version, version_info
from timeit import timeit # Compatibility: ver >= 2.6
# ver = version.partition('\n')[0].rstrip()
ver = '.'.join(str(v) for v in version_info[:3])
print(ver)
if version_info[0] == 2:
from StringIO import StringIO
else:
from io import StringIO
from functools import reduce
xrange = range
def build_plus():
output = ''
for _ in xrange(input_len):
output += 'a'
return output
def build_join():
return ''.join('a' for _ in xrange(input_len))
def build_bytes_plus():
output = b''
for _ in xrange(input_len):
output += b'a'
return output
def build_stringio():
output = StringIO()
for _ in xrange(input_len):
output.write('a')
return output.getvalue()
def build_reduce():
return reduce(concat, ('a' for _ in xrange(input_len)))
builds = {'str+': build_plus,
'join': build_join,
'reduce': build_reduce,
'bytes+': build_bytes_plus,
'StringIO': build_stringio}
if version_info[0] == 2:
import cStringIO
def build_cstringio():
output = cStringIO.StringIO()
for _ in xrange(input_len):
output.write('a')
return output.getvalue()
builds['cStringIO'] = build_cstringio
else:
from io import BytesIO
def build_bytesio():
output = BytesIO()
for _ in xrange(input_len):
output.write(b'a')
return output.getvalue()
builds['BytesIO'] = build_bytesio
resfile = open('times.csv', 'a')
size_range = 50 # Number of points over the size axis
min_order = 1.0 # 10^x byte input min
max_order = 5.0 # 10^x byte input max
for allow_gc in (False, True):
setup = 'gc.enable()' if allow_gc else 'pass'
for build_name, build_fun in builds.items():
# For a roughly constant confidence interval, aim for uniform sample density across the
# (logarithmic) input size axis.
for size_index in range(size_range+1):
input_len = int(10**((max_order-min_order)*size_index/size_range + min_order))
# Rather than repeating many measurements at one input size, perform one measurement
# per input size for a continuous range of input sizes and apply smoothing later.
dur = timeit(build_fun, setup, number=1)
resfile.write('"%s",%s,"%s",%d,%.6g\n' % (ver, str(allow_gc).upper(), build_name,
input_len, dur))
Some graphs from my R script shown here:
Concatenating strings with + or += in a loop was never a good idea. It only seemed efficient because there was a weird, controversial special case in the bytecode interpreter loop which would attempt to concatenate strings mutatively if it could prove no one else had a reference to the string it was messing with. There was no efficient resize policy in place; it just called realloc and hoped for the best, so it could still end up O(n^2) if realloc needed to copy.
In Python 3, that weird special case now handles unicode strings instead of bytestrings. Bytestring concatenation goes back to building a new string object each time, so your loop goes back to O(n^2).
In python, I have a huge list of floating point values(nearly 30 million values). I have to convert each of them as 4 byte values in little endian format and write all those to a binary file in order.
For a list with some thousands or even 100k of data, my code is working fine. But if the data increases, it is taking time to process and write to file. What optimization techniques can I use to write to file more efficiently?
As suggested in this blog , I am replacing all the small writes to a file by the use of bytearray. But still, the performance is not satisfiable.
Also I have tried multiprocessing (concurrent.futures.ProcessPoolExecutor()) to utilize all the cores in the system instead of using a single CPU core. But still it is taking more time to complete the execution.
Can anyone give me more suggestions on how to improve the performance(in terms of time and memory) for this problem.
Here is my code:
def process_value (value):
hex_value = hex(struct.unpack('<I', struct.pack('<f', value))[0])
if len(hex_value.split('x')[1]) < 8:
hex_value = hex_value[:2] + ('0' * (8 - len(hex_value.split('x')[1]))) + hex_value[2:]
dec1 = int( hex_value.split('x')[1][0] + hex_value.split('x')[1][1], 16)
dec2 = int(hex_value.split('x')[1][2]+hex_value.split('x')[1][3],16)
dec3 = int(hex_valur.split('x')[1][4]+hex_value.split('x')[1][5],16)
dec4 = int(hex_value.split('x')[1][6]+hex_value.split('x')[1][7],16)
msg = bytearray( [dec4,dec3,dec2,dec1] )
return msg
def main_function(fp, values):
msg = bytearray()
for val in values:
msg.extend (process_value(val))
fp.write(msg)
You could try converting all the floats before writing them, and then write the resulting data in one go:
import struct
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
f_output.write(struct.pack('<{}f'.format(len(my_floats)), *my_floats))
For the amount of values you have, you might need to do this in large blocks:
import struct
def blocks(data, n):
for i in xrange(0, len(data), n):
yield data[i:i+n]
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
for block in blocks(my_floats, 10000):
f_output.write(struct.pack('<{}f'.format(len(block)), *block))
The output from struct.pack() should be in the correct binary format for writing directly to the file. The file must be opened in binary mode e.g. wb is used.
I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).
I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.