Fastest way to write large CSV with Python - python

I want to write some random sample data in a csv file until it is 1GB big. Following code is working:
import numpy as np
import uuid
import csv
import os
outfile = 'data.csv'
outsize = 1024 # MB
with open(outfile, 'ab') as csvfile:
wtr = csv.writer(csvfile)
while (os.path.getsize(outfile)//1024**2) < outsize:
wtr.writerow(['%s,%.6f,%.6f,%i' % (uuid.uuid4(), np.random.random()*50, np.random.random()*50, np.random.randint(1000))])
How to get it faster?

The problem appears to be mainly IO-bound. You can improve the I/O a bit by writing to the file in larger chunks instead of writing one line at a time:
import numpy as np
import uuid
import os
outfile = 'data-alt.csv'
outsize = 10 # MB
chunksize = 1000
with open(outfile, 'ab') as csvfile:
while (os.path.getsize(outfile)//1024**2) < outsize:
data = [[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize)*50,
np.random.random(chunksize)*50,
np.random.randint(1000, size=(chunksize,))]
csvfile.writelines(['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)])
You can experiment with the chunksize (the number of rows written per chunk) to see what works best on your machine.
Here is a benchmark, comparing the above code to your original code, with outsize set to 10 MB:
% time original.py
real 0m5.379s
user 0m4.839s
sys 0m0.538s
% time write_in_chunks.py
real 0m4.205s
user 0m3.850s
sys 0m0.351s
So this is is about 25% faster than the original code.
PS. I tried replacing the calls to os.path.getsize with an estimation of the number of total lines needed. Unfortunately, it did not improve the speed. Since the number of bytes needed to represent the final int varies, the estimation also is inexact -- that is, it does not perfectly replicate the behavior of your original code. So I left the os.path.getsize in place.

Removing all unnecessary stuff, and therefore it should be faster and easier to understand:
import random
import uuid
outfile = 'data.csv'
outsize = 1024 * 1024 * 1024 # 1GB
with open(outfile, 'ab') as csvfile:
size = 0
while size < outsize:
txt = '%s,%.6f,%.6f,%i\n' % (uuid.uuid4(), random.random()*50, random.random()*50, random.randrange(1000))
size += len(txt)
csvfile.write(txt)

This is an update building on unutbu's answer above:
A large % of the time was spent in generating random numbers and checking the file size.
If you generate the rows ahead of time you can assess the raw disc io performance:
import time
from pathlib import Path
import numpy as np
import uuid
outfile = Path('data-alt.csv')
chunksize = 1_800_000
data = [
[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize) * 50,
np.random.random(chunksize) * 50,
np.random.randint(1000, size=(chunksize,))
]
rows = ['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)]
t0 = time.time()
with open(outfile, 'a') as csvfile:
csvfile.writelines(rows)
tdelta = time.time() - t0
print(tdelta)
On my standard 860 evo ssd (not nvme), I get 1.43 sec for 1_800_000 rows so that's 1,258,741 rows/sec (not too shabby imo)

Related

How to optimize binary file manipulation?

here is my code:
def decode(filename):
with open(filename, "rb") as binary_file:
# Read the whole file at once
data = bytearray( binary_file.read())
for i in range(len(data)):
data[i] = 0xff - data[i]
with open("out.log", "wb") as out:
out.write(data)
I have a file around 10MB, and I need to translate this file by flipping every bits, and save it to a new file.
It takes around 1 second using my code to translate a 10MB file, while it only takes less than 1ms using C.
This is my first python script. I don't if it is right to use bytearray. The most time consuming code is loop for bytearray.
If using using the numpy library is an option, then using it would be much★ faster since it can perform the operation on all the bytes via a single statement. Doing byte-level operations in pure Python to relatively large amoont of data is inherently going to be relatively slow as compared to using a module like numpy which is implemented in C and optimized for array processing.
★ Although not by quite as much in Python 2 as in 3 (see results below).
The following is a framework I set up to benchmark using it vs the code in your question. It may seem like a lot of code, but most of it is just part of the scaffolding for making performance comparisons.
I encourage others answering this question to also make use of it.
from __future__ import print_function
from collections import namedtuple
import os
import sys
from random import randrange
from textwrap import dedent
from tempfile import NamedTemporaryFile
import timeit
import traceback
N = 1 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
UNITS = 1024 * 1024 # MBs
FILE_SIZE = 10 * UNITS
# Create test files. Must be done here at module-level to allow file
# deletions at end.
with NamedTemporaryFile(mode='wb', delete=False) as inp_file:
FILE_NAME_IN = inp_file.name
print('Creating temp input file: "{}", length {:,d}'.format(FILE_NAME_IN, FILE_SIZE))
inp_file.write(bytearray(randrange(256) for _ in range(FILE_SIZE)))
with NamedTemporaryFile(mode='wb', delete=False) as out_file:
FILE_NAME_OUT = out_file.name
print('Creating temp output file: "{}"'.format(FILE_NAME_OUT))
# Common setup for all testcases (executed prior to any Testcase specific setup).
COMMON_SETUP = dedent("""
from __main__ import FILE_NAME_IN, FILE_NAME_OUT
""")
class Testcase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"user3181169": Testcase("""
def decode(filename, out_filename):
with open(filename, "rb") as binary_file:
# Read the whole file at once
data = bytearray(binary_file.read())
for i in range(len(data)):
data[i] = 0xff - data[i]
with open(out_filename, "wb") as out:
out.write(data)
""", """
decode(FILE_NAME_IN, FILE_NAME_OUT)
"""
),
"using numpy": Testcase("""
import numpy as np
def decode(filename, out_filename):
with open(filename, 'rb') as file:
data = np.frombuffer(file.read(), dtype=np.uint8)
# Applies mathematical operation to entire array.
data = 0xff - data
with open(out_filename, "wb") as out:
out.write(data)
""", """
decode(FILE_NAME_IN, FILE_NAME_OUT)
""",
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
major, minor, micro = sys.version_info[:3]
bitness = 64 if sys.maxsize > 2**32 else 32
print('Fastest to slowest execution speeds using ({}-bit) Python {}.{}.{}\n'
'({:,d} execution(s), best of {:d} repetition(s)'.format(
bitness, major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, relative speed: {:6,.2f}x, ({:8,.2f}% slower)'
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
# Clean-up.
for filename in (FILE_NAME_IN, FILE_NAME_OUT):
try:
os.remove(filename)
except FileNotFoundError:
pass
Output (Python 3):
Creating temp input file: "T:\temp\tmpw94xdd5i", length 10,485,760
Creating temp output file: "T:\temp\tmpraw4j4qd"
Fastest to slowest execution speeds using (32-bit) Python 3.7.1
(1 execution(s), best of 3 repetition(s)
using numpy : 0.017744 secs, relative speed: 1.00x, ( 0.00% slower)
user3181169 : 1.099956 secs, relative speed: 61.99x, (6,099.14% slower)
Output (Python 2):
Creating temp input file: "t:\temp\tmprk0njd", length 10,485,760
Creating temp output file: "t:\temp\tmpvcaj6n"
Fastest to slowest execution speeds using (32-bit) Python 2.7.15
(1 execution(s), best of 3 repetition(s)
using numpy : 0.017930 secs, relative speed: 1.00x, ( 0.00% slower)
user3181169 : 0.937218 secs, relative speed: 52.27x, (5,126.97% slower)

Why is reading large files in python so slow?

I am trying to read a csv file I created before in python using
with open(csvname, 'w') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerows(data)
Data ist a random matrix containing about 30k * 30k entries, np.float32 format. About 10 GB file size in total.
Once I read in the file using this function (since I know the size of my matrix already and np.genfromtxt is increadibly slow and would need about 100 GB RAM at this point)
def read_large_txt(path, delimiter=',', dtype=np.float32, nrows = 0):
t1 = time.time()
with open(path, 'r') as f:
out = np.empty((nrows, nrows), dtype=dtype)
for (ii, line) in enumerate(f):
if ii%2 == 0:
out[int(ii/2)] = line.split(delimiter)
print('Reading %s took %.3f s' %(path, time.time() - t1))
return out
it takes me about 10 minutes to read that file. The hard drive I am using should be able to read about 100 MB/s which would decrease the reading time to about 1-2 minutes.
Any ideas what I may be doing wrong?
Related: why numpy narray read from file consumes so much memory?
That's where the function read_large_txt is from.
I found a quite simple solution. Since I am creating the files as well, I don't need to save them as a .csv-file. It is way (!) faster to load them as .npy files:
Loading (incl. splitting each line by ',') a 30k * 30k matrix stored as .csv takes about 10 minutes. Doing the same with a matrix stored as .npy takes about 10 seconds!
That's why I have to change the code I wrote above to:
np.save(npyname, data)
and in the other script to
out = np.load(npyname + '.npy')
Another advantage of this method is: (in my case) the .npy files only have about 40% the size of the .csv files. :)

How to improve performance when having a huge list of bytes to be written to file?

In python, I have a huge list of floating point values(nearly 30 million values). I have to convert each of them as 4 byte values in little endian format and write all those to a binary file in order.
For a list with some thousands or even 100k of data, my code is working fine. But if the data increases, it is taking time to process and write to file. What optimization techniques can I use to write to file more efficiently?
As suggested in this blog , I am replacing all the small writes to a file by the use of bytearray. But still, the performance is not satisfiable.
Also I have tried multiprocessing (concurrent.futures.ProcessPoolExecutor()) to utilize all the cores in the system instead of using a single CPU core. But still it is taking more time to complete the execution.
Can anyone give me more suggestions on how to improve the performance(in terms of time and memory) for this problem.
Here is my code:
def process_value (value):
hex_value = hex(struct.unpack('<I', struct.pack('<f', value))[0])
if len(hex_value.split('x')[1]) < 8:
hex_value = hex_value[:2] + ('0' * (8 - len(hex_value.split('x')[1]))) + hex_value[2:]
dec1 = int( hex_value.split('x')[1][0] + hex_value.split('x')[1][1], 16)
dec2 = int(hex_value.split('x')[1][2]+hex_value.split('x')[1][3],16)
dec3 = int(hex_valur.split('x')[1][4]+hex_value.split('x')[1][5],16)
dec4 = int(hex_value.split('x')[1][6]+hex_value.split('x')[1][7],16)
msg = bytearray( [dec4,dec3,dec2,dec1] )
return msg
def main_function(fp, values):
msg = bytearray()
for val in values:
msg.extend (process_value(val))
fp.write(msg)
You could try converting all the floats before writing them, and then write the resulting data in one go:
import struct
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
f_output.write(struct.pack('<{}f'.format(len(my_floats)), *my_floats))
For the amount of values you have, you might need to do this in large blocks:
import struct
def blocks(data, n):
for i in xrange(0, len(data), n):
yield data[i:i+n]
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
for block in blocks(my_floats, 10000):
f_output.write(struct.pack('<{}f'.format(len(block)), *block))
The output from struct.pack() should be in the correct binary format for writing directly to the file. The file must be opened in binary mode e.g. wb is used.

Python pickle file strangely large

I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images.
(Plus an array of 100,000 integers whose values are one-digit).
My approximation for the total size of the pickle is,
4 byte x 80 x 80 x 100000 = 2.88 GB
plus the array of integers, which shouldn't be that large.
The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources.
Is there something wrong with my calculation or is it the way I pickled it?
I pickled the file in the following way.
from PIL import Image
import pickle
import os
import numpy
import time
trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for ii in range(x):
for j in range(y):
temp=float(Imv[j,ii])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)
output=open('data.pkl','wb')
pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)
Please let me know if you see something wrong with either my calculation or my code!
Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data."
The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle:
import os
import sys
import numpy
import pickle
testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0
test_labels_size = sys.getsizeof(testlabels) #80
output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)
print os.path.getsize('/tmp/pickle')
Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes (sys.getsizeof(1)) and numpy arrays are a minimum of 80 bytes (sys.getsizeof(numpy.array([0], float))).
As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects.
As an option: reduce the size of your training data/Pickle fewer objects.

Speed up creation of random data

I have written this very simple script to create some random data for machine learning.
from random import randint
f = open('2014-07-17-1M_testdata_1Mx500.cvs', 'w', 50000000) #50MB write buffer
for i in range(1000000): #num rows
for i2 in range(500): #entries per row
f.write(str(randint(0,1000000))) #Return a random integer N such that a <= N <= b.
if(i2 != 499): #entries per row - 1
f.write(",")
f.write("\n")
if(i != 0 and i % 100000 == 0):
print(str(i) + " lines written")
f.close
However, I've noticed that one CPU core is used with 100% load and the creation of data takes much longer than disk speed allows.
For creating large datasets (100+ GB), is there an easy way to speed this up? Some faster random library perhaps?
Pure Python is a tough one, but luckily there are efficient Python libraries that can help speed things up. numpy is a good one:
import numpy
import numpy.random
f = open('2014-07-17-1M_testdata_1Mx500.csv', 'w', 50000000)
for i in range(1000):
m = numpy.random.random_integers(0, 1000000, (1000, 500))
numpy.savetxt(f, m, delimiter=',')
f.close()
Running on my MacBook Pro, the code is definitely bound by writing to the disk instead of CPU,
so this seems to do the trick.

Categories

Resources