Why is reading large files in python so slow? - python

I am trying to read a csv file I created before in python using
with open(csvname, 'w') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerows(data)
Data ist a random matrix containing about 30k * 30k entries, np.float32 format. About 10 GB file size in total.
Once I read in the file using this function (since I know the size of my matrix already and np.genfromtxt is increadibly slow and would need about 100 GB RAM at this point)
def read_large_txt(path, delimiter=',', dtype=np.float32, nrows = 0):
t1 = time.time()
with open(path, 'r') as f:
out = np.empty((nrows, nrows), dtype=dtype)
for (ii, line) in enumerate(f):
if ii%2 == 0:
out[int(ii/2)] = line.split(delimiter)
print('Reading %s took %.3f s' %(path, time.time() - t1))
return out
it takes me about 10 minutes to read that file. The hard drive I am using should be able to read about 100 MB/s which would decrease the reading time to about 1-2 minutes.
Any ideas what I may be doing wrong?
Related: why numpy narray read from file consumes so much memory?
That's where the function read_large_txt is from.

I found a quite simple solution. Since I am creating the files as well, I don't need to save them as a .csv-file. It is way (!) faster to load them as .npy files:
Loading (incl. splitting each line by ',') a 30k * 30k matrix stored as .csv takes about 10 minutes. Doing the same with a matrix stored as .npy takes about 10 seconds!
That's why I have to change the code I wrote above to:
np.save(npyname, data)
and in the other script to
out = np.load(npyname + '.npy')
Another advantage of this method is: (in my case) the .npy files only have about 40% the size of the .csv files. :)

Related

How to improve performance when having a huge list of bytes to be written to file?

In python, I have a huge list of floating point values(nearly 30 million values). I have to convert each of them as 4 byte values in little endian format and write all those to a binary file in order.
For a list with some thousands or even 100k of data, my code is working fine. But if the data increases, it is taking time to process and write to file. What optimization techniques can I use to write to file more efficiently?
As suggested in this blog , I am replacing all the small writes to a file by the use of bytearray. But still, the performance is not satisfiable.
Also I have tried multiprocessing (concurrent.futures.ProcessPoolExecutor()) to utilize all the cores in the system instead of using a single CPU core. But still it is taking more time to complete the execution.
Can anyone give me more suggestions on how to improve the performance(in terms of time and memory) for this problem.
Here is my code:
def process_value (value):
hex_value = hex(struct.unpack('<I', struct.pack('<f', value))[0])
if len(hex_value.split('x')[1]) < 8:
hex_value = hex_value[:2] + ('0' * (8 - len(hex_value.split('x')[1]))) + hex_value[2:]
dec1 = int( hex_value.split('x')[1][0] + hex_value.split('x')[1][1], 16)
dec2 = int(hex_value.split('x')[1][2]+hex_value.split('x')[1][3],16)
dec3 = int(hex_valur.split('x')[1][4]+hex_value.split('x')[1][5],16)
dec4 = int(hex_value.split('x')[1][6]+hex_value.split('x')[1][7],16)
msg = bytearray( [dec4,dec3,dec2,dec1] )
return msg
def main_function(fp, values):
msg = bytearray()
for val in values:
msg.extend (process_value(val))
fp.write(msg)
You could try converting all the floats before writing them, and then write the resulting data in one go:
import struct
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
f_output.write(struct.pack('<{}f'.format(len(my_floats)), *my_floats))
For the amount of values you have, you might need to do this in large blocks:
import struct
def blocks(data, n):
for i in xrange(0, len(data), n):
yield data[i:i+n]
my_floats = [1.111, 1.222, 1.333, 1.444]
with open('floats.bin', 'wb') as f_output:
for block in blocks(my_floats, 10000):
f_output.write(struct.pack('<{}f'.format(len(block)), *block))
The output from struct.pack() should be in the correct binary format for writing directly to the file. The file must be opened in binary mode e.g. wb is used.

How to write an huge bytearray to file progressively without hitting MemoryError

I am working on a tool that generates random data for testing purposes. See below the part of my code that is giving me grief. This works perfectly and faster(takes around 20 seconds) than conventional solutions when the file is around 400MB, however, once it reaches around 500MB I am getting an out of memory error. How can I pull contents from the memory and write it in a file progressively having not more than 10 MB in my memory at a single time.
def createfile(filename,size_kb):
tbl = bytearray(range(256))
numrand = os.urandom(size_kb*1024)
with open(filename,"wb") as fh:
fh.write(numrand.translate(tbl))
createfile("file1.txt",500*1024)
Any help will be greatly appreciated
You can write out chunks of 10MB at a time, rather than generating the entire file in one go. As pointed out by #mhawke, the translate call is redundant and can be removed:
def createfile(filename,size_kb):
chunks = size_kb /(1024*10)
with open(filename,"wb") as fh:
for iter in range(chunks):
numrand = os.urandom(size_kb*1024 / chunks)
fh.write(numrand)
numrand = os.urandom(size_kb*1024 % chunks)
fh.write(numrand)
createfile("c:/file1.txt",500*1024)
comibining Jaco and mhawk and handling some float conversions.. here is the code that can generate Gbs of data in less than 10 seconds
def createfile(filename,size_kb):
chunksize = 1024
chunks = math.ceil(size_kb / chunksize)
with open(filename,"wb") as fh:
for iter in range(chunks):
numrand = os.urandom(int(size_kb*1024 / chunks))
fh.write(numrand)
numrand = os.urandom(int(size_kb*1024 % chunks))
fh.write(numrand)
Creates 1 Gb file in less than 8 seconds

Fastest way to write large CSV with Python

I want to write some random sample data in a csv file until it is 1GB big. Following code is working:
import numpy as np
import uuid
import csv
import os
outfile = 'data.csv'
outsize = 1024 # MB
with open(outfile, 'ab') as csvfile:
wtr = csv.writer(csvfile)
while (os.path.getsize(outfile)//1024**2) < outsize:
wtr.writerow(['%s,%.6f,%.6f,%i' % (uuid.uuid4(), np.random.random()*50, np.random.random()*50, np.random.randint(1000))])
How to get it faster?
The problem appears to be mainly IO-bound. You can improve the I/O a bit by writing to the file in larger chunks instead of writing one line at a time:
import numpy as np
import uuid
import os
outfile = 'data-alt.csv'
outsize = 10 # MB
chunksize = 1000
with open(outfile, 'ab') as csvfile:
while (os.path.getsize(outfile)//1024**2) < outsize:
data = [[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize)*50,
np.random.random(chunksize)*50,
np.random.randint(1000, size=(chunksize,))]
csvfile.writelines(['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)])
You can experiment with the chunksize (the number of rows written per chunk) to see what works best on your machine.
Here is a benchmark, comparing the above code to your original code, with outsize set to 10 MB:
% time original.py
real 0m5.379s
user 0m4.839s
sys 0m0.538s
% time write_in_chunks.py
real 0m4.205s
user 0m3.850s
sys 0m0.351s
So this is is about 25% faster than the original code.
PS. I tried replacing the calls to os.path.getsize with an estimation of the number of total lines needed. Unfortunately, it did not improve the speed. Since the number of bytes needed to represent the final int varies, the estimation also is inexact -- that is, it does not perfectly replicate the behavior of your original code. So I left the os.path.getsize in place.
Removing all unnecessary stuff, and therefore it should be faster and easier to understand:
import random
import uuid
outfile = 'data.csv'
outsize = 1024 * 1024 * 1024 # 1GB
with open(outfile, 'ab') as csvfile:
size = 0
while size < outsize:
txt = '%s,%.6f,%.6f,%i\n' % (uuid.uuid4(), random.random()*50, random.random()*50, random.randrange(1000))
size += len(txt)
csvfile.write(txt)
This is an update building on unutbu's answer above:
A large % of the time was spent in generating random numbers and checking the file size.
If you generate the rows ahead of time you can assess the raw disc io performance:
import time
from pathlib import Path
import numpy as np
import uuid
outfile = Path('data-alt.csv')
chunksize = 1_800_000
data = [
[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize) * 50,
np.random.random(chunksize) * 50,
np.random.randint(1000, size=(chunksize,))
]
rows = ['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)]
t0 = time.time()
with open(outfile, 'a') as csvfile:
csvfile.writelines(rows)
tdelta = time.time() - t0
print(tdelta)
On my standard 860 evo ssd (not nvme), I get 1.43 sec for 1_800_000 rows so that's 1,258,741 rows/sec (not too shabby imo)

How to download huge Oracle LOB with cx_Oracle on memory constrained system?

I'm developing part of a system where processes are limited to about 350MB of RAM; we use cx_Oracle to download files from an external system for processing.
The external system stores files as BLOBs, and we can grab them doing something like this:
# ... set up Oracle connection, then
cursor.execute(u"""SELECT filename, data, filesize
FROM FILEDATA
WHERE ID = :id""", id=the_one_you_wanted)
filename, lob, filesize = cursor.fetchone()
with open(filename, "w") as the_file:
the_file.write(lob.read())
lob.read() will obviously fail with MemoryError when we hit a file larger than 300-350MB, so we've tried something like this instead of reading it all at once:
read_size = 0
chunk_size = lob.getchunksize() * 100
while read_size < filesize:
data = lob.read(chunk_size, read_size + 1)
read_size += len(data)
the_file.write(data)
Unfortunately, we still get MemoryError after several iterations. From the time lob.read() is taking, and the out-of-memory condition we eventually get, it looks as if lob.read() is pulling ( chunk_size + read_size ) bytes from the database every time. That is, reads are taking O(n) time and O(n) memory, even though the buffer is quite a bit smaller.
To work around this, we've tried something like:
read_size = 0
while read_size < filesize:
q = u'''SELECT dbms_lob.substr(data, 2000, %s)
FROM FILEDATA WHERE ID = :id''' % (read_bytes + 1)
cursor.execute(q, id=filedataid[0])
row = cursor.fetchone()
read_bytes += len(row[0])
the_file.write(row[0])
This pulls 2000 bytes (argh) at a time, and takes forever (something like two hours for a 1.5GB file). Why 2000 bytes? According to the Oracle docs, dbms_lob.substr() stores its return value in a RAW, which is limited to 2000 bytes.
Is there some way I can store the dbms_lob.substr() results in a larger data object and read maybe a few megabytes at a time? How do I do this with cx_Oracle?
I think that the argument order in lob.read() is reversed in your code. The first argument should be the offset, the second argument should be the amount to read. This would explain the O(n) time and memory usage.

Why is loading this file taking so much memory?

Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).
My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).
I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:
print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
currentLine += 1.0
lined = line.split('\t')
printEvery100000 += 1
if printEvery100000 == 100000:
print str(currentLine / totalLines)
printEvery100000 = 0;
lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'
EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D
Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.
It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:
1.5 GB file size
31,164,015 total lines
each line split into a list with 4 pieces
Code:
import sys
def sizeof(lst):
return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)
GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))
val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols
line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))
Returns:
avg line size: 312 bytes
approx. memory needed: 9.1 GB
I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).
This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.
You can find a file-seeking binary search algorithm in this question.
Hopefully a memory efficient way of sorting a massive file by line:
import os
from mmap import mmap
input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')
# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()
for line in input_file:
mm.seek(0)
tup = line.split("\t")
while True:
cur_loc = mm.tell()
o_line = mm.readline()
o_tup = o_line.split("\t")
if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
mm.resize(cur_size + len(line))
mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
mm[cur_loc:cur_loc+len(line)] = line
cur_size += len(line)
break

Categories

Resources