Opening a 25GB text file for processing - python

I have a 25GB file I need to process. Here is what I'm currently doing, but it takes an extremely long time to open:
collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
collection_contents = f.readlines()
length_of_file = len(collection_contents)
for num, line in enumerate(collection_contents):
print '%s / %s' % (num+1, length_of_file)
cursor.execute(...)
How could I improve this?

Unless the lines in your file is really, really big, do not print the progress at every line. Printing to a terminal is very slow. Print progress e.g. every 100 or every 1000 lines.
Use the available operating system facilities to get the size of a file - os.path.getsize() , see Getting file size in Python?
Get rid of readlines() to avoid reading 25GB into memory. Instead read and process line by line, see e.g. How to read large file, line by line in python

Pass through the file twice: Once to count lines, once to do the printing. Never call readlines on a file that size -- you'll end up swapping everything to disk. (Actually, just never call readlines in general. It's silly.)
(Incidentally, I'm assuming that you're actually doing something with the lines, rather than just the number of lines -- the code you posted there doesn't actually use anything from the file other than the number of newlines in it.)

Combining the answers above, here is how I modified it.
size_of_file = os.path.getsize(collection_pricing)
progress = 0
line_count = 0
with open(collection_pricing, 'r') as f:
for line in f:
line_count += 1
progress += len(line)
if line_count % 10000 == 0:
print '%s / %s' % (progress, size_of_file)
This has the following improvements:
Doesn't use readlines() so not storing everything into memory
Only printing every 10,000 lines
Using size of file instead of line count to measure progress, so don't have to iterate files twice.

Related

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

I'm running SageMath 9.0, on Windows 10 OS
I've read several similar questions (and answers) in this site. Mainly this one one reading from the 7th line, and this one on optimizing. But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.
I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters. Each line has a constant number of characters. Here are the actual first 5 lines:
J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...
For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format. The file has been computed and made available by Brendan McKay on its webpage here.
I need to check every graph for some properties. I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop). I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.
My current code reads the file line by line from start, and do some computation after reading each line :
with open(filename,'r') as file:
while True:
# Get next line from file
line = file.readline()
# if line is empty, end of file is reached
if not line:
print("End of Database Reached")
break
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
In order to be able to stop the code, or save the progress in case of crash, I was thinking of :
Every million line read (or so), save the progress in a specific file
When restarting the code, read the last saved value, and instead of using line = file.readline(), I would use itertool option, for line in islice(file, start_line, None).
so that my new code is
from itertools import islice
start_line = load('foo')
count = start_line
save_every_n_lines = 1000000
with open(filename,'r') as file:
for line in islice(file, start_line, None):
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
count +=1
if (count % save_every_n_lines )==0:
save(count,'foo')
The code does work, but I would like to understand if I can optimise it. I'm not a big fan of my if statement in my for loop.
Is the itertools.islice() the good option here ? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached". As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?
Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one ? this would get read of the if statement in my for loop.
I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs. Could that be a good option ?
Each line has a constant number of characters. So "jumping" might be feasible.
Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell. The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.
This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.
import os
import mmap
# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1
# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
for i in range(100):
f.write("R{: 10}\n".format(i).encode('ascii'))
f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("record 20", record)
# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("get record 20", get_record(data, 11))
# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
if stop is None:
stop = mmapped_file.size()/LINE_SZ
for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
yield mmapped_file[pos:pos+RECORD_SZ]
print("enum 6 to 8", [record for record in enum_records(data,6,9)])
del data
f.close()
If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do
def get_line(k, line_len):
with open('file') as f:
f.seek(k*line_len)
return next(f)

Does it takes RAM to save a readlines array?

I am using the command lineslist = file.readlines() of a 2GB file.
So, I guess it will create a lineslist array of 2GB or more size. So, basically is it the same as readfile = file.read(), which also creates readfile (instance/variable?) of 2GB exactly?
Why should I prefer readlines in this case?
Adding to that I have one more question, it is also mentioned here https://docs.python.org/2/tutorial/inputoutput.html:
readline(): a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous;
I don't understand the last point. So, does readlines() also have unambiguous value in the last element of its array if there is no \n in the end of the file?
We are dealing with combining the files (which were split on the basis of blocksize) So, I am thinking of choosing readlines or read. As the individual files may not be end with a \n after splitting and if readlines returns unambiguous values, it would be a problem, I think.)
PS: I haven't learnt python. So, forgive me if there is no such thing as instances in python or if I am speaking rubbish. I am just assuming.
EDIT:
Ok, I just found. It's not returning any unambiguous output.
len(lineslist)
6923798
lineslist[6923797]
"\xf4\xe5\xcf1)\xff\x16\x93\xf2\xa3-\....\xab\xbb\xcd"
So, it doesn't end with '\n'. But it's not unambiguous output eiter.
Also, no unambiguous output with readline either for the lastline.
If I understood your issue correctly you just want to combine (ie concatenate) files.
If memory is an issue normally for line in f is the way to go.
I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.
Codes:
#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
chunk = f.read(chunksize)
while chunk:
chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop
#read whole file in one go - slowest in my test
with open(fn,'r') as f:
chunk = f.read()
#1 loop, best of 3: 11.7 s per loop
#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
for line in f:
s = line
#1 loop, best of 3: 6.74 s per loop
Knowing this you could write something like:
with open(outputfile,'w') as fo:
for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
with open(inputfile,'r') as fi:
for chunk in iter(lambda: fi.read(chunksize), ''):
fo.write(fi.read(chunk))
fo.write('\n') #newline between each file(might not be necessary)
file.read() will read the entire stream of data as 1 long string, whereas file.readlines() will create a list of lines from the stream.
Generally performance will suffer, especially in the case of large files, if you read in the entire thing all at once. The general approach is to iterate over the file object line by line, which it supports.
for line in file_object:
# Process the line
As this way of processing will only consume memory for a line (loosely speaking) and not the entire contents of the file.
Yes, readlines() causes reading all file to variable.
Much better it would be to read file line by line:
f = open("file_path", "r")
for line in f:
print f
It will cause loading only one line to RAM, so you're saving about 1.99 GB of memory :)
As I understood You want to concatenate two files.
target = open("target_file", "w")
f1 = open("f1", "r")
f2 = open("f2", "r")
for line in f1:
print >> target, line
for line in f2:
print >> target, line
target.close()
Or consider using other technology like bash:
cat file1 > target
cat file2 >> target

Split large files using python

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks
NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.
I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')
To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.

Reading a big file cost too much memory in Python 2.7

I used .readline() to parse file line by line, because I need
to find out the start position to extract data into a list, and the end
point to pause extracting, then repeat until the end of file.
My file to read is formatted like this:
blabla...
useless....
...
/sign/
data block(e.g. 10 cols x 1000 rows)
... blank line
/sign/
data block(e.g. 10 cols x 1000 rows)
... blank line
...
EOF
let's call this file 'myfile'
and my python snippet:
f=open('myfile','r')
blocknum=0 #number the data block
data=[]
while True:
# find the extract begnning
while not f.readline().startswith('/sign/'):pass
# creat multidimensional list to store data block
data=append([])
blocknum +=1
line=f.readline()
while line.strip():
# check if the line is a blank line, i.e the end of one block
data[blocknum-1].append(["2.6E" %float(x) for x in line.split()])
line = f.readline()
print "Read Block %d" %blocknum
if not f.readline(): break
The running result was that read a 500M file consume almost 2GB RAM, I
cannot figure it out, somebody help!
Thanks very much!
You have quite a lot of non-pythonic ambiguous lines in your code. I am not sure but think that you can modify your code the following way first and then check it again against memory usage:
data=[]
with open('myfile','r') as f:
for line in f:
# find the extract beginning - think you can add here more parameters to check
if not line.strip() or line.startswith('/sign/'):
continue
data.append(["%2.6E" % float(x) for x in line.strip().split()])
But I think that this code will also use quite a lot of memory - however if you don't really need to store all the read data from file you can modify code to use generator expression and proceed file data line by line - this would save your memory i guess.

Change python file in place

I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?
Thanks!
Say you want to split the file into N pieces, then simply start reading from the back of the file (more or less) and repeatedly call truncate:
Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position. The current file position is not changed. ...
import os
import stat
BUF_SIZE = 4096
size = os.stat("large_file")[stat.ST_SIZE]
chunk_size = size // N
# or simply set a fixed chunk size based on your free disk space
c = 0
in_ = open("large_file", "r+")
while size > 0:
in_.seek(-min(size, chunk_size), 2)
# now you have to find a safe place to split the file at somehow
# just read forward until you found one
...
old_pos = in_.tell()
with open("small_chunk%2d" % (c, ), "w") as out:
b = in_.read(BUF_SIZE)
while len(b) > 0:
out.write(b)
b = in_.read(BUF_SIZE)
in_.truncate(old_pos)
size = old_pos
c += 1
Be careful, as I didn't test any of this. It might be needed to call flush after the truncate call, and I don't know how fast the file system is going to actually free up the space.
If you're on Linux/Unix, why not use the split command like this guy does?
split --bytes=100m /input/file /output/dir/prefix
EDIT: then use csplit.
I'm pretty sure there is, as I've even been able to edit/read from the source files of scripts I've run, but the biggest problem would probably be all the shifting that would be done if you started at the beginning of the file. On the other hand, if you go through the file and record all the starting positions of the lines, you could then go in reverse order of position to copy the lines out; once that's done, you could go back, take the new files, one at a time, and (if they're small enough), use readlines() to generate a list, reverse the order of the list, then seek to the beginning of the file and overwrite the lines in their old order with the lines in their new one.
(You would truncate the file after reading the first block of lines from the end by using the truncate() method, which truncates all data past the current file position if used without any arguments besides that of the file object, assuming you're using one of the classes or a subclass of one of the classes from the io package to read your file. You'd just have to make sure that the current file position ends up at the beginning of the last line to be written to a new file.)
EDIT: Based on your comment about having to make the separations at the proper closing tags, you'll probably also have to develop an algorithm to detect such tags (perhaps using the peek method), possibly using a regular expression.
If time is not a major factor (or wear and tear on your disk drive):
Open handle to file
Read up to the size of your partition / logical break point (due to the xml)
Save the rest of your file to disk (not sure how python handles this as far as directly overwriting file or memory usage)
Write the partition to disk
goto 1
If Python does not give you this level of control, you may need to dive into C.
You could always parse the XML file and write out say every 10000 elements to there own file. Look at the Incremental Parsing section of this link.
http://effbot.org/zone/element-iterparse.htm
Here is my script...
import string
import os
from ftplib import FTP
# make ftp connection
ftp = FTP('server')
ftp.login('user', 'pwd')
ftp.cwd('/dir')
f1 = open('large_file.xml', 'r')
size = 0
split = False
count = 0
for line in f1:
if not split:
file = 'split_'+str(count)+'.xml'
f2 = open(file, 'w')
if count > 0:
f2.write('<?xml version="1.0"?>\n')
f2.write('<StartTag xmlns="http://www.blah/1.2.0">\n')
size = 0
count += 1
split = True
if size < 1073741824:
f2.write(line)
size += len(line)
elif str(line) == '</EndTag>\n':
f2.write(line)
f2.write('</EndEndTag>\n')
print('completed file %s' %str(count))
f2.close()
f2 = open(file, 'r')
print("ftp'ing file...")
ftp.storbinary('STOR ' + file, f2)
print('ftp done.')
split = False
f2.close()
os.remove(file)
else:
f2.write(line)
size += len(line)
Its a time to buy a new hard drive!
You can make backup before trying all other answers and don't get data lost :)

Categories

Resources