Change python file in place

Change python file in place - python

I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?
Thanks!

Say you want to split the file into N pieces, then simply start reading from the back of the file (more or less) and repeatedly call truncate:
Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position. The current file position is not changed. ...
import os
import stat
BUF_SIZE = 4096
size = os.stat("large_file")[stat.ST_SIZE]
chunk_size = size // N
# or simply set a fixed chunk size based on your free disk space
c = 0
in_ = open("large_file", "r+")
while size > 0:
in_.seek(-min(size, chunk_size), 2)
# now you have to find a safe place to split the file at somehow
# just read forward until you found one
...
old_pos = in_.tell()
with open("small_chunk%2d" % (c, ), "w") as out:
b = in_.read(BUF_SIZE)
while len(b) > 0:
out.write(b)
b = in_.read(BUF_SIZE)
in_.truncate(old_pos)
size = old_pos
c += 1
Be careful, as I didn't test any of this. It might be needed to call flush after the truncate call, and I don't know how fast the file system is going to actually free up the space.

If you're on Linux/Unix, why not use the split command like this guy does?
split --bytes=100m /input/file /output/dir/prefix
EDIT: then use csplit.

I'm pretty sure there is, as I've even been able to edit/read from the source files of scripts I've run, but the biggest problem would probably be all the shifting that would be done if you started at the beginning of the file. On the other hand, if you go through the file and record all the starting positions of the lines, you could then go in reverse order of position to copy the lines out; once that's done, you could go back, take the new files, one at a time, and (if they're small enough), use readlines() to generate a list, reverse the order of the list, then seek to the beginning of the file and overwrite the lines in their old order with the lines in their new one.
(You would truncate the file after reading the first block of lines from the end by using the truncate() method, which truncates all data past the current file position if used without any arguments besides that of the file object, assuming you're using one of the classes or a subclass of one of the classes from the io package to read your file. You'd just have to make sure that the current file position ends up at the beginning of the last line to be written to a new file.)
EDIT: Based on your comment about having to make the separations at the proper closing tags, you'll probably also have to develop an algorithm to detect such tags (perhaps using the peek method), possibly using a regular expression.

If time is not a major factor (or wear and tear on your disk drive):
Open handle to file
Read up to the size of your partition / logical break point (due to the xml)
Save the rest of your file to disk (not sure how python handles this as far as directly overwriting file or memory usage)
Write the partition to disk
goto 1
If Python does not give you this level of control, you may need to dive into C.

You could always parse the XML file and write out say every 10000 elements to there own file. Look at the Incremental Parsing section of this link.
http://effbot.org/zone/element-iterparse.htm

Here is my script...
import string
import os
from ftplib import FTP
# make ftp connection
ftp = FTP('server')
ftp.login('user', 'pwd')
ftp.cwd('/dir')
f1 = open('large_file.xml', 'r')
size = 0
split = False
count = 0
for line in f1:
if not split:
file = 'split_'+str(count)+'.xml'
f2 = open(file, 'w')
if count > 0:
f2.write('<?xml version="1.0"?>\n')
f2.write('<StartTag xmlns="http://www.blah/1.2.0">\n')
size = 0
count += 1
split = True
if size < 1073741824:
f2.write(line)
size += len(line)
elif str(line) == '</EndTag>\n':
f2.write(line)
f2.write('</EndEndTag>\n')
print('completed file %s' %str(count))
f2.close()
f2 = open(file, 'r')
print("ftp'ing file...")
ftp.storbinary('STOR ' + file, f2)
print('ftp done.')
split = False
f2.close()
os.remove(file)
else:
f2.write(line)
size += len(line)

Its a time to buy a new hard drive!
You can make backup before trying all other answers and don't get data lost :)

Related

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

I'm running SageMath 9.0, on Windows 10 OS
I've read several similar questions (and answers) in this site. Mainly this one one reading from the 7th line, and this one on optimizing. But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.
I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters. Each line has a constant number of characters. Here are the actual first 5 lines:
J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...
For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format. The file has been computed and made available by Brendan McKay on its webpage here.
I need to check every graph for some properties. I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop). I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.
My current code reads the file line by line from start, and do some computation after reading each line :
with open(filename,'r') as file:
while True:
# Get next line from file
line = file.readline()
# if line is empty, end of file is reached
if not line:
print("End of Database Reached")
break
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
In order to be able to stop the code, or save the progress in case of crash, I was thinking of :
Every million line read (or so), save the progress in a specific file
When restarting the code, read the last saved value, and instead of using line = file.readline(), I would use itertool option, for line in islice(file, start_line, None).
so that my new code is
from itertools import islice
start_line = load('foo')
count = start_line
save_every_n_lines = 1000000
with open(filename,'r') as file:
for line in islice(file, start_line, None):
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
count +=1
if (count % save_every_n_lines )==0:
save(count,'foo')
The code does work, but I would like to understand if I can optimise it. I'm not a big fan of my if statement in my for loop.
Is the itertools.islice() the good option here ? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached". As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?
Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one ? this would get read of the if statement in my for loop.
I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs. Could that be a good option ?
Each line has a constant number of characters. So "jumping" might be feasible.

Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell. The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.
This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.
import os
import mmap
# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1
# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
for i in range(100):
f.write("R{: 10}\n".format(i).encode('ascii'))
f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("record 20", record)
# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("get record 20", get_record(data, 11))
# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
if stop is None:
stop = mmapped_file.size()/LINE_SZ
for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
yield mmapped_file[pos:pos+RECORD_SZ]
print("enum 6 to 8", [record for record in enum_records(data,6,9)])
del data
f.close()

If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do
def get_line(k, line_len):
with open('file') as f:
f.seek(k*line_len)
return next(f)

Most efficient way to modify the last line of a large text file in Python

I need to update the last line from a few more than 2GB files made up of lines of text that can not be read with readlines(). Currently, it work fine by looping through line by line. However, I am wondering if there is any compiled library can achieve this more efficiently? Thanks!
Current approach
myfile = open("large.XML")
for line in myfile:
do_something()

If this is really something line based (where a true XML parser isn't necessary the best solution), mmap can help here.
mmap the file, then call .rfind('\n') on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). You can then slice out the final line alone. If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. Avoids reading or writing any more of the file than you need.
Example code (please comment if I made a mistake):
import mmap
# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
# len(mm) - 1 handles files ending w/newline by getting the prior line
# + 1 to avoid catching prior newline (and handle one line file seamlessly)
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# Get the line (with any newline stripped)
line = mm[startofline:].rstrip(b'\r\n')
# Do whatever calculates the new line, decoding/encoding to use str
# in do_something to simplify; this is an XML file, so I'm assuming UTF-8
new_line = do_something(line.decode('utf-8')).encode('utf-8')
# Resize to accommodate the new line (or to strip data beyond the new line)
mm.resize(startofline + len(new_line)) # + 1 if you need to add a trailing newline
mm[startofline:] = new_line # Replace contents; add a b"\n" if needed
Apparently on some systems (e.g. OSX) without mremap, mm.resize won't work, so to support those systems, you'd probably split the with (so the mmap closes before the file object), and use file object based seeks, writes and truncates to fix up the file. The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to use contextlib.closing for completeness:
import mmap
from contextlib import closing
with open("large.XML", 'r+b') as myfile:
with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline) # Move to where old line began
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
The advantages to mmap over any other approach are:
No need to read any more of the file beyond the line itself (meaning 1-2 pages of the file, the rest never gets read or written)
Using rfind means you can let Python do the work of finding the newline quickly at the C layer (in CPython); explicit seeks and reads of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newline
Caveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory. On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.). mmap doesn't require much physical RAM, but it needs contiguous address space; one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, so mmap can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omitting closing and imports for brevity) even for huge files:
with open("large.XML", 'r+b') as myfile:
filesize = myfile.seek(0, 2)
# Get an offset that only grabs the last 100 MB or so of the file aligned properly
offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# If line might be > 100 MB long, probably want to check if startofline
# follows a newline here
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline + offset) # Move to where old line began, adjusted for offset
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess

Update: Use ShadowRanger's answer. It's much shorter and robust.
For posterity:
Read the last N bytes of the file and search backwards for the newline.
#!/usr/bin/env python
with open("test.txt", "wb") as testfile:
testfile.write('\n'.join(["one", "two", "three"]) + '\n')
with open("test.txt", "r+b") as myfile:
# Read the last 1kiB of the file
# we could make this be dynamic, but chances are there's
# a number like 1kiB that'll work 100% of the time for you
myfile.seek(0,2)
filesize = myfile.tell()
blocksize = min(1024, filesize)
myfile.seek(-blocksize, 2)
# search backwards for a newline (excluding very last byte
# in case the file ends with a newline)
index = myfile.read().rindex('\n', 0, blocksize - 1)
# seek to the character just after the newline
myfile.seek(index + 1 - blocksize, 2)
# read in the last line of the file
lastline = myfile.read()
# modify last_line
lastline = "Brand New Line!\n"
# seek back to the start of the last line
myfile.seek(index + 1 - blocksize, 2)
# write out new version of the last line
myfile.write(lastline)
myfile.truncate()

Split large files using python

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file.
But there are two ways of "reading" files.
1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before)
In python, approaches to read WHOLE file at once I've tried include:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000]
Or the advantage of using list is that we can easily point to specific lines.
2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory.
Examples include:
f=gzip.open(file)
for line in f: blablabla...
or
for line in fileinput.FileInput(fileName):
I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?
Thanks

NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()

If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...so you could write that code something like this:
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt", "rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber, "wt")
outFile.write(buf)
outFile.close()
fileNumber += 1

The best solution I have found is using the library filesplit.
You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.
from fsplit.filesplit import Filesplit
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.

chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()

Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).
But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.
Therefore, I would go with your second solution.

I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.
split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()
for index in range(0, len(large_file)):
if (index > 0) and (index % 2000000 == 0):
new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
split_start_value = file_count * split_length
split_end_value = split_length * (file_count + 1)
file_content_list = large_file[split_start_value:split_end_value]
file_content = ''.join(line for line in file_content_list)
new_file.write(file_content)
new_file.close()
file_count += 1
print(f'created file {file_count}')

To split a file line-wise:
group every, say 40000 lines into one file
You can use module filesplit with method bylinecount (version 4.0):
import os
from filesplit.split import Split
LINES_PER_FILE = 40_000 # see PEP515 for readable numeric literals
filename = 'myinput.txt'
outdir = 'splitted/' # to store split-files `myinput_1.txt` etc.
Split(filename, outdir).bylinecount(LINES_PER_FILE)
This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.

How to read lines from a file in python starting from the end

I need to know how to read lines from a file in python so that I read the last line first and continue in that fashion until the cursor reach's the beginning of the file. Any idea's?

The general approach to this problem, reading a text file in reverse, line-wise, can be solved by at least three methods.
The general problem is that since each line can have a different length, you can't know beforehand where each line starts in the file, nor how many of them there are. This means you need to apply some logic to the problem.
General approach #1: Read the entire file into memory
With this approach, you simply read the entire file into memory, in some data structure that subsequently allows you to process the list of lines in reverse. A stack, a doubly linked list, or even an array can do this.
Pros: Really easy to implement (probably built into Python for all I know)
Cons: Uses a lot of memory, can take a while to read large files
General approach #2: Read the entire file, store position of lines
With this approach, you also read through the entire file once, but instead of storing the entire file (all the text) in memory, you only store the binary positions inside the file where each line started. You can store these positions in a similar data structure as the one storing the lines in the first approach.
Whever you want to read line X, you have to re-read the line from the file, starting at the position you stored for the start of that line.
Pros: Almost as easy to implement as the first approach
Cons: can take a while to read large files
General approach #3: Read the file in reverse, and "figure it out"
With this approach you will read the file block-wise or similar, from the end, and see where the ends are. You basically have a buffer, of say, 4096 bytes, and process the last line of that buffer. When your processing, which has to move one line at a time backward in that buffer, comes to the start of the buffer, you need to read another buffer worth of data, from the area before the first buffer you read, and continue processing.
This approach is generally more complicated, because you need to handle such things as lines being broken over two buffers, and long lines could even cover more than two buffers.
It is, however, the one that would require the least amount of memory, and for really large files, it might also be worth doing this to avoid reading through gigabytes of information first.
Pros: Uses little memory, does not require you to read the entire file first
Cons: Much hard to implement and get right for all corner cases
There are numerous links on the net that shows how to do the third approach:
ActiveState Recipe 120686 - Read a text file backwards
ActiveState Recipe 439045 - Read a text file backwards (yet another implementation)
Top4Download.com Script - Read a text file backwards

Recipe 120686: Read a text file backwards (Python)

You can also use python module file_read_backwards. It would be read in a memory efficient manner. It works with Python 2.7 and 3.
It supports "utf-8","latin-1", and "ascii" encoding. It will work with "\r", "\n", and "\r\n" as new lines.
After installing it, via pip install file_read_backwards (v1.2.1), you can read the entire file backwards (line-wise) via:
#!/usr/bin/env python2.7
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/path/to/file", encoding="utf-8") as frb:
for l in frb:
print l
Further documentation can be found at http://file-read-backwards.readthedocs.io/en/latest/readme.html

A straightforward way is to first create a temporary reversed file, then reversing each line in this file.
import os, tempfile
def reverse_file(in_filename, fout, blocksize=1024):
filesize = os.path.getsize(in_filename)
fin = open(in_filename, 'rb')
for i in range(filesize // blocksize, -1, -1):
fin.seek(i * blocksize)
data = fin.read(blocksize)
fout.write(data[::-1])
def enumerate_reverse_lines(in_filename, blocksize=1024):
fout = tempfile.TemporaryFile()
reverse_file(in_filename, fout, blocksize=blocksize)
fout.seek(0)
for line in fout:
yield line[::-1]
The above code will yield lines with newlines at the beginning instead of the end, and there is no attempt to handle DOS/Windows-style newlines (\r\n).

This solution is simpler than any others I've seen.
def xreadlines_reverse(f, blksz=524288):
"Act as a generator to return the lines in file f in reverse order."
buf = ""
f.seek(0, 2)
pos = f.tell()
lastn = 0
if pos == 0:
pos = -1
while pos != -1:
nlpos = buf.rfind("\n", 0, -1)
if nlpos != -1:
line = buf[nlpos + 1:]
if line[-1] != "\n":
line += "\n"
buf = buf[:nlpos + 1]
yield line
elif pos == 0:
pos = -1
yield buf
else:
n = min(blksz, pos)
f.seek(-(n + lastn), 1)
rdbuf = f.read(n)
lastn = len(rdbuf)
buf = rdbuf + buf
pos -= n
Example usage:
for line in xreadlines_reverse(open("whatever.txt")):
do_stuff(line)

Python goto text file line without reading previous lines

I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?

If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.

If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.

You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111

python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.

As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change python file in place - python

I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files? Thanks!

If you're on Linux/Unix, why not use the split command like this guy does? split --bytes=100m /input/file /output/dir/prefix EDIT: then use csplit.

You could always parse the XML file and write out say every 10000 elements to there own file. Look at the Incremental Parsing section of this link. http://effbot.org/zone/element-iterparse.htm

Its a time to buy a new hard drive! You can make backup before trying all other answers and don't get data lost :)

Related

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

Most efficient way to modify the last line of a large text file in Python

Split large files using python

How to read lines from a file in python starting from the end

Python goto text file line without reading previous lines

Categories

Resources