I'm writing a program to search for a specific line in a very large (unordered) file (so it would be preferred not to load the entire file into memory).
I'm implementing multi threading to speed up the process. I'm trying to give a particular thread a particular part of the file i.e., the first thread would run through the first quarter of the file, the 2nd thread scans (simultaneously) from the endpoint of where the first thread stops and so on.
So to do this I need to find byte location of different parts of the file for simplicity of the question lets say I just want to find the middle of the file. But the problem is each line has a different length so if I just do
fo.seek(0, 2)
end = fo.tell()
mid = end/2
fo.seek(mid, 0)
It could give me the middle of the line. So I need a way to seek to the next or previous newline. Also, note I dont want the exact middle just somewhere around it (since its a very large file).
Heres what I was able to code, I'm not sure whether this loads the file into memory or not. And I would really like to avoid opening 2 instances of the same file (I did so in my program because I didnt want to worry about the offset changing when I read the file).
Any modification (or a new program) which is faster would be appreciated.
fo = open(filename, "rw+")
f2 = open(filename, "rw+")
file_ = dict()
fo.seek(0, 2)
file_['end'] = fo.tell()
file_['mid'] = file_['end'] / 2
fo.seek(file_['mid'], 0)
f2.seek(file_['mid'], 0)
line = f2.readline()
fo.seek(f2.tell(), 0)
file_['mid'] = f2.tell()
fo.seek(file_['mid'], 0)
print fo.readline()
How large is very large? grep tears relatively quickly through even 1-10GB files.
If the file is static and you plan to search through it repeatedly, you could split it:
split -l <line_count> <file>
Now you have multiple files, and can pass each to a separate thread/process/whatever.
Is the file sorted? That changes things again, since now you can just binary search with fo.seek() calls.
How fast is fast enough? Beyond a certain point, you're going to have to build a search index. Up to that point, simple tools like grep, split, etc. work wonders.
Without more information, it's impossible to say what the right tradeoffs are here.
Related
This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.
Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.
If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB
I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...
I have a folder full of files that need to be modified in order to extract the true file in it's real format.
I need to remove a certain number of bytes from BOTH the beginning and end of the file in order to extract the data I am looking for.
How can I do this in python?
I need this to work recursively on an entire folder only
I also need this to output (or modify the exisiting) file with the bytes removed.
I would greatly appreciate any help or guidance you can provide.
Recursive iteration over files os.walk
Change position in file: f.seek
Get file size: os.stat
Remove data from current position to end of file: f.truncate
So, base logic:
Iterate over files
Get file size.
Open file ('rb+' i suppouse )
Seek to position from wich you want read file
Read until bytes you want to drop ( f.read(file_size - top_dropped - bottom_dropped ) )
Seek(0)
Write read text to file
Truncate file
Your question is pretty badly constructed, but as this is somewhat advanced stuff I'll provide you with a code.
You can now use os.walk() to recursively traverse directory you want and apply my slicefile() function.
This code does the following:
After checking validity of start and end arguments it creates a memory map on top of an opened file.
mmap() creates a memory map object that mapps, in this case, portion of a file system over which the file is written. The object exposes both a string-like and file-like interface with some additional methods like move(). So you can treat memory map either as a string or as a file or use size(), move(), resize() or whatever additional methods you need.
We calculate what is a distance between our start and end, i.e. this is how much bytes we will have in the end.
We move stream of bytes, end-start long, starting from our start position to the 0 position i.e. we move them backwards for number of bytes indicated by starting point.
We discard the rest of file. I.e. we resize it to end-start bytes. So, what is left is our new string.
The operation will be longer as the file is bigger. Unfortunately there is nothing much you can do about it. If a file is big, this is your best bet. The procedure is the same as when removing items from a start/middle of an in-memory array, except this has to be buffered (in chunks) not to fill RAM too much.
If your file is smaller than a third of your free RAM space, you can load it whole into a string with f.read(), then perform string slicing on the loaded content ( s = s[start:end] ) and then write it back into file by opening it again and just doing f.write(s).
If you have enough disk space, you can open another file, seek to the starting point you want in the original file and then read it in chunks and write these into the new file. Perhaps even using shutil.copyfileobj(). After that, you remove the original file and use os.rename() to put the new one in its place. These are your only 3 options.
Whole file into RAM; move by buffering backward and then resizing; and, copying into another file, then renaming it. The second option is most universal and won't fail you for small or big files. Therefore I used it.
OK, Not only 3 options. There is a fourth option. It could be possible to cut off N number of bytes from beginning of the file by manipulating the file system itself using low-level operations. To write a kind of truncate() function that truncates the beginning instead of the end. But this would be pretty suicidal. In the end memory fragmentation would occur and whole mess will arise. You don't need such speed anyway. You will be patient until your script finishes. :D
Why did i use mmap()?
Because it uses memory maps implemented in OS rather than completely new code. This reduces number of system calls needed to deal with the opened file. Half of the work is thrust upon operating system, leaving Python to breathe easily.
Because it is mostly written in C which makes it a touch faster than its pure Python implementation would be.
Because it implements move() which wee need. The buffering and everything is already written, so no needs for bulky while loop which would be the alternative (manual) solution.
And so on...
from mmap import mmap
def slicefile (path, start=0, end=None):
f = open(path, "r+b") # Read and write binary
f.seek(0, 2)
size = f.tell()
start = 0 if start==None else start
end = size if end==None else end
start = size+start if start<0 else start
end = size+end if end<0 else end
end = size if end>size else end
if (end==size and start==0) or (end<=start):
f.close()
return
# If start is 0, no need to move anything, just cut off the rest after end
if start==0:
f.seek(end)
f.truncate()
f.close()
return
# Modify in place using mapped memory:
newsize = end-start
m = mmap(f.fileno(), 0)
m.move(0, start, newsize)
m.flush()
m.resize(newsize)
m.close()
f.close()
I have a text file test.txt, with the following contents:
Thing 1. string
And I'm creating a python file that will increment the number every time it gets run without affecting the rest of the string, like so.
Run once:
Thing 2. string
Run twice:
Thing 3. string
Run three times:
Thing 4. string
Run four times:
Thing 5. string
This is the code that I'm using to accomplish this.
file = open("test.txt","r+")
started = False
beginning = 0 #start of the digits
done = False
num = 0
#building the number from digits
while not done:
next = file.read(1)
if ord(next) in range(48, 58): #ascii values of 0-9
started = True
num *= 10
num += int(next)
elif started: #has reached the end of the number
done = True
else: #has not reached the beginning of the number
beginning += 1
num += 1
file.seek(beginning,0)
file.write(str(num))
This code works, so long as the number is not 10^n-1 (9, 99, 999, etc) because in those cases, it writes more bytes than were previously in the number. As such, it will override the characters that follow.
So this brings me to the point. I need a way to write to the file that overwrites previously bytes, which I have, and a way to write to the file that does not overwrite previously existing bytes, which I don't have. Does such a mechanism exist in python, and if so, what is it?
I have already tried opening the file using the line file = open("test.txt","a+") instead. When I do that, it always writes to the end, regardless of the seek point.
file = open("test.txt","w+") will not work because I need to keep the contents of the file while altering it, and files opened in any variant of w mode are wiped clean.
I have also thought of solving my problem using a function like this:
#file is assumed to be in r+ mode
def write(string, file, index = -1):
if index != -1:
file.seek(index, 0)
remainder = file.read()
file.seek(index)
file.write(remainder + string)
But I also want to be able to expand the solution to larger files, and reading the rest of the file single-handedly changes what I'm trying to accomplish from being O(1) to O(n). It also seems very non-Pythonic, since it seeks to accomplish the task in a less-than-straightforward way.
It would also make my I/O operations inconsistent: I would have class methods (file.read() and file.write()) to read from the file and write to it replacing old characters, but an external function to insert without replacing.
If I make the code inline, rather than a function, it means I have to write several of the same lines of code every time I try to write without replacing, which is also non-Pythonic.
To reiterate my question, is there a more straightforward way to do this, or am I stuck with the function?
Unfortunately, what you want to do is not possible. This is a limitation at a lower level than Python, in the operating system. Neither the Unix nor the Windows file access API offers any way to insert new bytes in the middle of a file without overwriting the bytes that were already there.
Reading the rest of the file and rewriting it is the usual workaround. Actually, the usual workaround is to rewrite the entire file under a new name and then use rename to move it back to the old name. On Unix, this accomplishes an atomic file update - unless the computer crashes, concurrent readers will see either the new file or the old file, not some hybrid. (Windows, sadly, still does not allow you to rename over a name that already exists, so if you use this strategy you have to delete the old file first, opening an unavoidable race window where the file might appear not to exist at all.)
Yes, this is O(N), and yes, if you use the write-new-file-and-rename strategy it temporarily consumes scratch disk space equal to the size of the file (old or new, whichever is larger). That's just how it is.
I haven't thought about it enough to give you even a sketch of the code, but it should be possible to use context managers to wrap up the write-new-file-and-rename approach tidily.
No, the disk doesn't work like you think it does.
You have to remember that your file is stored on disk as one contiguous
chunk of data*
Your disk happens to be wound up in a great big spool, a bit like a record,
but if you were to unwind your file, you'd get something that looks like
this:
+------------------------------------------------------------+
| Thing 1. String |
+------------------------------------------------------------+
^ ^
^ | \_, ^
| Start of file End of File |
Start of disk End of disk
As you've discovered, there's no way to simply insert data in the middle.
Generally speaking, that wouldn't be possible at all, without physically
altering your disk. And who wants to do that? Especially when just flipping
the magnetic bits on your disk is so much easier and faster. In order to
do what you want to do, you have to read the bytes the you want to
overwrite, then start writing down your new ones. It might look something
like this:
Open the file
Seek to the point of insert
Read the current byte
Seek backward one byte
Write down the first byte of the new string
Read the next byte
Seek backward one byte
Write down the next byte of the new string
Repeat until all the bytes have been written to disk
close the file
Of course, this might be a little bit on the slow side, due to all the
seeking back & forth in the file. It might be faster to read each line,
and then seek back to the previous location in the file. It should be
relatively straightforward to implement something like this in Python,
but as you've discovered, there are system limitations that Python can't
really overcome.
*Unless the files are fragmented, but we're living in an ideal
world where gravity adheres to 9.8m/s2 and the Earth is a perfect
sphere.
As a disclaimer, I'm hardly a computer scientist, but I've been reading everything I can on the subject of efficient file i/o to try and tackle this facet of a project I'm working on.
I have a very large (10 - 100 GB) log file of comma-separated values that I need to parse through. The first value labels it as "A" or "B"; for every "A" line, I need to examine the line before it and the line after it, and if either line before or after it meets a criterion, I want to store it in memory or write it to a file. The lines are not uniform in size.
That is my specific problem: I can't seem to locate an efficient way to do this in a non-binary file. With a binary file, I'd simply iterate over the file once and rewind to and fro with a logical check. I've investigated memory mapping, but it seems structured for binary files; my current code is Pythonic and takes weeks to run [see disclaimer].
My other question would be-- how easily could parallelism be invoked to help here? I have a notion of how -- map the file out three lines at a time and send each chunk to each node [lines 1,2,3 go to one node; lines 3,4,5 go to another ...], but I have no idea how to go about implementing this.
Any help would be very much appreciated.
Just read the lines in a loop. Keep track of the previous line in memory and examine it when needed.
Pseudocode:
for each line:
previousLine := currentLine
read currentLine from file
do processing...
This is efficient assuming you're already reading every line into memory anyway, and if you use a proper buffering scheme for reading the file (read large chunks at a time into memory).
I don't think parallelism will help in this situation. If properly written, the bottleneck of the program should be disk I/O, and multiple threads/processes can't read from disk any faster than a single thread. Parallelism only improves CPU-bound problems.
For what it's worth you can "seek" in ASCII files the same way you can with binary files. You would just keep track of the file offset each time you begin to read a line, and store that offset so you know where to seek back to later. Depending on how this is implemented this will never perform better than the above, though, and sometimes worse (you would want the file data to be buffered in memory so that the "seeking" is a memory operation and not a disk operation; you definitely want to read the file contents sequentially to maximize cache-ahead benefits).
Here's a first pass. Assumes properly formatted lines of text.
from itertools import chain
with open('your-file') as f:
prev_line = None
cur_line = f.readline()
for next_line in chain(f, [None]):
pieces = cur_line.split(',')
if pieces[0] == 'A':
check_against_criterion_if_not_none(prev_line)
check_against_criterion_if_not_none(next_line)
prev_line, cur_line = cur_line, next_line
A nifty trick is tacking on that extra 'None' at the end of the file, using itertools.chain, so that code properly checks the last line of the file against the 2nd to last line.
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()