I have a giant file (1.2GB) of feature vectors saved as a csv file.
In order to go through the lines, I've created a python class that loads batches of rows from the giant file, to the memory, one batch at a time.
In order for this class to know where exactly to read in the file to get a batch of batch_size complete rows (lets say batch_size=10,000), in the first time using a giant file, this class goes through the entire file once, and registers the offset of each line, and saves these offsets to a helping file, so that later it could "file.seek(starting_offset); batch = file.read(num_bytes)" to read the next batch of lines.
First, I implemented the registration of line offsets in this manner:
offset = 0;
line_offsets = [];
for line in self.fid:
line_offsets.append(offset);
offset += len(line);
And it worked lovely with giant_file1.
But then I processed these features and created giant_file2 (with normalized features), with the assistance of this class I made.
And next, when I wanted to read batches of lines form giant_file2, it failed, because the batch strings it would read were not in the right place (for instance, reading something like "-00\n15.467e-04,..." instead of "15.467e-04,...\n").
So I tried changing the line offset calculation part to:
offset = 0;
line_offsets = [];
while True:
line = self.fid.readline();
if (len(line) <= 0):
break;
line_offsets.append(offset);
offset = self.fid.tell();
The main change is that the offset I register is taken from the result of fid.tell() instead of cumulative lengths of lines.
This version worked well with giant_file2, but failed with giant_file1.
The further I investigated it, I came to the feeling that functions seek(), tell() and read() are inconsistent with each other.
For instance:
fid = file('giant_file1.csv');
fid.readline();
>>>'0.089,169.039,10.375,-30.838,59.171,-50.867,13.968,1.599,-26.718,0.507,-8.967,-8.736,\n'
fid.tell();
>>>67L
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
fid.seek(67);
fid.tell();
>>>67L
fid.readline();
>>>'507,-8.967,-8.736,\n'
There is some contradiction here: when I'm positioned (according to fid.tell()) at byte 67 once the line read is one thing and in the second time (again when fid.tell() reports I'm positioned at byte 67) the line that is read is different.
I can't trust tell() and seek() to put me in the desired location to read from the beginning of the desired line.
On the other hand, when I use (with giant_file1) the length of strings as reference for seek() I get the correct position:
fid.seek(0);
line = fid.readline();
fid.tell();
>>>87L
len(line);
>>>86
fid.seek(86);
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
So what is going on?
The only difference between giant_file1 and giant_file2 that I can think of is that in giant_file1 the values are written with decimal dot (e.g. -0.435), and in giant_file2 they are all in scientific format (e.g. -4.350e-01). I don't think any of them is coded in unicode (I think so, since the strings I read with simple file.read() seem readable. how can I make sure?).
I would very much appreciate your help, with explanations, ideas for the cause, and possible solutions (or workarounds).
Thank you,
Yonatan.
I think you have a newline problem. Check whether giant_file1.csv is ending lines with \n or \r\n If you open the file in text mode, the file will return lines ending with \n, only and throw away redundant \r. So, when you look at the length of the line returned, it will be 1 off of the actual file position (which has consumed not just the \n, but also the \r\n). These errors will accumulate as you read more lines, of course.
The solution is to open the file in binary mode, instead. In this mode, there is no \r\n -> \n reduction, so your tally of line lengths would stay consistent with your file tell( ) queries.
I hope that solves it for you - as it's an easy fix. :) Good luck with your project and happy coding!
I had to do something similar in the past and ran into something in the standard library called linecache. You might want to look into that as well.
http://docs.python.org/library/linecache.html
Related
I have a folder full of files that need to be modified in order to extract the true file in it's real format.
I need to remove a certain number of bytes from BOTH the beginning and end of the file in order to extract the data I am looking for.
How can I do this in python?
I need this to work recursively on an entire folder only
I also need this to output (or modify the exisiting) file with the bytes removed.
I would greatly appreciate any help or guidance you can provide.
Recursive iteration over files os.walk
Change position in file: f.seek
Get file size: os.stat
Remove data from current position to end of file: f.truncate
So, base logic:
Iterate over files
Get file size.
Open file ('rb+' i suppouse )
Seek to position from wich you want read file
Read until bytes you want to drop ( f.read(file_size - top_dropped - bottom_dropped ) )
Seek(0)
Write read text to file
Truncate file
Your question is pretty badly constructed, but as this is somewhat advanced stuff I'll provide you with a code.
You can now use os.walk() to recursively traverse directory you want and apply my slicefile() function.
This code does the following:
After checking validity of start and end arguments it creates a memory map on top of an opened file.
mmap() creates a memory map object that mapps, in this case, portion of a file system over which the file is written. The object exposes both a string-like and file-like interface with some additional methods like move(). So you can treat memory map either as a string or as a file or use size(), move(), resize() or whatever additional methods you need.
We calculate what is a distance between our start and end, i.e. this is how much bytes we will have in the end.
We move stream of bytes, end-start long, starting from our start position to the 0 position i.e. we move them backwards for number of bytes indicated by starting point.
We discard the rest of file. I.e. we resize it to end-start bytes. So, what is left is our new string.
The operation will be longer as the file is bigger. Unfortunately there is nothing much you can do about it. If a file is big, this is your best bet. The procedure is the same as when removing items from a start/middle of an in-memory array, except this has to be buffered (in chunks) not to fill RAM too much.
If your file is smaller than a third of your free RAM space, you can load it whole into a string with f.read(), then perform string slicing on the loaded content ( s = s[start:end] ) and then write it back into file by opening it again and just doing f.write(s).
If you have enough disk space, you can open another file, seek to the starting point you want in the original file and then read it in chunks and write these into the new file. Perhaps even using shutil.copyfileobj(). After that, you remove the original file and use os.rename() to put the new one in its place. These are your only 3 options.
Whole file into RAM; move by buffering backward and then resizing; and, copying into another file, then renaming it. The second option is most universal and won't fail you for small or big files. Therefore I used it.
OK, Not only 3 options. There is a fourth option. It could be possible to cut off N number of bytes from beginning of the file by manipulating the file system itself using low-level operations. To write a kind of truncate() function that truncates the beginning instead of the end. But this would be pretty suicidal. In the end memory fragmentation would occur and whole mess will arise. You don't need such speed anyway. You will be patient until your script finishes. :D
Why did i use mmap()?
Because it uses memory maps implemented in OS rather than completely new code. This reduces number of system calls needed to deal with the opened file. Half of the work is thrust upon operating system, leaving Python to breathe easily.
Because it is mostly written in C which makes it a touch faster than its pure Python implementation would be.
Because it implements move() which wee need. The buffering and everything is already written, so no needs for bulky while loop which would be the alternative (manual) solution.
And so on...
from mmap import mmap
def slicefile (path, start=0, end=None):
f = open(path, "r+b") # Read and write binary
f.seek(0, 2)
size = f.tell()
start = 0 if start==None else start
end = size if end==None else end
start = size+start if start<0 else start
end = size+end if end<0 else end
end = size if end>size else end
if (end==size and start==0) or (end<=start):
f.close()
return
# If start is 0, no need to move anything, just cut off the rest after end
if start==0:
f.seek(end)
f.truncate()
f.close()
return
# Modify in place using mapped memory:
newsize = end-start
m = mmap(f.fileno(), 0)
m.move(0, start, newsize)
m.flush()
m.resize(newsize)
m.close()
f.close()
This is my code for getting lines from a file by seeking a position using f.seek() method but I am getting a wrong output. it is printing from middle of the first line.
can u help me to solve this please?
f=open(r"sample_text_file","r")
last_pos=int(f.read())
f1=open(r"C:\Users\ddadi\Documents\project2\check\log_file3.log","r")
f1.seek(last_pos)
for i in f1:
print i
last_position=f1.tell()
with open('sample_text.txt', 'a') as handle:
handle.write(str(last_position))
sample_text file contains the file pointer offset which is returned by f1.tell()
If it's printing from the middle of a line that's almost certainly because your offset is wrong. You don't explain how you came by the magic number you use as an argument to seek, and without that information it's difficult to help more precisely.
One thing is, however, rather important. It's not a very good idea to use seek on a file that is open in text mode (the default in Python). Try using open(..., 'rb') and see if the process becomes a little more predictable. It sounds as though you may have got the offset by counting characters after reading in text mode, but good old Windows includes carriage return characters in text files, which are removed by the Python I/O routines before your program sees the text.
I have a text file test.txt, with the following contents:
Thing 1. string
And I'm creating a python file that will increment the number every time it gets run without affecting the rest of the string, like so.
Run once:
Thing 2. string
Run twice:
Thing 3. string
Run three times:
Thing 4. string
Run four times:
Thing 5. string
This is the code that I'm using to accomplish this.
file = open("test.txt","r+")
started = False
beginning = 0 #start of the digits
done = False
num = 0
#building the number from digits
while not done:
next = file.read(1)
if ord(next) in range(48, 58): #ascii values of 0-9
started = True
num *= 10
num += int(next)
elif started: #has reached the end of the number
done = True
else: #has not reached the beginning of the number
beginning += 1
num += 1
file.seek(beginning,0)
file.write(str(num))
This code works, so long as the number is not 10^n-1 (9, 99, 999, etc) because in those cases, it writes more bytes than were previously in the number. As such, it will override the characters that follow.
So this brings me to the point. I need a way to write to the file that overwrites previously bytes, which I have, and a way to write to the file that does not overwrite previously existing bytes, which I don't have. Does such a mechanism exist in python, and if so, what is it?
I have already tried opening the file using the line file = open("test.txt","a+") instead. When I do that, it always writes to the end, regardless of the seek point.
file = open("test.txt","w+") will not work because I need to keep the contents of the file while altering it, and files opened in any variant of w mode are wiped clean.
I have also thought of solving my problem using a function like this:
#file is assumed to be in r+ mode
def write(string, file, index = -1):
if index != -1:
file.seek(index, 0)
remainder = file.read()
file.seek(index)
file.write(remainder + string)
But I also want to be able to expand the solution to larger files, and reading the rest of the file single-handedly changes what I'm trying to accomplish from being O(1) to O(n). It also seems very non-Pythonic, since it seeks to accomplish the task in a less-than-straightforward way.
It would also make my I/O operations inconsistent: I would have class methods (file.read() and file.write()) to read from the file and write to it replacing old characters, but an external function to insert without replacing.
If I make the code inline, rather than a function, it means I have to write several of the same lines of code every time I try to write without replacing, which is also non-Pythonic.
To reiterate my question, is there a more straightforward way to do this, or am I stuck with the function?
Unfortunately, what you want to do is not possible. This is a limitation at a lower level than Python, in the operating system. Neither the Unix nor the Windows file access API offers any way to insert new bytes in the middle of a file without overwriting the bytes that were already there.
Reading the rest of the file and rewriting it is the usual workaround. Actually, the usual workaround is to rewrite the entire file under a new name and then use rename to move it back to the old name. On Unix, this accomplishes an atomic file update - unless the computer crashes, concurrent readers will see either the new file or the old file, not some hybrid. (Windows, sadly, still does not allow you to rename over a name that already exists, so if you use this strategy you have to delete the old file first, opening an unavoidable race window where the file might appear not to exist at all.)
Yes, this is O(N), and yes, if you use the write-new-file-and-rename strategy it temporarily consumes scratch disk space equal to the size of the file (old or new, whichever is larger). That's just how it is.
I haven't thought about it enough to give you even a sketch of the code, but it should be possible to use context managers to wrap up the write-new-file-and-rename approach tidily.
No, the disk doesn't work like you think it does.
You have to remember that your file is stored on disk as one contiguous
chunk of data*
Your disk happens to be wound up in a great big spool, a bit like a record,
but if you were to unwind your file, you'd get something that looks like
this:
+------------------------------------------------------------+
| Thing 1. String |
+------------------------------------------------------------+
^ ^
^ | \_, ^
| Start of file End of File |
Start of disk End of disk
As you've discovered, there's no way to simply insert data in the middle.
Generally speaking, that wouldn't be possible at all, without physically
altering your disk. And who wants to do that? Especially when just flipping
the magnetic bits on your disk is so much easier and faster. In order to
do what you want to do, you have to read the bytes the you want to
overwrite, then start writing down your new ones. It might look something
like this:
Open the file
Seek to the point of insert
Read the current byte
Seek backward one byte
Write down the first byte of the new string
Read the next byte
Seek backward one byte
Write down the next byte of the new string
Repeat until all the bytes have been written to disk
close the file
Of course, this might be a little bit on the slow side, due to all the
seeking back & forth in the file. It might be faster to read each line,
and then seek back to the previous location in the file. It should be
relatively straightforward to implement something like this in Python,
but as you've discovered, there are system limitations that Python can't
really overcome.
*Unless the files are fragmented, but we're living in an ideal
world where gravity adheres to 9.8m/s2 and the Earth is a perfect
sphere.
As a disclaimer, I'm hardly a computer scientist, but I've been reading everything I can on the subject of efficient file i/o to try and tackle this facet of a project I'm working on.
I have a very large (10 - 100 GB) log file of comma-separated values that I need to parse through. The first value labels it as "A" or "B"; for every "A" line, I need to examine the line before it and the line after it, and if either line before or after it meets a criterion, I want to store it in memory or write it to a file. The lines are not uniform in size.
That is my specific problem: I can't seem to locate an efficient way to do this in a non-binary file. With a binary file, I'd simply iterate over the file once and rewind to and fro with a logical check. I've investigated memory mapping, but it seems structured for binary files; my current code is Pythonic and takes weeks to run [see disclaimer].
My other question would be-- how easily could parallelism be invoked to help here? I have a notion of how -- map the file out three lines at a time and send each chunk to each node [lines 1,2,3 go to one node; lines 3,4,5 go to another ...], but I have no idea how to go about implementing this.
Any help would be very much appreciated.
Just read the lines in a loop. Keep track of the previous line in memory and examine it when needed.
Pseudocode:
for each line:
previousLine := currentLine
read currentLine from file
do processing...
This is efficient assuming you're already reading every line into memory anyway, and if you use a proper buffering scheme for reading the file (read large chunks at a time into memory).
I don't think parallelism will help in this situation. If properly written, the bottleneck of the program should be disk I/O, and multiple threads/processes can't read from disk any faster than a single thread. Parallelism only improves CPU-bound problems.
For what it's worth you can "seek" in ASCII files the same way you can with binary files. You would just keep track of the file offset each time you begin to read a line, and store that offset so you know where to seek back to later. Depending on how this is implemented this will never perform better than the above, though, and sometimes worse (you would want the file data to be buffered in memory so that the "seeking" is a memory operation and not a disk operation; you definitely want to read the file contents sequentially to maximize cache-ahead benefits).
Here's a first pass. Assumes properly formatted lines of text.
from itertools import chain
with open('your-file') as f:
prev_line = None
cur_line = f.readline()
for next_line in chain(f, [None]):
pieces = cur_line.split(',')
if pieces[0] == 'A':
check_against_criterion_if_not_none(prev_line)
check_against_criterion_if_not_none(next_line)
prev_line, cur_line = cur_line, next_line
A nifty trick is tacking on that extra 'None' at the end of the file, using itertools.chain, so that code properly checks the last line of the file against the 2nd to last line.
I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?
If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.
You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.
As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line