I'm trying to open a file and read from the last point read. My files are rather big (20 Mb to ~ 1 Gb) After doing some research it seems that tell() and seek() would be one of the most efficient ways to perform this. I've tried the following code
opened = open(filename, "rU")
f1 = csv.reader(opened)
k = []
for line in f1:
k.append(opened.tell())
When I do this every value in the list is 8272 Long. Does that mean that I cannot use this implementation? Is there something I'm missing? Thanks for your help!
I'm running python 2.7 in Windows 7
Update
After piecing together everything learned here and trial and error I get the following code
opened = open(filename, "rU")
k = [0]
where = 1
for switch in opened:
where += len(switch) + 1
f = StringIO.StringIO(switch)
interesting = csv.reader(f, delimiter=',')
good_values = interesting.next()
k.append(where)
return k
This allows the user to know exactly where in the file to go to while still being able to parse it according to its format. I'm not completely sure of why the offsets need to be constantly added (It seems that the newline is not accurately accounted for in len()).
It looks like the csv.reader is reading the file in chunks of 8272 bytes, that's why you see this number returned from opened.tell() many times - until, I guess, you have read all the lines from your file in the range of 0-8272. After that you will see 8272*2 a few times, exact number will depend on the length of the lines in the buffer read.
So, basically, in your program, tell() doesn't give you offsets of new CSV lines, as you seem to assume. It's only telling you about offset of the end of the file's region currently read into an internal OS buffer used by system functions used to implement the Python's IO functions.
Related
I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.
Version 1, found here on stackoverflow:
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(file, 'rb')
for piece in read_in_chunks(f):
process_data(piece)
f.close()
Version 2, I used this before I found the code above:
f = open(file, 'rb')
while True:
piece = f.read(1024)
process_data(piece)
f.close()
The file is read partially in both versions. And the current piece could be processed. In the second example, piece is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.
But I don't really understand what yield does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?
There is something else that puzzles me, besides of the method used:
The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"?
Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr" - and the next piece would contain "ingILikeToFind". Using such a method it's not possible to detect the whole string in any piece.
Is there a way to read a file in chunks - but somehow care about such strings?
yield is the keyword in python used for generator expressions. That means that the next time the function is called (or iterated on), the execution will start back up at the exact point it left off last time you called it. The two functions behave identically; the only difference is that the first one uses a tiny bit more call stack space than the second. However, the first one is far more reusable, so from a program design standpoint, the first one is actually better.
EDIT: Also, one other difference is that the first one will stop reading once all the data has been read, the way it should, but the second one will only stop once either f.read() or process_data() throws an exception. In order to have the second one work properly, you need to modify it like so:
f = open(file, 'rb')
while True:
piece = f.read(1024)
if not piece:
break
process_data(piece)
f.close()
starting from python 3.8 you might also use an assignment expression (the walrus-operator):
with open('file.name', 'rb') as file:
while chunk := file.read(1024):
process_data(chunk)
the last chunk may be smaller than CHUNK_SIZE.
as read() will return b"" when the file has been read the while loop will terminate.
I think probably the best and most idiomatic way to do this would be to use the built-in iter() function along with its optional sentinel argument to create and use an iterable as shown below. Note that the last chunk might be less that the requested chunk size if the file size isn't an exact multiple of it.
from functools import partial
CHUNK_SIZE = 1024
filename = 'testfile.dat'
with open(filename, 'rb') as file:
for chunk in iter(partial(file.read, CHUNK_SIZE), b''):
process_data(chunk)
Update: Don't know when it was added, but almost exactly what's above is in now shown as an example in the official documentation of the iter() function.
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()
I'm new to Python from the R world, and I'm working on big text files, structured in data columns (this is LiDaR data, so generally 60 million + records).
Is it possible to change the field separator (eg from tab-delimited to comma-delimited) of such a big file without having to read the file and do a for loop on the lines?
No.
Read the file in
Change separators for each line
Write each line back
This is easily doable with just a few lines of Python (not tested but the general approach works):
# Python - it's so readable, the code basically just writes itself ;-)
#
with open('infile') as infile:
with open('outfile', 'w') as outfile:
for line in infile:
fields = line.split('\t')
outfile.write(','.join(fields))
I'm not familiar with R, but if it has a library function for this it's probably doing exactly the same thing.
Note that this code only reads one line at a time from the file, so the file can be larger than the physical RAM - it's never wholly loaded in.
You can use the linux tr command to replace any character with any other character.
Actually lets say yes, you can do it without loops eg:
with open('in') as infile:
with open('out', 'w') as outfile:
map(lambda line: outfile.write(','.join(line.split('\n'))), infile)
You cant, but i strongly advise you to check generators.
Point is that you can make faster and well structured program without need to write and store data in memory in order to process it.
For instance
file = open("bigfile","w")
j = (i.split("\t") for i in file)
s = (","join(i) for i in j)
#and now magic happens
for i in s:
some_other_file.write(i)
This code spends memory for holding only single line.
I have a large xml file (40 Gb) that I need to split into smaller chunks. I am working with limited space, so is there a way to delete lines from the original file as I write them to new files?
Thanks!
Say you want to split the file into N pieces, then simply start reading from the back of the file (more or less) and repeatedly call truncate:
Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position. The current file position is not changed. ...
import os
import stat
BUF_SIZE = 4096
size = os.stat("large_file")[stat.ST_SIZE]
chunk_size = size // N
# or simply set a fixed chunk size based on your free disk space
c = 0
in_ = open("large_file", "r+")
while size > 0:
in_.seek(-min(size, chunk_size), 2)
# now you have to find a safe place to split the file at somehow
# just read forward until you found one
...
old_pos = in_.tell()
with open("small_chunk%2d" % (c, ), "w") as out:
b = in_.read(BUF_SIZE)
while len(b) > 0:
out.write(b)
b = in_.read(BUF_SIZE)
in_.truncate(old_pos)
size = old_pos
c += 1
Be careful, as I didn't test any of this. It might be needed to call flush after the truncate call, and I don't know how fast the file system is going to actually free up the space.
If you're on Linux/Unix, why not use the split command like this guy does?
split --bytes=100m /input/file /output/dir/prefix
EDIT: then use csplit.
I'm pretty sure there is, as I've even been able to edit/read from the source files of scripts I've run, but the biggest problem would probably be all the shifting that would be done if you started at the beginning of the file. On the other hand, if you go through the file and record all the starting positions of the lines, you could then go in reverse order of position to copy the lines out; once that's done, you could go back, take the new files, one at a time, and (if they're small enough), use readlines() to generate a list, reverse the order of the list, then seek to the beginning of the file and overwrite the lines in their old order with the lines in their new one.
(You would truncate the file after reading the first block of lines from the end by using the truncate() method, which truncates all data past the current file position if used without any arguments besides that of the file object, assuming you're using one of the classes or a subclass of one of the classes from the io package to read your file. You'd just have to make sure that the current file position ends up at the beginning of the last line to be written to a new file.)
EDIT: Based on your comment about having to make the separations at the proper closing tags, you'll probably also have to develop an algorithm to detect such tags (perhaps using the peek method), possibly using a regular expression.
If time is not a major factor (or wear and tear on your disk drive):
Open handle to file
Read up to the size of your partition / logical break point (due to the xml)
Save the rest of your file to disk (not sure how python handles this as far as directly overwriting file or memory usage)
Write the partition to disk
goto 1
If Python does not give you this level of control, you may need to dive into C.
You could always parse the XML file and write out say every 10000 elements to there own file. Look at the Incremental Parsing section of this link.
http://effbot.org/zone/element-iterparse.htm
Here is my script...
import string
import os
from ftplib import FTP
# make ftp connection
ftp = FTP('server')
ftp.login('user', 'pwd')
ftp.cwd('/dir')
f1 = open('large_file.xml', 'r')
size = 0
split = False
count = 0
for line in f1:
if not split:
file = 'split_'+str(count)+'.xml'
f2 = open(file, 'w')
if count > 0:
f2.write('<?xml version="1.0"?>\n')
f2.write('<StartTag xmlns="http://www.blah/1.2.0">\n')
size = 0
count += 1
split = True
if size < 1073741824:
f2.write(line)
size += len(line)
elif str(line) == '</EndTag>\n':
f2.write(line)
f2.write('</EndEndTag>\n')
print('completed file %s' %str(count))
f2.close()
f2 = open(file, 'r')
print("ftp'ing file...")
ftp.storbinary('STOR ' + file, f2)
print('ftp done.')
split = False
f2.close()
os.remove(file)
else:
f2.write(line)
size += len(line)
Its a time to buy a new hard drive!
You can make backup before trying all other answers and don't get data lost :)
I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?
If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.
You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.
As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line