In python, I'm reading a large file, and I want to add each line(after some modifications) to an empty list. I want to do this to only the first few lines, so I did:
X = []
for line in range(3):
i = file.readline()
m = str(i)
X.append(m)
However, an error shows up, and says there is a MemoryError for the line
i = file.readline().
What should I do? It is the same even if I make the range 1 (although I don't know how that affects the line, since it's inside the loop).
How do I not get the error code? I'm iterating, and I can't make it into a binary file because the file isn't just integers - there's decimals and non-numerical characters.
The txt file is 5 gigs.
Any ideas?
filehandle.readline() breaks lines via the newline character (\n) - if your file has gigantic lines, or no new lines at all, you'll need to figure out a different way of chunking it.
Normally you might read the file in chunks and process those chunks one by one.
Can you figure out how you might break up the file? Could you, for example, only read 1024 bytes at a time, and work with that chunk?
If not, it's often easier to clean up the format of the file instead of designing a complicated reader.
Related
This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.
Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.
If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB
I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...
Assuming I ve got a large file where I want to replace nth line. I am aware of this solution:
w = open('out','w')
for line in open('in','r'):
w.write(replace_somehow(line))
os.remove('in')
os.rename('out','in')
I do not want to rewrite the whole file with many lines if the line which is to be replaced in the beginning of the file.
Is there any proper possibility to replace nth line directly?
Unless your new line is guaranteed to be exactly the same length as the original line, there is no way around rewriting the entire file.
Some word processors get really fancy by storing a journal of changes, or a big list of chunks with extra space at the end of each chunk, or a database of smaller chunks, so that auto-save modifications can be done quickly (just append to the journal, or rewrite a single chunk, or do a database update), but the real "save" button will then reconstruct the whole file and write it all at once.
This is worth doing if you autosave much more often than the user manually saves, and your files are very big. (Keep in mind that when, e.g., Microsoft Word was designed, 100KB was really big…)
And this points to the right answer. If you've got 5GB of data, and you need to change the Nth record within that, you should not be using a format that's defined as a sequence of variable-length records with no index. Which is what a text file is. The simplest format that makes sense for your case is a sequence of fixed-size records—but if you need to insert or remove records as well as changing them in-place, it will be just as bad as a text file would. So, first think through your requirements, then pick a data structure.
If you need to deal with some more limited format (like text files) for interchange with other programs, that's fine. You will have to rewrite the entire file once, after all of your changes, to "export", but you won't have to do it every time you make any change.
If all of your lines are exactly the same length, you can do this as follows:
with open('myfile.txt', 'rb+') as f:
f.seek(FIXED_LINE_LENGTH * line_number)
f.write(new_line)
Note that it's length in bytes that matters, not length in characters. And you must open the file in binary mode to use it this way.
If you don't know which line number you're trying to replace, you'd want something like this:
with open('myfile.txt', 'rb+') as f:
for line_number, line in enumerate(f):
if is_the_right_line(line):
f.seek(FIXED_LINE_LENGTH * line_number)
f.write(new_line)
If your lines aren't all required to be the same length, but you can be absolutely positive that this one new line is the same length as the old line, you can do this:
with open('myfile.txt', 'rb+') as f:
last_pos = 0
for line_number, line in enumerate(f):
if is_the_right_line(line):
f.seek(last_pos)
f.write(new_line)
last_pos = f.tell()
I have a giant file (1.2GB) of feature vectors saved as a csv file.
In order to go through the lines, I've created a python class that loads batches of rows from the giant file, to the memory, one batch at a time.
In order for this class to know where exactly to read in the file to get a batch of batch_size complete rows (lets say batch_size=10,000), in the first time using a giant file, this class goes through the entire file once, and registers the offset of each line, and saves these offsets to a helping file, so that later it could "file.seek(starting_offset); batch = file.read(num_bytes)" to read the next batch of lines.
First, I implemented the registration of line offsets in this manner:
offset = 0;
line_offsets = [];
for line in self.fid:
line_offsets.append(offset);
offset += len(line);
And it worked lovely with giant_file1.
But then I processed these features and created giant_file2 (with normalized features), with the assistance of this class I made.
And next, when I wanted to read batches of lines form giant_file2, it failed, because the batch strings it would read were not in the right place (for instance, reading something like "-00\n15.467e-04,..." instead of "15.467e-04,...\n").
So I tried changing the line offset calculation part to:
offset = 0;
line_offsets = [];
while True:
line = self.fid.readline();
if (len(line) <= 0):
break;
line_offsets.append(offset);
offset = self.fid.tell();
The main change is that the offset I register is taken from the result of fid.tell() instead of cumulative lengths of lines.
This version worked well with giant_file2, but failed with giant_file1.
The further I investigated it, I came to the feeling that functions seek(), tell() and read() are inconsistent with each other.
For instance:
fid = file('giant_file1.csv');
fid.readline();
>>>'0.089,169.039,10.375,-30.838,59.171,-50.867,13.968,1.599,-26.718,0.507,-8.967,-8.736,\n'
fid.tell();
>>>67L
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
fid.seek(67);
fid.tell();
>>>67L
fid.readline();
>>>'507,-8.967,-8.736,\n'
There is some contradiction here: when I'm positioned (according to fid.tell()) at byte 67 once the line read is one thing and in the second time (again when fid.tell() reports I'm positioned at byte 67) the line that is read is different.
I can't trust tell() and seek() to put me in the desired location to read from the beginning of the desired line.
On the other hand, when I use (with giant_file1) the length of strings as reference for seek() I get the correct position:
fid.seek(0);
line = fid.readline();
fid.tell();
>>>87L
len(line);
>>>86
fid.seek(86);
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
So what is going on?
The only difference between giant_file1 and giant_file2 that I can think of is that in giant_file1 the values are written with decimal dot (e.g. -0.435), and in giant_file2 they are all in scientific format (e.g. -4.350e-01). I don't think any of them is coded in unicode (I think so, since the strings I read with simple file.read() seem readable. how can I make sure?).
I would very much appreciate your help, with explanations, ideas for the cause, and possible solutions (or workarounds).
Thank you,
Yonatan.
I think you have a newline problem. Check whether giant_file1.csv is ending lines with \n or \r\n If you open the file in text mode, the file will return lines ending with \n, only and throw away redundant \r. So, when you look at the length of the line returned, it will be 1 off of the actual file position (which has consumed not just the \n, but also the \r\n). These errors will accumulate as you read more lines, of course.
The solution is to open the file in binary mode, instead. In this mode, there is no \r\n -> \n reduction, so your tally of line lengths would stay consistent with your file tell( ) queries.
I hope that solves it for you - as it's an easy fix. :) Good luck with your project and happy coding!
I had to do something similar in the past and ran into something in the standard library called linecache. You might want to look into that as well.
http://docs.python.org/library/linecache.html
I want to
open file
add 4 underline character to beginning of line
find blank lines
replace the newline character in the blank lines with 50 underline characters
add new lines before and after 50 underline characters
I found many similar questions in stackoverflow but I could not combine all these operations without getting errors. See my previous question here. Is there a simple beginners way to accomplish this so that I can start from there? (I don't mind writing to the same file; there is no need to open two files) Thanks.
You're going to have to pick:
Use two files, but never have to store more than 1 line in memory at a time
or
Build the new file in memory as you read the original, then overwrite the original with the new
A file isn't a flexible memory structure. You can't replace the 1 or 2 characters from a newline with 50 underscores, it just doesn't work like that. If you are sure the new file is going to be a manageable size and you don't mind writing over the original, you can do it without having a new file.
Myself, I would always allow the user to opt for an output file. What if something goes wrong? Disk space is super cheap.
You can do everything you want reading the file first, performing the changes on the lines, and finally writing it back. If the file doesn't fit in memory, then you should read the file in batches and create an temporal file. You can't modify the file in situ.
in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.