I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.
My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.
The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.
For example, I'll do a really simply command such as:
newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
replaced = line.replace("string1", "string2")
newfile.write(replaced)
And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?
I tried a few different solutions such as using:
import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
sys.stdout.write(line.replace("string1", "string2")
But it has the same effect. Nor does reading the file in chunks such as using
f.read(10000)
I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.
Can anyone offer any advice or things to try?
I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7
Try:
f = open("filename.txt", "rb")
On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).
You can also specify the mode when using fileinput:
fileinput.input("filename.txt", inplace=1, mode="rb")
Are you sure the problem is with reading and not with writing out?
Do you close the file that is written to, either explicitly newfile.close() or using the with construct?
Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.
If you use the file like this:
with open("filename.txt") as f:
for line in f:
newfile.write(line.replace("string1", "string2"))
It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)
Found to solution thanks to Gareth Latty. Using an iterator:
def read_in_chunks(file, chunk_size=1000):
while True:
data = file.read(chunk_size)
if not data: break
yield data
This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.
Related
Please do not behead me for my noob question. I have looked up many other questions on stackoverflow concerning this topic, but haven't found a solution that works as intended.
The Problem:
I have a fairly large txt-file (about 5 MB) that I want to copy via readlines() or any other build in string-handling function into a new file. For smaller files the following code sure works (only schematically coded here):
f = open('C:/.../old.txt', 'r');
n = open('C:/.../new.txt', 'w');
for line in f:
print(line, file=n);
However, as I found out here (UnicodeDecodeError: 'charmap' codec can't encode character X at position Y: character maps to undefined), internal restrictions of Windows prohibit this from working on larger files. So far, the only solution I came up with is the following:
f = open('C:/.../old.txt', 'r', encoding='utf8', errors='ignore');
n = open('C:/.../new.txt', 'a');
for line in f:
print(line, file=sys.stderr) and append(line, file='C:/.../new.txt');
f.close();
n.close();
But this doesn't work. I do get a new.txt-file, but it is empty. So, how do I iterate through a long txt-file and write every line into a new txt-file? Is there a way to read the sys.stderr as the source for the new file (I actually don't have any idea, what this sys.stderr is)?
I know this is a noob question, but I don't know where to look for an answer anymore.
Thanks in advance!
There is no need to use print() just write() to the file:
with open('C:/.../old.txt', 'r') as f, open('C:/.../new.txt', 'w') as n:
n.writelines(f)
However, it sounds like you may have an encoding issue, so make sure that both files are opened with the correct encoding. If you provide the error output perhaps more help can be provided.
BTW: Python doesn't use ; as a line terminator, it can be used to separate 2 statements if you want to put them on the same line but this is generally considered bad form.
You can set standard output to file like my code.
I successfully copied 6MB text file with this.
import sys
bigoutput = open("bigcopy.txt", "w")
sys.stdout = bigoutput
with open("big.txt", "r") as biginput:
for bigline in biginput.readlines():
print(bigline.replace("\n", ""))
bigoutput.close()
Why don't you just use the shutil module and copy the file?
you can try with this code it works for me.
with open("file_path/../large_file.txt") as f:
with open("file_path/../new_file", "wb") as new_f:
new_f.writelines(f.readlines())
new_f.close()
f.close()
I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4
I am attempting to output a new txt file but it come up blank. I am doing this
my_file = open("something.txt","w")
#and then
my_file.write("hello")
Right after this line it just says 5 and then no text comes up in the file
What am I doing wrong?
You must close the file before the write is flushed. If I open an interpreter and then enter:
my_file = open('something.txt', 'w')
my_file.write('hello')
and then open the file in a text program, there is no text.
If I then issue:
my_file.close()
Voila! Text!
If you just want to flush once and keep writing, you can do that too:
my_file.flush()
my_file.write('\nhello again') # file still says 'hello'
my_file.flush() # now it says 'hello again' on the next line
By the way, if you happen to read the beautiful, wonderful documentation for file.write, which is only 2 lines long, you would have your answer (emphasis mine):
Write a string to the file. There is no return value. Due to buffering, the string may not actually show up in the file until the flush() or close() method is called.
If you don't want to care about closing file, use with:
with open("something.txt","w") as f:
f.write('hello')
Then python will take care of closing the file for you automatically.
As Two-Bit Alchemist pointed out, the file has to be closed. The python file writer uses a buffer (BufferedIOBase I think), meaning it collects a certain number of bytes before writing them to disk in bulk. This is done to save overhead when a lot of write operations are performed on a single file.
Also: When working with files, try using a with-environment to make sure your file is closed after you are done writing/reading:
with open("somefile.txt", "w") as myfile:
myfile.write("42")
# when you reach this point, i.e. leave the with-environment,
# the file is closed automatically.
The python file writer uses a buffer (BufferedIOBase I think), meaning
it collects a certain number of bytes before writing them to disk in
bulk. This is done to save overhead when a lot of write operations are
performed on a single file. Ref #m00am
Your code is also okk. Just add a statement for close file, then work correctly.
my_file = open("fin.txt","w")
#and then
my_file.write("hello")
my_file.close()
This question already has answers here:
Unable to read huge (20GB) file from CPython
(2 answers)
Closed 9 years ago.
I'm a Python newbie and had a quick question regarding memory usage when reading large text files. I have a ~13GB csv I'm trying to read line-by-line following the Python documentation and more experienced Python user's advice to not use readlines() in order to avoid loading the entire file into memory.
When trying to read a line from the file I get the error below and am not sure what might be causing it. Besides this error, I also notice my PC's memory usage is excessively high. This was a little surprising since my understanding of the readline function is that it only loads a single line from the file at a time into memory.
For reference, I'm using Continuum Analytic's Anaconda distribution of Python 2.7 and PyScripter as my IDE for debugging and testing. Any help or insight is appreciated.
with open(R'C:\temp\datasets\a13GBfile.csv','r') as f:
foo = f.readline(); #<-- Err: SystemError: ..\Objects\stringobject.c:3902 bad argument to internal function
UPDATE:
Thank you all for the quick, informative and very helpful feedback, I reviewed the referenced link which is exactly the problem I was having. After applying the documented 'rU' option mode I was able to read lines from the file like normal. I didn't notice this mode mentioned in the documentation link I was referencing initially and neglected to look at the details for the open function first. Thanks again.
Unix text files end each line with \n.
Windows text files end each line with \r\n.
When you open a file in text mode, 'r', Python assumes it has the native line endings for your platform.
So, if you open a Unix text file on Windows, Python will look for \r\n sequences to split the lines. But there won't be any, so it'll treat your whole file is one giant 13-billion-character line. So that readline() call ends up trying to read the whole thing into memory.
The fix for this is to use universal newlines mode, by opening the file in mode rU. As explained in the docs for open:
supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.
So, instead of searching for \r\n sequences to split the lines, it looks for \r\n, or \n, or \r. And there are millions of \n. So, the problem is solved.
A different way to fix this is to use binary mode, 'rb'. In this mode, Python doesn't do any conversion at all, and assumes all lines end in \n, no matter what platform you're on.
On its own, this is pretty hacky—it means you'll end up with an extra \r on the end of every line in a Windows text file.
But it means you can pass the file on to a higher-level file reader like csv that wants binary files, so it can parse them the way it wants to. On top of magically solving this problem for you, a higher-level library will also probably make the rest of your code a lot simpler and more robust. For example, it might look something like this:
with open(R'C:\temp\datasets\a13GBfile.csv','rb') as f:
for row in csv.reader(f):
# do stuff
Now each row is automatically split on commas, except that commas that are inside quotes or escaped in the appropriate way don't count, and so on, so all you need to deal with is a list of column values.
I'm learning Python, and have run into a bit of a problem. On my OSX install of Python 3.1, this happens in the console:
>>> filename = "test"
>>> reader = open(filename, 'r')
>>> writer = open(filename, 'w')
>>> reader.read()
''
>>> writer.write("hello world\n")
12
>>> reader.read()
''
And calling more test in BASH confirms that there is nothing in test. What's going on?
Thanks.
There are two potential reasons why you are seeing this behaviour.
When you open a file for writing (with the "w" open mode in Python), the OS removes the original file and creates a totally new one. So by opening the file for reading first and then writing, the original reading handle refers to a file that no longer has a name (the file still exists until you close it). At that point you're reading from a different file than you're writing to.
After you swap the order of opening so you open for writing and then reading, you won't necessarily be able to read the data from the file until you flush it:
>>> writer.flush()
>>> reader.read()
'hello world\n'
Flushing the file writes any data that might be in Python's file buffers to the OS, so that when you read from the file from the other handle, the OS will return the data. Note that Python itself doesn't know these two handles refer to the same file, but the OS does.
You're probably trashing your file. It's not usually a good idea to open a file for reading and writing at the same time.
Buffering. If you really want to read and write to the same file open one handle using "w+".
And with the buttering, you will need to force the buffer to be emptied before reading. Closing the file is a good way to do this.