Python readline() fails when trying to read large (~ 13GB) csv file [duplicate]

Python readline() fails when trying to read large (~ 13GB) csv file [duplicate] - python

This question already has answers here:
Unable to read huge (20GB) file from CPython
(2 answers)
Closed 9 years ago.
I'm a Python newbie and had a quick question regarding memory usage when reading large text files. I have a ~13GB csv I'm trying to read line-by-line following the Python documentation and more experienced Python user's advice to not use readlines() in order to avoid loading the entire file into memory.
When trying to read a line from the file I get the error below and am not sure what might be causing it. Besides this error, I also notice my PC's memory usage is excessively high. This was a little surprising since my understanding of the readline function is that it only loads a single line from the file at a time into memory.
For reference, I'm using Continuum Analytic's Anaconda distribution of Python 2.7 and PyScripter as my IDE for debugging and testing. Any help or insight is appreciated.
with open(R'C:\temp\datasets\a13GBfile.csv','r') as f:
foo = f.readline(); #<-- Err: SystemError: ..\Objects\stringobject.c:3902 bad argument to internal function
UPDATE:
Thank you all for the quick, informative and very helpful feedback, I reviewed the referenced link which is exactly the problem I was having. After applying the documented 'rU' option mode I was able to read lines from the file like normal. I didn't notice this mode mentioned in the documentation link I was referencing initially and neglected to look at the details for the open function first. Thanks again.

Unix text files end each line with \n.
Windows text files end each line with \r\n.
When you open a file in text mode, 'r', Python assumes it has the native line endings for your platform.
So, if you open a Unix text file on Windows, Python will look for \r\n sequences to split the lines. But there won't be any, so it'll treat your whole file is one giant 13-billion-character line. So that readline() call ends up trying to read the whole thing into memory.
The fix for this is to use universal newlines mode, by opening the file in mode rU. As explained in the docs for open:
supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.
So, instead of searching for \r\n sequences to split the lines, it looks for \r\n, or \n, or \r. And there are millions of \n. So, the problem is solved.
A different way to fix this is to use binary mode, 'rb'. In this mode, Python doesn't do any conversion at all, and assumes all lines end in \n, no matter what platform you're on.
On its own, this is pretty hacky—it means you'll end up with an extra \r on the end of every line in a Windows text file.
But it means you can pass the file on to a higher-level file reader like csv that wants binary files, so it can parse them the way it wants to. On top of magically solving this problem for you, a higher-level library will also probably make the rest of your code a lot simpler and more robust. For example, it might look something like this:
with open(R'C:\temp\datasets\a13GBfile.csv','rb') as f:
for row in csv.reader(f):
# do stuff
Now each row is automatically split on commas, except that commas that are inside quotes or escaped in the appropriate way don't count, and so on, so all you need to deal with is a list of column values.

Related

how to print lines in a file by seeking a position from another text file in python?

This is my code for getting lines from a file by seeking a position using f.seek() method but I am getting a wrong output. it is printing from middle of the first line.
can u help me to solve this please?
f=open(r"sample_text_file","r")
last_pos=int(f.read())
f1=open(r"C:\Users\ddadi\Documents\project2\check\log_file3.log","r")
f1.seek(last_pos)
for i in f1:
print i
last_position=f1.tell()
with open('sample_text.txt', 'a') as handle:
handle.write(str(last_position))
sample_text file contains the file pointer offset which is returned by f1.tell()

If it's printing from the middle of a line that's almost certainly because your offset is wrong. You don't explain how you came by the magic number you use as an argument to seek, and without that information it's difficult to help more precisely.
One thing is, however, rather important. It's not a very good idea to use seek on a file that is open in text mode (the default in Python). Try using open(..., 'rb') and see if the process becomes a little more predictable. It sounds as though you may have got the offset by counting characters after reading in text mode, but good old Windows includes carriage return characters in text files, which are removed by the Python I/O routines before your program sees the text.

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?

Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.

I just had similar issue: python readlines() reports invalid chars heading the first line, something like ï»¿. I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Python redirect/write output to file?

I want to write the output write the output in a file, but all I got was "None" even for words with synonyms.
Note: when I am not writing in the file the output it works perfectly fine Another note: the output appears on the screen whether i am writing to a file or not, but I get "None" in the file, is theres anyway to fix it.
[im using python V2.7 Mac version]
file=open("INPUT.txt","w") #opening a file
for xxx in Diacritics:
print xxx
synsets = wn.get_synsetids_from_word(xxx) or []
for s in synsets:
file.write(str(wn._items[s].describe()))

I tried to simplify the question and rewrite your code so that its an independent test that you should be able to run and eventually modify if that was the problem with your code.
test = "Is this a real life? Is this fantasy? Caught in a test slide..."
with open('test.txt', 'w') as f:
for word in test.split():
f.write(word) # test.txt output: Isthisareallife?Isthisfantasy?Caughtinatestslide...
A side note it almost sounds like you want to append rather than truncate, but I am not sure, so take a look at this.
The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position). If mode is omitted, it defaults to 'r'. The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.) See below for more possible values of mode.

file.write() is going to write whatever is returned by the describe() command. Because 'None' is being written, and because output always goes to the screen, the problem is that describe is writing to the screen directly (probably with print) and returning None.
You need to use some other method besides describe, or give the correct parameters to describe to have it return the strings instead of printing them, or file a bug report. (I am not familiar with that package, so I don't know which is the correct course of action.)

Reading file in Python one line at a time

I do appreciate this question has been asked million of time, but I can't figure out while attempting to read a .txt file line by line I get the entire file read in one go.
This is my little snippet
num = 0
with open(inStream, "r") as f:
for line in f:
num += 1
print line + " ..."
print num
Having a look at the open function there is anything that suggest a second param to limit the reading as that is just the "mode" to pen the file.
So I can only guess there are same problem with my file, but this is a txt file, with entry line by line.
Any hint?

Without a little more information, it's hard to be absolutely sure… but most likely, your problem is inappropriate line endings.
For example, on a modern Mac OS X system, lines in text files end with '\n' newline characters. So, when you do for line in f:, Python breaks the text file on '\n' characters.
But on classic Mac OS 9, lines in text files ended with '\r' instead. If you have some ancient classic Mac text files lying around, and you give one to Python, it will go looking for '\n' characters and not find any, so it'll think the whole file is one giant line.
(Of course in real life, Windows is a problem more often than classic Mac OS, but I used this example because it's simpler.)
Python 2: Fortunately, Python has a feature called "universal newlines". For full details, see the link, but the short version is that adding "U" onto the end of the mode when opening a text file means Python will read any of the three standard line-ending conventions (and give them to your code as Unix-style '\n').
In other words, just change one line:
with open(inStream, "rU") as f:
Python 3: Universal newlines are part of the standard behavior; adding "U" has no effect and is deprecated.

Python Does Not Read Entire Text File

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.
My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.
The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.
For example, I'll do a really simply command such as:
newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
replaced = line.replace("string1", "string2")
newfile.write(replaced)
And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?
I tried a few different solutions such as using:
import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
sys.stdout.write(line.replace("string1", "string2")
But it has the same effect. Nor does reading the file in chunks such as using
f.read(10000)
I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.
Can anyone offer any advice or things to try?
I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7

Try:
f = open("filename.txt", "rb")
On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).
You can also specify the mode when using fileinput:
fileinput.input("filename.txt", inplace=1, mode="rb")

Are you sure the problem is with reading and not with writing out?
Do you close the file that is written to, either explicitly newfile.close() or using the with construct?
Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.

If you use the file like this:
with open("filename.txt") as f:
for line in f:
newfile.write(line.replace("string1", "string2"))
It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)

Found to solution thanks to Gareth Latty. Using an iterator:
def read_in_chunks(file, chunk_size=1000):
while True:
data = file.read(chunk_size)
if not data: break
yield data
This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.