Reading file in Python one line at a time - python

I do appreciate this question has been asked million of time, but I can't figure out while attempting to read a .txt file line by line I get the entire file read in one go.
This is my little snippet
num = 0
with open(inStream, "r") as f:
for line in f:
num += 1
print line + " ..."
print num
Having a look at the open function there is anything that suggest a second param to limit the reading as that is just the "mode" to pen the file.
So I can only guess there are same problem with my file, but this is a txt file, with entry line by line.
Any hint?

Without a little more information, it's hard to be absolutely sure… but most likely, your problem is inappropriate line endings.
For example, on a modern Mac OS X system, lines in text files end with '\n' newline characters. So, when you do for line in f:, Python breaks the text file on '\n' characters.
But on classic Mac OS 9, lines in text files ended with '\r' instead. If you have some ancient classic Mac text files lying around, and you give one to Python, it will go looking for '\n' characters and not find any, so it'll think the whole file is one giant line.
(Of course in real life, Windows is a problem more often than classic Mac OS, but I used this example because it's simpler.)
Python 2: Fortunately, Python has a feature called "universal newlines". For full details, see the link, but the short version is that adding "U" onto the end of the mode when opening a text file means Python will read any of the three standard line-ending conventions (and give them to your code as Unix-style '\n').
In other words, just change one line:
with open(inStream, "rU") as f:
Python 3: Universal newlines are part of the standard behavior; adding "U" has no effect and is deprecated.

Related

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Python readline() fails when trying to read large (~ 13GB) csv file [duplicate]

This question already has answers here:
Unable to read huge (20GB) file from CPython
(2 answers)
Closed 9 years ago.
I'm a Python newbie and had a quick question regarding memory usage when reading large text files. I have a ~13GB csv I'm trying to read line-by-line following the Python documentation and more experienced Python user's advice to not use readlines() in order to avoid loading the entire file into memory.
When trying to read a line from the file I get the error below and am not sure what might be causing it. Besides this error, I also notice my PC's memory usage is excessively high. This was a little surprising since my understanding of the readline function is that it only loads a single line from the file at a time into memory.
For reference, I'm using Continuum Analytic's Anaconda distribution of Python 2.7 and PyScripter as my IDE for debugging and testing. Any help or insight is appreciated.
with open(R'C:\temp\datasets\a13GBfile.csv','r') as f:
foo = f.readline(); #<-- Err: SystemError: ..\Objects\stringobject.c:3902 bad argument to internal function
UPDATE:
Thank you all for the quick, informative and very helpful feedback, I reviewed the referenced link which is exactly the problem I was having. After applying the documented 'rU' option mode I was able to read lines from the file like normal. I didn't notice this mode mentioned in the documentation link I was referencing initially and neglected to look at the details for the open function first. Thanks again.
Unix text files end each line with \n.
Windows text files end each line with \r\n.
When you open a file in text mode, 'r', Python assumes it has the native line endings for your platform.
So, if you open a Unix text file on Windows, Python will look for \r\n sequences to split the lines. But there won't be any, so it'll treat your whole file is one giant 13-billion-character line. So that readline() call ends up trying to read the whole thing into memory.
The fix for this is to use universal newlines mode, by opening the file in mode rU. As explained in the docs for open:
supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.
So, instead of searching for \r\n sequences to split the lines, it looks for \r\n, or \n, or \r. And there are millions of \n. So, the problem is solved.
A different way to fix this is to use binary mode, 'rb'. In this mode, Python doesn't do any conversion at all, and assumes all lines end in \n, no matter what platform you're on.
On its own, this is pretty hacky—it means you'll end up with an extra \r on the end of every line in a Windows text file.
But it means you can pass the file on to a higher-level file reader like csv that wants binary files, so it can parse them the way it wants to. On top of magically solving this problem for you, a higher-level library will also probably make the rest of your code a lot simpler and more robust. For example, it might look something like this:
with open(R'C:\temp\datasets\a13GBfile.csv','rb') as f:
for row in csv.reader(f):
# do stuff
Now each row is automatically split on commas, except that commas that are inside quotes or escaped in the appropriate way don't count, and so on, so all you need to deal with is a list of column values.

Python Does Not Read Entire Text File

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.
My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.
The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.
For example, I'll do a really simply command such as:
newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
replaced = line.replace("string1", "string2")
newfile.write(replaced)
And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?
I tried a few different solutions such as using:
import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
sys.stdout.write(line.replace("string1", "string2")
But it has the same effect. Nor does reading the file in chunks such as using
f.read(10000)
I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.
Can anyone offer any advice or things to try?
I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7
Try:
f = open("filename.txt", "rb")
On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).
You can also specify the mode when using fileinput:
fileinput.input("filename.txt", inplace=1, mode="rb")
Are you sure the problem is with reading and not with writing out?
Do you close the file that is written to, either explicitly newfile.close() or using the with construct?
Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.
If you use the file like this:
with open("filename.txt") as f:
for line in f:
newfile.write(line.replace("string1", "string2"))
It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)
Found to solution thanks to Gareth Latty. Using an iterator:
def read_in_chunks(file, chunk_size=1000):
while True:
data = file.read(chunk_size)
if not data: break
yield data
This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.

editing a single .txt line in python 3.1

i have some data stored in a .txt file in this format:
----------|||||||||||||||||||||||||-----------|||||||||||
1029450386abcdefghijklmnopqrstuvwxy0293847719184756301943
1020414646canBeFollowedBySpaces 3292532113435532419963
don't ask...
i have many lines of this, and i need a way to add more digits to the end of a particular line.
i've written code to find the line i want, but im stumped as to how to add 11 characters to the end of it. i've looked around, this site has been helpful with some other issues i've run into, but i can't seem to find what i need for this.
it is important that the line retain its position in the file, and its contents in their current order.
using python3.1, how would you turn this:
1020414646canBeFollowedBySpaces 3292532113435532419963
into
1020414646canBeFollowedBySpaces 329253211343553241996301846372998
As a general principle, there's no shortcut to "inserting" new data in the middle of a text file. You will need to make a copy of the entire original file in a new file, modifying your desired line(s) of text on the way.
For example:
with open("input.txt") as infile:
with open("output.txt", "w") as outfile:
for s in infile:
s = s.rstrip() # remove trailing newline
if "target" in s:
s += "0123456789"
print(s, file=outfile)
os.rename("input.txt", "input.txt.original")
os.rename("output.txt", "input.txt")
Check out the fileinput module, it can do sort of "inplace" edits with files. though I believe temporary files are still involved in the internal process.
import fileinput
for line in fileinput.input('input.txt', inplace=1, backup='.orig'):
if line.startswith('1020414646canBeFollowedBySpaces'):
line = line.rstrip() + '01846372998' '\n'
print(line, end='')
The print now prints to the file instead of the console.
You might want to back up your original file before editing.
target_chain = '1020414646canBeFollowedBySpaces 3292532113435532419963'
to_add = '01846372998'
with open('zaza.txt','rb+') as f:
ch = f.read()
x = ch.find(target_chain)
f.seek(x + len(target_chain),0)
f.write(to_add)
f.write(ch[x + len(target_chain):])
In this method it's absolutely obligatory to open the file in binary mode 'b' for some reason linked to the treatment of the end of lines by Python (see Universal Newline, enabled by default)
The mode 'r+' is to allow the writing as well as the reading
In this method, what is before the target_chain in the file remains untouched. And what is after the target_chain is shifted ahead. As said by Greg Hewgill, there is no possibility to move apart bits on a hard drisk to insert new bits in the middle.
Evidently, if the file is very big, reading all of its content in ch could be too much memory consuming and the algorithm should then be changed: reading line after line until the line containing the target_chain, and then reading the next line before inserting, and then continuing to do "reading the next line - re-writing on the current line" until the end of the file in order to shift progressively the content from the line concerned with addition.
You see what I mean...
Copy the file, line by line, to another file. When you get to the line that needs extra chars then add them before writing.

How do you read a file into a list in Python? [duplicate]

This question already has answers here:
How to read a file line-by-line into a list?
(28 answers)
Closed 8 years ago.
I want to prompt a user for a number of random numbers to be generated and saved to a file. He gave us that part. The part we have to do is to open that file, convert the numbers into a list, then find the mean, standard deviation, etc. without using the easy built-in Python tools.
I've tried using open but it gives me invalid syntax (the file name I chose was "numbers" and it saved into "My Documents" automatically, so I tried open(numbers, 'r') and open(C:\name\MyDocuments\numbers, 'r') and neither one worked).
with open('C:/path/numbers.txt') as f:
lines = f.read().splitlines()
this will give you a list of values (strings) you had in your file, with newlines stripped.
also, watch your backslashes in windows path names, as those are also escape chars in strings. You can use forward slashes or double backslashes instead.
Two ways to read file into list in python (note these are not either or) -
use of with - supported from python 2.5 and above
use of list comprehensions
1. use of with
This is the pythonic way of opening and reading files.
#Sample 1 - elucidating each step but not memory efficient
lines = []
with open("C:\name\MyDocuments\numbers") as file:
for line in file:
line = line.strip() #or some other preprocessing
lines.append(line) #storing everything in memory!
#Sample 2 - a more pythonic and idiomatic way but still not memory efficient
with open("C:\name\MyDocuments\numbers") as file:
lines = [line.strip() for line in file]
#Sample 3 - a more pythonic way with efficient memory usage. Proper usage of with and file iterators.
with open("C:\name\MyDocuments\numbers") as file:
for line in file:
line = line.strip() #preprocess line
doSomethingWithThisLine(line) #take action on line instead of storing in a list. more memory efficient at the cost of execution speed.
the .strip() is used for each line of the file to remove \n newline character that each line might have. When the with ends, the file will be closed automatically for you. This is true even if an exception is raised inside of it.
2. use of list comprehension
This could be considered inefficient as the file descriptor might not be closed immediately. Could be a potential issue when this is called inside a function opening thousands of files.
data = [line.strip() for line in open("C:/name/MyDocuments/numbers", 'r')]
Note that file closing is implementation dependent. Normally unused variables are garbage collected by python interpreter. In cPython (the regular interpreter version from python.org), it will happen immediately, since its garbage collector works by reference counting. In another interpreter, like Jython or Iron Python, there may be a delay.
f = open("file.txt")
lines = f.readlines()
Look over here. readlines() returns a list containing one line per element. Note that these lines contain the \n (newline-character) at the end of the line. You can strip off this newline-character by using the strip()-method. I.e. call lines[index].strip() in order to get the string without the newline character.
As joaquin noted, do not forget to f.close() the file.
Converting strint to integers is easy: int("12").
The pythonic way to read a file and put every lines in a list:
from __future__ import with_statement #for python 2.5
with open('C:/path/numbers.txt', 'r') as f:
lines = f.readlines()
Then, assuming that each lines contains a number,
numbers =[int(e.strip()) for e in lines]
You need to pass a filename string to open. There's an extra complication when the string has \ in it, because that's a special string escape character to Python. You can fix this by doubling up each as \\ or by putting a r in front of the string as follows: r'C:\name\MyDocuments\numbers'.
Edit: The edits to the question make it completely different from the original, and since none of them was from the original poster I'm not sure they're warrented. However it does point out one obvious thing that might have been overlooked, and that's how to add "My Documents" to a filename.
In an English version of Windows XP, My Documents is actually C:\Documents and Settings\name\My Documents. This means the open call should look like:
open(r"C:\Documents and Settings\name\My Documents\numbers", 'r')
I presume you're using XP because you call it My Documents - it changed in Vista and Windows 7. I don't know if there's an easy way to look this up automatically in Python.
hdl = open("C:/name/MyDocuments/numbers", 'r')
milist = hdl.readlines()
hdl.close()
To summarize a bit from what people have been saying:
f=open('data.txt', 'w') # will make a new file or erase a file of that name if it is present
f=open('data.txt', 'r') # will open a file as read-only
f=open('data.txt', 'a') # will open a file for appending (appended data goes to the end of the file)
If you wish have something in place similar to a try/catch
with open('data.txt') as f:
for line in f:
print line
I think #movieyoda code is probably what you should use however
If you have multiple numbers per line and you have multiple lines, you can read them in like this:
#!/usr/bin/env python
from os.path import dirname
with open(dirname(__file__) + '/data/path/filename.txt') as input_data:
input_list= [map(int,num.split()) for num in input_data.readlines()]

Categories

Resources