File contents not as long as expected

File contents not as long as expected - python

with open(sourceFileName, 'rt') as sourceFile:
sourceFileConents = sourceFile.read()
sourceFileConentsLength = len(sourceFileConents)
i = 0
while i < sourceFileConentsLength:
print(str(i) + ' ' + sourceFileConents[i])
i += 1
Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.
Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.
The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.
What gives?
Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
Have I hit some maximum string length? If so, can I get around it?
Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
What else?
[Update] I think that we strike two of those ideas.
For maximum string length, see this question.
I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.
[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.

If you read a file in text mode it will automatically convert line endings like \r\n to \n.
Try using
with open(sourceFileName, newline='') as sourceFile:
instead; this will turn off newline-translation (\r\n will be returned as \r\n).

If your file is encoded in something like UTF-8, you should decode it before counting the characters:
sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))
i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
print('%s %s' % (str(i), sourceFileContents[i]))
i += 1

Related

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?

Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.

I just had similar issue: python readlines() reports invalid chars heading the first line, something like ï»¿. I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Python conditional statement based on text file string

Noob question here. I'm scheduling a cron job for a Python script for every 2 hours, but I want the script to stop running after 48 hours, which is not a feature of cron. To work around this, I'm recording the number of executions at the end of the script in a text file using a tally mark x and opening the text file at the beginning of the script to only run if the count is less than n.
However, my script seems to always run regardless of the conditions. Here's an example of what I've tried:
with open("curl-output.txt", "a+") as myfile:
data = myfile.read()
finalrun = "xxxxx"
if data != finalrun:
[CURL CODE]
with open("curl-output.txt", "a") as text_file:
text_file.write("x")
text_file.close()
I think I'm missing something simple here. Please advise if there is a better way of achieving this. Thanks in advance.

The problem with your original code is that you're opening the file in a+ mode, which seems to set the seek position to the end of the file (try print(data) right after you read the file). If you use r instead, it works. (I'm not sure that's how it's supposed to be. This answer states it should write at the end, but read from the beginning. The documentation isn't terribly clear).
Some suggestions: Instead of comparing against the "xxxxx" string, you could just check the length of the data (if len(data) < 5). Or alternatively, as was suggested, use pickle to store a number, which might look like this:
import pickle
try:
with open("curl-output.txt", "rb") as myfile:
num = pickle.load(myfile)
except FileNotFoundError:
num = 0
if num < 5:
do_curl_stuff()
num += 1
with open("curl-output.txt", "wb") as myfile:
pickle.dump(num, myfile)
Two more things concerning your original code: You're making the first with block bigger than it needs to be. Once you've read the string into data, you don't need the file object anymore, so you can remove one level of indentation from everything except data = myfile.read().
Also, you don't need to close text_file manually. with will do that for you (that's the point).

Sounds more for a job scheduling with at command?
See http://www.ibm.com/developerworks/library/l-job-scheduling/ for different job scheduling mechanisms.

The first bug that is immediately obvious to me is that you are appending to the file even if data == finalrun. So when data == finalrun, you don't run curl but you do append another 'x' to the file. On the next run, data will be not equal to finalrun again so it will continue to execute the curl code.
The solution is of course to nest the code that appends to the file under the if statement.

Well there probably is an end of line jump \n character which makes that your file will contain something like xx\n and not simply xx. Probably this is why your condition does not work :)
EDIT
What happens if through the python command line you type
open('filename.txt', 'r').read() # where filename is the name of your file
you will be able to see whether there is an \n or not

Try using this condition along with if clause instead.
if data.count('x')==24
data string may contain extraneous data line new line characters. Check repr(data) to see if it actually a 24 x's.

Reading file in Python one line at a time

I do appreciate this question has been asked million of time, but I can't figure out while attempting to read a .txt file line by line I get the entire file read in one go.
This is my little snippet
num = 0
with open(inStream, "r") as f:
for line in f:
num += 1
print line + " ..."
print num
Having a look at the open function there is anything that suggest a second param to limit the reading as that is just the "mode" to pen the file.
So I can only guess there are same problem with my file, but this is a txt file, with entry line by line.
Any hint?

Without a little more information, it's hard to be absolutely sure… but most likely, your problem is inappropriate line endings.
For example, on a modern Mac OS X system, lines in text files end with '\n' newline characters. So, when you do for line in f:, Python breaks the text file on '\n' characters.
But on classic Mac OS 9, lines in text files ended with '\r' instead. If you have some ancient classic Mac text files lying around, and you give one to Python, it will go looking for '\n' characters and not find any, so it'll think the whole file is one giant line.
(Of course in real life, Windows is a problem more often than classic Mac OS, but I used this example because it's simpler.)
Python 2: Fortunately, Python has a feature called "universal newlines". For full details, see the link, but the short version is that adding "U" onto the end of the mode when opening a text file means Python will read any of the three standard line-ending conventions (and give them to your code as Unix-style '\n').
In other words, just change one line:
with open(inStream, "rU") as f:
Python 3: Universal newlines are part of the standard behavior; adding "U" has no effect and is deprecated.

Extracting a substring makes the for loop break in python

I am having a lot of trouble with a text file given to me that I need to parse. This is my third attempt at parsing it (I tried both C and php which seem to fail in different ways).
I have this extremely simple code :
import fileinput
for line in fileinput.input(['basin_stclair.txt']):
print line[0:64]
For some reason the code exits after the first print.
If I print the lines whole then it never stops but the lines are still combined. (If I only let the loop run for one iteration I get two lines(14 floats).
The text file looks like this(Several hundred lines like this one, 7 floats) :
1.749766 3.735660 0.294098 310.461737 0.000000 0.231367 0.230505
When I copy the entire text in kate it gets all jumbled and lines combine.
The text file was made using excell on a windows machine. (I'm working on a linux box).
Any ideas?

You have some problem with the newline characters in your file. Try opening the file using Python's universal newline support:
for line in open('basin_stclair.txt', 'U'):
print line[0:64]

Are you trying to print the first 64 lines? If so, try something like this:
i = 0
for line in fileinput.input(['basin_stclair.txt']):
print line[0:64]
if i > 63:
break
i = i + 1
Are you trying to print the first 64 characters of each line? Try something like this:
for line in fileinput.input(['basin_stclair.txt']):
if len(line) > 63:
print line[0:64]

Decompressing a .bz2 file in Python

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.
I'm quite a Python newbie, so the answer is probably quite obvious, please help me.
In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.
openZip = open(zipFile, "r")
s = ''
while True:
newLine = openZip.readline()
if(len(newLine)==0):
break
s+=newLine
print s
uncompressedData = bz2.decompress(s)
Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.
METHOD A:
print 'decompressing ' + filename
fileHandle = open(zipFile)
uncompressedData = ''
while True:
s = fileHandle.read(1024)
if not s:
break
print('RAW "%s"', s)
uncompressedData += bz2.decompress(s)
uncompressedData += bz2.flush()
newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
newFile.write(uncompressedData)
newFile.close()
I get the error:
uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream
METHOD B
zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
Same error :
uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream
Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.
By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.

You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.
uncompressedData = bz2.BZ2File(zipFile).read()
seems to be closer to what you're angling for.
Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:
opening ... the compressed file as if
it was a textfile ... It's NOT.
open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).
In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).
In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.
So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).

openZip = open(zipFile, "r")
If you're running on Windows, you may want to do say openZip = open(zipFile, "rb") here since the file is likely to contain CR/LF combinations, and you don't want them to be translated.
newLine = openZip.readline()
As Alex pointed out, this is very wrong, as the concept of "lines" is foreign to a compressed stream.
s = fileHandle.read(1024)
[...]
uncompressedData += bz2.decompress(s)
This is wrong for the same reason. 1024-byte chunks aren't likely to mean much to the decompressor, since it's going to want to work with it's own block-size.
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
If that doesn't work, I'd say it's the new-line translation problem I mentioned above.

This was very helpful.
44 of 2300 files gave an end of file missing error, on Windows open.
Adding the b(inary) flag to open fixed the problem.
for line in bz2.BZ2File(filename, 'rb', 10000000) :
works well. (the 10M is the buffering size that works well with the large files involved)
Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.