Python line file iteration and strange characters

Python line file iteration and strange characters - python

I have a huge gzipped text file which I need to read, line by line. I go with the following:
for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
print i, line
At some point late in the file, the python output diverges from the file. This is because lines are getting broken due to weird special characters that python thinks are newlines. When I open the file in 'vim', they are correct, but the suspect characters are formatted weirdly. Is there something I can do to fix this?
I've tried other codecs including utf-16, latin-1. I've also tried with no codec.
I looked at the file using 'od'. Sure enough, there are \n characters where they shouldn't be. But, the "wrong" ones are prepended by a weird character. I think there's some encoding here with some characters being 2-bytes, but the trailing byte being a \n if not viewed properly.
According to 'od -h file' the offending character is '1d1c'.
If I replace:
gzip.open('file.gz')
With:
os.popen('zcat file.gz')
It works fine (and actually, quite faster). But, I'd like to know where I'm going wrong.

Try again with no codec. The following reproduces your problem when using codec, and the absence of the problem without it:
import gzip
import os
import codecs
data = gzip.open("file.gz", "wb")
data.write('foo\x1d\x1cbar\nbaz')
data.close()
print list(codecs.getreader('utf-8')(gzip.open('file.gz')))
print list(os.popen('zcat file.gz'))
print list(gzip.open('file.gz'))
Outputs:
[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']

I asked (in a comment) """Show us the output from print repr(weird_special_characters). When you open the file in vim, WHAT are correct? Please be more precise than "formatted weirdly".""" But nothing :-(
What file are you looking at with od? file.gz?? If you can see anything recognisable in there, it's not a gzip file! You're not seeing newlines, you're seeing binary bytes that contain 0x0A.
If the original file was utf-8 encoded, what was the point of trying it with other codecs?
Does "works OK with zcat" mean that you got recognisable data without a utf8 decode step??
I suggest that you simplify your code, and do it a step at a time ... see for example the accepted answer to this question. Try it again and please show the exact code that you ran, and use repr() when describing the results.
Update It looks like DS has guessed what you were trying to explain about the \x1c and \x1d.
Here are some notes on WHY it happens like that:
In ASCII, only \r and \n are considered when line-breaking:
>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
'A\x0bA\x0cA\r', # line break
'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>
However in Unicode, the characters \x1D (FILE SEPARATOR), \x1E (GROUP SEPARATOR), and \x1E (RECORD SEPARATOR) also qualify as line-endings:
>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
u'A\x0bA\x0cA\r', # line break
u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
u'A\x1d', # line break
u'A\x1e', # line break
u'A\x1fBBB']
>>>
This will happen whatever codec you use. You still need to work out what (if any) codec you need to use. You also need to work out whether the original file was really a text file and not a binary file. If it's a text file, you need to consider the meaning of the \x1c and \x1d in the file.

Related

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?

Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.

I just had similar issue: python readlines() reports invalid chars heading the first line, something like ï»¿. I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Iterate through file but ignore certain line break characters?

I know that I can read the entire file into memory and simply replace the offending character in memory then iterate through the stored file, but I don't want to do that because these are MASSIVE text files (often exceeding 4GB).
With that said, I want to iterate line by line through a file (which has been properly encoded as utf-8 using codecs) but I don't want line breaks to occur on the \x0b (\v) character. Unfortunately, there is some binary data that shows up in my file that has the \x0b character. Naturally, this causes a line break which ends up splitting up some lines that I need to keep intact. I'd like to ignore this character when determining where line breaks should occur while iterating through the file.
Is there a parameter or approach that will enable me to do this? I'm ok with writing my own generator to iterate line by line through the file by specifying my own valid line break characters, but I'm not sure if there isn't a simpler approach, and I'm not sure how to do this since I'm using the codecs library to handle encoding.
Here are some (sanitized) sample data:
Record#|EventID|Date| Time-UTC|Level|computer name|param_01|param_02|param_03|param_04|param_05|param_06|source name|event log
84491|682|03/19/2015| 21:59:16.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0xF38058)|RDP-Tcp#12|RogueApp|10.3.98.6|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90582|682|04/03/2015| 14:42:14.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#5|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90613|682|04/03/2015| 16:26:03.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºà¨€Aà¨€Aì°†éªá… ê±ºà¬€Aà¬€Aé¶é«á… Ö Î„|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90626|682|04/03/2015| 16:57:35.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#11|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91018|682|04/04/2015| 13:56:13.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#33|Anonymous|10.3.58.13|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91038|682|04/04/2015| 14:09:19.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#39|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ¸€xæ¸€xì°†éªá… ê±ºæ¬€xæ¬€xé¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91064|682|04/04/2015| 15:25:33.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x11FA916)|RDP-Tcp#43|CONTROLLER|10.3.58.4|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91163|682|04/04/2015| 16:40:19.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#2|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºá´€æ®–á´€æ®–ì°†éªá… ê±ºã¬€æ®–ã¬€æ®–é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91204|682|04/04/2015| 18:10:55.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#5|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ˜€æ˜€ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91545|682|04/05/2015| 13:41:58.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#7|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºìˆ€ìˆ€ì°†éªá… ê±ºëŒ€ëŒ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91567|682|04/05/2015| 14:42:21.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºæ €æ €ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
92120|682|04/06/2015| 19:06:43.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x3D6DB)|RDP-Tcp#2|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºç„€ç„€ì°†éªá… ê±ºçœ€çœ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
It parses everything fine except for the very last row. Yes I know there shouldn't be binary data in a CSV file, but there is. And I have no choice in that matter.

>>> with open("out.test","wb") as f:
... f.write("a\va\nb\rq")
...
>>> for line in open("out.test","rb"):
... print line.decode("utf8")
...
a♂a
q
seems fine in python 2.7 ... what kind of encoding is this file that this wont work?

File contents not as long as expected

with open(sourceFileName, 'rt') as sourceFile:
sourceFileConents = sourceFile.read()
sourceFileConentsLength = len(sourceFileConents)
i = 0
while i < sourceFileConentsLength:
print(str(i) + ' ' + sourceFileConents[i])
i += 1
Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.
Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.
The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.
What gives?
Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
Have I hit some maximum string length? If so, can I get around it?
Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
What else?
[Update] I think that we strike two of those ideas.
For maximum string length, see this question.
I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.
[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.

If you read a file in text mode it will automatically convert line endings like \r\n to \n.
Try using
with open(sourceFileName, newline='') as sourceFile:
instead; this will turn off newline-translation (\r\n will be returned as \r\n).

If your file is encoded in something like UTF-8, you should decode it before counting the characters:
sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))
i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
print('%s %s' % (str(i), sourceFileContents[i]))
i += 1

How to make the file.write() method in Python explicitly write the newline characters?

I am trying to write text to an output file that explicitly shows all of the newline characters (\n, \r, \r\n,). I am using Python 3 and Windows 7. My thought was to do this by converting the strings that I am writing into bytes.
My code looks like this:
file_object = open(r'C:\Users\me\output.txt', 'wb')`
for line in lines:
line = bytes(line, 'UTF-8')
print('Line: ', line) #for debugging
file_object.write(line)
file_object.close()
The print() statement to standard output (my Windows terminal) is as I want it to be. For example, one line looks like so, with the \n character visible.
Line: b'<p class="byline">Foo C. Bar</p>\n'
However, the write() method does not explicitly print any of the newline characters in my output.txt file. Why does write() not explicitly show the newline characters in my output text file, even though I'm writing in bytes mode, but print does explicitly show the newline characters in the windows terminal?

What Python does when writing strings or bytes to text or binary files:
Strings to a text file. Directly written.
Bytes to a text file. Writes the repr.
Strings to a binary file. Throws an exception.
Bytes to a binary file. Directly written.
You say that you get what you’re looking for when you write a bytes to standard out (a text file). That, with the pseudo-table above, suggests you might look into using repr. Specifically, if you’re looking for the output b'<p class="byline">Foo C. Bar</p>\n', you’re looking for the repr of a bytes object. If line was a str to start with and you don’t actually need that b at the beginning, you might instead be looking for the repr of the string, '<p class="byline">Foo C. Bar</p>\n'. If so, you could write it like this:
with open(r'C:\Users\me\output.txt', 'w') as file_object:
for line in lines:
file_object.write(repr(line) + '\n')

readline() Produces Unexpected String

Getting some practice playing with dictionaries and file i/o today when a file gave me an unexpected output that I'm curious about. I wrote the following simple function that just takes the first line of a text file, breaks it into individual words, and puts each word into a dictionary:
def create_dict(file):
dict = {}
for i, item in enumerate(file.readline().split(' ')):
dict[i]= item
file.seek(0)
return dict
print "Enter a file name:"
f = open(raw_input('-> '))
dict1 = create_dict(f)
print dict1
Simple enough, in every case it produces exactly the expected output. Every case except for one. I have one text file that was created by piping the output of another python script to a text file via the following shell command:
C:\> python script.py > textFile.txt
When I use textFile.txt with my dictionary script, I get an output that looks like:
{0: '\xff\xfeN\x00Y\x00', 1: '\x00S\x00t\x00a\x00t\x00e\x00', 2: '\x00h\x00a\x00s\x00:\x00', 3: '\x00', 4: '\x00N\x00e\x00w\x00', 5: '\x00Y\x00o\x00r\x00k\x00\r\x00\n'}
What is this output called? Why does piping the output of the script to a text file via the command line produce a different type of string than any other text file? Why are there no visible differences when I open this file in my text editor? I searched and searched but I don't even know what that would be called as I'm still pretty new.

Your file is UTF-16 encoded. The first 2 characters is a Byte Order Marker (BOM) \xff and \xfe. Also you will notice that each character appears to take 2 bytes, one of which is \x00.
You can use the codecs module to decode for you:
import codecs
f = codecs.open(raw_input('-> '), 'r', encoding='utf-16')
Or, if you are using Python 3 you can supply the encoding argument to open().

I guess the problem you met is the 'Character Code' problem.
In python, the default character code is ascii，so when you use the open() fuction to read the file, the value will be explain to ascii code.
But, the output may not know what the character code means, you need to decode the output message to see it 'normal like'.
As normal, the system use the utf-8 code to read, you can try to decode(item, 'utf-8').
And you can search for more information about character code, ascii, utf-8, unicode and the transfer method of them.
Hope can helping.

>>> import codecs
>>> codecs.BOM_UTF16_LE
'\xff\xfe'
To read utf-16 encoded file you could use io module:
import io
with io.open(filename, encoding='utf-16') as file:
words = [word for line in file for word in line.split()]
The advantage compared to codecs.open() is that it supports the universal newline mode like the builtin open(), and io.open() is the builtin open() in Python 3.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.