Iterate through file but ignore certain line break characters? - python

I know that I can read the entire file into memory and simply replace the offending character in memory then iterate through the stored file, but I don't want to do that because these are MASSIVE text files (often exceeding 4GB).
With that said, I want to iterate line by line through a file (which has been properly encoded as utf-8 using codecs) but I don't want line breaks to occur on the \x0b (\v) character. Unfortunately, there is some binary data that shows up in my file that has the \x0b character. Naturally, this causes a line break which ends up splitting up some lines that I need to keep intact. I'd like to ignore this character when determining where line breaks should occur while iterating through the file.
Is there a parameter or approach that will enable me to do this? I'm ok with writing my own generator to iterate line by line through the file by specifying my own valid line break characters, but I'm not sure if there isn't a simpler approach, and I'm not sure how to do this since I'm using the codecs library to handle encoding.
Here are some (sanitized) sample data:
Record#|EventID|Date| Time-UTC|Level|computer name|param_01|param_02|param_03|param_04|param_05|param_06|source name|event log
84491|682|03/19/2015| 21:59:16.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0xF38058)|RDP-Tcp#12|RogueApp|10.3.98.6|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90582|682|04/03/2015| 14:42:14.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#5|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90613|682|04/03/2015| 16:26:03.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºà¨€A਀Aì°†éªá… ê±ºà¬€A଀Aé¶é«á… Ö Î„|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90626|682|04/03/2015| 16:57:35.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#11|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91018|682|04/04/2015| 13:56:13.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#33|Anonymous|10.3.58.13|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91038|682|04/04/2015| 14:09:19.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#39|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ¸€x渀xì°†éªá… ê±ºæ¬€x欀xé¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91064|682|04/04/2015| 15:25:33.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x11FA916)|RDP-Tcp#43|CONTROLLER|10.3.58.4|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91163|682|04/04/2015| 16:40:19.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#2|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºá´€æ®–ᴀ殖찆éªá… ê±ºã¬€æ®–㬀殖é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91204|682|04/04/2015| 18:10:55.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#5|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ˜€æ˜€ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91545|682|04/05/2015| 13:41:58.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#7|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºìˆ€ìˆ€ì°†éªá… ê±ºëŒ€ëŒ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91567|682|04/05/2015| 14:42:21.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºæ €æ €ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
92120|682|04/06/2015| 19:06:43.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x3D6DB)|RDP-Tcp#2|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºç„€ç„€ì°†éªá… ê±ºçœ€çœ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
It parses everything fine except for the very last row. Yes I know there shouldn't be binary data in a CSV file, but there is. And I have no choice in that matter.

>>> with open("out.test","wb") as f:
... f.write("a\va\nb\rq")
...
>>> for line in open("out.test","rb"):
... print line.decode("utf8")
...
a♂a
q
seems fine in python 2.7 ... what kind of encoding is this file that this wont work?

Related

Python read from text file unexpected results for find() function

I'm reading a text file generated by Praat : a .TextGrid.
fo=open(myFile)
fo.seek(0)
Then I have a loop over the lines of this file, in the course of which I need to identify some particular lines, so I evaluate a condition :
for line in fo:
(...)
foundName = line.find("name")
if foundName>0:
<things>
My problem is that for some files, this works, and my processing is all right, but for some other files, although the string "name" belongs to some lines, it is never found. For each character individually, it works (e.g. find('n'), find('a'), etc), but not for strings (e.g. find('na').
For these files, I observed that it is True that
line[x]=='n'
line[x+2]=='a'
And Idon't understand why the contents of the file is "spread" this way...
How to overcome this ? Is it a question of encoding ?
It's an indentation issue. Your checking the last line of the file and not all lines.
You should do
foundName = False
for line in fo:
(...)
foundName = foundName and (name in line)
if foundName:
<things>

Single Line from file is too big?

In python, I'm reading a large file, and I want to add each line(after some modifications) to an empty list. I want to do this to only the first few lines, so I did:
X = []
for line in range(3):
i = file.readline()
m = str(i)
X.append(m)
However, an error shows up, and says there is a MemoryError for the line
i = file.readline().
What should I do? It is the same even if I make the range 1 (although I don't know how that affects the line, since it's inside the loop).
How do I not get the error code? I'm iterating, and I can't make it into a binary file because the file isn't just integers - there's decimals and non-numerical characters.
The txt file is 5 gigs.
Any ideas?
filehandle.readline() breaks lines via the newline character (\n) - if your file has gigantic lines, or no new lines at all, you'll need to figure out a different way of chunking it.
Normally you might read the file in chunks and process those chunks one by one.
Can you figure out how you might break up the file? Could you, for example, only read 1024 bytes at a time, and work with that chunk?
If not, it's often easier to clean up the format of the file instead of designing a complicated reader.

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

python reading files

I need to get a specific line number from a file that I am passing into a python program I wrote. I know that the line I want will be line 5, so is there a way I can just grab line 5, and not have to iterate through the file?
If you know how many bytes you have before the line you're interested in, you could seek to that point and read out a line. Otherwise, a "line" is not a first class construct (it's just a list of characters terminated by a character you're assigning a special meaning to - a newline). To find these newlines, you have to read the file in.
Practically speaking, you could use the readline method to read off 5 lines and then read your line.
Why are you trying to do this?
you can to use linecache
import linecache
get = linecache.getline
print(get(path_of_file, number_of_line))
I think following should do :
line_number=4
# Avoid reading the whole file
f = open('path/to/my/file','r')
count=1
for i in f.readline():
if count==line_number:
print i
break
count+=1
# By reading the whole file
f = open('path/to/my/file','r')
lines = f.read().splitlines()
print lines[line_number-1] # Index starts from 0
This should give you the 4th line in the file.

taking a character input in python from a file?

in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.

Categories

Resources