Python read from text file unexpected results for find() function - python

I'm reading a text file generated by Praat : a .TextGrid.
fo=open(myFile)
fo.seek(0)
Then I have a loop over the lines of this file, in the course of which I need to identify some particular lines, so I evaluate a condition :
for line in fo:
(...)
foundName = line.find("name")
if foundName>0:
<things>
My problem is that for some files, this works, and my processing is all right, but for some other files, although the string "name" belongs to some lines, it is never found. For each character individually, it works (e.g. find('n'), find('a'), etc), but not for strings (e.g. find('na').
For these files, I observed that it is True that
line[x]=='n'
line[x+2]=='a'
And Idon't understand why the contents of the file is "spread" this way...
How to overcome this ? Is it a question of encoding ?

It's an indentation issue. Your checking the last line of the file and not all lines.
You should do
foundName = False
for line in fo:
(...)
foundName = foundName and (name in line)
if foundName:
<things>

Related

Binary text isn't found in a binary file

I opened a file in binary mode. I need to find a certain string inside this file and print the line after that. However, the string doesn't appear to be found in the text file. I looked into the text file manually, and the string is definitely found on one line.
I tried opening the file as textfile (not binary mode) and not making the string binary, but that gave an error that I solved with this question. The answer on that question led to the below (and current) code.
with open(os.path.join(directory, filename), 'rb') as read_obj:
# print(read_obj.read())
for line in read_obj:
line_number += 1
if b"PREPARED FOR" in line:
break
print(line_number)
Ok. So. Apparently .readlines() worked. I just had to read all the lines, loop through them. Find the one that had the string in it, call that index and add one to find the next line.

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Iterate through file but ignore certain line break characters?

I know that I can read the entire file into memory and simply replace the offending character in memory then iterate through the stored file, but I don't want to do that because these are MASSIVE text files (often exceeding 4GB).
With that said, I want to iterate line by line through a file (which has been properly encoded as utf-8 using codecs) but I don't want line breaks to occur on the \x0b (\v) character. Unfortunately, there is some binary data that shows up in my file that has the \x0b character. Naturally, this causes a line break which ends up splitting up some lines that I need to keep intact. I'd like to ignore this character when determining where line breaks should occur while iterating through the file.
Is there a parameter or approach that will enable me to do this? I'm ok with writing my own generator to iterate line by line through the file by specifying my own valid line break characters, but I'm not sure if there isn't a simpler approach, and I'm not sure how to do this since I'm using the codecs library to handle encoding.
Here are some (sanitized) sample data:
Record#|EventID|Date| Time-UTC|Level|computer name|param_01|param_02|param_03|param_04|param_05|param_06|source name|event log
84491|682|03/19/2015| 21:59:16.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0xF38058)|RDP-Tcp#12|RogueApp|10.3.98.6|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90582|682|04/03/2015| 14:42:14.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#5|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90613|682|04/03/2015| 16:26:03.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºà¨€A਀Aì°†éªá… ê±ºà¬€A଀Aé¶é«á… Ö Î„|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90626|682|04/03/2015| 16:57:35.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#11|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91018|682|04/04/2015| 13:56:13.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#33|Anonymous|10.3.58.13|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91038|682|04/04/2015| 14:09:19.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#39|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ¸€x渀xì°†éªá… ê±ºæ¬€x欀xé¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91064|682|04/04/2015| 15:25:33.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x11FA916)|RDP-Tcp#43|CONTROLLER|10.3.58.4|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91163|682|04/04/2015| 16:40:19.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#2|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºá´€æ®–ᴀ殖찆éªá… ê±ºã¬€æ®–㬀殖é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91204|682|04/04/2015| 18:10:55.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#5|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ˜€æ˜€ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91545|682|04/05/2015| 13:41:58.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#7|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºìˆ€ìˆ€ì°†éªá… ê±ºëŒ€ëŒ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91567|682|04/05/2015| 14:42:21.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºæ €æ €ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
92120|682|04/06/2015| 19:06:43.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x3D6DB)|RDP-Tcp#2|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºç„€ç„€ì°†éªá… ê±ºçœ€çœ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
It parses everything fine except for the very last row. Yes I know there shouldn't be binary data in a CSV file, but there is. And I have no choice in that matter.
>>> with open("out.test","wb") as f:
... f.write("a\va\nb\rq")
...
>>> for line in open("out.test","rb"):
... print line.decode("utf8")
...
a♂a
q
seems fine in python 2.7 ... what kind of encoding is this file that this wont work?

editing a single .txt line in python 3.1

i have some data stored in a .txt file in this format:
----------|||||||||||||||||||||||||-----------|||||||||||
1029450386abcdefghijklmnopqrstuvwxy0293847719184756301943
1020414646canBeFollowedBySpaces 3292532113435532419963
don't ask...
i have many lines of this, and i need a way to add more digits to the end of a particular line.
i've written code to find the line i want, but im stumped as to how to add 11 characters to the end of it. i've looked around, this site has been helpful with some other issues i've run into, but i can't seem to find what i need for this.
it is important that the line retain its position in the file, and its contents in their current order.
using python3.1, how would you turn this:
1020414646canBeFollowedBySpaces 3292532113435532419963
into
1020414646canBeFollowedBySpaces 329253211343553241996301846372998
As a general principle, there's no shortcut to "inserting" new data in the middle of a text file. You will need to make a copy of the entire original file in a new file, modifying your desired line(s) of text on the way.
For example:
with open("input.txt") as infile:
with open("output.txt", "w") as outfile:
for s in infile:
s = s.rstrip() # remove trailing newline
if "target" in s:
s += "0123456789"
print(s, file=outfile)
os.rename("input.txt", "input.txt.original")
os.rename("output.txt", "input.txt")
Check out the fileinput module, it can do sort of "inplace" edits with files. though I believe temporary files are still involved in the internal process.
import fileinput
for line in fileinput.input('input.txt', inplace=1, backup='.orig'):
if line.startswith('1020414646canBeFollowedBySpaces'):
line = line.rstrip() + '01846372998' '\n'
print(line, end='')
The print now prints to the file instead of the console.
You might want to back up your original file before editing.
target_chain = '1020414646canBeFollowedBySpaces 3292532113435532419963'
to_add = '01846372998'
with open('zaza.txt','rb+') as f:
ch = f.read()
x = ch.find(target_chain)
f.seek(x + len(target_chain),0)
f.write(to_add)
f.write(ch[x + len(target_chain):])
In this method it's absolutely obligatory to open the file in binary mode 'b' for some reason linked to the treatment of the end of lines by Python (see Universal Newline, enabled by default)
The mode 'r+' is to allow the writing as well as the reading
In this method, what is before the target_chain in the file remains untouched. And what is after the target_chain is shifted ahead. As said by Greg Hewgill, there is no possibility to move apart bits on a hard drisk to insert new bits in the middle.
Evidently, if the file is very big, reading all of its content in ch could be too much memory consuming and the algorithm should then be changed: reading line after line until the line containing the target_chain, and then reading the next line before inserting, and then continuing to do "reading the next line - re-writing on the current line" until the end of the file in order to shift progressively the content from the line concerned with addition.
You see what I mean...
Copy the file, line by line, to another file. When you get to the line that needs extra chars then add them before writing.

taking a character input in python from a file?

in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.

Categories

Resources