I need a functions that iterates over all the lines in the file.
Here's what I have so far:
def LineFeed(file):
ret = ""
for byte in file:
ret = ret + str(byte)
if str(byte) == '\r':
yield ret
ret = ""
All the lines in the file end with \r (not \n), and I'm reading it in "rb" mode, (I have to read this file in binary).
The yield doesn't work and returns nothing. Maybe there's a problem with the comparison?
I'm just not sure how you represent a byte/char in python.
I'm getting the idea that if you for loop on a "rb" file it still tries to iterate over lines not bytes..., How can I iterate over bytes?
My problem is that I don't have standard line endings. Also my file is filled with 0x00 bytes and I would like to get rid of them all, so I think I would need a second yield function, how could I implement that, I just don't know how to represent the 0x00 byte in python or the NULL char.
I think that you are confused with what "for x in file" does. Assuming you got your handle like "file = open(file_name)", byte in this case will be an entire line, not a single character. So you are only calling yield when the entire line consists of a single carriage return. Try changing "byte" to "line" and iterating over that with a second loop.
Perhaps if you were to explain what this file represents, why it has lots of '\x00', why you think you need to read it in binary mode, we could help you with your underlying problem.
Otherwise, try the following code; it avoids any dependence on (or interference from) your operating system's line-ending convention.
lines = open("the_file", "rb").read().split("\r")
for line in lines:
process(line)
Edit: the ASCII NUL (not "NULL") byte is "\x00".
If you're in control of how you open the file, I'd recommend opening it with universal newlines, since \r isn't recognized as a linefeed character if you just use 'rb' mode, but it is if you use 'Urb'.
This will only work if you aren't including \n as well as \r in your binary file somewhere, since the distinction between \r and \n is lost when using universal newlines.
Assuming you want your yielded lines to still be \r terminated:
NUL = '\x00'
def lines_without_nulls(path):
with open(path, 'Urb') as f:
for line in f:
yield line.replace(NUL, '').replace('\n', '\r')
So, your problem is iterating over the lines of a file open in binary mode that use '\r' as a line separator. Since the file is in binary mode, you cannot use the universal newline feature, and it turns out that '\r' is not interpreted as a line separator in binary mode.
Reading a file char by char is a terribly inefficient thing to do in Python, but here's how you could iterate over your lines:
def cr_lines(the_file):
line = []
while True:
byte = the_file.read(1)
if not byte:
break
line.append(byte)
if byte == '\r':
yield ''.join(line)
line = []
if line:
yield ''.join(line)
To be more efficient, you would need to read bigger chunks of text and handle buffering in your iterator. Keeping in mind that you could get strange bugs if seeking while iterating. Preventing those bugs would require a subclass of file so you can purge the buffer on seek.
Note the use of the ''.join(line) idiom. Accumulating a string with += has terrible performance and is common mistake made by beginning programmers.
Edit:
string1 += string2 string concatenation is slow. Try joining a list of strings.
ddaa is right--You shouldn't need the struct package if the binary file only contains ASCII. Also, my generator returns the string after the final '\r', before EOF. With these two minor fixes, my code is suspiciously similar (practically identical) to this more recent answer.
Code snip:
def LineFeed(f):
ret = []
while True:
oneByte = f.read(1)
if not oneByte: break
# Return everything up to, but not including the carriage return
if oneByte == '\r':
yield ''.join(ret)
ret = []
else:
ret.append(oneByte)
if oneByte:
yield ''.join(ret)
if __name__ == '__main__':
lf = LineFeed( open('filename','rb') )
for something in lf:
doSomething(something)
Related
I'm modifying a file with python that may already contain newlines like the following :
#comment
something
#new comment
something else
My code appends some lines to this file, I'm also writing the code that will remove what I added (ideally also working if other modifications occurred in the file).
Currently, I end up with a file that grows each time I apply the code (append/remove) with newlines characters at the end of the file.
I'm looking for a clean way to remove those newlines without too much programmatic complexity. Newlines "inside" the file should remain, newlines at the end of the file should be removed.
use str.rstrip() method:
my_file = open("text.txt", "r+")
content = my_file.read()
content = content.rstrip('\n')
my_file.seek(0)
my_file.write(content)
my_file.truncate()
my_file.close()
I needed a way to remove newline at eof without having to read the whole file into memory. The code below works for me. I find this efficient in terms of memory when dealing with large files.
with open('test.txt', 'r+') as f: #opens file in read/write text mode
f.seek(0, 2) #navigates to the position at end of file
f.seek(f.tell() - 1) #navigates to the position of the penultimate char at end of file
last_char = f.read()
if last_char == '\n':
f.truncate(f.tell() - 1)
I have two text files, both of them having 150000+ lines of data. I need to shorten them to a range of lines.
Allow me to explain:
The line which starts with "BO_ " must be the first line and the last will be the one which does not start with "BO_". How do I compare a set of characters since Python reads the file each byte at a time?
Is there any inbuilt function to trim the lines in the file. I thought of getting each byte and checking them consecutively with B, O, _ and " ". But this would be hectic, I bet the memory will run out before it is even able to check the file, considering if the mentioned happens only at the end of the file.
I tried the following code:
def character(f):
c = f.read(1)
while c:
yield c
c = f.read(1)
This code works perfectly fine, it returns each byte of the text. But, going by this approach, it will be difficult and time-consuming. The code would be very ugly.
You can use f.readline() to read a line (up until a newline b"\n" character)
read more here
I have looked around StackOverflow and couldn't find an answer to my specific question so forgive me if I have missed something.
import re
target = open('output.txt', 'w')
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
match_text = match.group()
target.write(match_text + '\n')
else:
continue
target.close()
The file I am parsing is huge so need to process it line by line.
This (of course) leaves an additional newline at the end of the file.
How should I best change this code so that on the final iteration of the 'if match' loop it doesn't put the extra newline character at the end of the file. Should it look through the file again at the end and remove the last line (seems a bit inefficient though)?
The existing StackOverflow questions I have found cover removing all new lines from a file.
If there is a more pythonic / efficient way to write this code I would welcome suggestions for my own learning also.
Thanks for the help!
Another thing you can do, is to truncate the file. .tell() gives us the current byte number in the file. We then subtract one, and truncate it there to remove the trailing newline.
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell()-1)
On Linux and MacOS, the -1 is correct, but on Windows it needs to be -2. A more Pythonic method of determining which is to check os.linesep.
import os
remove_chars = len(os.linesep)
with open('a.txt', 'w') as f:
f.write('abc\n')
f.write('def\n')
f.truncate(f.tell() - remove_chars)
kindal's answer is also valid, with the exception that you said it's a large file. This method will let you handle a terabyte sized file on a gigabyte of RAM.
Write the newline of each line at the beginning of the next line. To avoid writing a newline at the beginning of the first line, use a variable that is initialized to an empty string and then set to a newline in the loop.
import re
with open('input.txt') as source, open('output.txt', 'w') as target:
newline = ''
for line in source:
match = re.search(r'Stuff', line)
if match:
target.write(newline + match.group())
newline = '\n'
I also restructured your code a bit (the else: continue is not needed, because what else is the loop going to do?) and changed it to use the with statement so the files are automatically closed.
The shortest path from what you have to what you want is probably to store the results in a list, then join the list with newlines and write that to the file.
import re
target = open('output.txt', 'w')
results = []
for line in open('input.txt', 'r'):
match = re.search(r'Stuff', line)
if match:
results.append(match.group())
target.write("\n".join(results))
target.close()
VoilĂ , no extra newline at the beginning or end. Might not scale very well of the resulting list is huge. (And like kindall I left out the else)
Since you're performing the same regex over and over, you'd probably want to compile it beforehand.
import re
prog = re.compile(r'Stuff')
I tend to input from and output to stdin and stdout for simplicity. But that's a matter of taste (and specs).
from sys import stdin, stdout
Ignoring the specific requirement about removing the final EOL[1], and just addressing the bit about your own learning, the whole thing could be written like this:
from itertools import imap
stdout.writelines(match.group() for match in imap(prog.match, stdin) if match)
[1] As others have commented, this is a Bad Thing, and it's extremely annoying when someone does this.
Specifically I have exported a csv file from Google Adwords.
I read the file line by line and change the phone numbers.
Here is the literal script:
for line in open('ads.csv', 'r'):
newdata = changeNums(line)
sys.stdout.write(newdata)
And changeNums() just performs some string replaces and returns the string.
The problem is at the end of the printed newlines is a musical note.
The original CSV does not have this note at the end of lines. Also, I cannot copy-paste the note.
Is this some kind of encoding issue or what's going on?
Try opening with universal line support:
for line in open('ads.csv', 'rU'):
# etc
Either:
the original file has some characters on it (and they're being show as this symbol in the terminal)
changeNums is creating those characters
stdout.write is sending some non interpreted newline symbol, that again is being shown by the terminal as this symbol, change this line to a print(newdata)
My guess: changeNums is adding it.
Best debugging commands:
print([ord(x) for x in line])
print([ord(x) for x in newdata])
print line == newdata
And check for the character values present in the string.
You can strip out the newlines by:
for line in open('ads.csv', 'r'):
line = line.rstrip('\n')
newdata = changeNums(line)
sys.stdout.write(newdata)
An odd "note" character at the end is usually a CR/LF newline issue between *nix and *dos/*win environments.
So I have a program which runs. This is part of the code:
FileName = 'Numberdata.dat'
NumberFile = open(FileName, 'r')
for Line in NumberFile:
if Line == '4':
print('1')
else:
print('9')
NumberFile.close()
A pretty pointless thing to do, yes, but I'm just doing it to enhance my understanding. However, this code doesn't work. The file remains as it is and the 4's are not replaced by 1's and everything else isn't replaced by 9's, they merely stay the same. Where am I going wrong?
Numberdata.dat is "444666444666444888111000444"
It is now:
FileName = 'Binarydata.dat'
BinaryFile = open(FileName, 'w')
for character in BinaryFile:
if charcter == '0':
NumberFile.write('')
else:
NumberFile.write('#')
BinaryFile.close()
You need to build up a string and write it to the file.
FileName = 'Numberdata.dat'
NumberFileHandle = open(FileName, 'r')
newFileString = ""
for Line in NumberFileHandle:
for char in line: # this will work for any number of lines.
if char == '4':
newFileString += "1"
elif char == '\n':
newFileString += char
else:
newFileString += "9"
NumberFileHandle.close()
NumberFileHandle = open(FileName, 'w')
NumberFileHandle.write(newFileString)
NumberFileHandle.close()
First, Line will never equal 4 because each line read from the file includes the newline character at the end. Try if Line.strip() == '4'. This will remove all white space from the beginning and end of the line.
Edit: I just saw your edit... naturally, if you have all your numbers on one line, the line will never equal 4. You probably want to read the file a character at a time, not a line at a time.
Second, you're not writing to any file, so naturally the file won't be getting changed. You will run into difficulty changing a file as you read it (since you have to figure out how to back up to the same place you just read from), so the usual practice is to read from one file and write to a different one.
Because you need to write to the file as well.
with open(FileName, 'w') as f:
f.write(...)
Right now you are just reading and manipulating the data, but you're not writing them back.
At the end you'll need to reopen your file in write mode and write to it.
If you're looking for references, take a look at theopen() documentation and at the Reading and Writing Files section of the Python Tutorial.
Edit: You shouldn't read and write at the same time from the same file. You could either, write to a temp file and at the end call shutil.move(), or load and manipulate your data and then re-open your original file in write mode and write them back.
You are not sending any output to the data, you are simply printing 1 and 9 to stdout which is usually the terminal or interpreter.
If you want to write to the file you have to use open again with w.
eg.
out = open(FileName, 'w')
you can also use
print >>out, '1'
Then you can call out.write('1') for example.
Also it is a better idea to read the file first if you want to overwrite and write after.
According to your comment:
Numberdata is just a load of numbers all one line. Maybe that's where I'm going wrong? It is "444666444666444888111000444"
I can tell you that the for cycle, iterate over lines and not over chars. There is a logic error.
Moreover, you have to write the file, as Rik Poggi said (just rember to open it in write mode)
A few things:
The r flag to open indicates read-only mode. This obviously won't let you write to the file.
print() outputs things to the screen. What you really want to do is output to the file. Have you read the Python File I/O tutorial?
for line in file_handle: loops through files one line at a time. Thus, if line == '4' will only be true if the line consists of a single character, 4, all on its own.
If you want to loop over characters in a string, then do something like for character in line:.
Modifying bits of a file "in place" is a bit harder than you think.
This is because if you insert data into the middle of a file, the rest of the data has to shuffle over to make room - this is really slow because everything after your insertion has to be rewritten.
In theory, a one-byte for one-byte replacement can be done fast, but in general people don't want to replace byte-for-byte, so this is an advanced feature. (See seek().) The usual approach is to just write out a whole new file.
Because print doesn't write to your file.
You have to open the file and read it, modify the string you obtain creating a new string, open again the file and write it again.
FileName = 'Numberdata.dat'
NumberFile = open(FileName, 'r')
data = NumberFile.read()
NumberFile.close()
dl = data.split('\n')
for i in range(len(dl)):
if dl[i] =='4':
dl[i] = '1'
else:
dl[i] = '9'
NumberFile = open(FileName, 'w')
NumberFile.write('\n'.join(dl))
NumberFile.close()
Try in this way. There are for sure different methods but this seems to be the most "linear" to me =)