Reading and comparing a set of bytes: Python - python

I have two text files, both of them having 150000+ lines of data. I need to shorten them to a range of lines.
Allow me to explain:
The line which starts with "BO_ " must be the first line and the last will be the one which does not start with "BO_". How do I compare a set of characters since Python reads the file each byte at a time?
Is there any inbuilt function to trim the lines in the file. I thought of getting each byte and checking them consecutively with B, O, _ and " ". But this would be hectic, I bet the memory will run out before it is even able to check the file, considering if the mentioned happens only at the end of the file.
I tried the following code:
def character(f):
c = f.read(1)
while c:
yield c
c = f.read(1)
This code works perfectly fine, it returns each byte of the text. But, going by this approach, it will be difficult and time-consuming. The code would be very ugly.

You can use f.readline() to read a line (up until a newline b"\n" character)
read more here

Related

Single Line from file is too big?

In python, I'm reading a large file, and I want to add each line(after some modifications) to an empty list. I want to do this to only the first few lines, so I did:
X = []
for line in range(3):
i = file.readline()
m = str(i)
X.append(m)
However, an error shows up, and says there is a MemoryError for the line
i = file.readline().
What should I do? It is the same even if I make the range 1 (although I don't know how that affects the line, since it's inside the loop).
How do I not get the error code? I'm iterating, and I can't make it into a binary file because the file isn't just integers - there's decimals and non-numerical characters.
The txt file is 5 gigs.
Any ideas?
filehandle.readline() breaks lines via the newline character (\n) - if your file has gigantic lines, or no new lines at all, you'll need to figure out a different way of chunking it.
Normally you might read the file in chunks and process those chunks one by one.
Can you figure out how you might break up the file? Could you, for example, only read 1024 bytes at a time, and work with that chunk?
If not, it's often easier to clean up the format of the file instead of designing a complicated reader.

File contents not as long as expected

with open(sourceFileName, 'rt') as sourceFile:
sourceFileConents = sourceFile.read()
sourceFileConentsLength = len(sourceFileConents)
i = 0
while i < sourceFileConentsLength:
print(str(i) + ' ' + sourceFileConents[i])
i += 1
Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.
Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.
The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.
What gives?
Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
Have I hit some maximum string length? If so, can I get around it?
Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
What else?
[Update] I think that we strike two of those ideas.
For maximum string length, see this question.
I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.
[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.
If you read a file in text mode it will automatically convert line endings like \r\n to \n.
Try using
with open(sourceFileName, newline='') as sourceFile:
instead; this will turn off newline-translation (\r\n will be returned as \r\n).
If your file is encoded in something like UTF-8, you should decode it before counting the characters:
sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))
i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
print('%s %s' % (str(i), sourceFileContents[i]))
i += 1

Fastest way to split super long line into multiple lines

I have a huge XML-File (about 1TB) that is written in one long line.
I want to extract some of its features and think that it is easier to do this, as soon as I have the long line split into new lines after each tag.
The file is built like that:
<textA textB textC> <textD textE textF> <textG textH textI>
I now started cracking the long line with this code:
eof = 0
while eof == 0:
character = historyfile.read(1)
if character != ">" and character != "":
output.write(character)
if character == ">":
output.write('>' + '\n')
if character == "":
eof = 1
Unfortuantely this code will take about 12 days to process the whole file.
I am now thinking whether there are much faster ways that can process the file in a similiar way with at least double time.
My first idea is to maybe just parse through the file and replace the closing tag like this:
for line in infile:
line.replace('>', '>' + '\n')
Do you think this approach will be much faster? I would try it by myself, but I already have the first code running for 1 and a half days ;)
If you would try to just read the file line by line, which would be just one line of 1TB you would get a str variable of the same length. I do not know the implementation details, but I would guess, a MemoryError is raised long before reading finished.

taking a character input in python from a file?

in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.

Specifying chars in python

I need a functions that iterates over all the lines in the file.
Here's what I have so far:
def LineFeed(file):
ret = ""
for byte in file:
ret = ret + str(byte)
if str(byte) == '\r':
yield ret
ret = ""
All the lines in the file end with \r (not \n), and I'm reading it in "rb" mode, (I have to read this file in binary).
The yield doesn't work and returns nothing. Maybe there's a problem with the comparison?
I'm just not sure how you represent a byte/char in python.
I'm getting the idea that if you for loop on a "rb" file it still tries to iterate over lines not bytes..., How can I iterate over bytes?
My problem is that I don't have standard line endings. Also my file is filled with 0x00 bytes and I would like to get rid of them all, so I think I would need a second yield function, how could I implement that, I just don't know how to represent the 0x00 byte in python or the NULL char.
I think that you are confused with what "for x in file" does. Assuming you got your handle like "file = open(file_name)", byte in this case will be an entire line, not a single character. So you are only calling yield when the entire line consists of a single carriage return. Try changing "byte" to "line" and iterating over that with a second loop.
Perhaps if you were to explain what this file represents, why it has lots of '\x00', why you think you need to read it in binary mode, we could help you with your underlying problem.
Otherwise, try the following code; it avoids any dependence on (or interference from) your operating system's line-ending convention.
lines = open("the_file", "rb").read().split("\r")
for line in lines:
process(line)
Edit: the ASCII NUL (not "NULL") byte is "\x00".
If you're in control of how you open the file, I'd recommend opening it with universal newlines, since \r isn't recognized as a linefeed character if you just use 'rb' mode, but it is if you use 'Urb'.
This will only work if you aren't including \n as well as \r in your binary file somewhere, since the distinction between \r and \n is lost when using universal newlines.
Assuming you want your yielded lines to still be \r terminated:
NUL = '\x00'
def lines_without_nulls(path):
with open(path, 'Urb') as f:
for line in f:
yield line.replace(NUL, '').replace('\n', '\r')
So, your problem is iterating over the lines of a file open in binary mode that use '\r' as a line separator. Since the file is in binary mode, you cannot use the universal newline feature, and it turns out that '\r' is not interpreted as a line separator in binary mode.
Reading a file char by char is a terribly inefficient thing to do in Python, but here's how you could iterate over your lines:
def cr_lines(the_file):
line = []
while True:
byte = the_file.read(1)
if not byte:
break
line.append(byte)
if byte == '\r':
yield ''.join(line)
line = []
if line:
yield ''.join(line)
To be more efficient, you would need to read bigger chunks of text and handle buffering in your iterator. Keeping in mind that you could get strange bugs if seeking while iterating. Preventing those bugs would require a subclass of file so you can purge the buffer on seek.
Note the use of the ''.join(line) idiom. Accumulating a string with += has terrible performance and is common mistake made by beginning programmers.
Edit:
string1 += string2 string concatenation is slow. Try joining a list of strings.
ddaa is right--You shouldn't need the struct package if the binary file only contains ASCII. Also, my generator returns the string after the final '\r', before EOF. With these two minor fixes, my code is suspiciously similar (practically identical) to this more recent answer.
Code snip:
def LineFeed(f):
ret = []
while True:
oneByte = f.read(1)
if not oneByte: break
# Return everything up to, but not including the carriage return
if oneByte == '\r':
yield ''.join(ret)
ret = []
else:
ret.append(oneByte)
if oneByte:
yield ''.join(ret)
if __name__ == '__main__':
lf = LineFeed( open('filename','rb') )
for something in lf:
doSomething(something)

Categories

Resources