Fastest way to split super long line into multiple lines - python

I have a huge XML-File (about 1TB) that is written in one long line.
I want to extract some of its features and think that it is easier to do this, as soon as I have the long line split into new lines after each tag.
The file is built like that:
<textA textB textC> <textD textE textF> <textG textH textI>
I now started cracking the long line with this code:
eof = 0
while eof == 0:
character = historyfile.read(1)
if character != ">" and character != "":
output.write(character)
if character == ">":
output.write('>' + '\n')
if character == "":
eof = 1
Unfortuantely this code will take about 12 days to process the whole file.
I am now thinking whether there are much faster ways that can process the file in a similiar way with at least double time.
My first idea is to maybe just parse through the file and replace the closing tag like this:
for line in infile:
line.replace('>', '>' + '\n')
Do you think this approach will be much faster? I would try it by myself, but I already have the first code running for 1 and a half days ;)

If you would try to just read the file line by line, which would be just one line of 1TB you would get a str variable of the same length. I do not know the implementation details, but I would guess, a MemoryError is raised long before reading finished.

Related

Reading and comparing a set of bytes: Python

I have two text files, both of them having 150000+ lines of data. I need to shorten them to a range of lines.
Allow me to explain:
The line which starts with "BO_ " must be the first line and the last will be the one which does not start with "BO_". How do I compare a set of characters since Python reads the file each byte at a time?
Is there any inbuilt function to trim the lines in the file. I thought of getting each byte and checking them consecutively with B, O, _ and " ". But this would be hectic, I bet the memory will run out before it is even able to check the file, considering if the mentioned happens only at the end of the file.
I tried the following code:
def character(f):
c = f.read(1)
while c:
yield c
c = f.read(1)
This code works perfectly fine, it returns each byte of the text. But, going by this approach, it will be difficult and time-consuming. The code would be very ugly.
You can use f.readline() to read a line (up until a newline b"\n" character)
read more here

Ignore the rest of the line read after using file.readline(size)

I have got had an issue.
I have a Python application that will be deployed in various places. So Mr Nasty will highly likely tinker with the app.
So the problem is security related. The app will receive a file (plain text) received from a remote source. The device has a very limited amount of RAM (Raspberry Pi).
It is very much possible to feed extremely large input to the script which would be a big trouble.
I want to avoid reading each line of the file "as is" but rather read just the first part of the line limited to eg. 44 bytes and ignore the rest.
So just for the sake of the case a very crude sample:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
This works, but in case a line is longer than 44 chars, the next read will be the rest of the line, or multiple 44 byte long parts of the same line even.
To demonstate:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaa \n',
'11111111111111111111111111111111111111111111',
'111111111111111111111111111111111111111\n',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'bbbbbbbbbbbbbbb\n',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccc\n',
'333333333333\n',
'dddddddddddddddddddd\n']
This wouldn't save me from reading the whole content to a variable and potentially causing a neat DOS.
I've thought that maybe using file.next() would jump to the next line.
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
if line != "":
lines.append(line.strip())
fh.next()
But this throws an error:
Traceback (most recent call last):
File "./test.py", line 7, in <module>
line = fh.readline(44)
ValueError: Mixing iteration and read methods would lose data
...of which I can't do much about.
I've read up on file.seek() but that really doesn't have any capability as such what so ever (by the docs).
Meanwhile, I was writing this article, I've actually figured it out myself. It's so simple it's almost embarrassing. But I thought I will finish the article and leave it for others whom may have the same issue.
So my solution:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
if '\n' not in line:
fh.readline()
So the output now looks like this:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'11111111111111111111111111111111111111111111',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'333333333333\n',
'dddddddddddddddddddd\n']
Which is the close enough.
I don't dare to say it's the best or a good solution, but it seems to do the job, and I'm not storing the redundant part of the lines in a variable at all.
But just for the sake of curiosity, I actually have a question.
As above:
fh.readline()
When you call such a method without redirecting its output to a variable or else, where does this store the input, and what's its lifetime (I mean when is it going to be destroyed if it's being stored at all)?
Thank you all for the inputs. I've learned a couple of useful things.
I don't really like the way as file.read(n) works, even though most of the solutions rely on it.
Thanks to you guys I've come up with an improved solution of my original one using only file.readline(n):
limit = 10
lineList = []
with open("linesfortest.txt", "rb") as fh:
while True:
line = fh.readline(limit)
if not line:
break
if line.strip() != "":
lineList.append(line.strip())
while '\n' not in line:
line = fh.readline(limit)
print(lineList)
If my thinking is correct, the inner while loop will read the same chunks of the line until it reads the EOL char, and meanwhile, it will use only a sized variable again and again.
And that provides an output:
['"Alright,"',
'"You\'re re',
'"Tell us!"',
'"Alright,"',
'Question .',
'"The Answe',
'"Yes ...!"',
'"Of Life,',
'"Yes ...!"',
'"Yes ...!"',
'"Is ..."',
'"Yes ...!!',
'"Forty-two']
From the content of
"Alright," said the computer and settled into silence again. The two men fidgeted. The tension was unbearable.
"You're really not going to like it," observed Deep Thought.
"Tell us!"
"Alright," said Deep Thought.
Question ..."
"The Answer to the Great
"Yes ...!"
"Of Life, the Universe and Everything ..." said Deep Thought
"Yes ...!" "Is ..." said Deep Thought, and paused.
"Yes ...!"
"Is ..."
"Yes ...!!!...?"
"Forty-two," said Deep Thought, with infinite majesty and calm.
When you just do:
f.readline()
a line is read from the file, and a string is allocated, returned, then discarded.
If you have very large lines, you could run out of memory (in the allocation/reallocation phase) just by calling f.readline() (it happens when some files are corrupt) even if you don't store the value.
Limiting the size of the line works, but if you call f.readline() again, you get the remainder of the line. The trick would be to skip the remaining chars until a line termination char is found. A simple standalone example of how I'd do:
max_size = 20
with open("test.txt") as f:
while True:
l = f.readline(max_size)
if not l:
break # we reached the end of the file
if l[-1] != '\n':
# skip the rest of the line
while True:
c = f.read(1)
if not c or c == "\n": # end of file or end of line
break
print(l.rstrip())
That example reads the start of a line, and if the line has been truncated (when it doesn't end by a line termination, that is), I read the rest of the line, discarding it. Even if the line is very long, it doesn't consume memory. It's just dead slow.
About combining next() and readline(): those are concurrent mechanisms (manual iteration vs classical line read) and they mustn't be mixed because the buffering of one method may be ignored by the other one. But you can mix read() and readline(), for loop and next().
Try like this:
'''
$cat test.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
'''
from time import sleep # trust me on this one
lines = []
with open("test.txt", "r") as fh:
while True:
line = fh.readline(44)
print (line.strip())
if not line:
#sleep(0.05)
break
lines.append(line.strip())
if not line.endswith("\n"):
while fh.readline(1) != "\n":
pass
print(lines)
Quite simple, it will read 44 characters, and if its not ending in new line it will read 1 character at the time till it gets to it to avoid large chunks into the memory, only then will it go to process next 44 characters and append them to the list.
Dont forget to use line.strip() to avoid getting \n as a part of the string when its shorter than 44 characters.
I'm going to assume you're asking your original question here, and not your side question about temporary values (which Jean-François Fabre has already answered nicely).
Your existing solution doesn't actually solve your problem.
Let's say your attacker creates a line that's 100 million characters long. So:
You do a fh.readline(44), which reads the first 44 characters.
Then you do a fh.readline() to discard the rest of the line. This has to read the rest of the line into a string to discard it, so it uses up 100MB.
You could handle this by reading one character at a time in a loop until '\n', but there's a better solution: just fh.readline(44) in a loop until '\n'. Or maybe fh.readline(8192) or something—temporarily wasting 8KB (it's effectively the same 8KB being used over and over) isn't going to help your attacker.
For example:
while True:
line = fh.readline(20)
if not line:
break
lines.append(line.strip())
while line and not line.endswith('\n'):
line = fh.readline(8192)
In practice, this isn't going to be that much more efficient. A Python 2.x file object wraps a C stdio FILE, which already has a buffer, and with the default arguments to open, it's a buffer chosen by your platform. Let's say your platform uses 16KB.
So, whether you read(1) or readline(8192), it's actually reading 16KB at a time off disk into some hidden buffer, and just copying 1 or 8192 characters out of that buffer into a Python string.
And, while it obviously takes more time to loop 16384 times and build 16384 tiny strings than to loop twice and build two 8K strings, that time is still probably smaller than the disk I/O time.
So, if you understand the read(1) code better and can debug and maintain it more easily, just do that.
However, there might be a better solution here. If you're on a 64-bit platform, or your largest possible file is under 2GB (or it's acceptable for a file >2GB to raise an error before you even process it), you can mmap the file, then search it as if it were a giant string in memory:
from contextlib import closing
import mmap
lines = []
with open('ready.py') as f:
with closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
start = 0
while True:
end = m.find('\n', start)
if end == -1:
lines.append(m[start:start+44])
break
lines.append(m[start:min(start+44, end)])
start = end + 1
This maps the whole file into virtual memory, but most of that virtual memory is not mapped to physical memory. Your OS will automatically take care of paging it in and out as needed to fit well within your resources. (And if you're worried about "swap hell": swapping out an unmodified page that's already backed by a disk file is essentially instantaneous, so that's not an issue.)
For example, let's say you've got a 1GB file. On a laptop with 16GB of RAM, it'll probably end up with the whole file mapped into 1GB of contiguous memory by the time you reach the end, but that's also probably fine. On a resource-constrained system with 128MB of RAM, it'll start throwing out the least recently used pages, and it'll end up with just the last few pages of the file mapped into memory, which is also fine. The only difference is that, if you then tried to print m[0:100], the laptop would be able to do it instantaneously, while the embedded box would have to reload the first page into memory. Since you're not doing that kind of random access through the file, that doesn't come up.

Single Line from file is too big?

In python, I'm reading a large file, and I want to add each line(after some modifications) to an empty list. I want to do this to only the first few lines, so I did:
X = []
for line in range(3):
i = file.readline()
m = str(i)
X.append(m)
However, an error shows up, and says there is a MemoryError for the line
i = file.readline().
What should I do? It is the same even if I make the range 1 (although I don't know how that affects the line, since it's inside the loop).
How do I not get the error code? I'm iterating, and I can't make it into a binary file because the file isn't just integers - there's decimals and non-numerical characters.
The txt file is 5 gigs.
Any ideas?
filehandle.readline() breaks lines via the newline character (\n) - if your file has gigantic lines, or no new lines at all, you'll need to figure out a different way of chunking it.
Normally you might read the file in chunks and process those chunks one by one.
Can you figure out how you might break up the file? Could you, for example, only read 1024 bytes at a time, and work with that chunk?
If not, it's often easier to clean up the format of the file instead of designing a complicated reader.

Python conditional statement based on text file string

Noob question here. I'm scheduling a cron job for a Python script for every 2 hours, but I want the script to stop running after 48 hours, which is not a feature of cron. To work around this, I'm recording the number of executions at the end of the script in a text file using a tally mark x and opening the text file at the beginning of the script to only run if the count is less than n.
However, my script seems to always run regardless of the conditions. Here's an example of what I've tried:
with open("curl-output.txt", "a+") as myfile:
data = myfile.read()
finalrun = "xxxxx"
if data != finalrun:
[CURL CODE]
with open("curl-output.txt", "a") as text_file:
text_file.write("x")
text_file.close()
I think I'm missing something simple here. Please advise if there is a better way of achieving this. Thanks in advance.
The problem with your original code is that you're opening the file in a+ mode, which seems to set the seek position to the end of the file (try print(data) right after you read the file). If you use r instead, it works. (I'm not sure that's how it's supposed to be. This answer states it should write at the end, but read from the beginning. The documentation isn't terribly clear).
Some suggestions: Instead of comparing against the "xxxxx" string, you could just check the length of the data (if len(data) < 5). Or alternatively, as was suggested, use pickle to store a number, which might look like this:
import pickle
try:
with open("curl-output.txt", "rb") as myfile:
num = pickle.load(myfile)
except FileNotFoundError:
num = 0
if num < 5:
do_curl_stuff()
num += 1
with open("curl-output.txt", "wb") as myfile:
pickle.dump(num, myfile)
Two more things concerning your original code: You're making the first with block bigger than it needs to be. Once you've read the string into data, you don't need the file object anymore, so you can remove one level of indentation from everything except data = myfile.read().
Also, you don't need to close text_file manually. with will do that for you (that's the point).
Sounds more for a job scheduling with at command?
See http://www.ibm.com/developerworks/library/l-job-scheduling/ for different job scheduling mechanisms.
The first bug that is immediately obvious to me is that you are appending to the file even if data == finalrun. So when data == finalrun, you don't run curl but you do append another 'x' to the file. On the next run, data will be not equal to finalrun again so it will continue to execute the curl code.
The solution is of course to nest the code that appends to the file under the if statement.
Well there probably is an end of line jump \n character which makes that your file will contain something like xx\n and not simply xx. Probably this is why your condition does not work :)
EDIT
What happens if through the python command line you type
open('filename.txt', 'r').read() # where filename is the name of your file
you will be able to see whether there is an \n or not
Try using this condition along with if clause instead.
if data.count('x')==24
data string may contain extraneous data line new line characters. Check repr(data) to see if it actually a 24 x's.

File contents not as long as expected

with open(sourceFileName, 'rt') as sourceFile:
sourceFileConents = sourceFile.read()
sourceFileConentsLength = len(sourceFileConents)
i = 0
while i < sourceFileConentsLength:
print(str(i) + ' ' + sourceFileConents[i])
i += 1
Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.
Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.
The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.
What gives?
Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
Have I hit some maximum string length? If so, can I get around it?
Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
What else?
[Update] I think that we strike two of those ideas.
For maximum string length, see this question.
I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.
[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.
If you read a file in text mode it will automatically convert line endings like \r\n to \n.
Try using
with open(sourceFileName, newline='') as sourceFile:
instead; this will turn off newline-translation (\r\n will be returned as \r\n).
If your file is encoded in something like UTF-8, you should decode it before counting the characters:
sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))
i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
print('%s %s' % (str(i), sourceFileContents[i]))
i += 1

Categories

Resources