How to read large file (socket programming and python)?

How to read large file (socket programming and python)? - python

I'm a beginner in socket programming and python. I would like to learn how to send a large text file (e.g., > 5MB) from the server to client. I keep getting an error that says
Traceback (most recent call last):
File "fserver.py", line 50, in <module>
reply = f.read()
ValueError: Mixing iteration and read methods would lose data
Below is a partial of my code. Can someone take a look and give me some hints on how to resolve this issue? Thank you for your time.
myserver.py
#validate filename
if os.path.exists(filename):
with open(filename) as f:
for line in f:
reply = f.read()
client.send(reply)
#f = open(filename, 'r')
#reply = f.read()
#client.send(piece)
else:
reply = 'File not found'
client.send(reply)
myclient.py
while True:
print 'Enter a command: list or get <filename>'
command = raw_input()
if command.strip() == 'quit':
break
client_socket.send(command)
data = client_socket.recv(socksize)
print data

The problem here has nothing to do with sockets, or with how big the file is. When you do this:
for line in f:
reply = f.read()
The for line in f is trying to read one line of the file at a time, and then for each line you're trying to read the entire file. That won't work.
If you didn't get this error (which you won't in many cases), the first time through the loop you would read and ignore the first line, and then read and send everything but the first line (or, possibly, everything but the first, say, 4KB) as one giant reply, and then the loop would be done.
What you want is either one or the other:
for line in f:
reply = line
… or …
# no for loop
reply = f.read()
Meanwhile, on your client side, you're only doing one recv. That's going to get the first 4K (or whatever socksize is) or less, and then you never receive anything else.
What you need is a loop. Like this:
while True:
data = client_socket.recv(socksize)
print data
But now you have a new problem. Once the file is done, the client will sit there waiting forever for the next chunk of data, which will never come. So the client needs to know when it's done. And the only way it can know that is if the server puts that information into the data stream.
One way to do this is to send the length before the file. One standardized way to do this is to use the netstring protocol. You can find libraries that do this for you, but it's simple enough to do by hand. Or maybe do something more like HTTP, where the headers are just separated by newlines, and separated from the body by a blank line; then you can use socket.makefile as your protocol implementation. Or even a binary protocol, where you just send the length as four bytes.
There's another problem we might as well fix while we're here: send(reply) doesn't necessarily send the whole reply; it sends anywhere from 1 byte to the whole thing, and returns a number telling you what got sent. The simple fix to that is to use sendall(reply), which guarantees to send all of it.
And finally: Your server is expecting that each recv will get exactly one command, as sent by send. But sockets don't work that way. Sockets are byte streams, not message streams; there's nothing preventing recv from getting, say, just half a command, and then your server will break. So, you need some kind of protocol in that direction as well. Again, you could use netstring, or newline-separated messages, or a binary length prefix, but you have to do something.
(The blog post linked above has very simple example code for using binary length prefixes as a protocol.)

you can do for line in file.readlines()

Related

Ignore the rest of the line read after using file.readline(size)

I have got had an issue.
I have a Python application that will be deployed in various places. So Mr Nasty will highly likely tinker with the app.
So the problem is security related. The app will receive a file (plain text) received from a remote source. The device has a very limited amount of RAM (Raspberry Pi).
It is very much possible to feed extremely large input to the script which would be a big trouble.
I want to avoid reading each line of the file "as is" but rather read just the first part of the line limited to eg. 44 bytes and ignore the rest.
So just for the sake of the case a very crude sample:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
This works, but in case a line is longer than 44 chars, the next read will be the rest of the line, or multiple 44 byte long parts of the same line even.
To demonstate:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaa \n',
'11111111111111111111111111111111111111111111',
'111111111111111111111111111111111111111\n',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'bbbbbbbbbbbbbbb\n',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccc\n',
'333333333333\n',
'dddddddddddddddddddd\n']
This wouldn't save me from reading the whole content to a variable and potentially causing a neat DOS.
I've thought that maybe using file.next() would jump to the next line.
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
if line != "":
lines.append(line.strip())
fh.next()
But this throws an error:
Traceback (most recent call last):
File "./test.py", line 7, in <module>
line = fh.readline(44)
ValueError: Mixing iteration and read methods would lose data
...of which I can't do much about.
I've read up on file.seek() but that really doesn't have any capability as such what so ever (by the docs).
Meanwhile, I was writing this article, I've actually figured it out myself. It's so simple it's almost embarrassing. But I thought I will finish the article and leave it for others whom may have the same issue.
So my solution:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
if '\n' not in line:
fh.readline()
So the output now looks like this:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'11111111111111111111111111111111111111111111',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'333333333333\n',
'dddddddddddddddddddd\n']
Which is the close enough.
I don't dare to say it's the best or a good solution, but it seems to do the job, and I'm not storing the redundant part of the lines in a variable at all.
But just for the sake of curiosity, I actually have a question.
As above:
fh.readline()
When you call such a method without redirecting its output to a variable or else, where does this store the input, and what's its lifetime (I mean when is it going to be destroyed if it's being stored at all)?
Thank you all for the inputs. I've learned a couple of useful things.
I don't really like the way as file.read(n) works, even though most of the solutions rely on it.
Thanks to you guys I've come up with an improved solution of my original one using only file.readline(n):
limit = 10
lineList = []
with open("linesfortest.txt", "rb") as fh:
while True:
line = fh.readline(limit)
if not line:
break
if line.strip() != "":
lineList.append(line.strip())
while '\n' not in line:
line = fh.readline(limit)
print(lineList)
If my thinking is correct, the inner while loop will read the same chunks of the line until it reads the EOL char, and meanwhile, it will use only a sized variable again and again.
And that provides an output:
['"Alright,"',
'"You\'re re',
'"Tell us!"',
'"Alright,"',
'Question .',
'"The Answe',
'"Yes ...!"',
'"Of Life,',
'"Yes ...!"',
'"Yes ...!"',
'"Is ..."',
'"Yes ...!!',
'"Forty-two']
From the content of
"Alright," said the computer and settled into silence again. The two men fidgeted. The tension was unbearable.
"You're really not going to like it," observed Deep Thought.
"Tell us!"
"Alright," said Deep Thought.
Question ..."
"The Answer to the Great
"Yes ...!"
"Of Life, the Universe and Everything ..." said Deep Thought
"Yes ...!" "Is ..." said Deep Thought, and paused.
"Yes ...!"
"Is ..."
"Yes ...!!!...?"
"Forty-two," said Deep Thought, with infinite majesty and calm.

When you just do:
f.readline()
a line is read from the file, and a string is allocated, returned, then discarded.
If you have very large lines, you could run out of memory (in the allocation/reallocation phase) just by calling f.readline() (it happens when some files are corrupt) even if you don't store the value.
Limiting the size of the line works, but if you call f.readline() again, you get the remainder of the line. The trick would be to skip the remaining chars until a line termination char is found. A simple standalone example of how I'd do:
max_size = 20
with open("test.txt") as f:
while True:
l = f.readline(max_size)
if not l:
break # we reached the end of the file
if l[-1] != '\n':
# skip the rest of the line
while True:
c = f.read(1)
if not c or c == "\n": # end of file or end of line
break
print(l.rstrip())
That example reads the start of a line, and if the line has been truncated (when it doesn't end by a line termination, that is), I read the rest of the line, discarding it. Even if the line is very long, it doesn't consume memory. It's just dead slow.
About combining next() and readline(): those are concurrent mechanisms (manual iteration vs classical line read) and they mustn't be mixed because the buffering of one method may be ignored by the other one. But you can mix read() and readline(), for loop and next().

Try like this:
'''
$cat test.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
'''
from time import sleep # trust me on this one
lines = []
with open("test.txt", "r") as fh:
while True:
line = fh.readline(44)
print (line.strip())
if not line:
#sleep(0.05)
break
lines.append(line.strip())
if not line.endswith("\n"):
while fh.readline(1) != "\n":
pass
print(lines)
Quite simple, it will read 44 characters, and if its not ending in new line it will read 1 character at the time till it gets to it to avoid large chunks into the memory, only then will it go to process next 44 characters and append them to the list.
Dont forget to use line.strip() to avoid getting \n as a part of the string when its shorter than 44 characters.

I'm going to assume you're asking your original question here, and not your side question about temporary values (which Jean-François Fabre has already answered nicely).
Your existing solution doesn't actually solve your problem.
Let's say your attacker creates a line that's 100 million characters long. So:
You do a fh.readline(44), which reads the first 44 characters.
Then you do a fh.readline() to discard the rest of the line. This has to read the rest of the line into a string to discard it, so it uses up 100MB.
You could handle this by reading one character at a time in a loop until '\n', but there's a better solution: just fh.readline(44) in a loop until '\n'. Or maybe fh.readline(8192) or something—temporarily wasting 8KB (it's effectively the same 8KB being used over and over) isn't going to help your attacker.
For example:
while True:
line = fh.readline(20)
if not line:
break
lines.append(line.strip())
while line and not line.endswith('\n'):
line = fh.readline(8192)
In practice, this isn't going to be that much more efficient. A Python 2.x file object wraps a C stdio FILE, which already has a buffer, and with the default arguments to open, it's a buffer chosen by your platform. Let's say your platform uses 16KB.
So, whether you read(1) or readline(8192), it's actually reading 16KB at a time off disk into some hidden buffer, and just copying 1 or 8192 characters out of that buffer into a Python string.
And, while it obviously takes more time to loop 16384 times and build 16384 tiny strings than to loop twice and build two 8K strings, that time is still probably smaller than the disk I/O time.
So, if you understand the read(1) code better and can debug and maintain it more easily, just do that.
However, there might be a better solution here. If you're on a 64-bit platform, or your largest possible file is under 2GB (or it's acceptable for a file >2GB to raise an error before you even process it), you can mmap the file, then search it as if it were a giant string in memory:
from contextlib import closing
import mmap
lines = []
with open('ready.py') as f:
with closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
start = 0
while True:
end = m.find('\n', start)
if end == -1:
lines.append(m[start:start+44])
break
lines.append(m[start:min(start+44, end)])
start = end + 1
This maps the whole file into virtual memory, but most of that virtual memory is not mapped to physical memory. Your OS will automatically take care of paging it in and out as needed to fit well within your resources. (And if you're worried about "swap hell": swapping out an unmodified page that's already backed by a disk file is essentially instantaneous, so that's not an issue.)
For example, let's say you've got a 1GB file. On a laptop with 16GB of RAM, it'll probably end up with the whole file mapped into 1GB of contiguous memory by the time you reach the end, but that's also probably fine. On a resource-constrained system with 128MB of RAM, it'll start throwing out the least recently used pages, and it'll end up with just the last few pages of the file mapped into memory, which is also fine. The only difference is that, if you then tried to print m[0:100], the laptop would be able to do it instantaneously, while the embedded box would have to reload the first page into memory. Since you're not doing that kind of random access through the file, that doesn't come up.

Large File handling in Python

I am doing some file operations in python. I am using python version Python 3.5.2.
I have a large file of 4GB. And I'am reading the file in chunk say, 2KB.
I have a doubt.
If the any 2KB chunk happens to be in the middle of line (in between 2 newlines) will that line be truncated or the half-read lines' contents be returned ?
-Regards,

Yes, this is a problem. You can see that with a much smaller test:
s = io.BytesIO(b'line\nanother line\nanother\n')
while True:
buf = s.read(10)
if not buf: break
print('*** new buffer')
for line in buf.splitlines():
print(line.decode())
The output is:
*** new buffer
line
anoth
*** new buffer
er line
an
*** new buffer
other
As you can see, the first buffer has a truncated partial line that finishes in the next buffer, exactly what you were worried about. In fact, this will happen not just occasionally, but _most of the time).
The solution is to keep around the overflow (after the last line) from the old buffer, and use it as part of the new buffer. You should try to code this up for yourself, to make sure you understand it (remember to print out the leftover overflow at the end of the file).
But the good news is that you rarely need to do this, because Python file objects do it for you:
s = io.BytesIO(b'line\nanother line\nanother\n')
for line in s:
print(line.decode(), end='')
That's it. You can test this with a real file from open(path, 'rb') in place of BytesIO, it works just as well. Python will read in about a page at a time, and generate lines one by one, automatically handling all the tricky stuff for you. If "about a page" isn't good enough, you can use something more explicit, e.g., passing buffering=2048 to the open function.
In fact, you can do even better. Open the file in text mode, and Python will still read in about a page at a time, split it into lines, and decode them for you on the fly—and probably a lot more efficiently than anything you would have come up with:
for line in open(path):
print(line, end='')

Python conditional statement based on text file string

Noob question here. I'm scheduling a cron job for a Python script for every 2 hours, but I want the script to stop running after 48 hours, which is not a feature of cron. To work around this, I'm recording the number of executions at the end of the script in a text file using a tally mark x and opening the text file at the beginning of the script to only run if the count is less than n.
However, my script seems to always run regardless of the conditions. Here's an example of what I've tried:
with open("curl-output.txt", "a+") as myfile:
data = myfile.read()
finalrun = "xxxxx"
if data != finalrun:
[CURL CODE]
with open("curl-output.txt", "a") as text_file:
text_file.write("x")
text_file.close()
I think I'm missing something simple here. Please advise if there is a better way of achieving this. Thanks in advance.

The problem with your original code is that you're opening the file in a+ mode, which seems to set the seek position to the end of the file (try print(data) right after you read the file). If you use r instead, it works. (I'm not sure that's how it's supposed to be. This answer states it should write at the end, but read from the beginning. The documentation isn't terribly clear).
Some suggestions: Instead of comparing against the "xxxxx" string, you could just check the length of the data (if len(data) < 5). Or alternatively, as was suggested, use pickle to store a number, which might look like this:
import pickle
try:
with open("curl-output.txt", "rb") as myfile:
num = pickle.load(myfile)
except FileNotFoundError:
num = 0
if num < 5:
do_curl_stuff()
num += 1
with open("curl-output.txt", "wb") as myfile:
pickle.dump(num, myfile)
Two more things concerning your original code: You're making the first with block bigger than it needs to be. Once you've read the string into data, you don't need the file object anymore, so you can remove one level of indentation from everything except data = myfile.read().
Also, you don't need to close text_file manually. with will do that for you (that's the point).

Sounds more for a job scheduling with at command?
See http://www.ibm.com/developerworks/library/l-job-scheduling/ for different job scheduling mechanisms.

The first bug that is immediately obvious to me is that you are appending to the file even if data == finalrun. So when data == finalrun, you don't run curl but you do append another 'x' to the file. On the next run, data will be not equal to finalrun again so it will continue to execute the curl code.
The solution is of course to nest the code that appends to the file under the if statement.

Well there probably is an end of line jump \n character which makes that your file will contain something like xx\n and not simply xx. Probably this is why your condition does not work :)
EDIT
What happens if through the python command line you type
open('filename.txt', 'r').read() # where filename is the name of your file
you will be able to see whether there is an \n or not

Try using this condition along with if clause instead.
if data.count('x')==24
data string may contain extraneous data line new line characters. Check repr(data) to see if it actually a 24 x's.

Overwriting a file in python

I'm trying to over write a file in python so it only keeps the most up to date information read from a serial port. I've tried several different methods and read quite a few different posts but the file keeps writing the information over and over with out overwriting the previous entry.
import serial
ser=serial.Serial('/dev/ttyUSB0',57600)
target=open( 'wxdata' , 'w+' )
with ser as port, target as outf:
while 1:
target.truncate()
outf.write(ser.read))
outf.flush()
I have a weather station sending data wirelessly to a raspberry pi, I just want the file to keep one line of current data received. right now it just keeps looping and adding over and over. Any help would be greatly appreciated..

I would change your code to look like:
from serial import Serial
with Serial('/dev/ttyUSB0',57600) as port:
while True:
with open('wxdata', 'w') as file:
file.write(port.read())
This will make sure it gets truncated, flushed, etc. Why do work you don't have to? :)

By default, truncate() only truncates the file to the current position. Which, with your loop, is only at 0 the first time through. Change your loop to:
while 1:
outf.seek(0)
outf.truncate()
outf.write(ser.read())
outf.flush()
Note that truncate() does accept an optional size argument, which you could pass 0 for, but you'd still need to seek back to the beginning before writing the next part anyway.

Before you start writing the file, add the following line:
outf.seek(0)
outf.truncate()
This will make it so that whatever you write next will overwrite the file

Open text file, print new lines only in python

I am opening a text file, which once created is constantly being written to, and then printing this out to a console any new lines, as I don't want to reprint the whole text file each time. I am checking to see if the file grows in size, if it is, just print the next new line. This is mostly working, but occasionally it gets a bit confused about the next new line, and new lines appear a few lines up, mixed in with the old lines.
Is there a better way to do this, below is my current code.
infile = "Null"
while not os.path.exists(self.logPath):
time.sleep(.1)
if os.path.isfile(self.logPath):
infile = codecs.open(self.logPath, encoding='utf8')
else:
raise ValueError("%s isn't a file!" % file_path)
lastSize = 0
lastLineIndex = 0
while True:
wx.Yield()
fileSize = os.path.getsize(self.logPath)
if fileSize > lastSize:
lines = infile.readlines()
newLines = 0
for line in lines[lastLineIndex:]:
newLines += 1
self.running_log.WriteText(line)
lastLineIndex += newLines
if "DBG-X: Returning 1" in line:
self.subject = "FAILED! - "
self.sendEmail(self)
break
if "DBG-X: Returning 0" in line:
self.subject = "PASSED! - "
self.sendEmail(self)
break
fileSize1 = fileSize
infile.flush()
infile.seek(0)
infile.close()
Also my application freezes whilst waiting for the text file to be created, as it takes a couple of seconds to appear, which isn't great.
Cheers.

This solution could help. You'd also have to do a bit of waiting until the file appears, using os.path.isfile and time.sleep.

Maybe you could:
open the file each time you need to read in it,
use lastSize as argument to seek directly to where you stopped at last reading.
Additional comment: I don't know if you need some protection, but I think you should not bother to test whether given filename is a file or not; just open it in a try...except block and catch problems if any.
As for the freezing of your application, you may want to use some kind of Threading, for instance: one thread, your main one, is handling the GUI, and a second one would wait for the file to be created. Once the file is created, the second thread sends signals to the GUI thread, containing the data to be displayed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.