How to solve Python memory leak when using urrlib2?

How to solve Python memory leak when using urrlib2? - python

I'm trying to write a simple Python script for my mobile phone to periodically load a web page using urrlib2. In fact I don't really care about the server response, I'd only like to pass some values in the URL to the PHP. The problem is that Python for S60 uses the old 2.5.4 Python core, which seems to have a memory leak in the urrlib2 module. As I read there's seems to be such problems in every type of network communications as well. This bug have been reported here a couple of years ago, while some workarounds were posted as well. I've tried everything I could find on that page, and with the help of Google, but my phone still runs out of memory after ~70 page loads. Strangely the Garbege Collector does not seem to make any difference either, except making my script much slower. It is said that, that the newer (3.1) core solves this issue, but unfortunately I can't wait a year (or more) for the S60 port to come.
here's how my script looks after adding every little trick I've found:
import urrlib2, httplib, gc
while(true):
url = "http://something.com/foo.php?parameter=" + value
f = urllib2.urlopen(url)
f.read(1)
f.fp._sock.recv=None # hacky avoidance
f.close()
del f
gc.collect()
Any suggestions, how to make it work forever without getting the "cannot allocate memory" error?
Thanks for advance,
cheers, b_m
update:
I've managed to connect 92 times before it ran out of memory, but It's still not good enough.
update2:
Tried the socket method as suggested earlier, this is the second best (wrong) solution so far:
class UpdateSocketThread(threading.Thread):
def run(self):
global data
while 1:
url = "/foo.php?parameter=%d"%data
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('something.com', 80))
s.send('GET '+url+' HTTP/1.0\r\n\r\n')
s.close()
sleep(1)
I tried the little tricks, from above too. The thread closes after ~50 uploads (the phone has 50MB of memory left, obviously the Python shell has not.)
UPDATE:
I think I'm getting closer to the solution! I tried sending multiple data without closing and reopening the socket. This may be the key since this method will only leave one open file descriptor. The problem is:
import socket
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
socket.connect(("something.com", 80))
socket.send("test") #returns 4 (sent bytes, which is cool)
socket.send("test") #4
socket.send("test") #4
socket.send("GET /foo.php?parameter=bar HTTP/1.0\r\n\r\n") #returns the number of sent bytes, ok
socket.send("GET /foo.php?parameter=bar HTTP/1.0\r\n\r\n") #returns 0 on the phone, error on Windows7*
socket.send("GET /foo.php?parameter=bar HTTP/1.0\r\n\r\n") #returns 0 on the phone, error on Windows7*
socket.send("test") #returns 0, strange...
*: error message: 10053, software caused connection abort
Why can't I send multiple messages??

Using the test code suggested by your link, I tested my Python installation and confirmed that it indeed leaks. But, if, as #Russell suggested, I put each urlopen in its own process, the OS should clean up the memory leaks. In my tests, memory, unreachable objects and open files all remain more or less constant. I split the code into two files:
connection.py
import cPickle, urllib2
def connectFunction(queryString):
conn = urllib2.urlopen('http://something.com/foo.php?parameter='+str(queryString))
data = conn.read()
outfile = ('sometempfile'. 'wb')
cPickle.dump(data, outfile)
outfile.close()
if __name__ == '__main__':
connectFunction(sys.argv[1])
###launcher.py
import subprocess, cPickle
#code from your link to check the number of unreachable objects
def print_unreachable_len():
# check memory on memory leaks
import gc
gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()
unreachableL = []
for it in gc.garbage:
unreachableL.append(it)
return len(str(unreachableL))
#my code
if __name__ == '__main__':
print 'Before running a single process:', print_unreachable_len()
return_value_list = []
for i, value in enumerate(values): #where values is a list or a generator containing (or yielding) the parameters to pass to the URL
subprocess.call(['python', 'connection.py', str(value)])
print 'after running', i, 'processes:', print_unreachable_len()
infile = open('sometempfile', 'rb')
return_value_list.append(cPickle.load(infile))
infile.close()
Obviously, this is sequential, so you will only execute a single connection at a time, which may or may not be an issue for you. If it is, you will have to find a non-blocking way of communicating with the processes you're launching, but I'll leave that as an exercise for you.
EDIT: On re-reading your question, it seems you don't care about the server response. In that case, you can get rid of all the pickling related code. And obviously, you won't have the print_unreachable_len() related bits in your final code either.

There exist a reference cycle in urllib2 created in urllib2.py:1216. The issue is on going and exists since 2009.
https://bugs.python.org/issue1208304

I think this is probably your problem. To summarize that thread, there's a memory leak in Pys60's DNS lookup, and you can work around it by moving DNS lookup outside the inner loop.

This seems like a (very!) hacky workaround, but a bit of googling found this comment on the problem:
Apparently adding f.read(1) will stop the leaking!
import urllib2
f = urllib2.urlopen('http://www.google.com')
f.read(1)
f.close()
EDIT: oh, I see you already have f.read(1)... I'm all out of ideas then :/

Consider using the low-level socket API (related howto) instead of urllib2.
HOST = 'daring.cwi.nl' # The remote host
PORT = 50007 # The same port as used by the server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))
s.send('GET /path/to/file/index.html HTTP/1.0\n\n')
# you'll need to figure out how much data to read and read that exactly
# or wait for read() to return data of zero length (I think!)
DATA_SZ = 1024
data = s.recv(DATA_SZ)
s.close()
print 'Received', repr(data)
How to execute and read a HTTP request via low-level sockets is a bit beyond the scope of the question (and perhaps may make a good question on its own on stackoverflow — I searched but didn't see it), but I hope this points you in the direction of a solution that may resolve your problem!
edit An answer in here about using makefile may be helpful: HTTP basic authentication using sockets in python

This does not leak for me with Python 2.6.1 on a Mac. Which version are you using?
BTW, your program doesn't work due to a few typos. Here is one that does work:
import urllib2, httplib, gc
value = "foo"
count = 0
while(True):
url = "http://192.168.1.1/?parameter=" + value
f = urllib2.urlopen(url)
f.read(1)
f.fp._sock.recv=None # hacky avoidance
f.close()
del f
print "count=",count
count += 1

Depending on platform and python version, python might not release memory back to OS. See this stackoverflow thread. That said, python should not endlessly consume memory. Judging from the code you use, it appears to be bug in python runtime unless, urllib/sockets use globals which I don't believe it does - blame it on Python on S60!
Have you considered other sources of memory leakage? Endless log file open, ever increasing array or smth like that? If it truly is a bug in sockets interface, then your only option is to use the subprocess approach.

Related

ser.inWaiting() always returns 0 when reading a virtual port

I'm having difficulties getting pyserial to play nicely with a virtual port. I know this is an area which a few others have written about, but I couldn't find anything which solved my problem in those answers. Forgive me if I'm just being dense, and the solution exists ready-made elsewhere.
This is what I'm trying to achieve: I want to set up a virtual port, to which I can write data in one .py file, and from which I can then read data in another .py file. This is for the purposes of development and testing; I don't always have access to the device around which my current project is built.
This is my code so far:
dummy_serial.py
import os, pty, serial, time
master, slave = pty.openpty()
m_name = os.ttyname(master)
s_name = os.ttyname(slave)
# This tells us which ports "openpty" has happened to choose.
print("master: "+m_name)
print("slave: "+s_name)
ser = serial.Serial(s_name, 9600)
message = "Hello, world!"
encoded = message.encode("ascii")
while True:
ser.write(encoded)
time.sleep(1)
reader.py
import serial, time
# The port will change, depending on what port "openpty" (in the other file)
# happens to choose.
ser = serial.Serial("/dev/pts/1", 9600)
while True:
time.sleep(1)
incoming_bytes = ser.inWaiting()
# This print statement gives us an idea of what's going on.
print(incoming_bytes)
if incoming_bytes != 0:
data = ser.read(incoming_bytes)
print(data)
At present, dummy_serial.py seems to run okay. However, reader.py just keeps saying that there are no bytes waiting to be read, and hence reads no data.
What I would like:
An explanation of why ser.inWaiting() keeps returning 0, and a solution which makes ser.read(x) actually spit out "Hello, world!"
Or an explanation of why what I'm trying to do is fundamentally silly, and a better means of creating a writeable/readable virtual port.

Why is port scanning with python sockets so much slower on Windows than it is on linux?

I've tried looking around online through different python docs, forums, and other people's questions but I haven't found anyone with this same question.
What my scripts typically look like is I'll create a socket connection that tries connecting to ports 1-9999 and will only tell me when a port is open. When I run this on windows it takes 1 second to scan a port before moving on to the next one (60 ports/m. ~16.5m for 1000 ports). When I run the same scripts on linux, it'll cycle through all 9999 ports very quickly, while still returning the same desired results.
I was hoping to be able to build cross-compatible tools, but it appears linux
is just the better operating system when it comes to my networking needs? I have both at my disposal so I don't mind using one over the other. I'd just like to know if there's anything that could be done to make port scanning almost as equally fast on both operating systems, otherwise I won't spend as much time building on/for windows.
The difference in speed is the same regardless of which network I'm on.
My questions are:•Why is the performance so different on windows compared to linux when given the same functions?•Is there anything that can be done to make port scanning with sockets faster like it is on linux?
--edit--
here's the piece I use to check ports
def whole_scan(Host_):
service = ''
host = Host_
max_port = 9999
min_port = 1
def scan_host(host, port, r_code = 1):
try:
s = socket(AF_INET, SOCK_STREAM)
code = s.connect_ex((host, port))
if code == 0:
r_code = code
s.close()
except Exception, e:
pass
return r_code
hostip = gethostbyname(host)
for port in range(min_port, max_port):
try:
response = scan_host(host,port)
if response == 0:
try:
service = getservbyport(port)
except Exception, e:
service = 'n/a'
print(" |--port: %d\t%s" % (port,service.upper()))
except Exception, e:
pass
I've also verified my firewall is disabled and adding the value to my registry to disable the limit on connections had no change on performance. I'm on windows 10.

Windows limits the concurrent number of half-open connections and that may be at play here if you are opening that many connection requests at a time. For example, on Windows 7 try setting this key value to 0 (to disable it)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\EnableConnectionRateLimiting

I doubt that this is causing the performance problem, however, there is a bug in your scan_host() function.
This function attempts to return r_code, however, r_code is only set if connect_ex() returns 0. Should connect_ex() return a non-zero value, or an exception were to occur in the same block of code, r_code would not be set, and the return statement would raise a NameError exception. This exception will propagate to the calling code, which catches it then and ignores it and all other exceptions.
It's not a good idea to ignore exceptions; perhaps you might learn something relevant to the problem, perhaps not, but I suggest that you log the exceptions that are occurring.
Also, it would be useful if you added some debug print statements into your code. This will help you locate the part of your code where the majority of time is spent.
There is also the line:
hostip = gethostbyname(host)
which never seems to be executed - I can't tell because perhaps the indentation in your post is not quite right.
Another thing to consider is DNS. Possibly the DNS server used by Windows is slower, or there is some issue there. You could eliminate that by using the IP address instead of host name:
response = scan_host(gethostbyname(host), port)

Performance at serial read python

I am reading string from serial in a loop and realize that the processor is at 100% (RaspberryPI) while waiting for the next serial.read().
I found recommendation to add a few sleeps here and there, but doing this might cause missing serial data. In theorie I am getting a string from serial every 5 seconds, but could be a bit more or less and not in my control.
Is there a way to solve this in python better and with less processor use?
#!/usr/bin/env python
import serial
ser = serial.Serial("/dev/ttyUSB0", 57600, timeout=0)
def sr():
while True:
for line in ser.read():
try:
response = ser.readlines(None)
response = str(response)
print response
except:
print datetime.datetime.now(), " No data from serial connection."
if __name__ == '__main__':
sr
ser.close()

from what i remember (been a while since i used pyserial) i am sure that serial uses buffers, so as long as your message doesn't fill the buffer you shouldn't lose any data.
assuming i'm looking at the docs for the right module the following page:
[Pyserial docs][1]http://pyserial.sourceforge.net/pyserial_api.html
make mention about buffers both on the input and output.
so you should have no problems with putting sleeps into your program as the buffers will collect the data until you read it. (assuming your messages are not big enough to cause an overflow)
James

Is it possible to loop over an httplib.HTTPResponse's data?

I'm trying to develop a very simple proof-of-concept to retrieve and process data in a streaming manner. The server I'm requesting from will send data in chunks, which is good, but I'm having issues using httplib to iterate through the chunks.
Here's what I'm trying:
import httplib
def getData(src):
d = src.read(1024)
while d and len(d) > 0:
yield d
d = src.read(1024)
if __name__ == "__main__":
con = httplib.HTTPSConnection('example.com', port='8443', cert_file='...', key_file='...')
con.putrequest('GET', '/path/to/resource')
response = con.getresponse()
for s in getData(response):
print s
raw_input() # Just to give me a moment to examine each packet
Pretty simple. Just open an HTTPS connection to server, request a resource, and grab the result, 1024 bytes at a time. I'm definitely making the HTTPS connection successfully, so that's not a problem at all.
However, what I'm finding is that the call to src.read(1024) returns the same thing every time. It only ever returns the first 1024 bytes of the response, apparently never keeping track of a cursor within the file.
So how am I supposed to receive 1024 bytes at a time? The documentation on read() is pretty sparse. I've thought about using urllib or urllib2, but neither seems to be able to make an HTTPS connection.
HTTPS is required, and I am working in a rather restricted corporate environment where packages like Requests are a bit tough to get my hands on. If possible, I'd like to find a solution within Python's standard lib.
// Big Old Fat Edit
Turns out in my original code I had simply forgot to update the d variable. I initialized it with a read outside the yield loop and never changed it in the loop. Once I added it back in there it worked perfectly.
So, in short, I'm just a big idiot.

Is your con.putrequest() actually working? Doing a request with that method requires you to also call a bunch of other methods as you can see in the official httplib documentation:
http://docs.python.org/2/library/httplib.html
As an alternative to using the request() method described above, you
can also send your request step by step, by using the four functions
below.
putrequest()
putheader()
endheaders()
send()
Is there any reason why you're not using the default HTTPConnection.request() function?
Here's a working version for me, using request() instead:
import httlplib
def getData(src, chunk_size=1024):
d = src.read(chunk_size)
while d:
yield d
d = src.read(chunk_size)
if __name__ == "__main__":
con = httplib.HTTPSConnection('google.com')
con.request('GET', '/')
response = con.getresponse()
for s in getData(response, 8):
print s
raw_input() # Just to give me a moment to examine each packet

You can use the seek command to move the cursor along with your read.
This is my attempt at the problem. I apologize if I made it less pythonic in process.
if __name__ == "__main__":
con = httplib.HTTPSConnection('example.com', port='8443', cert_file='...', key_file='...')
con.putrequest('GET', '/path/to/resource')
response = con.getresponse()
c=0
while True:
response.seek(c*1024,0)
data =d.read(1024)
c+=1
if len(data)==0:
break
print data
raw_input()
I hope it is at least helpful.

"Snooping" python's telnetlib

I have an application that calls telnetlib.read_until(). For the most part, it works fine.
However when my app's telnet connection fails, it's hard to debug the exact cause. Is it my script or is the server connection dodgy? (This is a development lab, so there are a lot of dodgy servers).
What I would like to do is to be able to easily snoop the data placed into the cooked queue before my app calls telnetlib.read_until() (thereby hopefully avoiding impacting my app's operation.)
Poking around in telnetlib.py, I found that 'buf[0]' is just the data I want: the newly-added data without the repetition caused by snooping 'cookedq'.
I can insert a line right before the end of telnetlib.process_rawq() to print out the processed data as it is received from the server.
telnetlib.process_rawq ...
...
self.cookedq = self.cookedq + buf[0]
print("Dbg: Cooked Queue contents = %r" % buf[0] <= my added debug line
self.sbdataq = self.sbdataq + buf[1]
This works well. I can see the data almost exactly as received by my app without impacting its operation at all.
Here's the question: Is there a snazzier way to accomplish this? This approach is basic and works, but I'll have to remember to re-make this change every time I upgrade Python's libraries.
My attempts to simply extend telnet.process_rawq() were unsuccessful, as buf is internal to telnet.process_rawq()
Is there a (more pythonic) way to snoop this telnetlib.process_rawq()-internal value without modifying telnetlib.py?
Thanks.

I just found a much better solution (by reading the code, duh!)
telnetlib has a debugging output option already built-in. Just call set_debuglevel(1) and Bob's your uncle.

The easy hack is to monkey patch the library. Copy and paste the function you want to change into your source (unfortunately, process_rawq is a rather large function), and modify it as you need. You can then replace the method in the class with your own.
import telnetlib
def process_rawq(self):
#existing stuff
self.cookedq = self.cookedq + buf[0]
print("Dbg: Cooked Queue contents = %r" % buf[0]
self.sbdataq = self.sbdataq + buf[1]
telnetlib.Telnet.process_rawq = process_rawq
You could alternatively try the debugging built-in to the telnetlib module with set_debuglevel(1), which prints a lot of info to stdout.
In this situation, I would tend to just grab wireshark/tshark/tcpdump and directly inspect the network session.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to solve Python memory leak when using urrlib2? - python

There exist a reference cycle in urllib2 created in urllib2.py:1216. The issue is on going and exists since 2009. https://bugs.python.org/issue1208304

I think this is probably your problem. To summarize that thread, there's a memory leak in Pys60's DNS lookup, and you can work around it by moving DNS lookup outside the inner loop.

Related

ser.inWaiting() always returns 0 when reading a virtual port

Why is port scanning with python sockets so much slower on Windows than it is on linux?

Performance at serial read python

Is it possible to loop over an httplib.HTTPResponse's data?

"Snooping" python's telnetlib

Categories

Resources