So I hope this isn't a duplicate, however I either haven't been able to find the adequate solution or I just am not 100% on what I'm looking for. I've written a program to thread lots of requests. I create a thread to
Fetch responses from a number of api's such as this: share.yandex.ru/gpp.xml?url=MY_URL as well as scraping blogs
Parse the responses of all requests from the example above/ json/ using python-goose to extract articles
Return the parsed results back to the primary thread and insert into a database.
It's all been going well until it needs to pull back larger amounts of data which i haven't tested before. The primary reason for this is that it takes me over my shared memory limit on a shared Linux server (512mb) initiating a kill. This should be enough as it's only a few thousand requests, although i could be wrong. I'm clearing all large data variables/ objects within the main thread but that doesn't seem to help either.
I ran a memory_profile on the primary function which creates the threads with a thread class which looks like this:
class URLThread(Thread):
def __init__(self,request):
super(URLThread, self).__init__()
self.url = request['request']
self.post_id = request['post_id']
self.domain_id = request['domain_id']
self.post_data = request['post_params']
self.type = request['type']
self.code = ""
self.result = ""
self.final_results = ""
self.error = ""
self.encoding = ""
def run(self):
try:
self.request = get_page(self.url,self.type)
self.code = self.request['code']
self.result = self.request['result']
self.final_results = response_handler(dict(result=self.result,type=self.type,orig_url=self.url ))
self.encoding = chardet.detect(self.result)
self.error = self.request['error']
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
errors.append((exc_type, fname, exc_tb.tb_lineno,e,'NOW()'))
pass
#profile
def multi_get(uris,timeout=2.0):
def alive_count(lst):
alive = map(lambda x : 1 if x.isAlive() else 0, lst)
return reduce(lambda a,b : a + b, alive)
threads = [ URLThread(uri) for uri in uris ]
for thread in threads:
thread.start()
while alive_count(threads) > 0 and timeout > 0.0:
timeout = timeout - UPDATE_INTERVAL
sleep(UPDATE_INTERVAL)
return [ {"request":x.url,
"code":str(x.code),
"result":x.result,
"post_id":str(x.post_id),
"domain_id":str(x.domain_id),
"final_results":x.final_results,
"error":str(x.error),
"encoding":str(x.encoding),
"type":x.type}
for x in threads ]
And the results look like this on the first batch of requests i pump through it (FYI it's a link as the output text isn't readable in here, i can't paste a html table or embed an image until i get 2 more points ):
http://tinypic.com/r/28c147d/8
And it doesn't seem to drop any of the memory in subsequent passes (I'm batching 100 requests/ threads through at 1 time). By this i mean once a batch of threads is complete they seem to stay in memory ad every time it runs another, memory is added as below:
http://tinypic.com/r/nzkeoz/8
Am I doing something really stupid here?
Python will generally free the memory taken up by an object when there are no references to that object left. Your multi_get function returns a list that contains references to every thread that you have created. So it's unlikely that Python would free that memory. But we would need to see what the code that is calling multi_get is doing in order to be sure.
To start freeing the memory you will need to stop returning references to the threads from this function. Or if you want to continue to do that, at least delete them somewhere del x.
Related
I'm currently trying to model a service counter with SimPy but I'm running into difficulties with using yield to hold the resources. Under the Counter.arrive() function, if the line "yield req" exists, the entire function skips execution (at least I think that's what happens since I don't get any of the print output). However, if I comment out that line, the code executes like nothing happens. This is a problem because without the yield the code is not blocked until the request is approved and the entire simulation fails because everyone gets to use the resource.
Code snippet as follows:
import simpy
class Counter:
def __init__(self, env, name, staff):
self.env = env
self.staff = simpy.Resource(env, staff)
self.name = name
self.dreq = []
def arrive(self, name):
...
req = self.staff.request()
yield req
output = "Req: %s\n" % req
self.dreq.append(req)
...
print(output)
...
def customer(env, counter, name):
print("Customer %s arrived at %s" %(name,env.now))
counter.arrive(name)
yield env.timeout(5)
print("Customer %s left at %s" %(name,env.now))
...
env = simpy.Environment()
counter = Counter(env, "A", 1)
def setup(env, counter, MAX_CUST):
for i in range(MAX_CUST):
env.process(customer(env,counter, 1))
yield env.timeout(1)
env.process(setup(env,counter,5))
env.run(until=100)
Edit: I understand that using yield should pause the function until the request gets approved but the very first request does not go through as well which does not make sense as there is 1 unit of the resource available at the start.
Docs for convenience: https://simpy.readthedocs.io/en/3.0.6/topical_guides/resources.html
Requests (and timeouts and everything which you need to yield) gets processed by simpy, so it needs to arrive at simpy to get processed. You tell simpy to process customer with env.process:
env.process(customer(env,counter, 1))
In customer you call counter.arrive(name). Because arrive is a generator (because of yield) it does nothing until something calls next on it. Simpy needs to know of it to process it properly. You should be able to do this by:
env.process(counter.arrive(name))
which should fix your problem.
Note that also in this code you never release the ressource, so only one customer can actually arrive.
Recently I tried to add thread to my scraper so that it can have higher efficiency while scraping.
But somehow it randomly causes the python.exe to "has stopped working" with no further information given hence I have no idea how to debug it.
Here is some relevant code:
Where the threads are initiated:
def run(self):
"""
create the threads and run the scraper
:return:
"""
self.__load_resource()
self.__prepare_threads_args() # each thread is allocated a different set of links to scrape from, these should be no collision.
for item in self.threads_args:
try:
t = threading.Thread(target=self.urllib_method, args=(item,))
# use the following expression to use the selenium scraper
# t = threading.Thread(target=self.__scrape_site, args=(item,))
self.threads.append(t)
t.start()
except Exception as ex:
print ex
What the Scraper is like:
def urllib_method(self, thread_args):
"""
:param thread_args: arguments containing the files to scrape and the proxy to use
:return:
"""
site_scraper = SiteScraper()
for file in thread_args["files"]:
current_folder_path = self.__prepare_output_folder(file["name"])
articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")
articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")
articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
links = get_links_from_file(file["path"])
for link in links:
article_id = extract_article_id(link)
if isfile(join(current_folder_path, article_id)):
print "skip: ", link
if link not in articles_scraped_links:
append_text_to_file(articles_scraped_file, link)
continue
if link in articles_without_comments_links:
continue
comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])
if comments != "Pro article" and comments != "Crash" and comments != "No Comments" and comments is not None:
print article_id, comments[0:14]
write_text_to_file(os.path.join(current_folder_path, article_id), comments)
sleep(1)
append_text_to_file(articles_scraped_file, link)
elif comments == "No Comments":
print "article without comments: ", article_id
if link not in articles_without_comments_links:
append_text_to_file(articles_without_comments_file, link)
sleep(1)
I have tried to run the script on both Windows 10 and 8.1, the issue exists on both of them.
Also, the more data it scraped, the more frequent it happens. And the more threads used, the more frequent it happens.
Threads in Python pre 3.2 are very unsafe to use, due to the diabolical Global Interpreter Lock.
The preferred way to utilize multiple cores and processes in python is via the multiprocessing package.
https://docs.python.org/2/library/multiprocessing.html
I'm trying to generate a report that contains differing "groupings" of data. For each I have to query postgres differently and apply different logic, which can take a fair amount of time (~1 hour).
To increase performance I've created a thread for each task, each with its own connection as psycopg2 executes queries serially per connection. I'm using numpy to calculate the median and mean values of a portion of data (that is common between each group).
A short example of my code is the following:
# -*- coding: utf-8 -*-
from postgres import Connection
from lookup import Lookup
from queries import QUERY1, QUERY2
from threading import Thread
class Report(object):
def __init__(self, **credentials):
self.conn = self.__get_conn(**credentials)
self._lookup = Lookup(self.conn)
self.data = {}
def __get_conn(self, **credentials):
return Connection(**credentials)
def _get_averages(self, data):
return {
'mean' : numpy.mean(data),
'median' : numpy.median(data)
}
def method1(self):
conn = self.__get_conn()
cursor = conn.get_cursor()
data = cursor.execute(QUERY1)
for row in data:
# Logic specific to the results returned by the query.
row['arg1'] = self._lookup.find_data_by_method_1(row)
avgs = self._get_averages(row['data'])
row['mean'] = avgs['mean']
row['median'] = avgs['median']
return data
def method2(self):
conn = self.__get_conn()
cursor = conn.get_cursor()
data = cursor.execute(QUERY2)
for row in data:
# Logic specific to the results returned by the query.
row['arg2'] = self._lookup.find_data_by_method_2(row)
avgs = self._get_averages(row['data'])
row['mean'] = avgs['mean']
row['median'] = avgs['median']
return data
def lookup(self, arg):
methods = {
'arg1' : self.method1,
'arg2' : self.method2
}
method = methods(arg)
self.data[arg] = method()
def lookup_args(self):
return self._lookup.find_args()
def do_something_with_data(self):
print self.data
def main():
creds = {
'host':'host',
'user':'postgres',
'database':'mydatabase',
'password':'mypassword'
}
reporter = Report(**creds)
args = reporter.lookup_args()
threads = []
for arg in args:
thread = Thread(target=reporter.lookup, args=(arg,))
threads.append(thread)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
reporter.do_something_with_data()
The imported Connection class is a simple wrapper around psycopg2 to facilitate cursor creation and connecting to multiple postgres databases.
The imported Lookup class accepts a Connection instance and is used to perform short queries to find relevant data that drastically decreases performance when incorporated into the larger query.
data that is accepted by the _get_averages example method is a list of decimal.Decimal objects.
When I run all threads simultaneously I get a segfault. If I run each thread independently the script finishes successfully.
Using gdb I find numpy to be the culprit:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedc8c700 (LWP 10997)]
0x00007ffff2ac33b7 in sortCompare (a=0x2e956298, b=0x2e956390) at numpy/core/src/multiarray/item_selection.c:1045
1045 numpy/core/src/multiarray/item_selection.c: No such file or directory.
in numpy/core/src/multiarray/item_selection.c
I'm aware of this bug with numpy, but that seems to only affect sorting lists containing both class instances and other numeric types. The objects in my lists are guaranteed to be decimal.Decimal instances. (Yes, I verified this).
What could cause numpy to cause a segfault when used inside of a thread, but behave as expected otherwise?
I am having problems using the pyUSB library to read data from an ELM327 OBDII to USB device. I know that I need to write a command to the device on the write endpoint and read the received data back on the read endpoint. It doesn't seem to want to work for me though.
I wrote my own class obdusb for this:
import usb.core
class obdusb:
def __init__(self,_vend,_prod):
'''Handle to USB device'''
self.idVendor = _vend
self.idProduct = _prod
self._dev = usb.core.find(idVendor=_vend, idProduct=_prod)
return None
def GetDevice(self):
'''Must be called after constructor'''
return self._dev
def SetupEndpoint(self):
'''Must be called after constructor'''
try:
self._dev.set_configuration()
except usb.core.USBError as e:
sys.exit("Could not set configuration")
self._endpointWrite = self._dev[0][(0,0)][1]
self._endpointRead = self._dev[0][(0,0)][0]
#Resetting device and setting vehicle protocol (Auto)
#20ms is required as a delay between each written command
#ATZ resets device
self._dev.write(self._endpointWrite.bEndpointAddress,'ATZ',0)
sleep(0.002)
#ATSP 0 should set vehicle protocol automatically
self._dev.write(self._endpointWrite.bEndpointAddress,'ATSP 0',0)
sleep(0.02)
return self._endpointRead
def GetData(self,strCommand):
data = []
self._dev.write(self._endpintWrite.bEndpointAddress,strCommand,0)
sleep(0.002)
data = self._dev.read(self._endpointRead.bEndpointAddress, self._endpointRead.wMaxPacketSize)
return data
So I then use this class and call the GetData method using this code:
import obdusb
#Setting up library,device and endpoint
lib = obdusb.obdusb(0x0403,0x6001)
myDev = lib.GetDevice()
endp = lib.SetupEndpoint()
#Testing GetData function with random OBD command
#0902 is VIN number of vehicle being requested
dataArr = lib.GetData('0902')
PrintResults(dataArr)
raw_input("Press any key")
def PrintResults(arr):
size = len(arr)
print "Data currently in buffer:"
for i in range(0,size):
print "[" + str(i) + "]: " + str(make[i])
This only ever prints the numbers 1 and 60 from [0] and [1] element in the array. No other data has been return from the command. This is the case whether the device is connected to a car or not. I don't know what these 2 pieces of information are. I am expecting it to return a string of hexadecimal numbers. Does anyone know what I am doing wrong here?
If you don't use ATST or ATAT, you have to expect a timeout of 200ms at start, between every write/read combination.
Are you sending a '\r' after each command? It looks like you don't, so it's forever waiting for a Carriage Return.
And a hint: test with 010D or 010C or something. 09xx might be difficult what to expect.
UPDATE:
You can do that both ways. As long as you 'seperate' each command with a carriage return.
http://elmelectronics.com/ELM327/AT_Commands.pdf
http://elmelectronics.com/DSheets/ELM327DS.pdf (Expanded list).
That command list was quite usefull to me.
ATAT can be used to the adjust the timeout.
When you send 010D, the ELM chip will wait normally 200 ms, to get all possible reactions. Sometimes you can get more returns, so it waits the 200 ms.
What you also can do, and it's a mystery as only scantools tend to implement this:
'010D1/r'
The 1 after the command, specifies the ELM should report back, when it has 1 reply from the bus. So it reduces the delay quite efficiently, at the cost of not able to get more values from the address '010D'. (Which is speed!)
Sorry for my english, I hope send you in the right direction.
Just found the Queue module which is helping me adapt the pyftpdlib module. I'm running an very strict FTP server, and my goal is to restrict the filenames available to upload. This is to prevent people from uploading whatever they want (it's actually the backend of an upload client, not a complete FTP server).
I have this in the ftpserver Authorizer:
def fetch_worlds(queue, username):
while queue.empty():
worlds = models.World.objects.filter(is_synced=True, user__username=username)
print worlds
queue.put(worlds, timeout=1)
class FTPAuthorizer(ftpserver.DummyAuthorizer):
def __init__(self):
self.q = Queue.Queue()
self.t = None # Thread
self.world_item = None
def has_perm(self, username, perm, path=None):
print "Checking permission\n"
if perm not in ['r','w']:
return False
# Check world name
self.t = threading.Thread(target=fetch_worlds, args=(self.q, username))
self.t.daemon = True
self.t.start()
self.world_item = self.q.get()
print "WORLDITEM: %s" % self.world_item
if path is not None:
path = os.path.basename(path)
for world in self.world_item:
test = "{0}_{1}.zip".format(username, world.name)
if path == test:
print "Match on %s" % test
return True
return False
My issue is, after the server starts, the first time I STOR a file, it does an initial db call and gets all the worlds properly. But when I then add another world (for example, set is_synced=True on one, it still returns the old data from self.q.get(). has_perm() is called every time a file is uploaded, and it needs to return live data (to check if a file is allowed).
For example, brand new server:
STOR file.zip, self.q.get() returns <World1, World2>
Update the database via other methods etc
STOR file2.zip, inside fetch_worlds, print worlds returns <World1, World2, World3> but self.q.get() returns <World1, World2>
Just found the Queue module and it seemed like it would be helpful but I can't get the implementation right.
(also couldn't add tag pyftpdlib)
i think this is what could be happening here:
when has_perm is called, you create a Thread that will query a database (?) to add elements to the queue
after calling start the call to the database will take some time
meanwhile in your main thread you entered q.get which will block.
the db call finishes and the result is added to the queue
and is immediately removed from the queue again by the blocking q.get
the queue is now empty, your thread enters the while-loop again and executes the same query again and puts the result onto the queue.
the next call to q.get will return that instead of what it expects.
you see, you could have a race condition here, that already is aparent from the fact that you're adding something to a queue in a loop while you don't have a loop when pulling.
also you assume the element you get from the queue is the result to what you have put onto it before. that doesn't have to be true. if you call has_perm two times this will result in two calls to fetch_worlds with the possibility that the queue.empty() check for one of the calls fails. so only one result will be put onto the queue. now you have two threads waiting on q.get, but only one will get a result, while the oter one waits until one becomes ready...
has_perm looks like it should be a blocking call anyway.