How to create managers for the worker threads? - python

The code works fine for a single "manager", which basically launches some HTTP GETs to a server. But I've hit a brick wall.
How do I create 2 managers now, each with its own Download_Dashlet_Job object and tcp_pool_object? In essence, the managers would be commanding their own workers on two seperate jobs. This seems to be a really good puzzle for learning Python classes.
import workerpool
from urllib3 import HTTPConnectionPool
class Download_Dashlet_Job(workerpool.Job):
def __init__(self, url):
self.url = url
def run(self):
request = tcp_pool_object.request('GET', self.url, headers=headers)
tcp_pool_object = HTTPConnectionPool('M_Server', port=8080, timeout=None, maxsize=3, block=True)
dashlet_thread_worker_pool_object = workerpool.WorkerPool(size=100)
#this section emulates a single manager calling 6 threads from the pool but limited to 3 TCP sockets by tcp_pool_object
for url in open("overview_urls.txt"):
job_object = Download_Dashlet_Job(url.strip())
dashlet_thread_worker_pool_object.put(job_object)
dashlet_thread_worker_pool_object.shutdown()
dashlet_thread_worker_pool_object.wait()

First, workerpool.WorkerPool(size=100) creates 100 worker threads. In the comment below, you're saying you want 6 threads? You need to change that to 6.
In order to create a second pool, you need to create another pool. You can also create another job class, and just add this different type of job to the same pool, if you prefer.

Related

Build scraper REST API server with pyppeteer or selenium

I need to create a server to which I can make REST requests by obtaining the scraped data from the indicated site.
For example a url like this:
http://myip/scraper?url=www.exampe.com&token=0
I have to scrape a site built in javascript that recognizes if it is opened by a real or headless browser.
The only alternative are selenium or pyppeteer and a virtualDisplay.
I currently use selenium and FastAPI, but it is not a usable solution with a lot of requests. For each request chrome is opened and closed, this delays the response a lot and uses a lot of resources.
With pyppeteer async you can open multiple tabs at the same time on the same browser instance, reducing response times. But this would likely lead to other problems after a number of tabs.
I was thinking of creating a pool of browser instances on which to divide the various requests as puppeteer-cluster.
But so far I haven't been able to figure it out.
I was currently trying this code for Browser:
import json
from pyppeteer import launch
from strings import keepa_storage
class Browser:
async def __aenter__(self):
self._session = await launch(headless=False, args=['--no-sandbox', "--disable-gpu", '--lang=it',
'--disable-blink-features=AutomationControlled'], autoClose=False)
return self
async def __aexit__(self, *err):
self._session = None
async def fetch(self, url):
page = await self._session.newPage()
page_source = None
try:
await page.goto("https://example.com/404")
for key in keepa_storage:
await page.evaluate(
"window.localStorage.setItem('{}',{})".format(key, json.dumps(example_local_storage.get(key))))
await page.goto(url)
await page.waitForSelector('#tableElement')
page_source = await page.content()
except TimeoutError as e:
print(f'Timeout for: {url}')
finally:
await page.close()
return page_source
And this code for the request:
async with Browser() as http:
source = await asyncio.gather(
http.fetch('https://example.com')
)
But I have no idea how to reuse the same browser session for multiple server requests
While initialising the Server, create a Manager object. As per the implementation manager automatically spawns all the Worker needed. In the API implementation method invoke manager.assign(item). This should get an idle worker and assign the item to it. If no worker is idle at the moment, due to the Queue nature of the manager._AVAILABLE_WORKER it should wait till a worker is available. On a different thread create an infinite loop and invoke manager.heartbeat() to make sure that workers are not slacking off.
I have mentioned in the comment section what is the purpose of each method and what it is supposed to do. That should be enough to get you started. Feel free to let me know in case further clarification is required.
import Queue
class Worker:
###
# class to define behavior and parameters of workers
###
def __init__(self, base_url):
###
# Initialises a worker
# STEP 1. Create one worker with given inputs
# STEP 2. Mark the worker busy
# STEP 3. Get ready for item consumption with initialisation/login process done
# STEP 4. Mark the worker available and active
###
raise NotImplementedError()
def process_item(self, **item):
###
# Worker processes the given item and returns data to manager
# Step 1. worker marks himself busy
# Step 2. worker processes the item. Handle Errors here
# Step 3. worker marks himself available
# Step 4. Return the data scraped
###
raise NotImplementedError()
class Manager:
###
# class for manager who supervises all the workers and assigns work to them
###
def __init__(self):
self._WORKERS = set() # set container to hold all the workers details
self._AVAILABLE_WORKERS = Queue(maxsize=10) # queue container to hold available workers
# create all the worker we want and add them to self._WORKERS and self._AVAILABLE_WORKERS
def assign(self, item):
###
# Assigns an item to a worker to be processed and once processed returns data to the server
# STEP 1. remove worker from available pool
# STEP 2. assign item to worker
# STEP 3A. if item is successfully processed, put the worker back to available pool
# STEP 3B. if error occurred during item processing, try to reset the worker and put the worker back to
# available pool
###
raise NotImplementedError()
def heartbeat(self):
###
# process to check if all the workers are active and accounted for at particular interval.
# if the worker is available but not in the pool add it to the pool after checking if it's not busy
# if the worker is not active then reset the worker and add it to the pool
###
raise NotImplementedError()

tornado one handler blocks for another

Using python/tornado I wanted to set up a little "trampoline" server that allows two devices to communicate with each other in a RESTish manner. There's probably vastly superior/simpler "off the shelf" ways to do this. I'd welcome those suggestions, but I still feel it would be educational to figure out how to do my own using tornado.
Basically, the idea was that I would have the device in the role of server doing a longpoll with a GET. The client device would POST to the server, at which point the POST body would be transferred as the response of the blocked GET. Before the POST responded, it would block. The server side then does a PUT with the response, which is transferred to the blocked POST and return to the device. I thought maybe I could do this with tornado.queues. But that appears to not have worked out. My code:
import tornado
import tornado.web
import tornado.httpserver
import tornado.queues
ToServerQueue = tornado.queues.Queue()
ToClientQueue = tornado.queues.Queue()
class Query(tornado.web.RequestHandler):
def get(self):
toServer = ToServerQueue.get()
self.write(toServer)
def post(self):
toServer = self.request.body
ToServerQueue.put(toServer)
toClient = ToClientQueue.get()
self.write(toClient)
def put(self):
ToClientQueue.put(self.request.body)
self.write(bytes())
services = tornado.web.Application([(r'/query', Query)], debug=True)
services.listen(49009)
tornado.ioloop.IOLoop.instance().start()
Unfortunately, the ToServerQueue.get() does not actually block until the queue has an item, but rather returns a tornado.concurrent.Future. Which is not a legal value to pass to the self.write() call.
I guess my general question is twofold:
1) How can one HTTP verb invocation (e.g. get, put, post, etc) block and then be signaled by another HTTP verb invocation.
2) How can I share data from one invocation to another?
I've only really scratched the simple/straightforward use cases of making little REST servers with tornado. I wonder if the coroutine stuff is what I need, but haven't found a good tutorial/example of that to help me see the light, if that's indeed the way to go.
1) How can one HTTP verb invocation (e.g. get, put, post,u ne etc) block and then be signaled by another HTTP verb invocation.
2) How can I share data from one invocation to another?
The new RequestHandler object is created for every request. So you need some coordinator e.g. queues or locks with state object (in your case it would be re-implementing queue).
tornado.queues are queues for coroutines. Queue.get, Queue.put, Queue.join return Future objects, that need to be "resolved" - scheduled task done either with success or exception. To wait until future is resolved you should yielded it (just like in the doc examples of tornado.queues). The verbs method also need to be decorated with tornado.gen.coroutine.
import tornado.gen
class Query(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
toServer = yield ToServerQueue.get()
self.write(toServer)
#tornado.gen.coroutine
def post(self):
toServer = self.request.body
yield ToServerQueue.put(toServer)
toClient = yield ToClientQueue.get()
self.write(toClient)
#tornado.gen.coroutine
def put(self):
yield ToClientQueue.put(self.request.body)
self.write(bytes())
The GET request will last (wait in non-blocking manner) until something will be available on the queue (or timeout that can be defined as Queue.get arg).
tornado.queues.Queue provides also get_nowait (there is put_nowait as well) that don't have to be yielded - returns immediately item from queue or throws exception.

Django saving progress to the Session in subscript

So im wondering how this is done right.
I try to save the progress of a long running task inside the request.session object. And than be able to get the status of the process with another view method
Im using the Pool Class to make my long running progress async:
MyCalculation.py
def longrunning(x,request):
request.session['status'] = 5;
return x*x
views.py
def dolongrunning(request, x):
pool = Pool(processes=1)
result = pool.apply_async(MyCalculation.longrunning, [x, request])
return JsonResponse(..)
def status(request):
return JsonResponse(request.session.get('status))
so this doesnt work. My Async Job does executed but the request object doesnt get my progress informations.
How could i accomplish that or is there another way?
I have the feeling passing the request object is a bad idea in general.
What whould be a good practice to store the Status of a long running operation in Django/Python?
Different processes do not share the same memory space but they get a copy for each one of them.
In your case, the request object received by the worker process in the longrunning function is a copy of the one created in the parent process. Changes done on one of the processes do not affect the others.
What you want to do, is to send updates from the worker process to the parent one and then, within the parent one, update the request status.
from multiprocessing import Pool, Queue
def worker(task, message_queue): # longrunning
# do something
message_queue.put(5)
# do something else
message_queue.put(42)
def request_handler(request, task, message_queue): # dolongrunning
result = pool.apply_async(worker, [task, message_queue])
return JsonResponse(..)
def status(request):
status = message_queue.get() # this is blocking if no messages in queue
request.session['status'] = status;
return JsonResponse(request.session['status'])
pool = Pool(processes=1)
message_queue = Queue()
This is quite simplified and it's actually blocking on status requests if no status is set but it gives an idea.
A better way would be storing the requests in a buffer and keeping the message queue empty with a thread. Each time a status request is received the last status update received from the workers would be returned.

Simple threading in Django

I need to implement threading in Django. I require three simple APIs:
/work?process=data1&jobid=1&jobtype=nonasync
/status
/kill?jobid=1
The API descriptions are:
The work api will take a process and spawn a thread that processes it. For now, we can assume it to be a simple sleep(10) method. It will name the thread as jobid-1. The thread should be retrievable by this name. A new thread cannot be created if a jobid already exists. The jobtype could be async i.e, api call will immediately return http status code 200 after spawning a thread. Or it could be nonasync such that the api waits for the server to complete the thread and return result.
status api should just show the statues of each running processes.
kill api should kill a process based on jobid. status api should not show this job any longer.
Here is my Django code:
processList = []
class Processes(threading.Thread):
""" The work api can instantiate a process object and monitor it completion"""
threadBeginTime = time.time()
def __init__(self, timeout, threadName, jobType):
threading.Thread.__init__(self)
self.totalWaitTime = timeout
self.threadName = threadName
self.jobType = jobtype
def beginThread(self):
self.thread = threading.Thread(target=self.execution,
name = self.threadName)
self.thread.start()
def execution(self):
time.sleep(self.totalWaitTime)
def calculatePercentDone(self):
"""Gets the current percent done for the thread."""
temp = time.time()
secondsDone = float(temp - self.threadBeginTime)
percentDone = float((secondsDone) * 100 / self.totalWaitTime)
return (secondsDone, percentDone)
def killThread(self):
pass
# time.sleep(self.totalWaitTime)
def work(request):
""" Django process initiation view """
data = {}
timeout = int(request.REQUEST.get('process'))
jobid = int(request.REQUEST.get('jobid'))
jobtype = int(request.REQUEST.get('jobtype'))
myProcess = Processes(timeout, jobid, jobtype)
myProcess.beginThread()
processList.append(myProcess)
return render_to_response('work.html',{'data':data}, RequestContext(request))
def status(request):
""" Django process status view """
data = {}
for p in processList:
print p.threadName, p.calculatePercentDone()
return render_to_response('server-status.html',{'data':data}, RequestContext(request))
def kill(request):
""" Django process kill view """
data = {}
jobid = int(request.REQUEST.get('jobid'))
# find jobid in processList and kill it
return render_to_response('server-status.html',{'data':data}, RequestContext(request))
There are several implementation issues in the above code. The thread spawning is not done in a proper way. I am not able to retrieve the processes status in status function. Also, the kill function is still implemented as I could not grab thread from its job id. Need help refactoring.
Update: I am doing this example for learning purposes, not for writing production code. Hence will not favour any off-the-shelf queueing libraries. The objective here is to understand how a multithreading works in conjunction with a web framework and what edge cases are there to be dealt.
As #Daniel Roseman mentioned above -- doing threading INSIDE of a Django request / response cycle is a very bad idea for many reasons.
What you're actually looking for here is task queueing.
There are a few libraries out there which make this sort of thing fairly simple -- I'll list them below in order of ease-of-use (the simplest ones are listed first):
Django-RQ (https://github.com/ui/django-rq) -- A very awesome, simple API that uses Redis to handle queueing and asynchronous tasks.
Celery (http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html) -- A very powerful, flexible, and large projects which handles queuing and supports many different technologies for backends. I'd recommend this for large projects, but for everything else I'd use RQ as it's quite a bit simpler.
Just my two cents.

Is the Session object from Python's Requests library thread safe?

Python's popular Requests library is said to be thread-safe on its home page, but no further details are given. If I call requests.session(), can I then safely pass this object to multiple threads like so:
session = requests.session()
for i in xrange(thread_count):
threading.Thread(
target=target,
args=(session,),
kwargs={}
)
and make requests using the same connection pool in multiple threads?
If so, is this the recommended approach, or should each thread be given its own connection pool? (Assuming the total size of all the individual connection pools summed to the size of what would be one big connection pool, like the one above.) What are the pros and cons of each approach?
After reviewing the source of requests.session, I'm going to say the session object might be thread-safe, depending on the implementation of CookieJar being used.
Session.prepare_request reads from self.cookies, and Session.send calls extract_cookies_to_jar(self.cookies, ...), and that calls jar.extract_cookies(...) (jar being self.cookies in this case).
The source for Python 2.7's cookielib acquires a lock (threading.RLock) while it updates the jar, so it appears to be thread-safe. On the other hand, the documentation for cookielib says nothing about thread-safety, so maybe this feature should not be depended on?
UPDATE
If your threads are mutating any attributes of the session object such as headers, proxies, stream, etc. or calling the mount method or using the session with the with statement, etc. then it is not thread-safe.
https://github.com/psf/requests/issues/1871 implies that Session is not thread-safe, and that at least one maintainer recommends one Session per thread.
I just opened https://github.com/psf/requests/issues/2766 to clarify the documentation.
I also faced the same question and went to the source code to find a suitable solution for me.
In my opinion Session class generally has various problems.
It initializes the default HTTPAdapter in the constructor and leaks it if you mount another one to 'http' or 'https'.
HTTPAdapter implementation maintains the connection pool, I think it is not something to create on each Session object instantiation.
Session closes HTTPAdapter, thus you can't reuse the connection pool between different Session instances.
Session class doesn't seem to be thread safe according to various discussions.
HTTPAdapter internally uses the urlib3.PoolManager. And I didn't find any obvious problem related to the thread safety in the source code, so I would rather trust the documentation, which says that urlib3 is thread safe.
As the conclusion from the above list I didn't find anything better than overriding Session class
class HttpSession(Session):
def __init__(self, adapter: HTTPAdapter):
self.headers = default_headers()
self.auth = None
self.proxies = {}
self.hooks = default_hooks()
self.params = {}
self.stream = False
self.verify = True
self.cert = None
self.max_redirects = DEFAULT_REDIRECT_LIMIT
self.trust_env = True
self.cookies = cookiejar_from_dict({})
self.adapters = OrderedDict()
self.mount('https://', adapter)
self.mount('http://', adapter)
def close(self) -> None:
pass
And creating the connection factory like:
class HttpSessionFactory:
def __init__(self,
pool_max_size: int = DEFAULT_CONNECTION_POOL_MAX_SIZE,
retry: Retry = DEFAULT_RETRY_POLICY):
self.__http_adapter = HTTPAdapter(pool_maxsize=pool_max_size, max_retries=retry)
def session(self) -> Session:
return HttpSession(self.__http_adapter)
def close(self):
self.__http_adapter.close()
Finally, somewhere in the code I can write:
with self.__session_factory.session() as session:
response = session.get(request_url)
And all my session instances will reuse the same connection pool.
And somewhere at the end when the application stops I can close the HttpSessionFactory.
Hope this will help somebody.

Categories

Resources