Build scraper REST API server with pyppeteer or selenium

Build scraper REST API server with pyppeteer or selenium - python

I need to create a server to which I can make REST requests by obtaining the scraped data from the indicated site.
For example a url like this:
http://myip/scraper?url=www.exampe.com&token=0
I have to scrape a site built in javascript that recognizes if it is opened by a real or headless browser.
The only alternative are selenium or pyppeteer and a virtualDisplay.
I currently use selenium and FastAPI, but it is not a usable solution with a lot of requests. For each request chrome is opened and closed, this delays the response a lot and uses a lot of resources.
With pyppeteer async you can open multiple tabs at the same time on the same browser instance, reducing response times. But this would likely lead to other problems after a number of tabs.
I was thinking of creating a pool of browser instances on which to divide the various requests as puppeteer-cluster.
But so far I haven't been able to figure it out.
I was currently trying this code for Browser:
import json
from pyppeteer import launch
from strings import keepa_storage
class Browser:
async def __aenter__(self):
self._session = await launch(headless=False, args=['--no-sandbox', "--disable-gpu", '--lang=it',
'--disable-blink-features=AutomationControlled'], autoClose=False)
return self
async def __aexit__(self, *err):
self._session = None
async def fetch(self, url):
page = await self._session.newPage()
page_source = None
try:
await page.goto("https://example.com/404")
for key in keepa_storage:
await page.evaluate(
"window.localStorage.setItem('{}',{})".format(key, json.dumps(example_local_storage.get(key))))
await page.goto(url)
await page.waitForSelector('#tableElement')
page_source = await page.content()
except TimeoutError as e:
print(f'Timeout for: {url}')
finally:
await page.close()
return page_source
And this code for the request:
async with Browser() as http:
source = await asyncio.gather(
http.fetch('https://example.com')
)
But I have no idea how to reuse the same browser session for multiple server requests

While initialising the Server, create a Manager object. As per the implementation manager automatically spawns all the Worker needed. In the API implementation method invoke manager.assign(item). This should get an idle worker and assign the item to it. If no worker is idle at the moment, due to the Queue nature of the manager._AVAILABLE_WORKER it should wait till a worker is available. On a different thread create an infinite loop and invoke manager.heartbeat() to make sure that workers are not slacking off.
I have mentioned in the comment section what is the purpose of each method and what it is supposed to do. That should be enough to get you started. Feel free to let me know in case further clarification is required.
import Queue
class Worker:
###
# class to define behavior and parameters of workers
###
def __init__(self, base_url):
###
# Initialises a worker
# STEP 1. Create one worker with given inputs
# STEP 2. Mark the worker busy
# STEP 3. Get ready for item consumption with initialisation/login process done
# STEP 4. Mark the worker available and active
###
raise NotImplementedError()
def process_item(self, **item):
###
# Worker processes the given item and returns data to manager
# Step 1. worker marks himself busy
# Step 2. worker processes the item. Handle Errors here
# Step 3. worker marks himself available
# Step 4. Return the data scraped
###
raise NotImplementedError()
class Manager:
###
# class for manager who supervises all the workers and assigns work to them
###
def __init__(self):
self._WORKERS = set() # set container to hold all the workers details
self._AVAILABLE_WORKERS = Queue(maxsize=10) # queue container to hold available workers
# create all the worker we want and add them to self._WORKERS and self._AVAILABLE_WORKERS
def assign(self, item):
###
# Assigns an item to a worker to be processed and once processed returns data to the server
# STEP 1. remove worker from available pool
# STEP 2. assign item to worker
# STEP 3A. if item is successfully processed, put the worker back to available pool
# STEP 3B. if error occurred during item processing, try to reset the worker and put the worker back to
# available pool
###
raise NotImplementedError()
def heartbeat(self):
###
# process to check if all the workers are active and accounted for at particular interval.
# if the worker is available but not in the pool add it to the pool after checking if it's not busy
# if the worker is not active then reset the worker and add it to the pool
###
raise NotImplementedError()

Related

how to redirect to another page and call a python function when clicking on submit button [duplicate]

I am writing an application in Flask, which works really well except that WSGI is synchronous and blocking. I have one task in particular which calls out to a third party API and that task can take several minutes to complete. I would like to make that call (it's actually a series of calls) and let it run. while control is returned to Flask.
My view looks like:
#app.route('/render/<id>', methods=['POST'])
def render_script(id=None):
...
data = json.loads(request.data)
text_list = data.get('text_list')
final_file = audio_class.render_audio(data=text_list)
# do stuff
return Response(
mimetype='application/json',
status=200
)
Now, what I want to do is have the line
final_file = audio_class.render_audio()
run and provide a callback to be executed when the method returns, whilst Flask can continue to process requests. This is the only task which I need Flask to run asynchronously, and I would like some advice on how best to implement this.
I have looked at Twisted and Klein, but I'm not sure they are overkill, as maybe Threading would suffice. Or maybe Celery is a good choice for this?

I would use Celery to handle the asynchronous task for you. You'll need to install a broker to serve as your task queue (RabbitMQ and Redis are recommended).
app.py:
from flask import Flask
from celery import Celery
broker_url = 'amqp://guest#localhost' # Broker URL for RabbitMQ task queue
app = Flask(__name__)
celery = Celery(app.name, broker=broker_url)
celery.config_from_object('celeryconfig') # Your celery configurations in a celeryconfig.py
#celery.task(bind=True)
def some_long_task(self, x, y):
# Do some long task
...
#app.route('/render/<id>', methods=['POST'])
def render_script(id=None):
...
data = json.loads(request.data)
text_list = data.get('text_list')
final_file = audio_class.render_audio(data=text_list)
some_long_task.delay(x, y) # Call your async task and pass whatever necessary variables
return Response(
mimetype='application/json',
status=200
)
Run your Flask app, and start another process to run your celery worker.
$ celery worker -A app.celery --loglevel=debug
I would also refer to Miguel Gringberg's write up for a more in depth guide to using Celery with Flask.

Threading is another possible solution. Although the Celery based solution is better for applications at scale, if you are not expecting too much traffic on the endpoint in question, threading is a viable alternative.
This solution is based on Miguel Grinberg's PyCon 2016 Flask at Scale presentation, specifically slide 41 in his slide deck. His code is also available on github for those interested in the original source.
From a user perspective the code works as follows:
You make a call to the endpoint that performs the long running task.
This endpoint returns 202 Accepted with a link to check on the task status.
Calls to the status link returns 202 while the taks is still running, and returns 200 (and the result) when the task is complete.
To convert an api call to a background task, simply add the #async_api decorator.
Here is a fully contained example:
from flask import Flask, g, abort, current_app, request, url_for
from werkzeug.exceptions import HTTPException, InternalServerError
from flask_restful import Resource, Api
from datetime import datetime
from functools import wraps
import threading
import time
import uuid
tasks = {}
app = Flask(__name__)
api = Api(app)
#app.before_first_request
def before_first_request():
"""Start a background thread that cleans up old tasks."""
def clean_old_tasks():
"""
This function cleans up old tasks from our in-memory data structure.
"""
global tasks
while True:
# Only keep tasks that are running or that finished less than 5
# minutes ago.
five_min_ago = datetime.timestamp(datetime.utcnow()) - 5 * 60
tasks = {task_id: task for task_id, task in tasks.items()
if 'completion_timestamp' not in task or task['completion_timestamp'] > five_min_ago}
time.sleep(60)
if not current_app.config['TESTING']:
thread = threading.Thread(target=clean_old_tasks)
thread.start()
def async_api(wrapped_function):
#wraps(wrapped_function)
def new_function(*args, **kwargs):
def task_call(flask_app, environ):
# Create a request context similar to that of the original request
# so that the task can have access to flask.g, flask.request, etc.
with flask_app.request_context(environ):
try:
tasks[task_id]['return_value'] = wrapped_function(*args, **kwargs)
except HTTPException as e:
tasks[task_id]['return_value'] = current_app.handle_http_exception(e)
except Exception as e:
# The function raised an exception, so we set a 500 error
tasks[task_id]['return_value'] = InternalServerError()
if current_app.debug:
# We want to find out if something happened so reraise
raise
finally:
# We record the time of the response, to help in garbage
# collecting old tasks
tasks[task_id]['completion_timestamp'] = datetime.timestamp(datetime.utcnow())
# close the database session (if any)
# Assign an id to the asynchronous task
task_id = uuid.uuid4().hex
# Record the task, and then launch it
tasks[task_id] = {'task_thread': threading.Thread(
target=task_call, args=(current_app._get_current_object(),
request.environ))}
tasks[task_id]['task_thread'].start()
# Return a 202 response, with a link that the client can use to
# obtain task status
print(url_for('gettaskstatus', task_id=task_id))
return 'accepted', 202, {'Location': url_for('gettaskstatus', task_id=task_id)}
return new_function
class GetTaskStatus(Resource):
def get(self, task_id):
"""
Return status about an asynchronous task. If this request returns a 202
status code, it means that task hasn't finished yet. Else, the response
from the task is returned.
"""
task = tasks.get(task_id)
if task is None:
abort(404)
if 'return_value' not in task:
return '', 202, {'Location': url_for('gettaskstatus', task_id=task_id)}
return task['return_value']
class CatchAll(Resource):
#async_api
def get(self, path=''):
# perform some intensive processing
print("starting processing task, path: '%s'" % path)
time.sleep(10)
print("completed processing task, path: '%s'" % path)
return f'The answer is: {path}'
api.add_resource(CatchAll, '/<path:path>', '/')
api.add_resource(GetTaskStatus, '/status/<task_id>')
if __name__ == '__main__':
app.run(debug=True)

You can also try using multiprocessing.Process with daemon=True; the process.start() method does not block and you can return a response/status immediately to the caller while your expensive function executes in the background.
I experienced similar problem while working with falcon framework and using daemon process helped.
You'd need to do the following:
from multiprocessing import Process
#app.route('/render/<id>', methods=['POST'])
def render_script(id=None):
...
heavy_process = Process( # Create a daemonic process with heavy "my_func"
target=my_func,
daemon=True
)
heavy_process.start()
return Response(
mimetype='application/json',
status=200
)
# Define some heavy function
def my_func():
time.sleep(10)
print("Process finished")
You should get a response immediately and, after 10s you should see a printed message in the console.
NOTE: Keep in mind that daemonic processes are not allowed to spawn any child processes.

Flask 2.0
Flask 2.0 supports async routes now. You can use the httpx library and use the asyncio coroutines for that. You can change your code a bit like below
#app.route('/render/<id>', methods=['POST'])
async def render_script(id=None):
...
data = json.loads(request.data)
text_list = data.get('text_list')
final_file = await asyncio.gather(
audio_class.render_audio(data=text_list),
do_other_stuff_function()
)
# Just make sure that the coroutine should not having any blocking calls inside it.
return Response(
mimetype='application/json',
status=200
)
The above one is just a pseudo code, but you can checkout how asyncio works with flask 2.0 and for HTTP calls you can use httpx. And also make sure the coroutines are only doing some I/O tasks only.

If you are using redis, you can use Pubsub event to handle background tasks.
See more: https://redis.com/ebook/part-2-core-concepts/chapter-3-commands-in-redis/3-6-publishsubscribe/

How to get httpx.gather() with return_exceptions=True to complete the Queue of tasks when the exception count exceeds the worker count?

I'm using asyncio in concert with the httpx.AsyncClient for the first time and trying to figure out how to complete my list of tasks when some number of them may fail. I'm using a pattern I found in a few places where I populate an asyncio Queue with coroutine functions, and have a set of workers process that queue from inside asyncio.gather. Normally, if the function doing the work raises an exception, you'll see the whole script just fail during that processing, and report the exception along with a RuntimeWarning: coroutine foo was never awaited, indicating that you never finished your list.
I found the return_exceptions option for asyncio.gather, and that has helped, but not completely. my script will still die after I've gotten the exception the same number of times as the total number of workers that I've thrown into my call to gather. The following is a simple script that demonstrates the problem.
from httpx import AsyncClient, Timeout
from asyncio import run, gather, Queue as asyncio_Queue
from random import choice
async def process_url(client, url):
"""
opens the URL and pulls a header attribute
randomly raises an exception to demonstrate my problem
"""
if choice([True, False]):
await client.get(url)
print(f'retrieved url {url}')
else:
raise AssertionError(f'generated error for url {url}')
async def main(worker_count, urls):
"""
orchestrates the workers that call process_url
"""
httpx_timeout = Timeout(10.0, read=20.0)
async with AsyncClient(timeout=httpx_timeout, follow_redirects=True) as client:
tasks = asyncio_Queue(maxsize=0)
for url in urls:
await tasks.put(process_url(client, url))
async def worker():
while not tasks.empty():
await tasks.get_nowait()
results = await gather(*[worker() for _ in range(worker_count)], return_exceptions=True)
return results
if __name__ == '__main__':
urls = ['https://stackoverflow.com/questions',
'https://stackoverflow.com/jobs',
'https://stackoverflow.com/tags',
'https://stackoverflow.com/users',
'https://www.google.com/',
'https://www.bing.com/',
'https://www.yahoo.com/',
'https://www.foxnews.com/',
'https://www.cnn.com/',
'https://www.npr.org/',
'https://www.opera.com/',
'https://www.mozilla.org/en-US/firefox/',
'https://www.google.com/chrome/',
'https://www.epicbrowser.com/'
]
print(f'processing {len(urls)} urls')
run_results = run(main(4, urls))
print('\n'.join([str(rr) for rr in run_results]))
one run of this script outputs:
processing 14 urls
retrieved url https://stackoverflow.com/tags
retrieved url https://stackoverflow.com/jobs
retrieved url https://stackoverflow.com/users
retrieved url https://www.bing.com/
generated error for url https://stackoverflow.com/questions
generated error for url https://www.foxnews.com/
generated error for url https://www.google.com/
generated error for url https://www.yahoo.com/
sys:1: RuntimeWarning: coroutine 'process_url' was never awaited
Process finished with exit code 0
Here you see that we got through 8 of the total 14 urls, but by the time we reached 4 errors, the script wrapped up and ignored the rest of the urls.
What I want to do is have the script complete the full set of urls, but inform me of the errors at the end. Is there a way to do this here? It may be that I'll have to wrap everything in process_url() inside a try/except block and use something like aiofile to dump them out in the end?
Update
To be clear, this demo script is a simplification of what I'm really doing. My real script is hitting a small number of server api endpoints a few hundred thousand times. The purpose of using the set of workers is to avoid overwhelming the server I'm hitting [it's a test server, not production, so it's not intended to handle huge volumes of requests, though the number is greater than 4 8-)]. I'm open to learning about alternatives.

The program design you have outlined should work OK, but you must prevent the tasks (instances of your worker function) from crashing. The below listing shows one way to do that.
Your Queue is named "tasks" but the items you place in it aren't tasks - they are coroutines. As it stands, your program has five tasks: one of them is the main function, which is made into a task by asyncio.run(). The other four tasks are instances of worker, which are made into tasks by asyncio.gather.
When worker awaits on a coroutine and that coroutine crashes, the exception is propagated into worker at the await statement. Because the exception isn't handled, worker will crash in turn. To prevent that, do something like this:
async def worker():
while not tasks.empty():
try:
await tasks.get_nowait()
except Exception:
pass
# You might want to do something more intelligent here
# (logging, perhaps), rather than simply suppressing the exception
This should allow your example program to run to completion.

Keep Websockets connection open for incoming requests

I have a Flask server that accepts HTTP requests from a client. This HTTP server needs to delegate work to a third-party server using a websocket connection (for performance reasons).
I find it hard to wrap my head around how to create a permanent websocket connection that can stay open for HTTP requests. Sending requests to the websocket server in a run-once script works fine and looks like this:
async def send(websocket, payload):
await websocket.send(json.dumps(payload).encode("utf-8"))
async def recv(websocket):
data = await websocket.recv()
return json.loads(data)
async def main(payload):
uri = f"wss://the-third-party-server.com/xyz"
async with websockets.connect(uri) as websocket:
future = send(websocket, payload)
future_r = recv(websocket)
_, output = await asyncio.gather(future, future_r)
return output
asyncio.get_event_loop().run_until_complete(main({...}))
Here, main() establishes a WSS connection and closes it when done, but how can I keep that connection open for incoming HTTP requests, such that I can call main() for each of those without re-establising the WSS connection?

The main problem there is that when you code a web app responding http(s), your code have a "life cycle" that is very peculiar to that: usually you have a "view" function that will get the request data, perform all actions needed to gather the response data and return it.
This "view" function in most web frameworks has to be independent from the rest of the system - it should be able to perform its duty relying on no other data or objects than what it gets when called - which are the request data, and system configurations - that gives the application server (the framework parts designed to actually connect your program to the internet) can choose a variety of ways to serve your program: they may run your view function in several parallel threads, or in several parallel processes, or even in different processes in various containers or physical servers: you application would not need to care about that.
If you want a resource that is available across calls to your view functions, you need to break out of this paradigm. For example, typically, frameworks will want to create a pool of database connections, so that views on the same process can re-use those connections. These database connections are usually supplied by the framework itself, which implements a mechanism for allowing then to be reused, and be available in a transparent way, as needed. You have to recreate a mechanism of the same sort if you want to keep a websocket connection alive.
In a certain way, you need a Python object that can mediate your websocket data behaving like a "server" for your web view functions.
That is simpler to do than it sounds - a special Python class designed to have a single instance per process, which keeps the connections, and is able to send and receive data received from parallel calls without mangling it is enough. A callable that will ensure this instance exists in the current process is enough to work under any strategy configured to serve your app to the web.
If you are using Flask, which does not use asyncio, you get a further complication - you will loose the async-ability inside your views, they will have to wait for the websocket requisition to be completed - it will then be the job of your application server to have your view in different threads or processes to ensure availability. And, it is your job to have the asyncio loop for your websocket running in a separate thread, so that it can make the requests it needs.
Here is some example code.
Please note that apart from using a single websocket per process,
this has no provisions in case of failure of any kind, but,
most important: it does nothing in parallel: all
pairs of send-recv are blocking, as you give no clue of
a mechanism that would allow one to pair each outgoing message
with its response.
import asyncio
import threading
from queue import Queue
class AWebSocket:
instance = None
def __new__(cls, *args, **kw):
if cls.instance:
return cls.instance
return super().__new__(cls, *args, **kw)
def __init__(self, *args, **kw):
cls = self.__class__
if cls.instance:
# init will be called even if new finds the existing instance,
# so we have to check again
return
self.outgoing = Queue()
self.responses = Queue()
self.socket_thread = threading.Thread(target=self.start_socket)
self.socket_thread.start()
def start_socket():
# starts an async loop in a separate thread, and keep
# the web socket running, in this separate thread
asyncio.get_event_loop().run_until_complete(self.core())
def core(self):
self.socket = websockets.connect(uri)
async def _send(self, websocket, payload):
await websocket.send(json.dumps(payload).encode("utf-8"))
async def _recv(self, websocket):
data = await websocket.recv()
return json.loads(data)
async def core(self):
uri = f"wss://the-third-party-server.com/xyz"
async with websockets.connect(uri) as websocket:
self.websocket = websocket
while True:
# This code is as you wrote it:
# it essentially blocks until a message is sent
# and the answer is received back.
# You have to have a mechanism in your websocket
# messages allowing you to identify the corresponding
# answer to each request. On doing so, this is trivially
# paralellizable simply by calling asyncio.create_task
# instead of awaiting on asyncio.gather
payload = self.outgoing.get()
future = self._send(websocket, payload)
future_r = self._recv(websocket)
_, response = await asyncio.gather(future, future_r)
self.responses.put(response)
def send(self, payload):
# This is the method you call from your views
# simply do:
# `output = AWebSocket().send(payload)`
self.outgoing.put(payload)
return self.responses.get()

Django saving progress to the Session in subscript

So im wondering how this is done right.
I try to save the progress of a long running task inside the request.session object. And than be able to get the status of the process with another view method
Im using the Pool Class to make my long running progress async:
MyCalculation.py
def longrunning(x,request):
request.session['status'] = 5;
return x*x
views.py
def dolongrunning(request, x):
pool = Pool(processes=1)
result = pool.apply_async(MyCalculation.longrunning, [x, request])
return JsonResponse(..)
def status(request):
return JsonResponse(request.session.get('status))
so this doesnt work. My Async Job does executed but the request object doesnt get my progress informations.
How could i accomplish that or is there another way?
I have the feeling passing the request object is a bad idea in general.
What whould be a good practice to store the Status of a long running operation in Django/Python?

Different processes do not share the same memory space but they get a copy for each one of them.
In your case, the request object received by the worker process in the longrunning function is a copy of the one created in the parent process. Changes done on one of the processes do not affect the others.
What you want to do, is to send updates from the worker process to the parent one and then, within the parent one, update the request status.
from multiprocessing import Pool, Queue
def worker(task, message_queue): # longrunning
# do something
message_queue.put(5)
# do something else
message_queue.put(42)
def request_handler(request, task, message_queue): # dolongrunning
result = pool.apply_async(worker, [task, message_queue])
return JsonResponse(..)
def status(request):
status = message_queue.get() # this is blocking if no messages in queue
request.session['status'] = status;
return JsonResponse(request.session['status'])
pool = Pool(processes=1)
message_queue = Queue()
This is quite simplified and it's actually blocking on status requests if no status is set but it gives an idea.
A better way would be storing the requests in a buffer and keeping the message queue empty with a thread. Each time a status request is received the last status update received from the workers would be returned.

How to create managers for the worker threads?

The code works fine for a single "manager", which basically launches some HTTP GETs to a server. But I've hit a brick wall.
How do I create 2 managers now, each with its own Download_Dashlet_Job object and tcp_pool_object? In essence, the managers would be commanding their own workers on two seperate jobs. This seems to be a really good puzzle for learning Python classes.
import workerpool
from urllib3 import HTTPConnectionPool
class Download_Dashlet_Job(workerpool.Job):
def __init__(self, url):
self.url = url
def run(self):
request = tcp_pool_object.request('GET', self.url, headers=headers)
tcp_pool_object = HTTPConnectionPool('M_Server', port=8080, timeout=None, maxsize=3, block=True)
dashlet_thread_worker_pool_object = workerpool.WorkerPool(size=100)
#this section emulates a single manager calling 6 threads from the pool but limited to 3 TCP sockets by tcp_pool_object
for url in open("overview_urls.txt"):
job_object = Download_Dashlet_Job(url.strip())
dashlet_thread_worker_pool_object.put(job_object)
dashlet_thread_worker_pool_object.shutdown()
dashlet_thread_worker_pool_object.wait()

First, workerpool.WorkerPool(size=100) creates 100 worker threads. In the comment below, you're saying you want 6 threads? You need to change that to 6.
In order to create a second pool, you need to create another pool. You can also create another job class, and just add this different type of job to the same pool, if you prefer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Build scraper REST API server with pyppeteer or selenium - python

Related

how to redirect to another page and call a python function when clicking on submit button [duplicate]

How to get httpx.gather() with return_exceptions=True to complete the Queue of tasks when the exception count exceeds the worker count?

Keep Websockets connection open for incoming requests

Django saving progress to the Session in subscript

How to create managers for the worker threads?

Categories

Resources