How to make a python multi-threading scraper as a websocket server - python

I currently have a python multi-threading scraper that looks like following, how can I turn this into a websocket server so that it can push message once the scraper found some new data? Most websocket implementations that I found are asynchronous using asyncio and I don't want to rewrite all of my codes for that.
Are there any websocket implementations that I can use without rewriting my multi-threading codes into async version? Thanks!
from threading import Thread
Class Scraper:
def _send_request(self):
response = requests.get(url)
if response.status_code == 200 and 'good' in response.json():
send_ws_message()
def scrape(self):
while True:
Thread(target=self._send_request).start()
time.sleep(1)
scraper = Scraper()
scraper.scrape()

Related

How to fire and forgot a HTTP request?

Is it possible to fire a request and not wait for response at all?
For python, most internet search results in
asynchronous-requests-with-python-requests
grequests
requests-futures
However, all the above solutions spawns a new thread and wait for response on each of the respective threads. Is it possible to not wait for any response at all, anywhere?
You can run your thread as a daemon, see the code below; If I comment out the line (t.daemon = True), the code will wait on the threads to finish before exiting. With daemon set to true, it will simply exit. You can try it with the example below.
import requests
import threading
import time
def get_thread():
g = requests.get("http://www.google.com")
time.sleep(2)
print(g.text[0:100])
if __name__ == '__main__':
t = threading.Thread(target=get_thread)
t.daemon = True # Try commenting this out, running it, and see the difference
t.start()
print("Done")
I don't really know what you are trying to achieve by just firing an http request. So I will list some use cases I can think of.
Ignoring the result
If the only thing you want is that your program feels like it never stops for making a request. You can use a library like aiohttp to make concurrent request without actually calling await for the responses.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
session.get('http://python.org')
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
but how can you know that the request was made successfully if you don't check anything?
Ignoring the body
Maybe you want to be very performant, and you are worried about loosing time reading the body. In that case yo can just fire the request, check the status code and then close the connection.
def make_request(url = "yahoo.com", timeout= 50):
conn = http.client.HTTPConnection(url, timeout=timeout)
conn.request("GET", "/")
res = conn.getresponse()
print(res.status)
conn.close()
If you close the connection as I did previously you won't be able to reuse the connections.
The right way
I would recommend to await on asynchronous calls using aiohttp so you can add the necessary logic without having to block.
But if you are looking for performance, a custom solution with the http library is necessary. Maybe you could also consider very small request/responses, small timeouts and compression in your client and server.

Get url without blocking the function

I'm trying to get a resource from a url inside a route on a web server without blocking it since getting it sometimes takes 11 seconds+.
I switched from flask to aiohttp for this.
async def process(request):
data = await request.json()
req = urllib.request.Request(
request["resource_url"],
data=None,
headers=hdrs
)
# Do processing on the resource
But I'm not sure how to make the call and would it allow other calls to be made to this route while the resource is getting fetched?

how to open different urls at the same time by using python selenium?

is this possible to open 50 different urls at the same time using selenium with python?
is this possible using threading?
If so, how would I go about doing this?
If not, then what would be a good method to do so?
You can try below to open 50 URLs one by one in new tab:
urls = ["http://first.com", "http://second.com", ...]
for url in urls:
driver.execute_script('window.open("%s")' % url)
you can use celery (Distributed Task Queue) to open all these urls.
or you can use async and await with aiohttp on python >= 3.5 , which runs a single thread on a single process but concurrently(utilises the wait time on urls for fetching other urls)
here is the code sample for the same. Loop takes care of scheduling these concurrent tasks.
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def hello(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
loop.run_until_complete(hello("http://httpbin.org/headers"))
well, open 50 urls at the same time seems unreasonable and will require a LOT of processing, but, is possible. However i would recommend you to use a form of iteration opening one url at a time. 50 times.
list = ['list of urls here','2nd url'...]
driver = webdriver.Firefox()
for i in list:
moving = driver.get(i)
...#rest of your code
driver.quit()
but... you can make one driver.get('url') for each url you want... using different drivers. Or tabs. But it will require a lot of processing.

The tasks from asyncio.gather does not work concurrently

I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.
async def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
future = asyncio.Future()
future.set_result(soup)
return future
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.
And once the function finally finishes the execution, the future instance within it gets the result value.
But why does this not work concurrently? What am I missing here?
Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.
To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.
I'll add a little more to user4815162342's response. The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. See the diagram at the end of this section for a nice graphical representation. As user4815162342 mentioned, the requests library doesn't support asyncio. I know of two ways to make this work concurrently. First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests. The second is to run this synchronous code in separate threads or processes. The latter is easy because of the run_in_executor function.
loop = asyncio.get_event_loop()
async def return_soup(url):
r = await loop.run_in_executor(None, requests.get, url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.
The reason as mentioned in other answers is the lack of library support for coroutines.
As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.
Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.
In your example the code would be:
def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
def parseURL_async(url):
print("Started to download {0}".format(url))
soup = return_soup(url)
print("Finished downloading {0}".format(url))
return soup
async def main():
result_url_1, result_url_2 = await asyncio.gather(
asyncio.to_thread(parseURL_async, url_1),
asyncio.to_thread(parseURL_async, url_2),
)
asyncio.run(main())

Filling the list with responses and processing them at the same time

I'm trying to download some products from a web page. This web page (according to robots.txt) allows me to send 2000req/minute. The problem is that sequential sending requests and then processing it is too much time-consuming.
I've realised that method which sends request can be moved into the pool which is much better according to time consume. It's probably because the processor don't need to wait to response and rather sends another request at the moment.
So I have a pool, the responses are being appended into the list RESPONSES.
Simple code:
from multiprocessing.pool import ThreadPool as Pool
import requests
RESPONSES = []
with open('products.txt') as f:
LINES = f.readlines()[:100]
def post_request(url):
html = requests.get(url).content
RESPONSES.append(html)
def parse_html_return_object(resp):
#some code here
pass
def insert_object_into_database():
pass
pool = Pool(100)
for line in LINES:
pool.apply_async(post_request,args=(line[:-1],))
pool.close()
pool.join()
The thing I want is to process those RESPONSES (HTMLS) so it would be popping responses from RESPONSE list and parsing it during the requesting.
So it could be like this (Time -->):
post_request(line1)->post_request(line2)->Response_line1->parse_html_return_object(response)->post_request...
Is there some simple way to do that?

Categories

Resources