How can I use multithreading with requests? - python

Hello I have this code using Python which use the requests module :
import requests
url1 = "myurl1" # I do not remember exactly the exact url
reponse1 = requests.get(url1)
temperature1 = reponse1.json()["temperature"]
url2 = "myurl2" # I do not remember exactly the exact url
reponse2 = requests.get(url2)
temperature2 = reponse2.json()["temp"]
url3 = "myurl3" # I do not remember exactly the exact url
reponse3 = requests.get(url3)
temperature3 = reponse3.json()[0]
print(temperature1)
print(temperature2)
print(temperature3)
And actually I have to tell you this is a little bit slow... Have you got a solution to improve the speed of my code ? I thought to use multi threading but I don't know how to use it...
Thank you very much !

Try Python executors:
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from multiprocessing import cpu_count
urls = ['/url1', '/url2', '/url3']
with ThreadPoolExecutor(max_workers=2*cpu_count()) as executor:
future_to_url = {executor.submit(requests.get, url): url for url in urls}
for future in as_completed(future_to_url):
response = future.result() # TODO: handle exceptions here
url = future_to_url[future]
# TODO: do something with that data

Related

Web scraping with urllib.request - data is not refreshing

I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")

threading: function seems to run as a blocking loop although i am using threading

I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.
Here is the code:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
]
def get_url(url):
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
a = soup.select_one('a')
print(a)
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
results = {executor.submit( get_url(url)) : url for url in urls}
for future in concurrent.futures.as_completed(results):
try:
pass
except Exception as exc:
print('ERROR for symbol:', results[future])
print(exc)
However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.
Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.
for u in urls:
get_url(u)
I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.
I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.
You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit, experiencing the concurrent.futures analog to this problem with raw threading.Thread usage. Change:
results = {executor.submit( get_url(url)) : url for url in urls}
to:
results = {executor.submit(get_url, url) : url for url in urls}
so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.

python program call with tornado post request block session till end of the python program

this is a program input multiple urls calling url localhost:8888/api/v1/crawler
this program taking 1+hour to run its ok but it block other apis.
when it running other any api will not work till the existing api end so i want to run this program asynchronously so how can i achieve with the same program
#tornado.web.asynchronous
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
print "Enter In Crawler Match Script POST"
print "Argsssss........"
print args
data = tornado.escape.json_decode(self.request.body)
print "Data................"
import json
print json.dumps(data.get('urls'))
from urllib import urlopen
from bs4 import BeautifulSoup
try:
urls = json.dumps(data.get('urls'));
urls = urls.split()
import sys
list = [];
# orig_stdout = sys.stdout
# f = open('out.txt', 'w')
# sys.stdout = f
for url in urls:
# print "FOFOFOFOFFOFO"
# print url
url = url.replace('"'," ")
url = url.replace('[', " ")
url = url.replace(']', " ")
url = url.replace(',', " ")
print "Final Url "
print url
try:
site = urlopen(url) ..............
Your post method is 100% synchronous. You should make the site = urlopen(url) async. There is an async HTTP client in Tornado for that. Also good example here.
You are using urllib which is the reason for blocking.
Tornado provides a non-blocking client called AsyncHTTPClient, which is what you should be using.
Use it like this:
from tornado.httpclient import AsyncHTTPClient
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
...
http_client = AsyncHTTPClient()
site = yield http_client.fetch(url)
...
Another thing that I'd like to point out is don't import modules from inside a function. Although, it's not the reason for blocking but it is still slower than if you put all your imports at the top of file. Read this question.

Fastest way to read and process 100,000 URLs in Python

I have a file with 100,000 URLs that I need to request then process. The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that.
Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python?):
from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys
concurrent = 100
def worker():
while True:
url = q.get()
html = get_html(url)
process_html(html)
q.task_done()
def get_html(url):
try:
html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
return html
except:
print "error", url
return None
def process_html(html):
if html == None:
return
soup = BeautifulSoup(html)
text = soup.get_text()
# do some more processing
# write the text to a file
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=worker)
t.daemon = True
t.start()
try:
for url in open('text.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). It will give the same speed as if you were working with memory and not a file.
with open("test.txt") as f:
mmap_file = mmap.mmap(f.fileno(), 0)
# code that does what you need
mmap_file.close()

Webscrape multithread python 3

i have been dong a simple webscraping program to learn how to code and i made it work but i wanted to see how to make it faster. I wanted to ask how could i implement multi-threading to this program? all that the program does is open the stock symbols file and searches for the price for that stock online.
Here is my code
import urllib.request
import urllib
from threading import Thread
symbolsfile = open("Stocklist.txt")
symbolslist = symbolsfile.read()
thesymbolslist = symbolslist.split("\n")
i=0
while i<len (thesymbolslist):
theurl = "http://www.google.com/finance/getprices?q=" + thesymbolslist[i] + "&i=10&p=25m&f=c"
thepage = urllib.request.urlopen(theurl)
# read the correct character encoding from `Content-Type` request header
charset_encoding = thepage.info().get_content_charset()
# apply encoding
thepage = thepage.read().decode(charset_encoding)
print(thesymbolslist[i] + " price is " + thepage.split()[len(thepage.split())-1])
i= i+1
If you just iterate a function on a list, i recommend you the multiprocessing.Pool.map(function, list).
https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20map#multiprocessing.pool.Pool.map
You need to use asyncio. That's quite neat package that could also help you with scrapping. I have created a small snippet of how to integrate with linkedin with asyncio but you can adopt it to your needs quite easily.
import asyncio
import requests
def scrape_first_site():
url = 'http://example.com/'
response = requests.get(url)
def scrape_another_site():
url = 'http://example.com/other/'
response = requests.get(url)
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(None, scrape_first_site),
loop.run_in_executor(None, scrape_another_site)
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Since default executor is ThreadPoolExecutor it will run each task in the separate thread. You can use ProcessPoolExecutor if you'd like to run tasks in process rather than threads (GIL related issues maybe).

Categories

Resources