Multithread python requests [duplicate] - python

This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 6 years ago.
For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from the server.
Is there anyway to speed it up and keep my current header setting? All tutorials I found where without a header.
Here is my code snipped:
def parse(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
for line in r.iter_lines():
...

Well you can use threads since this is a I/O Bound problem. Using the built in threading library is your best choice. I used the Semaphore object to limit how many threads can run at the same time.
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
def parse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""
print 'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
def parse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print 'done'

You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait() and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.
import asyncio
import functools
async def parse(url):
print('in parse for url {}'.format(url))
info = await #write the logic for fetching the info, it waits for the responses from the urls
print('done with url {}'.format(url))
return 'parse {} result from {}'.format(info, url)
async def main(sites):
print('starting main')
parses = [
parse(url)
for url in sites
]
print('waiting for phases to complete')
completed, pending = await asyncio.wait(parses)
results = [t.result() for t in completed]
print('results: {!r}'.format(results))
event_loop = asyncio.get_event_loop()
try:
websites = ['site1', 'site2', 'site3']
event_loop.run_until_complete(main(websites))
finally:
event_loop.close()

i think it's a good idea to use mutil-thread like threading or multiprocess, or you can use grequests(async requests) due to gevent

Related

Run Parallel Request session in python

I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break
There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).

Continuously probing multiple ULRs with fewer threads, how to control threads

Background:
I want to monitor say 100 URLs (take a snapshot, and stores it if content is different from previous), my plan is to using urllib.request to scan them every x minutes, say x=5, non-stop.
So I can't use a single for loop and sleep, as I want to kick off detection for ULR1, and then kick off URL2 almost simultaneously.
while TRUE:
for url in urlList:
do_detection()
time.sleep(sleepLength)
Therefore I should be using pool? But I should limit the thread to a small amount that my CPU can handle (can't set to 100 threads if I have 100 ULRs)
My question:
Even I can send the 100 URLs in my list to ThreadPool(4) with four threads, how shall I design to control each thread to handle the 100/4=25 URLs, so the thread probes URL1, sleep(300) before next probe to URL1, and then do URL2.... ULR25 and goes back to URL1...? I don't want to wait for 5 min*25 for a full cycle.
Psuedo code or examples will be great help! I can't find or think a way to make looper() and detector() behave as needed?
(I think How to scrap multiple html page in parallel with beautifulsoup in python? this is close but not exact answer)
Maybe something like this for each thread? I will try to work out how to split the 100 items to each thread now. using pool.map(func, iterable[, chunksize]) takes a list and I can set chunksize to 25.
def one_thread(Url):
For url in Url[0:24]:
CurrentDetect(url)
if 300-timelapsed>0:
remain_sleeping=300-timtlapsed
else:
remain_sleeping=0
sleep (remain_sleeping)
For url in Url[0:24]:
NextDetect()
The non-working code I am trying to write:
import urllib.request as req
import time
def url_reader(url = "http://stackoverflow.com"):
try
f = req.urlopen(url)
print (f.read())
except Exception as err
print (err)
def save_state():
pass
return []
def looper (sleepLength=720,urlList):
for url in urlList: #initial save
Latest_saved.append(save_state(url_reader(url))) # return a list
while TRUE:
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
time.sleep(sleepLength) # how to parallel this? if we have 100 urls, then takes 100*20 min to loop?
detector(urlList) #? use last saved status returned to compare?
def detector (urlList):
for url in urlList:
contentFirst=url_reader(url)
contentNext=url_reader(url)
if contentFirst!=contentNext:
save_state(contentFirst)
save_state(contentNext)
You need to install requests,
pip install requests
if you want to use the following code:
# -*- coding: utf-8 -*-
import concurrent.futures
import requests
import queue
import threading
# URL Pool
URLS = [
# Put your urls here
]
# Time interval (in seconds)
INTERVAL = 5 * 60
# The number of worker threads
MAX_WORKERS = 4
# You should set up request headers
# if you want to better evade anti-spider programs
HEADERS = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
#'Host': None,
'If-Modified-Since': '0',
#'Referer': None,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
}
############################
def handle_response(response):
# TODO implement your logics here !!!
raise RuntimeError('Please implement function `handle_response`!')
# Retrieve a single page and report the URL and contents
def load_url(session, url):
#print('load_url(session, url={})'.format(url))
response = session.get(url)
if response.status_code == 200:
# You can refactor this part and
# make it run in another thread
# devoted to handling local IO tasks,
# to reduce the burden of Net IO worker threads
return handle_response(response)
def ThreadPoolExecutor():
return concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS)
# Generate a session object
def Session():
session = requests.Session()
session.headers.update(HEADERS)
return session
# We can use a with statement to ensure threads are cleaned up promptly
with ThreadPoolExecutor() as executor, Session() as session:
if not URLS:
raise RuntimeError('Please fill in the array `URLS` to start probing!')
tasks = queue.Queue()
for url in URLS:
tasks.put_nowait(url)
def wind_up(url):
#print('wind_up(url={})'.format(url))
tasks.put(url)
while True:
url = tasks.get()
# Work
executor.submit(load_url, session, url)
threading.Timer(interval=INTERVAL, function=wind_up, args=(url,)).start()

semaphore/multiple pool locks in asyncio for 1 proxy - aiohttp

I have 5,00,000 urls. and want to get response of each asynchronously.
import aiohttp
import asyncio
#asyncio.coroutine
def worker(url):
response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))
main()
I want 200 connections at a time(concurrent 200), not more than this because
when I run this program for 50 urls it works fine, i.e url_list[:50]
but if I pass whole list, i get this error
aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
may be frequency is too much and server is denying to respond after a limit?
Yes, one can expect a server to stop responding after causing too much traffic (whatever the definition of "too much traffic") to it.
One way to limit number of concurrent requests (throttle them) in such cases is to use asyncio.Semaphore, similar in use to these used in multithreading: just like there, you create a semaphore and make sure the operation you want to throttle is aquiring that semaphore prior to doing actual work and releasing it afterwards.
For your convenience, asyncio.Semaphore implements context manager to make it even easier.
Most basic approach:
CONCURRENT_REQUESTS = 200
#asyncio.coroutine
def worker(url, semaphore):
# Aquiring/releasing semaphore using context manager.
with (yield from semaphore):
response = yield from aiohttp.request(
'GET',
url,
connector=aiohttp.TCPConnector(share_cookies=True,
verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))

How to open Post Urls in multithreads in python

I am using python 2.7 on Windows machine. I have an array of urls accompanied by data and headers, so POST method is required.
In simple execution it works well:
rescodeinvalid =[]
success = []
for i in range(0,len(HostArray)):
data = urllib.urlencode(post_data)
req = urllib2.Request(HostArray[i], data)
response = urllib2.urlopen(req)
rescode=response.getcode()
if responsecode == 400:
rescodeinvalid.append(HostArray[i])
if responsecode == 200:
success.append(HostArray[i])
My question is if HostArray length is very large, then it is taking much time in loop.
So, how to check each url of HostArray in a multithread. If response code of each url is 200, then I am doing different operation. I have arrays to store 200 and 400 responses.
So, how to do this in multithread in python
If you want to do each one in a separate thread you could do something like:
rescodeinvalid =[]
success = []
def post_and_handle(url,post_data)
data = urllib.urlencode(post_data)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
rescode=response.getcode()
if responsecode == 400:
rescodeinvalid.append(url) # Append is thread safe
elif responsecode == 200:
success.append(url) # Append is thread safe
workers = []
for i in range(0,len(HostArray)):
t = threading.Thread(target=post_and_handle,args=(HostArray[i],post_data))
t.start()
workers.append(t)
# Wait for all of the requests to complete
for t in workers:
t.join()
I'd also suggest using requests: http://docs.python-requests.org/en/latest/
as well as a thread pool:
Threading pool similar to the multiprocessing Pool?
Thread pool usage:
from multiprocessing.pool import ThreadPool
# Done here because this must be done in the main thread
pool = ThreadPool(processes=50) # use a max of 50 threads
# do this instead of Thread(target=func,args=args,kwargs=kwargs))
pool.apply_async(func,args,kwargs)
pool.close() # I think
pool.join()
scrapy uses twisted library to call multiple urls in parallel without the overhead of opening a new thread per request, it also manage internal queue to accumulate and even prioritize them as a bonus you can also restrict number of parallel requests by settings maximum concurrent requests, you can either launch a scrapy spider as an external process or from your code, just set spider start_urls = HostArray
Your case (basically processing a list into another list) looks like an ideal candidate for concurrent.futures (see for example this answer) or you may go all the way to Executor.map. And of course use ThreadPoolExecutor to limit the number of concurrently running threads to something reasonable.

grequests pool with multiple request.session?

I want to make a lot of url requets to a REST webserivce. Typically between 75-90k. However, I need to throttle the number of concurrent connections to the webservice.
I started playing around with grequests in the following manner, but quickly started chewing up opened sockets.
concurrent_limit = 30
urllist = buildUrls()
hdrs = {'Host' : 'hostserver'}
g_requests = (grequests.get(url, headers=hdrs) for url in urls)
g_responses = grequests.map(g_requests, size=concurrent_limit)
As this runs for a minute or so, I get hit with 'maximum number of sockets reached' errors.
As far as I can tell, each one of the requests.get calls in grequests uses it's own session which means a new socket is opened for each request.
I found a note on github referring how to make grequests use a single session. But this seems to effectively bottleneck all requests into a single shared pool. That seems to defeat the purpose of asynchronous http requests.
s = requests.session()
rs = [grequests.get(url, session=s) for url in urls]
grequests.map(rs)
Is is possible to use grequests or gevent.Pool in a way that creates a number of sessions?
Put another way: How can I make many concurrent http requests using either through queuing or connection pooling?
I ended up not using grequests to solve my problem. I'm still hopeful it might be possible.
I used threading:
class MyAwesomeThread(Thread):
"""
Threading wrapper to handle counting and processing of tasks
"""
def __init__(self, session, q):
self.q = q
self.count = 0
self.session = session
self.response = None
Thread.__init__(self)
def run(self):
"""TASK RUN BY THREADING"""
while True:
url, host = self.q.get()
httpHeaders = {'Host' : host}
self.response = session.get(url, headers=httpHeaders)
# handle response here
self.count+= 1
self.q.task_done()
return
q=Queue()
threads = []
for i in range(CONCURRENT):
session = requests.session()
t=MyAwesomeThread(session,q)
t.daemon=True # allows us to send an interrupt
threads.append(t)
## build urls and add them to the Queue
for url in buildurls():
q.put_nowait((url,host))
## start the threads
for t in threads:
t.start()
rs is a AsyncRequest list。each AsyncRequest have it's own session.
rs = [grequests.get(url) for url in urls]
grequests.map(rs)
for ar in rs:
print(ar.session.cookies)
Something like this:
NUM_SESSIONS = 50
sessions = [requests.Session() for i in range(NUM_SESSIONS)]
reqs = []
i = 0
for url in urls:
reqs.append(grequests.get(url, session=sessions[i % NUM_SESSIONS]
i+=1
responses = grequests.map(reqs, size=NUM_SESSIONS*5)
That should spread the requests over 50 different sessions.

Categories

Resources