Background:
I want to monitor say 100 URLs (take a snapshot, and stores it if content is different from previous), my plan is to using urllib.request to scan them every x minutes, say x=5, non-stop.
So I can't use a single for loop and sleep, as I want to kick off detection for ULR1, and then kick off URL2 almost simultaneously.
while TRUE:
for url in urlList:
do_detection()
time.sleep(sleepLength)
Therefore I should be using pool? But I should limit the thread to a small amount that my CPU can handle (can't set to 100 threads if I have 100 ULRs)
My question:
Even I can send the 100 URLs in my list to ThreadPool(4) with four threads, how shall I design to control each thread to handle the 100/4=25 URLs, so the thread probes URL1, sleep(300) before next probe to URL1, and then do URL2.... ULR25 and goes back to URL1...? I don't want to wait for 5 min*25 for a full cycle.
Psuedo code or examples will be great help! I can't find or think a way to make looper() and detector() behave as needed?
(I think How to scrap multiple html page in parallel with beautifulsoup in python? this is close but not exact answer)
Maybe something like this for each thread? I will try to work out how to split the 100 items to each thread now. using pool.map(func, iterable[, chunksize]) takes a list and I can set chunksize to 25.
def one_thread(Url):
For url in Url[0:24]:
CurrentDetect(url)
if 300-timelapsed>0:
remain_sleeping=300-timtlapsed
else:
remain_sleeping=0
sleep (remain_sleeping)
For url in Url[0:24]:
NextDetect()
The non-working code I am trying to write:
import urllib.request as req
import time
def url_reader(url = "http://stackoverflow.com"):
try
f = req.urlopen(url)
print (f.read())
except Exception as err
print (err)
def save_state():
pass
return []
def looper (sleepLength=720,urlList):
for url in urlList: #initial save
Latest_saved.append(save_state(url_reader(url))) # return a list
while TRUE:
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
time.sleep(sleepLength) # how to parallel this? if we have 100 urls, then takes 100*20 min to loop?
detector(urlList) #? use last saved status returned to compare?
def detector (urlList):
for url in urlList:
contentFirst=url_reader(url)
contentNext=url_reader(url)
if contentFirst!=contentNext:
save_state(contentFirst)
save_state(contentNext)
You need to install requests,
pip install requests
if you want to use the following code:
# -*- coding: utf-8 -*-
import concurrent.futures
import requests
import queue
import threading
# URL Pool
URLS = [
# Put your urls here
]
# Time interval (in seconds)
INTERVAL = 5 * 60
# The number of worker threads
MAX_WORKERS = 4
# You should set up request headers
# if you want to better evade anti-spider programs
HEADERS = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
#'Host': None,
'If-Modified-Since': '0',
#'Referer': None,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
}
############################
def handle_response(response):
# TODO implement your logics here !!!
raise RuntimeError('Please implement function `handle_response`!')
# Retrieve a single page and report the URL and contents
def load_url(session, url):
#print('load_url(session, url={})'.format(url))
response = session.get(url)
if response.status_code == 200:
# You can refactor this part and
# make it run in another thread
# devoted to handling local IO tasks,
# to reduce the burden of Net IO worker threads
return handle_response(response)
def ThreadPoolExecutor():
return concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS)
# Generate a session object
def Session():
session = requests.Session()
session.headers.update(HEADERS)
return session
# We can use a with statement to ensure threads are cleaned up promptly
with ThreadPoolExecutor() as executor, Session() as session:
if not URLS:
raise RuntimeError('Please fill in the array `URLS` to start probing!')
tasks = queue.Queue()
for url in URLS:
tasks.put_nowait(url)
def wind_up(url):
#print('wind_up(url={})'.format(url))
tasks.put(url)
while True:
url = tasks.get()
# Work
executor.submit(load_url, session, url)
threading.Timer(interval=INTERVAL, function=wind_up, args=(url,)).start()
Related
I have searched through all the requests docs I can find, including requests-futures, and I can't find anything addressing this question. Wondering if I missed something or if anyone here can help answer:
Is it possible to create and manage multiple unique/autonomous sessions with requests.Session()?
In more detail, I'm asking if there is a way to create two sessions I could use "get" with separately (maybe via multiprocessing) that would retain their own unique set of headers, proxies, and server-assigned sticky data.
If I want Session_A to hit someimaginarysite.com/page1.html using specific headers and proxies, get cookied and everything else, and then hit someimaginarysite.com/page2.html with the same everything, I can just create one session object and do it.
But what if I want a Session_B, at the same time, to start on page3.html and then hit page4.html with totally different headers/proxies than Session_A, and be assigned its own cookies and whatnot? Crawl multiple pages consecutively with the same session, not just a single request followed by another request from a blank (new) session.
Is this as simple as saying:
import requests
Session_A = requests.Session()
Session_B = requests.Session()
headers_a = {A headers here}
proxies_a = {A proxies here}
headers_b = {B headers here}
proxies_b = {B proxies here}
response_a = Session_A.get('https://someimaginarysite.com/page1.html', headers=headers_a, proxies=proxies_a)
response_a = Session_A.get('https://someimaginarysite.com/page2.html', headers=headers_a, proxies=proxies_a)
# --- and on a separate thread/processor ---
response_b = Session_B.get('https://someimaginarysite.com/page3.html', headers=headers_b, proxies=proxies_b)
response_b = Session_B.get('https://someimaginarysite.com/page4.html', headers=headers_b, proxies=proxies_b)
Or will the above just create one session accessible by two names so the server will see the same cookies and session appearing with two ips and two sets of headers... which would seem more than a little odd.
Greatly appreciate any help with this, I've exhausted my research abilities.
I think that there is probably a better way to do this, but without more information about the pagination and exactly what you want, it is a bit hard to understand exactly what you need. The following will make 2 threads with sequential calls in each keeping the same headers and proxies. Once again, there may be a better way to approach but with the limited information it's a bit murky.
import requests
import concurrent.futures
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
for url in urls:
lst.append(s.get(url, headers=headers, proxies=proxies))
return lst
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page1.html',
'https://someimaginarysite.com/page2.html'
],
headers={'user-agent': 'curl/7.61.1'},
proxies={'https': 'https://10.10.1.11:1080'}
),
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page3.html',
'https://someimaginarysite.com/page4.html'
],
headers={'user-agent': 'curl/7.61.2'},
proxies={'https': 'https://10.10.1.11:1080'}
),
]
flst = []
for future in concurrent.futures.as_completed(futures):
flst.append(future.result())
return flst
Not sure if this is better for the first function, mileage may vary
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
s.headers.update(headers)
s.proxies.update(proxies)
for url in urls:
lst.append(s.get(url))
return lst
i have managed to send multiple requests to a web api at the same time through ThreadPoolExecutor and get the json responses but i cant send requests with payload
would you be kind enough to see my code and suggest me an edit to send payload (data , header)
i just dont know how to send payload
.
from concurrent.futures import ThreadPoolExecutor
import requests
from timer import timer
URL = 'whatever.com'
payload = {'aaaaa': '0xxxxxxx'}
headers = {
'abc': 'xyz',
'Content-Type': 'application/json',
}
def fetch(session, url):
with session.post(url) as response:
print(response.json())
#timer(1, 1)
def main():
with ThreadPoolExecutor(max_workers=100) as executor:
with requests.session() as session:
executor.map(fetch, [session] * 100, [URL] * 100)
executor.shutdown(wait=True)
Normally you specify a "payload" using the data keyword argument on the call to the post method. But if you want to send it in JSON format, then you should use the json keyword argument:
session.post(url, json=payload, headers=headers)
(If the header specified 'Content-Type': 'application/json', as yours does, and if payload were already a JSON string, which yours is not, then you would be correct to use the data keyword argument for then you would not need any JSON conversion. But here you clearly need to first have requests convert a Python dictionary to a JSON string for transmission and that is why the json argument is being used. You do not really need to explicitly specify a header argument since requests will provide an appropriate one for you.)
Now I know this is a just a "dummy" program fetching the same URL 100 times. In a more realistic version you would be fetching 100 different URLs but you would, of course, be using the same Session instance for each call to fetch. You could therefore simplify the program in the following way:
from functools import partial
...
def main():
with ThreadPoolExecutor(max_workers=100) as executor:
with requests.Session() as session:
worker = partial(fetch, session) # first argument will be session
executor.map(worker, [URL] * 100)
# remove following line
#executor.shutdown(wait=True)
Note that I have also commented out your explicit call to method shutdown since shutdown will be automatically called following the termination of the with ... as executor: block.
I have a set of URLs (same http server but different request parameters). What I want to achieve is to keep on requesting all of them asynchronously or in parallel, until I kill it.
I started with using threading.Thread() to create one thread per URL, and do a while True: loop in the requesting function. This worked already faster than single thread/single request of course. But I would like to achieve better performance.
Then I tried aiohttp library to run the requests asynchronously. My code is like this (FYI, each URL is composed with url_base and product.id, and each URL has a different proxy to be used for the request):
async def fetch(product, i, proxies, session):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
while True:
try:
async with session.get(
url_base + product.id,
proxy = proxies[i],
headers=headers,
ssl = False)
) as response:
content = await response.read()
print(content)
except Exception as e:
print('ERROR ', str(e))
async def startQuery(proxies):
tasks = []
async with aiohttp.ClientSession() as session:
for [i, product] in enumerate(hermes_products):
task = asyncio.ensure_future(fetch(product, i, proxies, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
loop = asyncio.get_event_loop()
loop.run_until_complete(startQuery(global_proxy))
The observation is: 1) it is not as fast as I would expect. Actually slower than using threads. 2)More importantly, the requests only returned normal in the beginning of the running, and soon almost all of them returned several errors like:
ERROR Cannot connect to host PROXY_IP:PORT ssl:False [Connect call failed ('PROXY_IP', PORT)]
or
ERROR 503, message='Too many open connections'
or
ERROR [Errno 54] Connection reset by peer
Am I doing something wrong here (particularly with the while True loop? If so, how can I achieve my goal properly?
This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 6 years ago.
For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from the server.
Is there anyway to speed it up and keep my current header setting? All tutorials I found where without a header.
Here is my code snipped:
def parse(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
for line in r.iter_lines():
...
Well you can use threads since this is a I/O Bound problem. Using the built in threading library is your best choice. I used the Semaphore object to limit how many threads can run at the same time.
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
def parse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""
print 'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
def parse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print 'done'
You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait() and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.
import asyncio
import functools
async def parse(url):
print('in parse for url {}'.format(url))
info = await #write the logic for fetching the info, it waits for the responses from the urls
print('done with url {}'.format(url))
return 'parse {} result from {}'.format(info, url)
async def main(sites):
print('starting main')
parses = [
parse(url)
for url in sites
]
print('waiting for phases to complete')
completed, pending = await asyncio.wait(parses)
results = [t.result() for t in completed]
print('results: {!r}'.format(results))
event_loop = asyncio.get_event_loop()
try:
websites = ['site1', 'site2', 'site3']
event_loop.run_until_complete(main(websites))
finally:
event_loop.close()
i think it's a good idea to use mutil-thread like threading or multiprocess, or you can use grequests(async requests) due to gevent
I want to make a lot of url requets to a REST webserivce. Typically between 75-90k. However, I need to throttle the number of concurrent connections to the webservice.
I started playing around with grequests in the following manner, but quickly started chewing up opened sockets.
concurrent_limit = 30
urllist = buildUrls()
hdrs = {'Host' : 'hostserver'}
g_requests = (grequests.get(url, headers=hdrs) for url in urls)
g_responses = grequests.map(g_requests, size=concurrent_limit)
As this runs for a minute or so, I get hit with 'maximum number of sockets reached' errors.
As far as I can tell, each one of the requests.get calls in grequests uses it's own session which means a new socket is opened for each request.
I found a note on github referring how to make grequests use a single session. But this seems to effectively bottleneck all requests into a single shared pool. That seems to defeat the purpose of asynchronous http requests.
s = requests.session()
rs = [grequests.get(url, session=s) for url in urls]
grequests.map(rs)
Is is possible to use grequests or gevent.Pool in a way that creates a number of sessions?
Put another way: How can I make many concurrent http requests using either through queuing or connection pooling?
I ended up not using grequests to solve my problem. I'm still hopeful it might be possible.
I used threading:
class MyAwesomeThread(Thread):
"""
Threading wrapper to handle counting and processing of tasks
"""
def __init__(self, session, q):
self.q = q
self.count = 0
self.session = session
self.response = None
Thread.__init__(self)
def run(self):
"""TASK RUN BY THREADING"""
while True:
url, host = self.q.get()
httpHeaders = {'Host' : host}
self.response = session.get(url, headers=httpHeaders)
# handle response here
self.count+= 1
self.q.task_done()
return
q=Queue()
threads = []
for i in range(CONCURRENT):
session = requests.session()
t=MyAwesomeThread(session,q)
t.daemon=True # allows us to send an interrupt
threads.append(t)
## build urls and add them to the Queue
for url in buildurls():
q.put_nowait((url,host))
## start the threads
for t in threads:
t.start()
rs is a AsyncRequest list。each AsyncRequest have it's own session.
rs = [grequests.get(url) for url in urls]
grequests.map(rs)
for ar in rs:
print(ar.session.cookies)
Something like this:
NUM_SESSIONS = 50
sessions = [requests.Session() for i in range(NUM_SESSIONS)]
reqs = []
i = 0
for url in urls:
reqs.append(grequests.get(url, session=sessions[i % NUM_SESSIONS]
i+=1
responses = grequests.map(reqs, size=NUM_SESSIONS*5)
That should spread the requests over 50 different sessions.