How can improve my multithreading speed in my code?
My code takes 130 seconds with 100 threads to do 700 requests which is really slow and frustrating assuming that i use 100 threads.
My code edits the parameter values from an url and makes a request to it including the original url (unedited) the urls are received from a file (urls.txt)
Let me show you an example:
Let's consider the following url:
https://www.test.com/index.php?parameter=value1¶meter2=value2
The url contains 2 parameters so my code will make 3 requests.
1 request to the original url:
https://www.test.com/index.php?parameter=value1¶meter2=value2
1 request to the first modified value:
https://www.test.com/index.php?parameter=replaced_value¶meter2=value2
1 request to the second modified value:
https://www.test.com/index.php?parameter=value1¶meter2=replaced_value
I have tried using asyncio for this but I had more success with concurrent.futures
I even tried increasing the threads which I thought it was the issue at first but in this case wasnt if I would increase the threads considerably then the script would freeze at start for 30-50 seconds and it really didnt increased the speed as i expected
I assume this is a code issue how I build up the multithreading becuase I saw other people achieved incredible speeds with concurrent.futures
import requests
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
start = time.time()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
def make_request(url2):
try:
if '?' and '=':
request_1 = requests.get(url2, headers=headers, timeout=10)
url2_modified = url2.split("?")[1]
times = url2_modified.count("&") + 1
for x in range(0, times):
split1 = url2_modified.split("&")[x]
value = split1.split("=")[1]
parameter = split1.split("=")[0]
url = url2.replace('='+value, '=1')
request_2 = requests.get(url, stream=True, headers=headers, timeout=10)
html_1 = request_1.text
html_2 = request_2.text
print(request_1.status_code + ' - ' + url2)
print(request_2.status_code + ' - ' + url)
except requests.exceptions.RequestException as e:
return e
def runner():
threads= []
with ThreadPoolExecutor(max_workers=100) as executor:
file1 = open('urls.txt', 'r', errors='ignore')
Lines = file1.readlines()
count = 0
for line in Lines:
count += 1
threads.append(executor.submit(make_request, line.strip()))
runner()
end = time.time()
print(end - start)
Inside loop in make_request you run normal requests.get and it doesn't use thread (or any other method) to make it faster - so it has to wait for end of previous request to run next request.
In make_request I use another ThreadPoolExecutor to run every requests.get (created in loop) in separated thread
executor.submit(make_modified_request, modified_url)
and it gives me time ~1.2s
If I use normal
make_modified_request(modified_url)
then it gives me time ~3.2s
Minimal working example:
I use real urls https://httpbin.org/get so everyone can simply copy and run it.
from concurrent.futures import ThreadPoolExecutor
import requests
import time
#import urllib.parse
# --- constansts --- (PEP8: UPPER_CASE_NAMES)
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
# --- functions ---
def make_modified_request(url):
"""Send modified url."""
print('send:', url)
response = requests.get(url, stream=True, headers=HEADERS)
print(response.status_code, '-', url)
html = response.text # ???
# ... code to process HTML ...
def make_request(url):
"""Send normal url and create threads with modified urls."""
threads = []
with ThreadPoolExecutor(max_workers=10) as executor:
print('send:', url)
# send base url
response = requests.get(url, headers=HEADERS)
print(response.status_code, '-', url)
html = response.text # ???
#parts = urllib.parse.urlparse(url)
#print('query:', parts.query)
#arguments = urllib.parse.parse_qs(parts.query)
#print('arguments:', arguments) # dict {'a': ['A'], 'b': ['B'], 'c': ['C'], 'd': ['D'], 'e': ['E']}
arguments = url.split("?")[1]
arguments = arguments.split("&")
arguments = [arg.split("=") for arg in arguments]
print('arguments:', arguments) # list [['a', 'A'], ['b', 'B'], ['c', 'C'], ['d', 'D'], ['e', 'E']]
for name, value in arguments:
modified_url = url.replace('='+value, '=1')
print('modified_url:', modified_url)
# run thread with modified url
threads.append(executor.submit(make_modified_request, modified_url))
# run normal function with modified url
#make_modified_request(modified_url)
print('[make_request] len(threads):', len(threads))
def runner():
threads = []
with ThreadPoolExecutor(max_workers=10) as executor:
#fh = open('urls.txt', errors='ignore')
fh = [
'https://httpbin.org/get?a=A&b=B&c=C&d=D&e=E',
'https://httpbin.org/get?f=F&g=G&h=H&i=I&j=J',
'https://httpbin.org/get?k=K&l=L&m=M&n=N&o=O',
'https://httpbin.org/get?a=A&b=B&c=C&d=D&e=E',
'https://httpbin.org/get?f=F&g=G&h=H&i=I&j=J',
'https://httpbin.org/get?k=K&l=L&m=M&n=N&o=O',
]
for line in fh:
url = line.strip()
# create thread with url
threads.append(executor.submit(make_request, url))
print('[runner] len(threads):', len(threads))
# --- main ---
start = time.time()
runner()
end = time.time()
print('time:', end - start)
BTW:
I was thinking to use single
executor = ThreadPoolExecutor(max_workers=10)
and later use the same executor in all functions - and maybe it would run little faster - but at this moment I don't have working code.
Related
I am very new to asynchronous programming and I was playing around with httpx. I have the following code and I am sure I am doing something wrong - just don't know what it is. There are two methods, one synchronous and other asynchronous. They are both pull from google finance. On my system I am seeing the time spent as following:
Asynchronous: 5.015218734741211
Synchronous: 5.173618316650391
Here is the code:
import httpx
import asyncio
import time
#
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
#
def sync_pull(url):
r = httpx.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
#
async def async_pull(url):
async with httpx.AsyncClient() as client:
r = await client.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
#
#--------------------------------------------------------------------
#
if __name__ == "__main__":
goog_fin_nyse_url = 'https://www.google.com/finance/quote/'
tickers = ['F', 'TWTR', 'CVX', 'VZ', 'GME', 'GM', 'PG', 'AAL',
'MARK', 'AAP', 'THO', 'NGD', 'ZSAN', 'SEAC',
]
print("Running asynchronously...")
async_start = time.time()
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
asyncio.run(async_pull(url))
async_end = time.time()
print(f"Time lapsed is: {async_end - async_start}")
print("Running synchronously...")
sync_start = time.time()
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
sync_pull(url)
sync_end = time.time()
print(f"Time lapsed is: {sync_end - sync_start}")
I had hoped the asynchronous method approach would require a fraction of the time the synchronous approach is requiring. What am I doing wrong?
When you say asyncio.run(async_pull) you're saying run 'async_pull' and wait for the result to come back. Since you do this once per each ticker in your loop, you're essentially using asyncio to run things synchronously and won't see performance benefits.
What you need to do is create several async calls and run them concurrently. There are several ways to do this, the easiest is to use asyncio.gather (see https://docs.python.org/3/library/asyncio-task.html#asyncio.gather) which takes in a sequence of coroutines and runs them concurrently. Adapting your code is fairly straightforward, you create an async function to take a list of urls and then call async_pull on each of them and then pass that in to asyncio.gather and await the results. Adapting your code to this looks like the following:
import httpx
import asyncio
import time
def sync_pull(url):
r = httpx.get(url)
print(r.status_code)
async def async_pull(url):
async with httpx.AsyncClient() as client:
r = await client.get(url)
print(r.status_code)
async def async_pull_all(urls):
return await asyncio.gather(*[async_pull(url) for url in urls])
if __name__ == "__main__":
goog_fin_nyse_url = 'https://www.google.com/finance/quote/'
tickers = ['F', 'TWTR', 'CVX', 'VZ', 'GME', 'GM', 'PG', 'AAL',
'MARK', 'AAP', 'THO', 'NGD', 'ZSAN', 'SEAC',
]
print("Running asynchronously...")
async_start = time.time()
results = asyncio.run(async_pull_all([goog_fin_nyse_url + ticker + ':NYSE' for ticker in tickers]))
async_end = time.time()
print(f"Time lapsed is: {async_end - async_start}")
print("Running synchronously...")
sync_start = time.time()
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
sync_pull(url)
sync_end = time.time()
print(f"Time lapsed is: {sync_end - sync_start}")
Running this way, the asynchronous version runs in about a second for me as opposed to seven synchronously.
Here's a nice pattern I use (I tend to change it a little each time). In general, I make a module async_utils.py and just import the top-level fetching function (e.g. here fetch_things), and then my code is free to forget about the internals (other than error handling). You can do it in other ways, but I like the 'functional' style of aiostream, and often find the repeated calls to the process function take certain defaults I set using functools.partial.
Note: async currying with partials is Python 3.8+ only
You can pass in a tqdm.tqdm progress bar to pbar (initialised with known size total=len(things)) to have it update when each async response is processed.
import asyncio
import httpx
from aiostream import stream
from functools import partial
__all__ = ["fetch", "process", "async_fetch_urlset", "fetch_things"]
async def fetch(session, url, raise_for_status=False):
response = await session.get(str(url))
if raise_for_status:
response.raise_for_status()
return response
async def process_thing(data, things, pbar=None, verbose=False):
# Map the response back to the thing it came from in the things list
source_url = data.history[0].url if data.history else data.url
thing = next(t for t in things if source_url == t.get("thing_url"))
# Handle `data.content` here, where `data` is the `httpx.Response`
if verbose:
print(f"Processing {source_url=}")
build.update({"computed_value": "result goes here"})
if pbar:
pbar.update()
async def async_fetch_urlset(urls, things, pbar=None, verbose=False, timeout_s=10.0):
timeout = httpx.Timeout(timeout=timeout_s)
async with httpx.AsyncClient(timeout=timeout) as session:
ws = stream.repeat(session)
xs = stream.zip(ws, stream.iterate(urls))
ys = stream.starmap(xs, fetch, ordered=False, task_limit=20)
process = partial(process_thing, things=things, pbar=pbar, verbose=verbose)
zs = stream.map(ys, process)
return await zs
def fetch_things(urls, things, pbar=None, verbose=False):
return asyncio.run(async_fetch_urlset(urls, things, pbar, verbose))
In this example, the input is a list of dicts (with string keys and values), things: list[dict[str,str]], and the key "thing_url" is accessed to retrieve the URL. Having a dict or object is desirable instead of just the URL string for when you want to 'map' the result back to the object it came from. The process_thing function is able to modify the input list things in-place (i.e. any changes are not scoped within the function, they change it back in the scope that called it).
You'll often find errors arise during async runs that you don't get when running synchronously, so you'll need to catch them, and re-try. A common gotcha is to retry at the wrong level (e.g. around the entire loop)
In particular, you'll want to import and catch httpcore.ConnectTimeout, httpx.ConnectTimeout, httpx.RemoteProtocolError, and httpx.ReadTimeout.
Increasing the timeout_s parameter will reduce the frequency of the timeout errors by letting the AsyncClient 'wait' for longer, but doing so may in fact slow down your program (it won't "fail fast" quite as fast).
Here's an example of how to use the async_utils module given above:
from async_utils import fetch_things
import httpx
import httpcore
# UNCOMMENT THIS TO SEE ALL THE HTTPX INTERNAL LOGGING
#import logging
#log = logging.getLogger()
#log.setLevel(logging.DEBUG)
#log_format = logging.Formatter('[%(asctime)s] [%(levelname)s] - %(message)s')
#console = logging.StreamHandler()
#console.setLevel(logging.DEBUG)
#console.setFormatter(log_format)
#log.addHandler(console)
things = [
{"url": "https://python.org", "name": "Python"},
{"url": "https://www.python-httpx.org/", "name": "HTTPX"},
]
#log.debug("URLSET:" + str(list(t.get("url") for t in things)))
def make_urlset(things):
"""Make a URL generator (empty if all have been fetched)"""
urlset = (t.get("url") for t in things if "computed_value" not in t)
return urlset
retryable_errors = (
httpcore.ConnectTimeout,
httpx.ConnectTimeout, httpx.RemoteProtocolError, httpx.ReadTimeout,
)
# ASYNCHRONOUS:
max_retries = 100
for i in range(max_retries):
print(f"Retry {i}")
try:
urlset = make_urlset(things)
foo = fetch_things(urls=urlset, things=things, verbose=True)
except retryable_errors as exc:
print(f"Caught {exc!r}")
if i == max_retries - 1:
raise
except Exception:
raise
# SYNCHRONOUS:
#for t in things:
# resp = httpx.get(t["url"])
In this example I set a key "computed_value" on a dictionary once the async response has successfully been processed which then prevents that URL from being entered into the generator on the next round (when make_urlset is called again). In this way, the generator gets progressively smaller. You can also do it with lists but I find a generator of the URLs to be pulled works reliably. For an object you'd change the dictionary key assignment/access (update/in) to attribute assignment/access (settatr/hasattr).
I wanted to post working version of the coding using futures - virtually the same run-time:
import httpx
import asyncio
import time
#
#--------------------------------------------------------------------
# Synchronous pull
#--------------------------------------------------------------------
#
def sync_pull(url):
r = httpx.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
# Asynchronous Pull
#--------------------------------------------------------------------
#
async def async_pull(url):
async with httpx.AsyncClient() as client:
r = await client.get(url)
print(r.status_code)
#
#--------------------------------------------------------------------
# Build tasks queue & execute coroutines
#--------------------------------------------------------------------
#
async def build_task() -> None:
goog_fin_nyse_url = 'https://www.google.com/finance/quote/'
tickers = ['F', 'TWTR', 'CVX', 'VZ', 'GME', 'GM', 'PG', 'AAL',
'MARK', 'AAP', 'THO', 'NGD', 'ZSAN', 'SEAC',
]
tasks= []
#
## Following block of code will create a queue full of function
## call
for ticker in tickers:
url = goog_fin_nyse_url + ticker + ':NYSE'
tasks.append(asyncio.ensure_future(async_pull(url)))
start_time = time.time()
#
## This block of code will derefernce the function calls
## from the queue, which will cause them all to run
## rapidly
await asyncio.gather(*tasks)
#
## Calculate time lapsed
finish_time = time.time()
elapsed_time = finish_time - start_time
print(f"\n Time spent processing: {elapsed_time} ")
# Start from here
if __name__ == "__main__":
asyncio.run(build_task())
I am here basically accessing the api call with various values coming from the list list_of_string_ids
I am expecting to create 20 threads, tell them to do something, write the values to DB and then have them all returning zero and going again to take the next data etc.
I have problem getting this to work using threading. Below is a code which is working correctly as expected, however it is taking very long to finish execration (around 45 minutes or more). The website I am getting the data from allows Async I/O using rate of 20 requests.
I assume this can make my code 20x faster but not really sure how to implement it.
import requests
import json
import time
import threading
import queue
headers = {'Content-Type': 'application/json',
'Authorization': 'Bearer TOKEN'}
start = time.perf_counter()
project_id_number = 123
project_id_string = 'pjiji4533'
name = "Assignment"
list_of_string_ids = [132,123,5345,123,213,213,...,n] # Len of list is 20000
def construct_url_threaded(project_id_number, id_string):
url = f"https://api.test.com/{}/{}".format(project_id_number,id_string)
r = requests.get(url , headers=headers) # Max rate allowed is 20 requests at once.
json_text = r.json()
comments = json.dumps(json_text, indent=2)
for item in json_text['data']:
# DO STUFF
for string_id in all_string_ids_list:
construct_url_threaded(project_id_number=project_id_number, id_string=string_id)
My trial is below
def main():
q = queue.Queue()
threads = [threading.Thread(target=create_url_threaded, args=(project_id_number,string_id, q)) for i in range(5) ] #5 is for testing
for th in threads:
th.daemon = True
th.start()
result1 = q.get()
result2 = q.get()
I have a json file with list of urls.
After reading documentation, I figured, multiproccessing.pool is the best option for me.
I ran 10 urls, with multiprocessing.Pool(10), I was expecting the results would pretty much be instant, but it takes me about 12 seconds to complete everything, not sure if I am using it correctly, but below is my code.
def download_site(data, outputArr, proxyArr=None):
session = requests.Session()
# print("Scraping last name {lastName}".format(lastName=data['lastName']))
userAgents = open('user-agents.txt').read().split('\n')
params = (
('Name', data['lastName']),
('type', 'P'),
)
url = 'someurl'
if not proxyArr:
proxyArr = {
'http': data['proxy']['http']
}
try:
with session.get(url, params=params, proxies=proxyArr, headers=headers) as response:
name = multiprocessing.current_process().name
try:
content = response.json()
loadJson = json.loads(content)['nameBirthDetails']
for case in loadJson:
dateFile = loadJson[case]['dateFiled']
year = int(dateFile.split('/')[-1])
if year > 2018:
profileDetailUrlParams = (
('caseId',loadJson[case]['caseYear']),
('caseType', 'WC'),
('caseNumber', loadJson[case]['caseSeqNbr']),
)
loadJson[case]['caseDetail'] = getAttorneyData(profileDetailUrlParams, session, proxyArr)
outputArr.append(loadJson[case])
# print("Total Scraped Results so far ", len(outputArr))
except (requests.exceptions.ConnectionError, json.decoder.JSONDecodeError):
print("Error Found JSON DECODE ERROR - passing for last name", data['lastName'])
except simplejson.errors.JSONDecodeError:
print("Found Simple Json Error", data['lastName'])
pass
# newProxy = generate_random_proxy()
# download_site(data, outputArr, newProxy)
except:
raise
def queueList(sites):
manager = multiprocessing.Manager()
outputArr = manager.list()
functionMain = partial(download_site, outputArr = outputArr)
p = multiprocessing.Pool(10)
records = p.map(functionMain, sites)
p.terminate()
p.join()
if __name__ == "__main__":
outputArr = []
fileData = json.loads(open('lastNamesWithProxy.json').read())[:10]
start_time = time.time()
queueList(fileData)
duration = time.time() - start_time
print(f"Downloaded {len(fileData)} in {duration} seconds")
The function download_site, is the function where I fetch a list via requests library - then for each item in the list, I make another requests aka function getAttorneyData
How can I further hone this to run faster? I have a high-end computer so CPU shouldn't be an issue, I want to use it to it's max potential.
My goal is to be able to spawn 10 workers and consume each worker with each request. So, 10 requests would really take me 1-2 seconds instead of 12, which is currently.
I try to download an excel file from a specific website. In my local computer it works perfectly:
>>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'\xd0\xcf\x11\xe0\xa1\xb1...\x00\x00' # Long binary string
But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.
r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'<HTML>\n<head>\n<script>\nChallenge=141020;\nChallengeId=120854618;\nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";\n</script>\n<script>\nfunction test(var1)\n{\n\tvar var_str=""+Challenge;\n\tvar var_arr=var_str.split("");\n\tvar LastDig=var_arr.reverse()[0];\n\tvar minDig=var_arr.sort()[0];\n\tvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);\n\tvar subvar2 = (2 * var_arr[2])+var_arr[1];\n\tvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);\n\tvar x=(var1*3+subvar1)*1;\n\tvar y=Math.cos(Math.PI*subvar2);\n\tvar answer=x*y;\n\tanswer-=my_pow*1;\n\tanswer+=(minDig*1)-(LastDig*1);\n\tanswer=answer+subvar2;\n\treturn answer;\n}\n</script>\n<script>\nclient = null;\nif (window.XMLHttpRequest)\n{\n\tvar client=new XMLHttpRequest();\n}\nelse\n{\n\tif (window.ActiveXObject)\n\t{\n\t\tclient = new ActiveXObject(\'MSXML2.XMLHTTP.3.0\');\n\t};\n}\nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))\n{\n\tdocument.write("Not all needed JavaScript methods are supported.<BR>");\n\n}\nelse\n{\n\tclient.onreadystatechange = function()\n\t{\n\t\tif(client.readyState == 4)\n\t\t{\n\t\t\tvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");\n\t\t\tif ((MyCookie == null) || (MyCookie==""))\n\t\t\t{\n\t\t\t\tdocument.write(client.responseText);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t\n\t\t\tvar cookieName = MyCookie.split(\'=\')[0];\n\t\t\tif (document.cookie.indexOf(cookieName)==-1)\n\t\t\t{\n\t\t\t\tdocument.write(GenericErrorMessageCookies);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\twindow.location.reload(true);\n\t\t}\n\t};\n\ty=test(Challenge);\n\tclient.open("POST",window.location,true);\n\tclient.setRequestHeader(\'X-AA-Challenge-ID\', ChallengeId);\n\tclient.setRequestHeader(\'X-AA-Challenge-Result\',y);\n\tclient.setRequestHeader(\'X-AA-Challenge\',Challenge);\n\tclient.setRequestHeader(\'Content-Type\' , \'text/plain\');\n\tclient.send();\n}\n</script>\n</head>\n<body>\n<noscript>JavaScript must be enabled in order to view this page.</noscript>\n</body>\n</HTML>'
On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.
The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.
The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.
Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:
from math import cos, pi, floor
import requests
URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'
def parse_challenge(page):
"""
Parse a challenge given by mmi and mavat's web servers, forcing us to solve
some math stuff and send the result as a header to actually get the page.
This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
"""
top = page.split('<script>')[1].split('\n')
challenge = top[1].split(';')[0].split('=')[1]
challenge_id = top[2].split(';')[0].split('=')[1]
return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}
def get_challenge_answer(challenge):
"""
Solve the math part of the challenge and get the result
"""
arr = list(challenge)
last_digit = int(arr[-1])
arr.sort()
min_digit = int(arr[0])
subvar1 = (2 * int(arr[2])) + int(arr[1])
subvar2 = str(2 * int(arr[2])) + arr[1]
power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
x = (int(challenge) * 3 + subvar1)
y = cos(pi * subvar1)
answer = x * y
answer -= power
answer += (min_digit - last_digit)
answer = str(int(floor(answer))) + subvar2
return answer
def main():
s = requests.Session()
r = s.get(URL)
if 'X-AA-Challenge' in r.text:
challenge = parse_challenge(r.text)
r = s.get(URL, headers={
'X-AA-Challenge': challenge['challenge'],
'X-AA-Challenge-ID': challenge['challenge_id'],
'X-AA-Challenge-Result': challenge['challenge_result']
})
yum = r.cookies
r = s.get(URL, cookies=yum)
print(r.content)
if __name__ == '__main__':
main()
you can use this code to avoid block
url = 'your url come here'
s = HTMLSession()
s.headers['user-agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
r = s.get(url)
r.html.render(timeout=8000)
print(r.status_code)
print(r.content)
My current code as it stands prints an empty list, how do I wait for all requests and callbacks to finish before continuing with the code flow?
from requests_futures.sessions import FuturesSession
from time import sleep
session = FuturesSession(max_workers=100)
i = 1884001540 - 100
list = []
def testas(session, resp):
print(resp)
resp = resp.json()
print(resp['participants'][0]['stats']['kills'])
list.append(resp['participants'][0]['stats']['kills'])
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
temp = session.get(url, background_callback=testas)
i += 1
print(list)
From looking at session.py in requests-futures-0.9.5.tar.gz its necesssary to create a future in order to wait for its result as shown in this code:
from requests_futures import FuturesSession
session = FuturesSession()
# request is run in the background
future = session.get('http://httpbin.org/get')
# ... do other stuff ...
# wait for the request to complete, if it hasn't already
response = future.result()
print('response status: {0}'.format(response.status_code))
print(response.content)
As shown in the README.rst a future can and should be created for every session.get() and waited on to complete.
This might be applied in your code as follows starting just before the while loop:
future = []
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
future.append(session.get(url, background_callback=testas)
i += 1
for f in future:
response = f.result()
# the following print statements may be useful for debugging
# print('response status: {0}'.format(response.status_code))
# print(response.content, "\n")
print(list)
I'm not sure how your system will respond to a large number (1884001440) of futures and another way to do it is by processing them in smaller groups say 100 or 1000 at a time. It might be wise to test the script with a relatively small number of them at the beginning to find out how fast they return results.
from here https://pypi.python.org/pypi/requests-futures it says
from requests_futures.sessions import FuturesSession
session = FuturesSession()
# first request is started in background
future_one = session.get('http://httpbin.org/get')
# second requests is started immediately
future_two = session.get('http://httpbin.org/get?foo=bar')
# wait for the first request to complete, if it hasn't already
response_one = future_one.result()
so it seems that .result() is what you are looking for