Improve speed on current multiprocessing.Pool() - python

I have a json file with list of urls.
After reading documentation, I figured, multiproccessing.pool is the best option for me.
I ran 10 urls, with multiprocessing.Pool(10), I was expecting the results would pretty much be instant, but it takes me about 12 seconds to complete everything, not sure if I am using it correctly, but below is my code.
def download_site(data, outputArr, proxyArr=None):
session = requests.Session()
# print("Scraping last name {lastName}".format(lastName=data['lastName']))
userAgents = open('user-agents.txt').read().split('\n')
params = (
('Name', data['lastName']),
('type', 'P'),
)
url = 'someurl'
if not proxyArr:
proxyArr = {
'http': data['proxy']['http']
}
try:
with session.get(url, params=params, proxies=proxyArr, headers=headers) as response:
name = multiprocessing.current_process().name
try:
content = response.json()
loadJson = json.loads(content)['nameBirthDetails']
for case in loadJson:
dateFile = loadJson[case]['dateFiled']
year = int(dateFile.split('/')[-1])
if year > 2018:
profileDetailUrlParams = (
('caseId',loadJson[case]['caseYear']),
('caseType', 'WC'),
('caseNumber', loadJson[case]['caseSeqNbr']),
)
loadJson[case]['caseDetail'] = getAttorneyData(profileDetailUrlParams, session, proxyArr)
outputArr.append(loadJson[case])
# print("Total Scraped Results so far ", len(outputArr))
except (requests.exceptions.ConnectionError, json.decoder.JSONDecodeError):
print("Error Found JSON DECODE ERROR - passing for last name", data['lastName'])
except simplejson.errors.JSONDecodeError:
print("Found Simple Json Error", data['lastName'])
pass
# newProxy = generate_random_proxy()
# download_site(data, outputArr, newProxy)
except:
raise
def queueList(sites):
manager = multiprocessing.Manager()
outputArr = manager.list()
functionMain = partial(download_site, outputArr = outputArr)
p = multiprocessing.Pool(10)
records = p.map(functionMain, sites)
p.terminate()
p.join()
if __name__ == "__main__":
outputArr = []
fileData = json.loads(open('lastNamesWithProxy.json').read())[:10]
start_time = time.time()
queueList(fileData)
duration = time.time() - start_time
print(f"Downloaded {len(fileData)} in {duration} seconds")
The function download_site, is the function where I fetch a list via requests library - then for each item in the list, I make another requests aka function getAttorneyData
How can I further hone this to run faster? I have a high-end computer so CPU shouldn't be an issue, I want to use it to it's max potential.
My goal is to be able to spawn 10 workers and consume each worker with each request. So, 10 requests would really take me 1-2 seconds instead of 12, which is currently.

Related

Threading using Python limiting the number of threads and passing list of different values as arguments

I am here basically accessing the api call with various values coming from the list list_of_string_ids
I am expecting to create 20 threads, tell them to do something, write the values to DB and then have them all returning zero and going again to take the next data etc.
I have problem getting this to work using threading. Below is a code which is working correctly as expected, however it is taking very long to finish execration (around 45 minutes or more). The website I am getting the data from allows Async I/O using rate of 20 requests.
I assume this can make my code 20x faster but not really sure how to implement it.
import requests
import json
import time
import threading
import queue
headers = {'Content-Type': 'application/json',
'Authorization': 'Bearer TOKEN'}
start = time.perf_counter()
project_id_number = 123
project_id_string = 'pjiji4533'
name = "Assignment"
list_of_string_ids = [132,123,5345,123,213,213,...,n] # Len of list is 20000
def construct_url_threaded(project_id_number, id_string):
url = f"https://api.test.com/{}/{}".format(project_id_number,id_string)
r = requests.get(url , headers=headers) # Max rate allowed is 20 requests at once.
json_text = r.json()
comments = json.dumps(json_text, indent=2)
for item in json_text['data']:
# DO STUFF
for string_id in all_string_ids_list:
construct_url_threaded(project_id_number=project_id_number, id_string=string_id)
My trial is below
def main():
q = queue.Queue()
threads = [threading.Thread(target=create_url_threaded, args=(project_id_number,string_id, q)) for i in range(5) ] #5 is for testing
for th in threads:
th.daemon = True
th.start()
result1 = q.get()
result2 = q.get()

asynchroneous error handling and response processing of an unbounded list of tasks using zeep

So here is my use case:
I read from a database rows containing information to make a complex SOAP call (I'm using zeep to do these calls).
One row from the database corresponds to a request to the service.
There can be up to 20 thousand lines, so I don't want to read everything in memory before making the calls.
I need to process the responses - when the
response is OK, I need to store some returned information back into
my database, and when there is an exception I need to process the
exception for that particular request/response pair.
I need also to capture some external information at the time of the request creation, so that I know where to store the response from the request. In my current code I'm using the delightful property of gather() that makes the results come in the same order.
I read the relevant PEPs and Python documentation but I'm still very confused, as there seems to be multiple ways to solve the same problem.
I also went through countless exercises on the web, but the examples are all trivial - it's either asyncio.sleep() or some webscraping with a finite list of urls.
The solution that I have come up so far kinda works - the asyncio.gather() method is very, very, useful, but I have not been able to 'feed' it from a generator. I'm currently just counting to an arbitrary size and then starting a .gather() operation. I've transcribed the code, with boring parts left out and I've tried to anonymise the code
I've tried solutions involving semaphores, queues, different event loops, but I'm failing every time. Ideally I'd like to be able to create Futures 'continuously' - I think I'm missing the logic of 'convert this awaitable call to a future'
I'd be grateful for any help!
import asyncio
from asyncio import Future
import zeep
from zeep.plugins import HistoryPlugin
history = HistoryPlugin()
max_concurrent_calls = 5
provoke_errors = True
def export_data_async(db_variant: str, order_nrs: set):
st = time.time()
results = []
loop = asyncio.get_event_loop()
def get_client1(service_name: str, system: Systems = Systems.ACME) -> Tuple[zeep.Client, zeep.client.Factory]:
client1 = zeep.Client(wsdl=system.wsdl_url(service_name=service_name),
transport=transport,
plugins=[history],
)
factory_ns2 = client1.type_factory(namespace='ns2')
return client1, factory_ns2
table = 'ZZZZ'
moveback_table = 'EEEEEE'
moveback_dict = create_default_empty_ordered_dict('attribute1 attribute2 attribute3 attribute3')
client, factory = get_client1(service_name='ACMEServiceName')
if log.isEnabledFor(logging.DEBUG):
client.wsdl.dump()
zeep_log = logging.getLogger('zeep.transports')
zeep_log.setLevel(logging.DEBUG)
with Db(db_variant) as db:
db.open_db(CON_STRING[db_variant])
db.init_table_for_read(table, order_list=order_nrs)
counter_failures = 0
tasks = []
sids = []
results = []
def handle_future(future: Future) -> None:
results.extend(future.result())
def process_tasks_concurrently() -> None:
nonlocal tasks, sids, counter_failures, results
futures = asyncio.gather(*tasks, return_exceptions=True)
futures.add_done_callback(handle_future)
loop.run_until_complete(futures)
for i, response_or_fault in enumerate(results):
if type(response_or_fault) in [zeep.exceptions.Fault, zeep.exceptions.TransportError]:
counter_failures += 1
log_webservice_fault(sid=sids[i], db=db, err=response_or_fault, object=table)
else:
db.write_dict_to_table(
moveback_table,
{'sid': sids[i],
'attribute1': response_or_fault['XXX']['XXX']['xxx'],
'attribute2': response_or_fault['XXX']['XXX']['XXXX']['XXX'],
'attribute3': response_or_fault['XXXX']['XXXX']['XXX'],
}
)
db.commit_db_con()
tasks = []
sids = []
results = []
return
for row in db.rows(table):
if int(row.id) % 2 == 0 and provoke_errors:
payload = faulty_message_payload(row=row,
factory=factory,
)
else:
payload = message_payload(row=row,
factory=factory,
)
tasks.append(client.service.myRequest(
MessageHeader=factory.MessageHeader(**message_header_arguments(row=row)),
myRequestPayload=payload,
_soapheaders=[security_soap_header],
))
sids.append(row.sid)
if len(tasks) == max_concurrent_calls:
process_tasks_concurrently()
if tasks: # this is the remainder of len(db.rows) % max_concurrent_calls
process_tasks_concurrently()
loop.run_until_complete(transport.session.close())
db.execute_this_statement(statement=update_sql)
db.commit_db_con()
log.info(db.activity_log)
if counter_failures:
log.info(f"{table :<25} Count failed: {counter_failures}")
print("time async: %.2f" % (time.time() - st))
return results
Failed attempt with Queue: (blocks at await client.service)
loop = asyncio.get_event_loop()
counter = 0
results = []
async def payload_generator(db_variant: str, order_nrs: set):
# code that generates the data for the request
yield counter, row, payload
async def service_call_worker(queue, results):
while True:
counter, row, payload = await queue.get()
results.append(await client.service.myServicename(
MessageHeader=calculate_message_header(row=row)),
myPayload=payload,
_soapheaders=[security_soap_header],
)
)
print(colorama.Fore.BLUE + f'after result returned {counter}')
# Here do the relevant processing of response or error
queue.task_done()
async def main_with_q():
n_workers = 3
queue = asyncio.Queue(n_workers)
e = pprint.pformat(queue)
p = payload_generator(DB_VARIANT, order_list_from_args())
results = []
workers = [asyncio.create_task(service_call_worker(queue, results))
for _ in range(n_workers)]
async for c in p:
await queue.put(c)
await queue.join() # wait for all tasks to be processed
for worker in workers:
worker.cancel()
if __name__ == '__main__':
try:
loop.run_until_complete(main_with_q())
loop.run_until_complete(transport.session.close())
finally:
loop.close()

How can I get Google Calendar API status_code in Python when get list events?

I try to use Google Calendar API
events_result = service.events().list(calendarId=calendarId,
timeMax=now,
alwaysIncludeEmail=True,
maxResults=100, singleEvents=True,
orderBy='startTime').execute()
Everything is ok, when I have permission to access the calendarId, but it will be errors if wrong when I don't have calendarId permission.
I build an autoload.py function with schedule python to load events every 10 mins, this function will be stopped if error come, and I have to use SSH terminal to restart autoload.py manually
So i want to know:
How can I get status_code, example, if it is 404, python will PASS
Answer:
You can use a try/except block within a loop to go through all your calendars, and skip over accesses which throw an error.
Code Example:
To get the error code, make sure to import json:
import json
and then you can get the error code out of the Exception:
calendarIds = ["calendar ID 1", "calendar ID 2", "calendar Id 3", "etc"]
for i in calendarIds:
try:
events_result = service.events().list(calendarId=i,
timeMax=now,
alwaysIncludeEmail=True,
maxResults=100, singleEvents=True,
orderBy='startTime').execute()
except Exception as e:
print(json.loads(e.content)['error']['code'])
continue
Further Reading:
Python Try Except - w3schools
Python For Loops - w3schools
Thanks to #Rafa Guillermo, I uploaded the full code to the autoload.py program, but I also wanted to know, how to get response json or status_code for request Google API.
The solution:
try:
code here
except Exception as e:
continue
import schedule
import time
from datetime import datetime
import dir
import sqlite3
from project.function import cmsCalendar as cal
db_file = str(dir.dir) + '/admin.sqlite'
def get_list_shop_from_db(db_file):
cur = sqlite3.connect(db_file).cursor()
query = cur.execute('SELECT * FROM Shop')
colname = [ d[0] for d in query.description ]
result_list = [ dict(zip(colname, r)) for r in query.fetchall() ]
cur.close()
cur.connection.close()
return result_list
def auto_load_google_database(list_shop, calendarError=False):
shopId = 0
for shop in list_shop:
try:
shopId = shopId+1
print("dang ghi vao shop", shopId)
service = cal.service_build()
shop_step_time_db = list_shop[shopId]['shop_step_time']
shop_duration_db = list_shop[shopId]['shop_duration']
slot_available = list_shop[shopId]['shop_slots']
slot_available = int(slot_available)
workers = list_shop[shopId]['shop_workers']
workers = int(workers)
calendarId = list_shop[shopId]['shop_calendarId']
if slot_available > workers:
a = workers
else:
a = slot_available
if shop_duration_db == None:
shop_duration_db = '30'
if shop_step_time_db == None:
shop_step_time_db = '15'
shop_duration = int(shop_duration_db)
shop_step_time = int(shop_step_time_db)
shop_start_time = list_shop[shopId]['shop_start_time']
shop_start_time = datetime.strptime(shop_start_time, "%H:%M:%S.%f").time()
shop_end_time = list_shop[shopId]['shop_end_time']
shop_end_time = datetime.strptime(shop_end_time, "%H:%M:%S.%f").time()
# nang luc moi khung gio lay ra tu file Json WorkShop.js
booking_status = cal.auto_load_listtimes(service, shopId, calendarId, shop_step_time, shop_duration, a,
shop_start_time,
shop_end_time)
except Exception as e:
continue
def main():
list_shop = get_list_shop_from_db(db_file)
auto_load_google_database(list_shop)
if __name__ == '__main__':
main()
schedule.every(5).minutes.do(main)
while True:
# Checks whether a scheduled task
# is pending to run or not
schedule.run_pending()
time.sleep(1)

Python asyncio and aiohttp slowing down after 150+ requests

I'm using asyncio and aiohttp to make a async scraper.
For some reason after I hit 150+ request its start slowing down. The first async
runs fine where i get the links. The second one is where i get the problem where the slowness happens. Like after 200 it needs 1min for one request. Any idea why? Am I using Asyncio or aiohttp incorrectly?
Edit: I'm running this localy on a 7gb ram so I don't think im running out of memory.
import aiohttp
import asyncio
import async_timeout
import re
from lxml import html
import timeit
from os import makedirs,chmod
basepath = ""
start = timeit.default_timer()
novel = ""
novel = re.sub(r"[^a-zA-Z0-9 ]+/", "", novel)
novel = re.sub(r" ", "-", novel)
novel_url = {}
#asyncio.coroutine
def get(*args, **kwargs):
response = yield from aiohttp.request('GET', *args, **kwargs)
return (yield from response.text())
def scrape_links(page):
url = html.fromstring(page)
links = url.xpath("")
chapter_count = url.xpath("")
dictonaries = dict(zip(chapter_count,links))
novel_url.update(dictonaries)
def print_links(query):
# Makedirs and apply chmod
makedirs('%s/%s' % ( basepath,query ),exist_ok=True)
makedirs('%s/%s/img' % (basepath, query),exist_ok=True)
chmod('%s/%s' % ( basepath,query ), 0o765)
chmod('%s/%s/img/' % ( basepath,query ), 0o765)
url = 'https://www.examplesite.org/' + query
page = yield from get(url, compress=True)
magnet = scrape_links(page)
loop = asyncio.get_event_loop()
f = asyncio.wait([print_links(novel)])
loop.run_until_complete(f)
##### now getting chapters from links array
def scrape_chapters(page, i):
url = html.fromstring(page)
title = url.xpath("")
title = ''.join(title)
title = re.sub(r"", "", title)
chapter = url.xpath("")
# Use this to join them insteed of looping though if it doesn't work in epub maker
# chapter = '\n'.join(chapter)
print(title)
# file = open("%s/%s/%s-%s.html" % (basepath, novel, novel, i), 'w+')
# file.write("<h1>%s</h1>" % title)
# for x in chapter:
# file.write("\n<p>%s</p>" % x)
# file.close()
def print_chapters(query):
chapter = (str(query[0]))
chapter_count = re.sub(r"CH ", "", chapter)
page = yield from get(query[1], compress=True)
chapter = scrape_chapters(page, chapter_count)
loop = asyncio.get_event_loop()
f = asyncio.wait([print_chapters(d) for d in novel_url.items()])
loop.run_until_complete(f)
stop = timeit.default_timer()
print("\n")
print(stop - start)
Could it be due to the limit on aiohttp.ClientSession connections?
https://docs.aiohttp.org/en/latest/http_request_lifecycle.html#how-to-use-the-clientsession
It may try passing connector with larger limit: https://docs.aiohttp.org/en/latest/client_advanced.html#limiting-connection-pool-size

Python: wait for requests_futures.sessions to finish before continuing with the code flow

My current code as it stands prints an empty list, how do I wait for all requests and callbacks to finish before continuing with the code flow?
from requests_futures.sessions import FuturesSession
from time import sleep
session = FuturesSession(max_workers=100)
i = 1884001540 - 100
list = []
def testas(session, resp):
print(resp)
resp = resp.json()
print(resp['participants'][0]['stats']['kills'])
list.append(resp['participants'][0]['stats']['kills'])
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
temp = session.get(url, background_callback=testas)
i += 1
print(list)
From looking at session.py in requests-futures-0.9.5.tar.gz its necesssary to create a future in order to wait for its result as shown in this code:
from requests_futures import FuturesSession
session = FuturesSession()
# request is run in the background
future = session.get('http://httpbin.org/get')
# ... do other stuff ...
# wait for the request to complete, if it hasn't already
response = future.result()
print('response status: {0}'.format(response.status_code))
print(response.content)
As shown in the README.rst a future can and should be created for every session.get() and waited on to complete.
This might be applied in your code as follows starting just before the while loop:
future = []
while i < 1884001540:
url = "https://acs.leagueoflegends.com/v1/stats/game/NA1/" + str(i)
future.append(session.get(url, background_callback=testas)
i += 1
for f in future:
response = f.result()
# the following print statements may be useful for debugging
# print('response status: {0}'.format(response.status_code))
# print(response.content, "\n")
print(list)
I'm not sure how your system will respond to a large number (1884001440) of futures and another way to do it is by processing them in smaller groups say 100 or 1000 at a time. It might be wise to test the script with a relatively small number of them at the beginning to find out how fast they return results.
from here https://pypi.python.org/pypi/requests-futures it says
from requests_futures.sessions import FuturesSession
session = FuturesSession()
# first request is started in background
future_one = session.get('http://httpbin.org/get')
# second requests is started immediately
future_two = session.get('http://httpbin.org/get?foo=bar')
# wait for the first request to complete, if it hasn't already
response_one = future_one.result()
so it seems that .result() is what you are looking for

Categories

Resources