I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break
There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).
Related
I have a python code and I want to speed it up using threads but when I try to I get the same lines getting duplicated, is there is any way I could speed it up without getting duplicate lines
code
import requests
import json
f = open("urls.json")
data = json.load(f)
def urls():
for i in data['urls']:
r = requests.get("https://" + i)
print(r.headers)
You can use ThreadPoolExecutor class from concurrent.futures. It is efficient way according to Thread class.
You can change the max_workers value according to your task
Here is the piece of code:
import requests
from concurrent.futures import ThreadPoolExecutor
import json
with open("urls.json") as f:
data = json.load(f)
def urls():
urls = ["https://" + url for url in data['urls']]
print(urls)
with ThreadPoolExecutor(max_workers=5) as pool:
iterator = pool.map(requests.get,urls)
for response in iterator:
print(response.headers)
print("\n")
Make async or threaded calls.
So, you would do something like this:
import aiohttp
import asyncio
import time
start_time = time.time()
async def main():
async with aiohttp.ClientSession() as session:
for number in range(1, 151):
pokemon_url = f'https://pokeapi.co/api/v2/pokemon/{number}'
async with session.get(pokemon_url) as resp:
pokemon = await resp.json()
print(pokemon['name'])
asyncio.run(main())
Could also do multiprocessing as per the comment, but async is better for i/o type tasks.
For my project I need to request a api and to store the result in a list. But the no. of requests I need to give more than 5000 with different body values. So, it take huge amount of time to complete. Is there is any way to parallely send the requests to complete the process quickly. I tried some threading code in this but I can't be able to figure out the ay to solve this.
import requests
res_list=[]
l=[19821, 29674 , 41983, 40234 ,.....] # Nearly 5000 items for now and the count may increase in future
for i in l:
URL ="https://api.something.com/?key=xxx-xxx-xxx&job_id={0}".format(i)
res = requests.get(url=URL)
res_list.append(res.text)
Probably, you just need to make your queries asynchronously. Something like that:
import asyncio
import aiohttp
NUMBERS = [1, 2, 3]
async def call():
async with aiohttp.ClientSession() as session:
for num in NUMBERS:
async with session.get(f'http://httpbin.org/get?{num}') as resp:
print(resp.status)
print(await resp.text())
if __name__ == '__main__':
loop = asyncio.new_event_loop()
loop.run_until_complete(call())
In order to create a scraper for a page with dynamic loaded content, requests-html provides modules to get the rendered page after the JS execution. However, when trying to use the AsyncHTMLSession by calling the arender() method in a multithreaded implementation, the HTML generated doesn't change.
E.g. in the URL provided in the source code, the tables HTML values are empty by default and after the script execution, emulated by the arender() method it is expected to insert the values into the markup, though no visible changes are noticed in the source code.
from pprint import pprint
#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor
from requests_html import AsyncHTMLSession, HTML
async def fetch(session, url):
r = await session.get(url)
await r.html.arender()
return r.content
def parseWebpage(page):
print(page)
async def get_data_asynchronous():
urls = [
'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
]
with ThreadPoolExecutor(max_workers=20) as executor:
with AsyncHTMLSession() as session:
# Set any session parameters here before calling `fetch`
# Initialize the event loop
loop = asyncio.get_event_loop()
# Use list comprehension to create a list of
# tasks to complete. The executor will run the `fetch`
# function for each url in the urlslist
tasks = [
await loop.run_in_executor(
executor,
fetch,
*(session, url) # Allows us to pass in multiple arguments to `fetch`
)
for url in urls
]
# Initializes the tasks to run and awaits their results
for response in await asyncio.gather(*tasks):
parseWebpage(response)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_data_asynchronous())
loop.run_until_complete(future)
main()
The source code representation post the execution of the rendering method is not under the content attribute of the session, but under raw_html in the HTML object. In this case, the value returned should be r.html.raw_html.
I have a text file with over 20 million lines in the below format:
ABC123456|fname1 lname1|fname2 lname2
.
.
.
.
My task is to read the file line by line and send both the names to Google transliteration API and print the results on the terminal (linux). Below is my code:
import asyncio
import urllib.parse
from aiohttp import ClientSession
async def getResponse(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
tasks = []
# I'm using test server localhost, but you can use any url
url = "https://www.google.com/inputtools/request?{}"
for line in open('tg.txt'):
vdata = line.split("|")
if len(vdata) == 3:
names = vdata[1]+"_"+vdata[2]
tdata = {"text":names,"ime":"transliteration_en_te"}
qstring = urllib.parse.urlencode(tdata)
task = asyncio.ensure_future(getResponse(url.format(qstring)))
tasks.append(task)
loop.run_until_complete(asyncio.wait(tasks))
In the above code, my file tg.txt contains 20+ million lines. When I run it, my laptop freezes and I have to hard restart it. But this code works fine when I use another file tg1.txt which has only 10 lines. What am I missing?
You can try to use asyncio.gather(*futures) instead of asyncio.wait.
Also try to do this with batches of fixed size (for example 10 lines per batch) and add print after each processed batch, it should help you to debug your app.
Also your future could finish in different order and it's better to store result of gather and print it when processing of batch is finished.
I have 5,00,000 urls. and want to get response of each asynchronously.
import aiohttp
import asyncio
#asyncio.coroutine
def worker(url):
response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))
main()
I want 200 connections at a time(concurrent 200), not more than this because
when I run this program for 50 urls it works fine, i.e url_list[:50]
but if I pass whole list, i get this error
aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
may be frequency is too much and server is denying to respond after a limit?
Yes, one can expect a server to stop responding after causing too much traffic (whatever the definition of "too much traffic") to it.
One way to limit number of concurrent requests (throttle them) in such cases is to use asyncio.Semaphore, similar in use to these used in multithreading: just like there, you create a semaphore and make sure the operation you want to throttle is aquiring that semaphore prior to doing actual work and releasing it afterwards.
For your convenience, asyncio.Semaphore implements context manager to make it even easier.
Most basic approach:
CONCURRENT_REQUESTS = 200
#asyncio.coroutine
def worker(url, semaphore):
# Aquiring/releasing semaphore using context manager.
with (yield from semaphore):
response = yield from aiohttp.request(
'GET',
url,
connector=aiohttp.TCPConnector(share_cookies=True,
verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))