How to asynchronous run http requests in python

How to asynchronous run http requests in python - python

this code i'm running synchronously
import requests
URL = "http://maps.googleapis.com/maps/api/geocode/json"
location = "delhi technological university"
PARAMS = {'address':location}
for _ in range(1,100):
r = requests.get(url = URL, params = PARAMS)
print(r)
how to run the same code asynchronously using python
tried this:
import requests
import asyncio
loop = asyncio.get_event_loop()
URL = "http://maps.googleapis.com/maps/api/geocode/json"
location = "delhi technological university"
PARAMS = {'address':location}
async def run():
for j in range(1,100):
r = requests.get(url = URL, params = PARAMS)
if __name__ == "__main__":
loop.run_until_complete(run())
loop.close()
tried the above code but getting run time error . RuntimeError: This event loop is already running

Related

How to use html.render inside a thread?

When trying to send a function to the stream that parses the page and then executes the html.render, an error occurs:
Error: There is no current event loop in thread 'Thread-1 (take_proxy_us_spys_one_thread)
I started talking about a similar problem and realized that a friend here somehow managed to implement this. But I still get an error.
Here is my code which should be repeated all the time.
Help, please, to understand.
import urllib3
import requests
import time
from requests_html import HTMLSession
import threading
import fake_useragent
def take_proxy_us_spys_one(urls: list=[], header:dict = None,):
for url in urls:
try:
url_first = 'https://spys.one'
r = requests.get(url_first, headers=header)
cookies = r.cookies
session = HTMLSession()
r = session.post(url,
data={'xx00': '','xpp': '5','xf1': '0','xf2': '0','xf3': '0','xf4': '0', 'xf5': '0'},
headers=header,
cookies=cookies)
r.html.render(reload=False,)
print(str(r))
except Exception as exc:
print("Error: " + str(exc))
def take_proxy_us_spys_one_thread(event, sleeptime= 60, urls=[], lock = None):
while event.is_set():
try:
user = fake_useragent.UserAgent().random
header = {'User-Agent': user}
lock.acquire() if lock!=None else None
proxies_1 = take_proxy_us_spys_one(urls=urls, header=header)
lock.release() if lock != None else None
time.sleep(sleeptime)
except Exception as exc:
print("Error: " + str(exc))
time.sleep(sleeptime)
if __name__ == '__main__':
start_in_thread = True
urllib3.disable_warnings()
urls_spys_one = [
'https://spys.one/free-proxy-list/ALL/'
]
lock = threading.Lock()
event = threading.Event()
event.set()
t2 = threading.Thread(target=take_proxy_us_spys_one_thread, args=(event, 10, urls_spys_one, lock),).start()
I tried to implement the mechanism from here.

aiohttp asyncio parsing works fine for a time, then without any error gets no data

I need to parse html from list of domains (only main pages)
Script works well for a period of time, then it's getting no data with very high speed. Looks like requests doesn't even send.
My code:
import asyncio
import time
import aiohttp
import pandas as pd
import json
from bs4 import BeautifulSoup
df = pd.read_excel('work_file.xlsx')
domains_count = df.shape[0]
start_time = time.time()
print(start_time)
data = {}
async def get_data(session, url, j):
try:
async with session.get(url) as resp:
html = await resp.text()
rawhtml = BeautifulSoup(html, 'lxml')
title = rawhtml.title
data[url] = {'url': url, 'resp': resp.status, 'title': str(title), 'html': str(rawhtml)}
print(j)
print(url)
except Exception as e:
data[url] = {'url': url, 'resp': str(e), 'title': 'None', 'html': 'None'}
print(j)
print(url)
print(str(e))
async def get_queue():
tasks = []
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=120)) as session:
for j, i in enumerate(df.domain):
i = 'http://' + i.lower()
task = asyncio.create_task(get_data(session, i, j))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(get_queue())
with open('parsed_data.json', 'a+') as file:
file.write(json.dumps(data))
end_time = time.time() - start_time
print(end_time)

That was timeout error.
'resp': str(e)
This code prints only error exception message. TimeOut Error has no exception message, so str(e) = empty string.
str(repr(e)) helps to see an Error.

(Python) How can I apply asyncio in while loop with accumulator?

I have a block of codes that works well in fetching data from API requests to a specific site. The issue is that the site only gives me a limit of 50 objects for each call, and I have to make multiple calls. As a result, it takes me too long to finish the fetching work (sometimes I have to wait nearly 20 minutes). Here are my codes:
import concurrent.futures
import requests
supply = 3000
offset = 0
token_ids = []
while offset < supply:
url = "url_1" + str(offset)
response = requests.request("GET", url)
a = response.json()
assets = a["assets"]
def get_token_ids(an):
if str(an['sell_orders']) == 'None' and str(an['last_sale']) == 'None' and str(an['num_sales']) == '0':
token_ids.append(str(an['token_id']))
with concurrent.futures.ThreadPoolExecutor() as executor:
results = [executor.submit(get_token_ids, asset) for asset in assets]
offset += 50
print(token_ids)
The problem is that the codes run through and wait for all actions to be finished before making another request. I am thinking of an improvement that when the request is sent, the offset value gets added, and the loop processes to another request, thus I don't have to wait. I don't know how to do it, I studied 'asyncio', but it is still a challenge for me. Can anyone help me with this?

The problem is that Requests is not asynchronous code, so each of its network calls blocks the loop until its completion.
https://docs.python-requests.org/en/latest/user/advanced/#blocking-or-non-blocking
Therefore, it is better to try asynchronous libraries, for example, aiohttp:
https://github.com/aio-libs/aiohttp
Example
Create session for all connections:
async with aiohttp.ClientSession() as session:
and run all desired requests:
results = await asyncio.gather(
*[get_data(session, offset) for offset in range(0, supply, step)]
)
here, requests are executed asynchronously, with session.get(url) gets only the response headers, and the content gets await response.json():
async with session.get(url) as response:
a = await response.json()
And in the main block main loop starts:
loop = asyncio.get_event_loop()
token_ids = loop.run_until_complete(main())
loop.close()
The full code
import aiohttp
import asyncio
async def get_data(session, offset):
token_ids = []
url = "url_1" + str(offset)
async with session.get(url) as response:
# For tests:
# print("Status:", response.status)
# print("Content-type:", response.headers['content-type'])
a = await response.json()
assets = a["assets"]
for asset in assets:
if str(asset['sell_orders']) == 'None' and str(asset['last_sale']) == 'None' and str(asset['num_sales']) == '0':
token_ids.append(str(asset['token_id']))
return token_ids
async def main():
supply = 3000
step = 50
token_ids = []
# Create session for all connections and pass it to "get" function
async with aiohttp.ClientSession() as session:
results = await asyncio.gather(
*[get_data(session, offset) for offset in range(0, supply, step)]
)
for ids in results:
token_ids.extend(ids)
return token_ids
if __name__ == "__main__":
# asynchronous code start here
loop = asyncio.get_event_loop()
token_ids = loop.run_until_complete(main())
loop.close()
# asynchronous code end here
print(token_ids)

concurrent.futures multithreading with 2 lists as variables

So I would like to multi-thread the following working piece of code with concurrent futures but nothing I've tried so far seems to work.
def download(song_filename_list, song_link_list):
with requests.Session() as s:
login_request = s.post(login_url, data= payload, headers= headers)
for x in range(len(song_filename_list)):
download_request = s.get(song_link_list[x], headers= download_headers, stream=True)
if download_request.status_code == 200:
print(f"Downloading {x+1} out of {len(song_filename_list)}!\n")
pass
else:
print(f"\nStatus Code: {download_request.status_code}!\n")
sys.exit()
with open (song_filename_list[x], "wb") as file:
file.write(download_request.content)
The 2 main variables are the song_filename_list and the song_link_list.
The first list has names of each file and the second has all their respective download links.
So the name and link of each file are located at the same position.
For example: name_of_file1 = song_filename_list[0] and link_of_file1 = song_link_list[0]
This is the most recent attempt at multi-threading:
def download(song_filename_list, song_link_list):
with requests.Session() as s:
login_request = s.post(login_url, data= payload, headers= headers)
x = []
for i in range(len(song_filename_list)):
x.append(i)
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.submit(get_file, x)
def get_file(x):
download_request = s.get(song_link_list[x], headers= download_headers, stream=True)
if download_request.status_code == 200:
print(f"Downloading {x+1} out of {len(song_filename_list)}!\n")
pass
else:
print(f"\nStatus Code: {download_request.status_code}!\n")
sys.exit()
with open (song_filename_list[x], "wb") as file:
file.write(download_request.content)
Could someone explain to me what am I doing wrong?
Cause nothing happens after the get_file function call.
It skips all the code and exits without any errors, so where is my logic wrong?
EDIT 1:
After adding prints to:
print(song_filename_list, song_link_list)
with concurrent.futures.ThreadPoolExecutor() as executor:
print("Before executor.map")
executor.map(get_file, zip(song_filename_list, song_link_list))
print("After executor.map")
print(song_filename_list, song_link_list)
And to the start and end get_file and its file.write.
The output is as follows:
Succesfully logged in!
["songs names"] ["songs links"] <- These are correct.
Before executor.map
After executor.map
["songs names"] ["songs links"] <- These are correct.
Exiting.
In other words values are correct but it skips the get_file in the executor.map.
EDIT 2:
Here are the values used.
song_filename_list = ['100049 Himeringo - Yotsuya-san ni Yoroshiku.osz', '1001507 ZUTOMAYO - Kan Saete Kuyashiiwa.osz']
song_link_list = ['https://osu.ppy.sh/beatmapsets/100049/download', 'https://osu.ppy.sh/beatmapsets/1001507/download']
EDIT 3:
After some tinkering around it would seem that this works.
for i in range(len(song_filename_list)):
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.submit(get_file, song_filename_list, song_link_list, i, s)
def get_file(song_filename_list, song_link_list, i, s):
download_request = s.get(song_link_list[i], headers= download_headers, stream=True)
if download_request.status_code == 200:
print("Downloading...")
pass
else:
print(f"\nStatus Code: {download_request.status_code}!\n")
sys.exit()
with open (song_filename_list[i], "wb") as file:
file.write(download_request.content)

In your download() function you submit the whole array while you should submit each items:
def download(song_filename_list, song_link_list):
with requests.Session() as s:
login_request = s.post(login_url,
data=payload,
headers=headers)
for i in range(len(song_filename_list)):
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.submit(get_file, i)
You can simplify this with the executor .map() method:
def download(song_filename_list, song_link_list):
with requests.Session() as session:
session.post(login_url,
data=payload,
headers=headers)
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(get_file, song_filename_list, song_link_list)
Where the get_file function is:
def get_file(song_name, song_link):
with requests.Session() as session:
download_request = session.get(song_link,
headers=download_headers,
stream=True)
if download_request.status_code == 200:
print(f"Downloaded {song_name}")
else:
print(f"\nStatus Code: {download_request.status_code}!\n")
with open(song_name, "wb") as file:
file.write(download_request.content)
This avoid sharing state between threads, which avoids potential data races.
If you need to monitor how much songs have been downloaded, you can use tqdm which has a thread_map iterator wrapper that does exactly this.

Asyncio exception handling, possible to not gather exceptions?

I have some code, which makes some API calls with asyncio and aiohttp. For some urls, asyncio will raise an exception, so I allow it to return it (with asyncio.gather(return_exceptions = True)), so it doesn't break the event loop. Is it possible to no gather the returned exceptions, so it returns only the results which worked? Or do I need to clean up the list manually afterwards?
This is the code:
import asyncio
import aiohttp
import ssl
import datetime as dt
limit = 30
start_epoch = int(dt.datetime(2018,7,1).timestamp())
end_epoch = int(dt.datetime.now().timestamp())
epoch_step = 40000
url_list = []
while True:
url = "https://api.pushshift.io/reddit/search/comment/?q=" + "Nestle" + "&size=" + str(limit) + "&after=" + str(start_epoch) + "&before=" + str(start_epoch + epoch_step)
url_list.append(url)
start_epoch += epoch_step
if start_epoch > end_epoch:
break
async def fetch(session, url):
async with session.get(url, ssl=ssl.SSLContext()) as response:
return await response.json()
async def fetch_all(urls, loop):
async with aiohttp.ClientSession(loop=loop) as session:
results = await asyncio.gather(*[fetch(session, url) for url in urls], return_exceptions=True)
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
urls = url_list
htmls = loop.run_until_complete(fetch_all(urls, loop))
print(htmls)
and it returns a list which looks something like this:
[ContentTypeError("0, message='Attempt to decode JSON with unexpected mimetype: text/html'",), {'data': [{'author':...]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to asynchronous run http requests in python - python

Related

How to use html.render inside a thread?

aiohttp asyncio parsing works fine for a time, then without any error gets no data

(Python) How can I apply asyncio in while loop with accumulator?

concurrent.futures multithreading with 2 lists as variables

Asyncio exception handling, possible to not gather exceptions?

Categories

Resources