Parallelize checking of dead URLs

Parallelize checking of dead URLs - python

The question is quite easy: Is it possible to test a list of URLs and store in a list only dead URLs (response code > 400) using asynchronous function?
I previously use requests library to do it and it works great but I have a big list of URLs to test and if I do it sequentially it takes more than 1 hour.
I saw a lot of article on how to make parallels requests using asyncio and aiohttp but I didn't see many things about how to test URLs with these libraries.
Is it possible to do it?

Using multithreading you could do it like this:
import requests
from concurrent.futures import ThreadPoolExecutor
results = dict()
# test the given url
# add url and status code to the results dictionary if GET succeeds but status code >= 400
# also add url to results dictionary if an exception arises with full exception details
def test_url(url):
try:
r = requests.get(url)
if r.status_code >= 400:
results[url] = f'{r.status_code=}'
except requests.exceptions.RequestException as e:
results[url] = str(e)
# return a list of URLs to be checked. probably get these from a file in reality
def get_list_of_urls():
return ['https://facebook.com', 'https://google.com', 'http://google.com/nonsense', 'http://goooglyeyes.org']
def main():
with ThreadPoolExecutor() as executor:
executor.map(test_url, get_list_of_urls())
print(results)
if __name__ == '__main__':
main()

You could do something like this using aiohttp and asyncio.
Could be done more pythonic I guess but this should work.
import aiohttp
import asyncio
urls = ['url1', 'url2']
async def test_url(session, url):
async with session.get(url) as resp:
if resp.status > 400:
return url
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for url in urls:
tasks.append(asyncio.ensure_future(test_url(session, url)))
dead_urls = await asyncio.gather(*tasks)
print(dead_urls)
asyncio.run(main())

Very basic example, but this is how I would solve it:
from aiohttp import ClientSession
from asyncio import create_task, gather, run
async def TestUrl(url, session):
async with session.get(url) as response:
if response.status >= 400:
r = await response.text()
print(f"Site: {url} is dead, response code: {str(response.status)} response text: {r}")
async def TestUrls(urls):
resultsList: list = []
async with ClientSession() as session:
# Maybe some rate limiting?
partitionTasks: list = [
create_task(TestUrl(url, session))
for url in urls]
resultsList.append(await gather(*partitionTasks, return_exceptions=False))
# do stuff with the results or return?
return(resultsList)
async def main():
urls = []
test = await TestUrls(urls)
if __name__ == "__main__":
run(main())

Try using a ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
import requests
url_list=[
"https://www.google.com",
"https://www.adsadasdad.com",
"https://www.14fsdfsff.com",
"https://www.ggr723tg.com",
"https://www.yyyyyyyyyyyyyyy.com",
"https://www.78sdf8sf5sf45sf.com",
"https://www.wikipedia.com",
"https://www.464dfgdfg235345.com",
"https://www.tttllldjfh.com",
"https://www.qqqqqqqqqq456.com"
]
def check(url):
r=requests.get(url)
if r.status_code < 400:
print(f"{url} is ALIVE")
with ThreadPoolExecutor(max_workers=5) as e:
for url in url_list:
e.submit(check, url)

Multiprocessing could be the better option for your problem.
from multiprocessing import Process
from multiprocessing import Manager
import requests
def checkURLStatus(url, url_status):
res = requests.get(url)
if res.status_code >= 400:
url_status[url] = "Inactive"
else:
url_status[url] = "Active"
if __name__ == "__main__":
urls = [
"https://www.google.com"
]
manager = Manager()
# to store the results for later usage
url_status = manager.dict()
procs = []
for url in urls:
proc = Process(target=checkURLStatus, args=(url, url_status))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print(url_status.values())
url_status is a shared variable to store data for separate threads. Refer this page for more info.

Related

using python requests to speed up for loop

I have a python code and I want to speed it up using threads but when I try to I get the same lines getting duplicated, is there is any way I could speed it up without getting duplicate lines
code
import requests
import json
f = open("urls.json")
data = json.load(f)
def urls():
for i in data['urls']:
r = requests.get("https://" + i)
print(r.headers)

You can use ThreadPoolExecutor class from concurrent.futures. It is efficient way according to Thread class.
You can change the max_workers value according to your task
Here is the piece of code:
import requests
from concurrent.futures import ThreadPoolExecutor
import json
with open("urls.json") as f:
data = json.load(f)
def urls():
urls = ["https://" + url for url in data['urls']]
print(urls)
with ThreadPoolExecutor(max_workers=5) as pool:
iterator = pool.map(requests.get,urls)
for response in iterator:
print(response.headers)
print("\n")

Make async or threaded calls.
So, you would do something like this:
import aiohttp
import asyncio
import time
start_time = time.time()
async def main():
async with aiohttp.ClientSession() as session:
for number in range(1, 151):
pokemon_url = f'https://pokeapi.co/api/v2/pokemon/{number}'
async with session.get(pokemon_url) as resp:
pokemon = await resp.json()
print(pokemon['name'])
asyncio.run(main())
Could also do multiprocessing as per the comment, but async is better for i/o type tasks.

Does anyone know how I can make this code move faster?

I have finished making a web scraper that will go through Roblox, and pick out all of the usernames of the first 1000 accounts made on Roblox. Fortunately it works! However, there is a downside.
My problem is that this code takes absolutely FOREVER to finish. Does anyone know a more efficient way to write the same thing, or is this just the base speed of Python Requests? Code is below :)
PS: The code took 5 minutes to go through only 600 accounts.
def find_account(id):
import requests
from bs4 import BeautifulSoup
r = requests.request(url=f'https://web.roblox.com/users/{id}/profile', method='get')
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
stuff = soup.find_all('h2')
special = stuff[0]
special = list(special)
special = special[0]
return str(special) + ' ID: {}'.format(id)
else:
return None
users = []
for i in range(10000,11000):
users.append(find_account(i))
print(f'{i-9999} out of 1000 done')
#There is more below this, but that is just the GUI and stuff. This is the part that gets the usernames.

Try the async library to asynchronously attempt to do the same thing. The advantage of using async python is that you do not need to wait for one http call to finish before calling the next. This is a fantastic article on how to write concurrent/parallel code in python, give it a read if the syntax here is confusing.
refactored to run in async mode:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def find_account(id, session):
async with session.get(f'https://web.roblox.com/users/{id}/profile') as r:
if r.status == 200:
response_text = await r.read()
soup = BeautifulSoup(response_text, 'html.parser')
stuff = soup.find_all('h2')
special = stuff[0]
special = list(special)
special = special[0]
print(f'{id-9999} out of 1000 done')
return str(special) + ' ID: {}'.format(id)
else:
return None
async def crawl_url_id_range(min_id, max_id):
tasks = []
async with aiohttp.ClientSession() as session:
for id in range(min_id, max_id):
tasks.append(asyncio.ensure_future(find_account(id=id, session=session)))
return await asyncio.gather(*tasks)
event_loop = asyncio.get_event_loop()
users = event_loop.run_until_complete(crawl_url_id_range(min_id=10000, max_id=11000))
I tested and the above code works fairly well.

Make Python Async Requests Faster

I am writing a get method that gets an array of ids and then makes a request for each id. The array of ids can be potentially 500+ and right now the requests are taking 20+ minutes. I have tried several different async methods like aiohttp and async and neither of them have worked to make the requests faster. Here is my code:
async def get(self):
self.set_header("Access-Control-Allow-Origin", "*")
story_list = []
duplicates = []
loop = asyncio.get_event_loop()
ids = loop.run_in_executor(None, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
response = await ids
response_data = response.json()
print(response.text)
for url in response_data:
if url not in duplicates:
duplicates.append(url)
stories = loop.run_in_executor(None, requests.get, "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(
url))
data = await stories
if data.status_code == 200 and len(data.text) > 5:
print(data.status_code)
print(data.text)
story_list.append(data.json())
Is there a way I can use multithreading to make the requests faster?

The main issue here is that the code isn't really async.
After getting your list of URL's you are then fetching them one at a time and then awaiting the response.
A better idea would be to filter out the duplicates (use a set) before queuing all of the URLs in the executor and awaiting all of them to finish eg:
async def get(self):
self.set_header("Access-Control-Allow-Origin", "*")
stories = []
loop = asyncio.get_event_loop()
# Single executor to share resources
executor = ThreadPoolExecutor()
# Get the initial set of ids
response = await loop.run_in_executor(executor, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
response_data = response.json()
print(response.text)
# Putting them in a set will remove duplicates
urls = set(response_data)
# Build the set of futures (returned by run_in_executor) and wait for them all to complete
responses = await asyncio.gather(*[
loop.run_in_executor(
executor, requests.get,
"https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(url)
) for url in urls
])
# Process the responses
for response in responses:
if response.status_code == 200 and len(response.text) > 5:
print(response.status_code)
print(response.text)
stories.append(response.json())
return stories

Multithreading Python Requests Through Tor

The following code is my attempt at doing python requests through tor, this works fine, however I am interested in adding multithreading to this.
So I would like to simultaneously do about 10 different requests and process their outputs. What is the simplest and most efficient way to do this?
def onionrequest(url, onionid):
onionid = onionid
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
#r = session.get('http://google.com')
onionurlforrequest = "http://" + url
try:
r = session.get(onionurlforrequest, timeout=15)
except:
return None
if r.status_code = 200:
listofallonions.append(url)

I would recommend using the the following packages to achieve this: asyncio, aiohttp, aiohttp_socks
example code:
import asyncio
import aiohttp
from aiohttp_socks import ProxyConnector
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main(urls):
tasks = []
connector = ProxyConnector.from_url('socks5://localhost:9150', rdns=True)
async with aiohttp.ClientSession(connector=connector, rdns=True) as session:
for url in urls:
tasks.append(fetch(session, url))
htmls = await asyncio.gather(*tasks)
for html in htmls:
print(html)
if __name__ == '__main__':
urls = [
'http://python.org',
'https://google.com',
...
]
loop = asyncio.get_event_loop()
loop.run_until_complete(main(urls))
Using asyncio can get a bit daunting at first, so you might need to practice for a while before you get the hang of it.
If you want a more in-depth explanation of the difference between synchronous and asynchronous, check out this question.

Run Parallel Request session in python

I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break

There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelize checking of dead URLs - python

Related

using python requests to speed up for loop

Does anyone know how I can make this code move faster?

Make Python Async Requests Faster

Multithreading Python Requests Through Tor

Run Parallel Request session in python

Categories

Resources