Does anyone know how I can make this code move faster? - python

I have finished making a web scraper that will go through Roblox, and pick out all of the usernames of the first 1000 accounts made on Roblox. Fortunately it works! However, there is a downside.
My problem is that this code takes absolutely FOREVER to finish. Does anyone know a more efficient way to write the same thing, or is this just the base speed of Python Requests? Code is below :)
PS: The code took 5 minutes to go through only 600 accounts.
def find_account(id):
import requests
from bs4 import BeautifulSoup
r = requests.request(url=f'https://web.roblox.com/users/{id}/profile', method='get')
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
stuff = soup.find_all('h2')
special = stuff[0]
special = list(special)
special = special[0]
return str(special) + ' ID: {}'.format(id)
else:
return None
users = []
for i in range(10000,11000):
users.append(find_account(i))
print(f'{i-9999} out of 1000 done')
#There is more below this, but that is just the GUI and stuff. This is the part that gets the usernames.

Try the async library to asynchronously attempt to do the same thing. The advantage of using async python is that you do not need to wait for one http call to finish before calling the next. This is a fantastic article on how to write concurrent/parallel code in python, give it a read if the syntax here is confusing.
refactored to run in async mode:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def find_account(id, session):
async with session.get(f'https://web.roblox.com/users/{id}/profile') as r:
if r.status == 200:
response_text = await r.read()
soup = BeautifulSoup(response_text, 'html.parser')
stuff = soup.find_all('h2')
special = stuff[0]
special = list(special)
special = special[0]
print(f'{id-9999} out of 1000 done')
return str(special) + ' ID: {}'.format(id)
else:
return None
async def crawl_url_id_range(min_id, max_id):
tasks = []
async with aiohttp.ClientSession() as session:
for id in range(min_id, max_id):
tasks.append(asyncio.ensure_future(find_account(id=id, session=session)))
return await asyncio.gather(*tasks)
event_loop = asyncio.get_event_loop()
users = event_loop.run_until_complete(crawl_url_id_range(min_id=10000, max_id=11000))
I tested and the above code works fairly well.

Related

proxies do not work with aiohttp requests?

So I've been experimenting with web scraping with aiohttp, and I ran into this issue where whenever I use a proxy, the code within the session.get doesn't run. I've looked all over the internet and couldn't find a solution.
import asyncio
import time
import aiohttp
from aiohttp.client import ClientSession
import random
failed = 0
success = 0
proxypool = []
with open("proxies.txt", "r") as jsonFile:
lines = jsonFile.readlines()
for i in lines:
x = i.split(":")
proxypool.append("http://"+x[2]+":"+x[3].rstrip()+"#"+x[0]+":"+x[1])
async def download_link(url:str,session:ClientSession):
global failed
global success
proxy = proxypool[random.randint(0, len(proxypool))]
print(proxy)
async with session.get(url, proxy=proxy) as response:
if response.status != 200:
failed +=1
else:
success +=1
result = await response.text()
print(result)
async def download_all(urls:list):
my_conn = aiohttp.TCPConnector(limit=1000)
async with aiohttp.ClientSession(connector=my_conn,trust_env=True) as session:
tasks = []
for url in urls:
task = asyncio.ensure_future(download_link(url=url,session=session))
tasks.append(task)
await asyncio.gather(*tasks,return_exceptions=True) # the await must be nest inside of the session
url_list = ["https://www.google.com"]*100
start = time.time()
asyncio.run(download_all(url_list))
end = time.time()
print(f'download {len(url_list)-failed} links in {end - start} seconds')
print(failed, success)
Here is the problem though, the code works fine on my mac. However, when I try to run the exact same code on windows, it doesn't run. It also works fine without proxies, but as soon as I add them, it doesn't work.
At the end, you can see that I print failed and succeeded. On my mac it will output 0, 100, whereas on my windows computer, it will print 0,0 - This proves that that code isn't running (Also, nothing is printed)
The proxies I am using are paid proxies, and they work normally if I use requests.get(). Their format is "http://user:pass#ip:port"
I have also tried just using "http://ip:port" then using BasicAuth to carry the user and password, but this does not work either.
I've seen that many other people have had this problem, however the issue never seems to get solved.
Any help would be appreciated :)
So after some more testing and researching I found the issue, I needed to add ssl = False
So the correct way to make the request would be:
async with session.get(url, proxy=proxy, ssl = False) as response:
That worked for me.

Parallelize checking of dead URLs

The question is quite easy: Is it possible to test a list of URLs and store in a list only dead URLs (response code > 400) using asynchronous function?
I previously use requests library to do it and it works great but I have a big list of URLs to test and if I do it sequentially it takes more than 1 hour.
I saw a lot of article on how to make parallels requests using asyncio and aiohttp but I didn't see many things about how to test URLs with these libraries.
Is it possible to do it?
Using multithreading you could do it like this:
import requests
from concurrent.futures import ThreadPoolExecutor
results = dict()
# test the given url
# add url and status code to the results dictionary if GET succeeds but status code >= 400
# also add url to results dictionary if an exception arises with full exception details
def test_url(url):
try:
r = requests.get(url)
if r.status_code >= 400:
results[url] = f'{r.status_code=}'
except requests.exceptions.RequestException as e:
results[url] = str(e)
# return a list of URLs to be checked. probably get these from a file in reality
def get_list_of_urls():
return ['https://facebook.com', 'https://google.com', 'http://google.com/nonsense', 'http://goooglyeyes.org']
def main():
with ThreadPoolExecutor() as executor:
executor.map(test_url, get_list_of_urls())
print(results)
if __name__ == '__main__':
main()
You could do something like this using aiohttp and asyncio.
Could be done more pythonic I guess but this should work.
import aiohttp
import asyncio
urls = ['url1', 'url2']
async def test_url(session, url):
async with session.get(url) as resp:
if resp.status > 400:
return url
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for url in urls:
tasks.append(asyncio.ensure_future(test_url(session, url)))
dead_urls = await asyncio.gather(*tasks)
print(dead_urls)
asyncio.run(main())
Very basic example, but this is how I would solve it:
from aiohttp import ClientSession
from asyncio import create_task, gather, run
async def TestUrl(url, session):
async with session.get(url) as response:
if response.status >= 400:
r = await response.text()
print(f"Site: {url} is dead, response code: {str(response.status)} response text: {r}")
async def TestUrls(urls):
resultsList: list = []
async with ClientSession() as session:
# Maybe some rate limiting?
partitionTasks: list = [
create_task(TestUrl(url, session))
for url in urls]
resultsList.append(await gather(*partitionTasks, return_exceptions=False))
# do stuff with the results or return?
return(resultsList)
async def main():
urls = []
test = await TestUrls(urls)
if __name__ == "__main__":
run(main())
Try using a ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
import requests
url_list=[
"https://www.google.com",
"https://www.adsadasdad.com",
"https://www.14fsdfsff.com",
"https://www.ggr723tg.com",
"https://www.yyyyyyyyyyyyyyy.com",
"https://www.78sdf8sf5sf45sf.com",
"https://www.wikipedia.com",
"https://www.464dfgdfg235345.com",
"https://www.tttllldjfh.com",
"https://www.qqqqqqqqqq456.com"
]
def check(url):
r=requests.get(url)
if r.status_code < 400:
print(f"{url} is ALIVE")
with ThreadPoolExecutor(max_workers=5) as e:
for url in url_list:
e.submit(check, url)
Multiprocessing could be the better option for your problem.
from multiprocessing import Process
from multiprocessing import Manager
import requests
def checkURLStatus(url, url_status):
res = requests.get(url)
if res.status_code >= 400:
url_status[url] = "Inactive"
else:
url_status[url] = "Active"
if __name__ == "__main__":
urls = [
"https://www.google.com"
]
manager = Manager()
# to store the results for later usage
url_status = manager.dict()
procs = []
for url in urls:
proc = Process(target=checkURLStatus, args=(url, url_status))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print(url_status.values())
url_status is a shared variable to store data for separate threads. Refer this page for more info.

Concurrent HTTP and SQL requests using async Python 3

first time trying asyncio and aiohttp.
I have the following code that gets urls from the MySQL database for GET requests. Gets the responses and pushes them to MySQL database.
if __name__ == "__main__":
database_name = 'db_name'
company_name = 'company_name'
my_db = Db(database=database_name) # wrapper class for mysql.connector
urls_dict = my_db.get_rest_api_urls_for_specific_company(company_name=company_name)
update_id = my_db.get_updateid()
my_db.get_connection(dictionary=True)
for url in urls_dict:
url_id = url['id']
url = url['url']
table_name = my_db.make_sql_table_name_by_url(url)
insert_query = my_db.get_sql_for_insert(table_name)
r = requests.get(url=url).json() # make the request
args = [json.dumps(r), update_id, url_id]
my_db.db_execute_one(insert_query, args, close_conn=False)
my_db.close_conn()
This works fine but to speed it up How can I run it asynchronously?
I have looked here, here and here but can't seem to get my head around it.
Here is what I have tried based on #Raphael Medaer's answer.
async def fetch(url):
async with ClientSession() as session:
async with session.request(method='GET', url=url) as response:
json = await response.json()
return json
async def process(url, update_id):
table_name = await db.make_sql_table_name_by_url(url)
result = await fetch(url)
print(url, result)
if __name__ == "__main__":
"""Get urls from DB"""
db = Db(database="fuse_src")
urls = db.get_rest_api_urls() # This returns list of dictionary
update_id = db.get_updateid()
url_list = []
for url in urls:
url_list.append(url['url'])
print(update_id)
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[process(url, update_id) for url in url_list]))
I get an error in the process method:
TypeError: object str can't be used in 'await' expression
Not sure whats the problem?
Any code example specific to this would be highly appreciated.
Make this code asynchronous will not speed it up at all. Except if you consider to run a part of your code in "parallel". For instance you can run multiple (SQL or HTTP) queries in "same time". By doing asynchronous programming you will not execute code in "same time". Although you will get benefit of long IO tasks to execute other part of your code while you're waiting for IOs.
First of all, you'll have to use asynchronous libraries (instead of synchronous one).
mysql.connector could be replaced by aiomysql from aio-libs.
requests could be replaced by aiohttp
To execute multiple asynchronous tasks in "parallel" (for instance to replace your loop for url in urls_dict:), you have to read carefully about asyncio tasks and function gather.
I will not (re)write your code in an asynchronous way, however here are a few lines of pseudo code which could help you:
async def process(url):
result = await fetch(url)
await db.commit(result)
if __name__ == "__main__":
db = MyDbConnection()
urls = await db.fetch_all_urls()
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[process(url) for url in urls]))

Multithreading Python Requests Through Tor

The following code is my attempt at doing python requests through tor, this works fine, however I am interested in adding multithreading to this.
So I would like to simultaneously do about 10 different requests and process their outputs. What is the simplest and most efficient way to do this?
def onionrequest(url, onionid):
onionid = onionid
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
#r = session.get('http://google.com')
onionurlforrequest = "http://" + url
try:
r = session.get(onionurlforrequest, timeout=15)
except:
return None
if r.status_code = 200:
listofallonions.append(url)
I would recommend using the the following packages to achieve this: asyncio, aiohttp, aiohttp_socks
example code:
import asyncio
import aiohttp
from aiohttp_socks import ProxyConnector
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main(urls):
tasks = []
connector = ProxyConnector.from_url('socks5://localhost:9150', rdns=True)
async with aiohttp.ClientSession(connector=connector, rdns=True) as session:
for url in urls:
tasks.append(fetch(session, url))
htmls = await asyncio.gather(*tasks)
for html in htmls:
print(html)
if __name__ == '__main__':
urls = [
'http://python.org',
'https://google.com',
...
]
loop = asyncio.get_event_loop()
loop.run_until_complete(main(urls))
Using asyncio can get a bit daunting at first, so you might need to practice for a while before you get the hang of it.
If you want a more in-depth explanation of the difference between synchronous and asynchronous, check out this question.

Run Parallel Request session in python

I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break
There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).

Categories

Resources