So I've been experimenting with web scraping with aiohttp, and I ran into this issue where whenever I use a proxy, the code within the session.get doesn't run. I've looked all over the internet and couldn't find a solution.
import asyncio
import time
import aiohttp
from aiohttp.client import ClientSession
import random
failed = 0
success = 0
proxypool = []
with open("proxies.txt", "r") as jsonFile:
lines = jsonFile.readlines()
for i in lines:
x = i.split(":")
proxypool.append("http://"+x[2]+":"+x[3].rstrip()+"#"+x[0]+":"+x[1])
async def download_link(url:str,session:ClientSession):
global failed
global success
proxy = proxypool[random.randint(0, len(proxypool))]
print(proxy)
async with session.get(url, proxy=proxy) as response:
if response.status != 200:
failed +=1
else:
success +=1
result = await response.text()
print(result)
async def download_all(urls:list):
my_conn = aiohttp.TCPConnector(limit=1000)
async with aiohttp.ClientSession(connector=my_conn,trust_env=True) as session:
tasks = []
for url in urls:
task = asyncio.ensure_future(download_link(url=url,session=session))
tasks.append(task)
await asyncio.gather(*tasks,return_exceptions=True) # the await must be nest inside of the session
url_list = ["https://www.google.com"]*100
start = time.time()
asyncio.run(download_all(url_list))
end = time.time()
print(f'download {len(url_list)-failed} links in {end - start} seconds')
print(failed, success)
Here is the problem though, the code works fine on my mac. However, when I try to run the exact same code on windows, it doesn't run. It also works fine without proxies, but as soon as I add them, it doesn't work.
At the end, you can see that I print failed and succeeded. On my mac it will output 0, 100, whereas on my windows computer, it will print 0,0 - This proves that that code isn't running (Also, nothing is printed)
The proxies I am using are paid proxies, and they work normally if I use requests.get(). Their format is "http://user:pass#ip:port"
I have also tried just using "http://ip:port" then using BasicAuth to carry the user and password, but this does not work either.
I've seen that many other people have had this problem, however the issue never seems to get solved.
Any help would be appreciated :)
So after some more testing and researching I found the issue, I needed to add ssl = False
So the correct way to make the request would be:
async with session.get(url, proxy=proxy, ssl = False) as response:
That worked for me.
Related
I'm trying to create a simple program to check proxies, and about 99% of them timeout. This is 100% NOT an issue with the proxies and I can't seem to figure it out. Here is the code:
import aiohttp,asyncio,requests,collections
import random
proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt').text.split('\n')] #large proxy list
#proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt').text.split('\n')] #small proxy list
random.shuffle(proxies)
statuses = collections.Counter()
async def make_request(session,proxies_queue):
url = 'https://google.com'
try:
proxy = proxies_queue.get_nowait()
async with session.get(url,proxy = proxy) as resp:
statuses[resp.status]+=1
print('status')
except Exception as e:
print(f'Exception: {e}')
statuses['exception']+=1
async def make_requests(n,delay):
proxies_queue = asyncio.Queue()
for proxy in proxies:
proxies_queue.put_nowait(proxy)
tasks = []
timeout = aiohttp.ClientTimeout(total=10)
async with aiohttp.ClientSession(timeout=timeout) as session:
for i in range(n):
tasks.append(asyncio.create_task(make_request(session,proxies_queue)))
await asyncio.sleep(delay)
for task in asyncio.as_completed(tasks):
await task
asyncio.run(make_requests(100,0.2))
I've had random runs of the program produce 100% 200 status codes, so that's why i'm not trusting anyone who tells me it's the proxies. Also, I just checked the smaller proxy list the moment it was updated, and it produced the same result.
I have finished making a web scraper that will go through Roblox, and pick out all of the usernames of the first 1000 accounts made on Roblox. Fortunately it works! However, there is a downside.
My problem is that this code takes absolutely FOREVER to finish. Does anyone know a more efficient way to write the same thing, or is this just the base speed of Python Requests? Code is below :)
PS: The code took 5 minutes to go through only 600 accounts.
def find_account(id):
import requests
from bs4 import BeautifulSoup
r = requests.request(url=f'https://web.roblox.com/users/{id}/profile', method='get')
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
stuff = soup.find_all('h2')
special = stuff[0]
special = list(special)
special = special[0]
return str(special) + ' ID: {}'.format(id)
else:
return None
users = []
for i in range(10000,11000):
users.append(find_account(i))
print(f'{i-9999} out of 1000 done')
#There is more below this, but that is just the GUI and stuff. This is the part that gets the usernames.
Try the async library to asynchronously attempt to do the same thing. The advantage of using async python is that you do not need to wait for one http call to finish before calling the next. This is a fantastic article on how to write concurrent/parallel code in python, give it a read if the syntax here is confusing.
refactored to run in async mode:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def find_account(id, session):
async with session.get(f'https://web.roblox.com/users/{id}/profile') as r:
if r.status == 200:
response_text = await r.read()
soup = BeautifulSoup(response_text, 'html.parser')
stuff = soup.find_all('h2')
special = stuff[0]
special = list(special)
special = special[0]
print(f'{id-9999} out of 1000 done')
return str(special) + ' ID: {}'.format(id)
else:
return None
async def crawl_url_id_range(min_id, max_id):
tasks = []
async with aiohttp.ClientSession() as session:
for id in range(min_id, max_id):
tasks.append(asyncio.ensure_future(find_account(id=id, session=session)))
return await asyncio.gather(*tasks)
event_loop = asyncio.get_event_loop()
users = event_loop.run_until_complete(crawl_url_id_range(min_id=10000, max_id=11000))
I tested and the above code works fairly well.
first time trying asyncio and aiohttp.
I have the following code that gets urls from the MySQL database for GET requests. Gets the responses and pushes them to MySQL database.
if __name__ == "__main__":
database_name = 'db_name'
company_name = 'company_name'
my_db = Db(database=database_name) # wrapper class for mysql.connector
urls_dict = my_db.get_rest_api_urls_for_specific_company(company_name=company_name)
update_id = my_db.get_updateid()
my_db.get_connection(dictionary=True)
for url in urls_dict:
url_id = url['id']
url = url['url']
table_name = my_db.make_sql_table_name_by_url(url)
insert_query = my_db.get_sql_for_insert(table_name)
r = requests.get(url=url).json() # make the request
args = [json.dumps(r), update_id, url_id]
my_db.db_execute_one(insert_query, args, close_conn=False)
my_db.close_conn()
This works fine but to speed it up How can I run it asynchronously?
I have looked here, here and here but can't seem to get my head around it.
Here is what I have tried based on #Raphael Medaer's answer.
async def fetch(url):
async with ClientSession() as session:
async with session.request(method='GET', url=url) as response:
json = await response.json()
return json
async def process(url, update_id):
table_name = await db.make_sql_table_name_by_url(url)
result = await fetch(url)
print(url, result)
if __name__ == "__main__":
"""Get urls from DB"""
db = Db(database="fuse_src")
urls = db.get_rest_api_urls() # This returns list of dictionary
update_id = db.get_updateid()
url_list = []
for url in urls:
url_list.append(url['url'])
print(update_id)
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[process(url, update_id) for url in url_list]))
I get an error in the process method:
TypeError: object str can't be used in 'await' expression
Not sure whats the problem?
Any code example specific to this would be highly appreciated.
Make this code asynchronous will not speed it up at all. Except if you consider to run a part of your code in "parallel". For instance you can run multiple (SQL or HTTP) queries in "same time". By doing asynchronous programming you will not execute code in "same time". Although you will get benefit of long IO tasks to execute other part of your code while you're waiting for IOs.
First of all, you'll have to use asynchronous libraries (instead of synchronous one).
mysql.connector could be replaced by aiomysql from aio-libs.
requests could be replaced by aiohttp
To execute multiple asynchronous tasks in "parallel" (for instance to replace your loop for url in urls_dict:), you have to read carefully about asyncio tasks and function gather.
I will not (re)write your code in an asynchronous way, however here are a few lines of pseudo code which could help you:
async def process(url):
result = await fetch(url)
await db.commit(result)
if __name__ == "__main__":
db = MyDbConnection()
urls = await db.fetch_all_urls()
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[process(url) for url in urls]))
I am downloading some information from webpages in the form
http://example.com?p=10
http://example.com?p=20
...
The point is that I don't know how many they are. At some point I will receive an error from the server, or maybe at some point I want to stop the processing since I have enough. I want to run them in parallel.
def generator_query(step=10):
i = 0
yield "http://example.com?p=%d" % i
i += step
def task(url):
t = request.get(url).text
if not t: # after the last one
return None
return t
I can implement it with consumer/producer pattern with queues, but I am wondering it is possible to have an higher level implementation, for example with the concurrent module.
Non-concurrent example:
results = []
for url in generator_query():
results.append(task(url))
You could use concurrent's ThreadPoolExecutor. An example of how to use it is provided here.
You'll need to break out of the example's for-loop, when you're getting invalid answers from the server (the except section) or whenever you feel like you got enough data (you could count valid responses in the else section for example).
You could use aiohttp for this purpose:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def coro(step):
url = 'https://example.com?p={}'.format(step)
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
print(html)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
tasks = [coro(i*10) for i in range(10)]
loop.run_until_complete(asyncio.wait(tasks))
as for the page error, you might have to figure it yourself since I don't know what website you're dealing with. Maybe try...except?
Notice: if your python version is higher than 3.5, it might cause an ssl certificate verification error.
Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.