Proxy Error : BeautifulSoup HTTPSConnectionPool - python

My code is suppose to call the https://httpbin.org/ip to get my origin IP using a random proxy I have choosen in a list scraped from a website that provides a list of free proxies.
However, when I run my code below, sometimes it returns a correct response (200 and with the correct response) and some of the time it returns :
MaxRetryError: HTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /ip (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001EF83500DC8>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')))
Traceback (most recent call last):
File "<ipython-input-196-baf92a94e8ec>", line 19, in <module>
response = s.get(url,proxies=proxyDict)
This is the code I am using
import requests
from bs4 import BeautifulSoup
res = requests.get('https://free-proxy-list.net/', headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
proxies = []
for items in soup.select("#proxylisttable tbody tr"):
proxy_list = ':'.join([item.text for item in items.select("td")[:2]])
proxies.append(proxy_list)
url = 'https://httpbin.org/ip'
choosenProxy = random.choice(proxies)
proxyDict = {
'http' : 'http://'+str(choosenProxy),
'https' : 'https://'+str(choosenProxy)
}
s = requests.Session()
response = s.get(url,proxies=proxyDict)
print(response.text)
What does the error mean ? Is there a way I could fix this ?

Try the following solution. It will keep trying with different proxies until it find a working one. Once it finds a working proxy, the script should give you the required response and break the loop.
import random
import requests
from bs4 import BeautifulSoup
url = 'https://httpbin.org/ip'
proxies = []
res = requests.get('https://free-proxy-list.net/', headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#proxylisttable tbody tr"):
proxy_list = ':'.join([item.text for item in items.select("td")[:2]])
proxies.append(proxy_list)
while True:
choosenProxy = random.choice(proxies)
proxyDict = {
'http' : f'http://{choosenProxy}',
'https' : f'https://{choosenProxy}'
}
print("trying with:",proxyDict)
try:
response = requests.get(url,proxies=proxyDict,timeout=5)
print(response.text)
break
except Exception:
continue

Related

How do I built a connection to the internet through VPN, while using a library that connects to websites?

I am connected to the web via VPN and I would like to connect to news site to grab, well, news. For this a library exists: finNews. And this is the code:
import FinNews as fn
cnbc_feed = fn.CNBC(topics=['finance', 'earnings'])
print(cnbc_feed.get_news())
print(cnbc_feed.possible_topics())
Now because of the VPN the connection wont work and it throws:
<urlopen error [WinError 10061] No connection could be made because
the target machine actively refused it ( client - server )
So I started separately to understand how to make a connection work and it does work (return is "connected"):
import urllib.request
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
url = "http://www.google.com/search?q=test"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
print ("Connected")
except (HTTPError, URLError) as err:
print("No internet connection.")
Now I figured how to access news and how to make a connection via VPN, but I cant bring both together. I want to grab the news via the library through VPN?! I am fairly new to Python so I guess I dont get the logic fully yet.
EDIT: I tried to combine with Feedparser, based on furas hint:
import urllib.request
import feedparser
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
#url = "http://www.google.com/search?q=test"
#url = "http://www.reddit.com/r/python/.rss"
url = "https://timesofindia.indiatimes.com/rssfeedstopstories.cms"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
#print ("Connected")
feed = feedparser.parse(html)
#print (feed['feed']['link'])
print ("Number of RSS posts :", len(feed.entries))
entry = feed.entries[1]
print ("Post Title :",entry.title)
except (HTTPError, URLError) as err:
print("No internet connection.")
But same error....this is a big nut to crack...
May I ask for your advice? Thank you :)

Getting SSLError: HTTPSConnectionPool when scraping a list of links

I have a list of links of a reviews website:
links =['https://www.yelp.com/biz/city-tamale-bronx-3', 'https://www.yelp.com/biz/the-boogie-down-grind-caf%C3%A9-bronx', 'https://www.yelp.com/biz/fratillis-pizza-bronx', 'https://www.yelp.com/biz/randall-restaurant-bronx', 'https://www.yelp.com/biz/valencia-bakery-bronx-3', 'https://www.yelp.com/biz/the-point-cafe-and-bascom-catering-new-york', 'https://www.yelp.com/biz/delfini-restaurant-bronx', 'https://www.yelp.com/biz/bayside-seafood-company-bronx', 'https://www.yelp.com/biz/il-forno-bakery-bronx', 'https://www.yelp.com/biz/allen-restaurant-bronx']
I wrote a function that retrieves the name of the reviewer:
import requests
from bs4 import BeautifulSoup
def getReviewerName (restaurantLink, headers, proxies):
session = requests.Session()
time.sleep(10)
req = session.get (restaurantLink,headers = headers, proxies = proxies)
bs = BeautifulSoup (req.text, "html.parser")
time.sleep(4)
nameDiv = bs.find_all ("div", {"class":"media-story"})
time.sleep(3)
name = [name.find ("li", {"class": "user-name"}) for name in nameDiv]
time.sleep(2)
name = [n.text for n in name if n is not None]
print (name)
I am applying time.sleep before each request so that my bot remains undetected.
I wrote a for loop that applies the function getReviewerName to each link in the list of links:
for link in links:
headers = {'User-Agent': get_User_Agent()}
proxies = {"http": "http://"+get_proxies(), "https":"http://" + get_proxies()}
getReviewerName (link, headers, proxies )
In this for loop I am using a function called get_User_Agent() which returns a random User-Agent, I am also using a function called get_proxies() which returns a random proxy. All this with the aim of remaining undetected.
I am getting the expected result just for the first link in the list of links:
['\nDavid L.\n', '\nKarla G.\n', '\nMickey W.\n', '\nGabrielle P.\n', '\nOmar M.\n', '\nフェルナンド\n', '\nMichael B.\n', '\nBrittany H.\n', '\nTy C.\n', '\ndouble double u.\n', '\nLizzy N.\n', '\nAlina G.\n', '\nSam W.\n', '\nCristina C.\n', '\nLetticia C.\n', '\nJennifer S.\n', '\nJeremy R.\n', '\nKahliah L.\n', '\nE. M.\n', '\nSaïeda H.\n']
However, when I get to the second link, I am getting an SSLError:
SSLError: HTTPSConnectionPool(host='www.yelp.com', port=443): Max retries exceeded with url: /biz/the-boogie-down-grind-caf%C3%A9-bronx (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))
Any help on how to sort this out will be much appreciated. Thanks!
Check your proxies, you are probably using HTTP proxies as HTTPS i.e. change your proxy format from:
{"https": "https://..."}
to
{"http": "http://..."}

Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed

I was wondering if my requests is stopped by the website and I need to set a proxy.I first try to close the http's connection ,bu I failed.I also try to test my code but now it seems no outputs.Mybe I use a proxy everything will be OK?
Here is the code.
import requests
from urllib.parse import urlencode
import json
from bs4 import BeautifulSoup
import re
from html.parser import HTMLParser
from multiprocessing import Pool
from requests.exceptions import RequestException
import time
def get_page_index(offset, keyword):
#headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
data = {
'offset': offset,
'format': 'json',
'keyword': keyword,
'autoload': 'true',
'count': 20,
'cur_tab': 1
}
url = 'http://www.toutiao.com/search_content/?' + urlencode(data)
try:
response = requests.get(url, headers={'Connection': 'close'})
response.encoding = 'utf-8'
if response.status_code == 200:
return response.text
return None
except RequestException as e:
print(e)
def parse_page_index(html):
data = json.loads(html)
if data and 'data' in data.keys():
for item in data.get('data'):
url = item.get('article_url')
if url and len(url) < 100:
yield url
def get_page_detail(url):
try:
response = requests.get(url, headers={'Connection': 'close'})
response.encoding = 'utf-8'
if response.status_code == 200:
return response.text
return None
except RequestException as e:
print(e)
def parse_page_detail(html):
soup = BeautifulSoup(html, 'lxml')
title = soup.select('title')[0].get_text()
pattern = re.compile(r'articleInfo: (.*?)},', re.S)
pattern_abstract = re.compile(r'abstract: (.*?)\.', re.S)
res = re.search(pattern, html)
res_abstract = re.search(pattern_abstract, html)
if res and res_abstract:
data = res.group(1).replace(r".replace(/<br \/>|\n|\r/ig, '')", "") + '}'
abstract = res_abstract.group(1).replace(r"'", "")
content = re.search(r'content: (.*?),', data).group(1)
source = re.search(r'source: (.*?),', data).group(1)
time_pattern = re.compile(r'time: (.*?)}', re.S)
date = re.search(time_pattern, data).group(1)
date_today = time.strftime('%Y-%m-%d')
img = re.findall(r'src="(.*?)&quot', content)
if date[1:11] == date_today and len(content) > 50 and img:
return {
'title': title,
'content': content,
'source': source,
'date': date,
'abstract': abstract,
'img': img[0]
}
def main(offset):
flag = 1
html = get_page_index(offset, '光伏')
for url in parse_page_index(html):
html = get_page_detail(url)
if html:
data = parse_page_detail(html)
if data:
html_parser = HTMLParser()
cwl = html_parser.unescape(data.get('content'))
data['content'] = cwl
print(data)
print(data.get('img'))
flag += 1
if flag == 5:
break
if __name__ == '__main__':
pool = Pool()
pool.map(main, [i*20 for i in range(10)])
and the error is the here!
HTTPConnectionPool(host='tech.jinghua.cn', port=80): Max retries exceeded with url: /zixun/20160720/f191549.shtml (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x00000000048523C8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
By the way, When I test my code at first it shows everything is OK!
Thanks in advance!
It seems to me you're hitting the limit of connection in the HTTPConnectionPool. Since you start 10 threads at the same time
Try one of the following:
Increase the request timeout (seconds): requests.get('url', timeout=5)
Close the response: Response.close(). Instead of returning response.text, assign response to a varialble, close Response, and then return variable
When I faced this issue I had the following problems
I wasn't able to do the following
- The requests python module was unable to get information from any url. Although I was able to surf the site with browser, also could get wget or curl to download that page.
- pip install was also not working and use to fail with following errors
Failed to establish a new connection: [Errno 11004] getaddrinfo failed
Certain site blocked me so i tried forcebindip to use another network interface for my python modules and then i removed it. Probably that cause my network to mess up and my request module and even the direct socket module were stuck and not able to fetch any url.
So I followed network configuration reset in the below URL and now I am good.
network configuration reset
In case it helps someone else, I faced this same error message:
Client-Request-ID=long-string Retry policy did not allow for a retry: , HTTP status code=Unknown, Exception=HTTPSConnectionPool(host='table.table.core.windows.net', port=443): Max retries exceeded with url: /service(PartitionKey='requests',RowKey='9999') (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D920ADA970>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')).
...when trying to retrieve a record from Azure Table Storage using
table_service.get_entity(table_name, partition_key, row_key).
My issue:
I had the table_name incorrectly defined.
My structural URL was incorrect (after ".com" there was no slash and there was a coupling of another part of the url)
Sometimes it's due to a VPN connection. I had the same problem. I wasn't even capable of installing the package requests via pip. I turned off my VPN and voilà, I managed to install it and also to make requests. The [Errno 11004] code was gone.

How to overcome urlopen error [WinError 10060] issue?

I have follow this tutorial but I still fail to get output. Below is my code in view.py
def index(request):
#html="a"
#url= requests.get("https://www.python.org/")
#page = urllib.request.urlopen(url)
#soup = BeautifulSoup(page.read())
#soup=url.content
#urllib3.disable_warnings()
#requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
#url=url.content
#default_headers = make_headers(basic_auth='myusername:mypassword')
#http = ProxyManager("https://myproxy.com:8080/", headers=default_headers)
r = urllib.request.urlopen('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts').read()
soup = BeautifulSoup(r)
url= type(soup)
context={"result":url,}
return render (request, 'index.html',context)
Output:
urlopen error [WinError 10060] A connection attempt failed because the
connected party did not properly respond after a period of time, or
established connection failed because connected host has failed to respond
If you are sitting behind a firewall or similar you might have to specify a proxy for the request to get through.
See below example using the requests library.
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
r = requests.get('http://example.org', proxies=proxies)

python using requests with valid hostname

Trying to use requests to download a list of urls and catch the exception if it is a bad url. Here's my test code:
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
print "the url is good"
except ConnectionError,e:
print e
print "the url is bad"
The problem is if I pass in url = "http://www.google.com" everything works as it should and as expected since it is a good url.
http://www.google.com
the url is good
But if I pass in url = "http://www.google.com/thereisnothing.jpg"
I still get :
http://www.google.com/thereisnothing.jpg
the url is good
So its almost like its not even looking at anything after the "/"
just to see if the error checking is working at all I passed a bad hostname: #url = "http://somethingpotato.com"
Which kicked back the error message I expected:
http://somethingpotato.com
HTTPConnectionPool(host='somethingpotato.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b6cd15b90>: Failed to establish a new connection: [Errno -2] Name or service not known',))
the url is bad
What am I missing to make request capture a bad url not just a bad hostname?
Thanks
requests do not create a throwable exception at a 404 response. Instead you need to filter them out be checking to see if the status is 'ok' (HTTP response 200)
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com/nothing"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
print "the url is bad"
except ConnectionError,e:
print e
print "the url is bad"
EDIT:
import requests
from requests.exceptions import ConnectionError
def printFailedUrl(url, response):
if isinstance(response, ConnectionError):
print "The url " + url + " failed to connect with the exception " + str(response)
else:
print "The url " + url + " produced the failed response code " + str(response.status_code)
def testUrl(url):
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
printFailedUrl(url, r)
except ConnectionError,e:
printFailedUrl(url, e)
def main():
testUrl("http://www.google.com") #'Good' Url
testUrl("http://www.google.com/doesnotexist.jpg") #'Bad' Url with 404 response
testUrl("http://sdjgb") #'Bad' url with inaccessable url
main()
In this case one function can handle both getting an exception or a request response passed into it. This way you can have separate responses for if the url returns some non 'good' (non-200) response vs an unusable url which throws an exception. Hope this has the information you need in it.
what you want is to check r.status_code. Getting r.status_code on "http://www.google.com/thereisnothing.jpg" will give you 404. you can put a condition for only 200 code URL to be "good".

Categories

Resources