Hello, so I am trying to get requests to some https page using proxies, but it gives me an error.
I tried couple of proxies from http://free-proxy.cz/en/proxylist/country/all/https/ping/all and other free proxy lists
but none of them works (only http)
import requests
proxies = [
{
"https" : "207.236.12.76:10458"
}
]
url = "https://api.ipify.org?format=json"
for proxy in proxies:
resp = requests.get(url, proxies=proxy)
print(resp.text)
This gives me this:
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='api.ipify.org', port=443): Max retries exceeded with url: /?format=json (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))
When i tried adding https like {"https" : "https://207.236.12.76:10458" }:
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='api.ipify.org', port=443): Max retries exceeded with url: /?format=json (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))
Am I doing something wrong or the proxies just doesn't work?
Before implementation I'd suggest you check all proxies by curl
like this
curl -v -L "https://2ip.ru/" -x "https://205.207.100.81:8282"
Here is my code
os.environ['REQUESTS_CA_BUNDLE'] = os.path.join('/path/to/','ca-own.crt')
s = requests.Session()
s.cert = ('some.crt', 'some.key')
s.get('https://some.site.com')
Last instruction returns:
requests.exceptions.SSLError: HTTPSConnectionPool(host='some.site.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
With curl:
curl --cacert ca-own.crt --key some.key --cert some.crt https://some.site.com
returns normal html code.
How can i make python requests.Session send correct certificates to the endpoint?
P.S. The same situation will be if i add the following
s.verify = 'some.crt'
or
cat some.crt ca-own.crt > res.crt
s.verify = 'res.crt'
P.P.S.
cat some.crt some.key > res.pem
s.cert = "res.pem"
requests.exceptions.SSLError: HTTPSConnectionPool(host='some.site.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
cat ca-own.crt some.crt some.key > res.pem
s.cert = "res.pem"
requests.exceptions.SSLError: HTTPSConnectionPool(host='some.site.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(116, '[X509: KEY_VALUES_MISMATCH] key values mismatch (_ssl.c:4067)')))
Above code will work if you put verify=False in the GET request, but it's not ideal security wise(Man in the middle attacks) thus you need to add the CA certificate(issuer's certificate) file to the verify parameter. More info here
session = requests.Session()
session.verify = "/path/to/issuer's certificate"(CA certificate)
session.get('https://some.site.com')
you can try this -
session = requests.Session()
session.verify = "your CA cert"
response = session.get(url, cert=('path of client cert','path of client key'))
session.close()
I'm trying to fetch the title from a webpage. The title visible in there as BM Wendling Real Estate. The script that I've tried with sometimes can scrape it accordingly but most of the time throws 403 status. As the site bans ips, I used proxies to bypass that.
import random
import requests
from bs4 import BeautifulSoup
link = 'https://www.veteranownedbusiness.com/business/25150/bm-wendling-real-estate'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"}
def get_proxy_list():
r = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(r.text,"html.parser")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies
def get(proxies):
proxy = proxies.pop(random.randrange(len(proxies)))
return {'https': f'http://{proxy}','http': f'http://{proxy}'}
def scrape(url,proxy,proxies):
while True:
try:
print("proxy being used: {}".format(proxy))
r = requests.get(url, headers=headers, proxies=proxy, timeout=10)
assert r.status_code == 200
soup = BeautifulSoup(r.text,"html.parser")
title = soup.select_one(".bizname_hdr > h1").get_text(strip=True)
return title
except Exception as e:
proxy = get(proxies)
if __name__ == "__main__":
proxies = get_proxy_list()
proxy = get(proxies)
title = scrape(link,proxy,proxies)
print(title)
Question: How can I scrape the title unhindered?
Note: The site restricts it's access to few countries.
This is a bit of a long story, but one with a semi-happy ending, so please bear with me:
First, I made a a few change to your program. The first one was to ensure that I was only selecting proxies where 'yes' was in the 'Https' column. Second, I think the requests documentation may have been a bit misleading. The proxies header has two keys, 'https' and 'http', but I am sure that these should be different IP/Port combinations. Since the proxies we are using are HTTPS proxies, I am only providing the 'https' key. Finally, I have changed the function interfaces slightly and printed out some diagnostics:
import random
import requests
from bs4 import BeautifulSoup
link = 'https://www.veteranownedbusiness.com/business/25150/bm-wendling-real-estate'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"}
def get_proxy_list():
r = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(r.text,"html.parser")
proxies = (
((item.select_one("td").text, item.select_one("td:nth-of-type(2)").text), item.select_one("td:nth-of-type(7)").text)
for item in soup.select("table.table tr") if "yes" in item.text
)
proxies = filter(lambda proxy: proxy[1] == 'yes', proxies)
return [':'.join(proxy[0]) for proxy in proxies]
def select_proxy(proxies):
proxy = proxies.pop(random.randrange(len(proxies)))
#return {'https': f'https://{proxy}','http': f'http://{proxy}'}
return {'https': f'http://{proxy}'}
def scrape(url):
proxies = get_proxy_list()
while True:
try:
proxy = select_proxy(proxies)
print("\nproxy being used: {}".format(proxy))
r = requests.get(url, headers=headers, proxies=proxy, timeout=10)
assert r.status_code == 200
soup = BeautifulSoup(r.text,"html.parser")
xpath = '/html/body/div[5]/div/div/div/div/div[1]/div[1]/h1'
title = soup.select_one(".bizname_hdr > h1").get_text(strip=True)
return title
except Exception as e:
print('Exception is:', e)
if __name__ == "__main__":
title = scrape(link)
print(title)
What I got was a lot of errors:
proxy being used: {'https': 'http://71.19.145.97:6100'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000027DCFBB35E0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')))
proxy being used: {'https': 'http://1.20.100.134:40698'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000027DCFBB3DF0>, 'Connection to 1.20.100.134 timed out. (connect timeout=10)'))
proxy being used: {'https': 'http://212.126.102.142:31785'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', timeout('timed out')))
proxy being used: {'https': 'http://68.183.185.149:8118'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000027DCFBB38E0>, 'Connection to 68.183.185.149 timed out. (connect timeout=10)'))
proxy being used: {'https': 'http://141.0.11.243:80'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', OSError(0, 'Error')))
proxy being used: {'https': 'http://139.59.90.141:80'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', timeout('timed out')))
proxy being used: {'https': 'http://36.89.8.235:8080'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000027DCFBC24F0>, 'Connection to 36.89.8.235 timed out. (connect timeout=10)'))
proxy being used: {'https': 'http://150.242.182.98:80'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', timeout('timed out')))
proxy being used: {'https': 'http://168.169.146.12:8080'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)')))
etc.
I broke out of the above. I noticed that when I entered the link URL in my Chrome browser that it took an extremely long time for the host name to be resolved. But when I reload the page it just takes a very long time for the page to reload. I am guessing that perhaps that some of the above timeouts are due to proxies having the same difficulty. There are, of course, errors just trying to connect to the proxy itself. But all of this may be moot. Now that I have connected to the website, I am able to fetch the page simply with:
import requests
from bs4 import BeautifulSoup
from time import time
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"}
url = 'https://www.veteranownedbusiness.com/business/25150/bm-wendling-real-estate'
t0 = time()
r = requests.get(url, headers=headers, timeout=120)
text = r.text
t1 = time()
soup = BeautifulSoup(text, "html.parser")
title = soup.select_one(".bizname_hdr > h1").get_text(strip=True)
print(title, t1 - t0)
Prints:
BM Wendling Real Estate 63.243138551712036
Note that I specified a 120 second timeout. And in fact it took 63 seconds for the results to come back. Yet when I ping the host, it comes back rather quickly:
Pinging www.veteranownedbusiness.com [172.67.75.109] with 32 bytes of data:
Reply from 172.67.75.109: bytes=32 time=3ms TTL=59
Reply from 172.67.75.109: bytes=32 time=4ms TTL=59
Reply from 172.67.75.109: bytes=32 time=4ms TTL=59
Reply from 172.67.75.109: bytes=32 time=4ms TTL=59
Ping statistics for 172.67.75.109:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 3ms, Maximum = 4ms, Average = 3ms
So the hang up is probably not so much slow DNS lookup in general but whatever it is that causes the long time it takes for the page to load.
So I went back to the original program (actually my updated version) and changed the timeout value to 120. I am not sure if this gives more time to connect to the proxy but it seems to give more time to connect to the actual website. The results were:
proxy being used: {'https': 'http://217.172.122.5:8080'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002DF162D3460>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')))
proxy being used: {'https': 'http://36.89.182.225:32338'}
Exception is: HTTPSConnectionPool(host='www.veteranownedbusiness.com', port=443): Max retries exceeded with url: /business/25150/bm-wendling-real-estate (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002DF162D3C70>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')))
proxy being used: {'https': 'http://95.58.145.22:8080'}
BM Wendling Real Estate
I was lucky to get a result on the third proxy try. I have rerun this several times since and after approximately 10 - 15 unsuccessful tries I break out. I have only have had one further success.
Here is the segment of code that is causing an issue:
proxies = {"http": "http://%s:%s#proxy_address:8080" %(user, pwd),
"https": "http://%s:%s#proxy_address:8080" %(user, pwd)}
requests.get('https://www.apple.com/', proxies=proxies)
I receive the error: '''ProxyError: HTTPSConnectionPool(host='www.apple.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 407 Proxy Authentication Required')))'''
Can someone please help me resolve this issue? I sure that my username and password is correct.
Thanks.
I test webscraping on localhost using requests library to open and get website content. When I test on my localhost some website it's work perfectly.
But the same script, the same tested URL on producetion server return:
HTTPSConnectionPool(host='example.com', port=443): Max retries
exceeded with url: /somewhere.html (Caused by
SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate
verify failed (_ssl.c:852)'),))
Anybody know what is the difference?
Give this a try: (See here for more)
requests.get('your_url_here', verify=False)