i wanted to get some proxy list from this webPage; https://free-proxy-list.net/
but i stuck in this error and dont know how to fix it.
requests.exceptions.ProxyError: HTTPSConnectionPool(host='free-proxy-list.net', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000278BFFA1EB0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected
party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')))
and btw, this is my related code:
import urllib
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent(cache=False)
header = {
"User-Agent": str(ua.msie)
}
proxy = {
"https": "http://95.66.151.101:8080"
}
urls = "https://free-proxy-list.net/"
res = requests.get(urls, proxies=proxy)
soup = BeautifulSoup(res.text,'lxml')
and i tried to scrape other web sites, but i realized that its not the way.
Your using https in the Json dict when your proxy is a http proxy
Proxies should always be inside this format
For a http proxy
{'"http": "Http Proxy"}
For a https proxy
{"https":"Https Proxy"}
And for the UserAgent
{"User-Agent": "Opera/9.80 (X11; Linux x86_64; U; de) Presto/2.2.15 Version/10.00"}
Example
import requests
requests.get("https://example.com", proxies={"http":"http://95.66.151.101:8080"}, headers={"User-Agent": "Opera/9.80 (X11; Linux x86_64; U; de) Presto/2.2.15 Version/10.00"})
The module from fake_useragent import UserAgent You imported is irrelevant and unnecessary
Extra
The error could've also happened because the proxy isn't valid or responded improperly
If you are looking for free lists of proxies consider checking out these sources
https://pastebin.com/raw/VJwVkqRT
https://proxyscrape.com/free-proxy-list
https://www.freeproxylists.net/
I have never seen the fake_useragent module and don't know what its for, but I removed it. Also don't know why you added these header elements, but I dont believe it is necessary for the task you described. Looking at the html in your link, the proxies are in section id="list"-->div class="container"--> <tbody>. The below code does give all the elements in the mentioned area, and includes all the proxies. You can alter this if you want to get more specific info.
import requests
from bs4 import BeautifulSoup
urls = "https://free-proxy-list.net/"
res = requests.get(urls)
soup = BeautifulSoup(res.text,"html.parser")
tbody = soup.find("tbody")
print(tbody.prettify())
Related
I am developing a web scraping application with BeautifulSoup and Django and I am experiencing some 'conexion issues' (I think).
The app has to check if any website is satisfying all the SEO requirements, and for that, I have to make different 'requests'... first to get the "soup" and then to check if the robots.txt and sitemap.xml, for example, exists... so I guess some sites are blocking my app because of that, and I keep getting the "'Connection aborted.', RemoteDisconnected" error or in other cases, I don't get the error but the "soup" is empty... is there a way to fix this? I have tried with time.sleep() but doesn't seem to work...
This is part of my code:
http = PoolManager()
r = http.request('GET', "https://" + url, headers={'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36", 'Accept-Encoding': 'br'})
soup = BeautifulSoup(r.data, 'lxml')
And where I check if robots and sitemap exists:
robots_url = url + "/robots.txt"
robot = requests.get(robots_url, headers)
if robot.ok:
robot = True
else:
robot = False
sleep(5)
sitemap_url = url + '/sitemap.xml'
sitemap = requests.get(sitemap_url, headers=headers)
if sitemap.ok:
sitemap = True
else:
sitemap = False
In most websites the code is working fine but there are some pages that I supposed have a higher security level that ends the connection with that error:
During handling of the above exception (('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))), another exception occurred:
/app/.heroku/python/lib/python3.9/site-packages/django/core/handlers/exception.py, line 47, in inner
Thank you so much in advance for your time and advice.
I am trying to web scrape a http website and I am getting below error when I am trying to read the website.
HTTPSConnectionPool(host='proxyvipecc.nb.xxxx.com', port=83): Max retries exceeded with url: http://campanulaceae.myspecies.info/ (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden',)))
Below is the code I have written with similar website. I tried using urllib and user-agent and still the same issue.
url = "http://campanulaceae.myspecies.info/"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'})
soup = BeautifulSoup(response.text, 'html.parser')
Can anyone help me with the issue. Thanks in advance
you should try to add proxy while requesting url.
proxyDict = {
'http' : "add http proxy",
'https' : "add https proxy"
}
requests.get(url, proxies=proxyDict)
you can find more information here
i tried using User-Agent: Defined and it worked for me.
url = "http://campanulaceae.myspecies.info/"
headers = {
"Accept-Language" : "en-US,en;q=0.5",
"User-Agent": "Defined",
}
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.text
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())
If you get an error that says "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html-parser." Then it means you're not using the right parser, you'll need to import lxml at the top and install the module then use "lxml" instead of "html.parser" when you make soup.
Sending a post request with proxies but keep running into proxy error.
Already tried multiple solutions on stackoverflow for [WinError 10061] No connection could be made because the target machine actively refused it.
Tried changing, system settings, verified if the remote server is existing and running, also no HTTP_PROXY environment variable is set in the system.
import requests
proxy = {IP_ADDRESS:PORT} #proxy
proxy = {'https': 'https://' + proxy}
#standard header
header={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer": "https://tres-bien.com/adidas-yeezy-boost-350-v2-black-fu9006-fw19",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
}
#payload to be posted
payload = {
"form_key":"1UGlG3F69LytBaMF",
"sku":"adi-fw19-003",
# above two values are dynamically populating the field; hardcoded the value here to help you replicate.
"fullname": "myname",
"email": "myemail#gmail.com",
"address": "myaddress",
"zipcode": "areacode",
"city": "mycity" ,
"country": "mycountry",
"phone": "myphonenumber",
"Size_raffle":"US_11"
}
r = requests.post(url, proxies=proxy, headers=header, verify=False, json=payload)
print(r.status_code)
Expected output: 200, alongside an email verification sent to my email address.
Actual output: requests.exceptions.ProxyError: HTTPSConnectionPool(host='tres-bien.com', port=443): Max retries exceeded with url: /adidas-yeezy-boost-350-v2-black-fu9006-fw19 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',)))
Quite a few things are wrong here... (after looking at the raffle page you're trying to post to, I suspect it is https://tres-bien.com/adidas-yeezy-boost-350-v2-black-fu9006-fw19 based on the exception you posted).
1) I'm not sure whats going on with your first definition of proxy as a dict instead of a string. That said, it's probably a good practice to use both http and https proxies. If your proxy can support https then it should be able to support http.
proxy = {
'http': 'http://{}:{}'.format(IP_ADDRESS, PORT),
'https': 'https://{}:{}'.format(IP_ADDRESS, PORT)
}
2) Second issue is that the raffle you're trying to submit to takes url encoded form data, not json. Thus your request should be structured like:
r = requests.post(
url=url,
headers=headers,
data=payload
)
3) That page has a ReCaptcha present, which is missing from your form payload. This isn't why your request is getting a connection error, but you're not going to successfully submit a form that has a ReCaptcha field without a proper token.
4) Finally, I suspect the root of your ProxyError is you are trying to POST to the wrong url. Looking at Chrome Inspector, you should be submitting this data to
https://tres-bien.com/tbscatalog/manage/rafflepost/ whereas your exception output indicates you are POSTing to https://tres-bien.com/adidas-yeezy-boost-350-v2-black-fu9006-fw19
Good luck with the shoes.
I used BeautifulSoup4 and Python to parse local html files a few times in the past. Now I would like to scrape a website using a proxy. (400 Request needed in total / after 100 request the IP does get blocked)
After slowing down my script with an ordinary 'sleep', I want to use a proxy, but I have never done this before and do need some help here. I tried two methods, with help from Stack Overflow questions:
Method 1
This method does work with another website, but it doesn't download data. When I 'print' the data I received, it does print "Response [200]". When I try this method with the real website, it does return an error: "Max retries exceeded with url:" I suspect the proxy is not being handled correct. When I try to read the html, I get the following error.
page_html = response.read()
AttributeError: 'Response' object has no attribute 'read'
response = requests.get(URL, proxies=PROXY, headers=HEADER)
Method 2
I was able to download another webpage, but I wasn't able to download from the original webpage (which blocked me). I assume there is a mistake with the script and the proxy isn't handled correct. Either the real IP is sent to the website, or I can't connect to the proxy:
response = urllib.request.urlopen(urllib.request.Request(url, None, header, proxy))
My script does look like this:
HEADER = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
URL = "https://www.website.php"
PROXY = {"https": "https//59.110.7.190:1080"}
#response.close() Is this even necessary
page_html = response.read() # With Method 1 I also tried response.text which resulted in "str is not callable"
response.close()
page_soup = soup(page_html, "html.parser")
adresses = page_soup.findAll("li", {"class":"list-group-item"})
for address in adresses:
try:
#parsing the html
except (TypeError):
f.write("invalid data" + "\n")
time.sleep(random.randint(1, 10))
The error I usually get is the following:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.firmendb.de', port=443): Max retries exceeded with url: /[website.php] (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError(': Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)))
Process finished with exit code 1
I assume I messed up the proxy part of the script. It did work before I tried to implement it. Because I have never done this before my main question is, is the proxy part correct? I got the proxy from the following website: https://free-proxy-list.net/
How to choose a proxy from these lists?
How to connect to the proxies?
Any suggestions on proxy-providers to use?
Any proposal for my script?
If you don't mind using an API, I can recommend https://gimmeproxy.com which proved to be reliable source of working proxies.
There is even a python wrapper: https://github.com/ericfourrier/gimmeproxy-api
Result will be like this:
{
"supportsHttps": true,
"protocol": "socks5",
"ip": "19.162.12.82",
"port": "915",
"get": true,
"post": true,
"cookies": true,
"referer": true,
"user-agent": true,
"anonymityLevel": 1,
"websites": {
"example": true,
"google": false,
"amazon": true
},
"country": "US",
"tsChecked": 1517952910,
"curl": "socks5://19.162.12.82:915",
"ipPort": "19.162.12.82:915",
"type": "socks5",
"speed": 17.7,
"otherProtocols": {}
}
Thanks to the comments! The error was, that I didn't consider how often proxys would change. I wrote the proxy in my script way earlier than testing the script.
To help other, this is how the script would have to look in Pyhton3.
Of course the HEADER/URL/PROXY could also be a list and then fed trough a for loop.
HEADER = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
URL = "https://www.website.php"
PROXY = {"https": "https//59.110.7.190:1080"}
response = requests.get(URL, proxies=PROXY, headers=HEADER)
page_html = response.text
page_soup = soup(page_html, "html.parser")
adresses = page_soup.findAll("li", {"class":"list-group-item"}) #for example
for address in adresses:
try:
#parsing the html
except (TypeError):
f.write("invalid data" + "\n")
time.sleep(random.randint(1, 10))
I am trying to make request through a SOCKS5 proxy server over HTTPS but it fails or returns the empty string. I am using PySocks library.
Here is my example
WEB_SITE_PROXY_CHECK_URL = "whatismyipaddress.com/"
REQUEST_SCHEMA = "https://"
host_url = REQUEST_SCHEMA + WEB_SITE_PROXY_CHECK_URL
socket.connect((host_url, 443))
request = "GET / HTTP/1.1\nHost: " + host_url + "\nUser-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11\n\n"
socket.send(request)
response = socket.recv(4096)
print response
But it doesn't work, it prints an empty string response.
Is there any way to make HTTPS request through the socks5 proxy in Python ?
Thanks
As of requests version 2.10.0, released on 2016-04-29, requests
supports SOCKS.
It requires PySocks, which can be installed with pip install pysocks.
import requests
host_url = 'https://example.com'
#Fill in your own proxies' details
proxies={http:'socks5://user:pass#host:port',
https:'socks5://user:pass#host:port'}
#define headers if you will
headers={}
response = requests.get(host_url, headers=headers, proxies=proxies)
Beware, when using a SOCKS proxy, request socks will make HTTP requests with the full URL (e.g., GET example.com HTTP/1.1 rather than GET / HTTP/1.1) and this behavior may cause problems.