I am currently building a proxy rotator for Python. Everything is running fine so far, except for the fact that despite the proxies, the tracker - pages return my own IP.
I have already read through dozens of posts in this forum. It often says "something is wrong with the proxy in this case".
I have a long list of proxies ( about 600 ) which I test with my method and I made sure when I scrapped them that they were marked either "elite" or "anonymous" before I put them on this list.
So can it be that the majority of free proxies are "junk" when it comes to anonymity or am I fundamentally doing something wrong?
And is there basically a way to find out how the proxy is set regarding anonymity?
Python 3.10.
import requests
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
proxi = {"http": ""}
prox_ping_ready = [173.219.112.85:8080,
43.132.148.107:2080,
216.176.187.99:8886,
193.108.21.234:1234,
151.80.120.192:3128,
139.255.10.234:8080,
120.24.33.141:8000,
12.88.29.66:9080,
47.241.66.249:1081,
51.79.205.165:8080,
63.250.53.181:3128,
160.3.168.70:8080]
ipTracker = ["wtfismyip.com/text", "api.ip.sb/ip", "ipecho.net/plain", "ifconfig.co/ip"]
for element in proxy_ping_ready:
for choice in ipTracker:
try:
proxi["http"] = "http://" + element
ips = requests.get(f'https://{choice}', proxies=proxi, timeout=1, headers=headers).text
print(f'My IP address is: {ips}', choice)
except Exception as e:
print("Error:", e)
time.sleep(3)
Output(example):
My IP address is: 89.13.9.135
api.ip.sb/ip
My IP address is: 89.13.9.135
wtfismyip.com/text
My IP address is: 89.13.9.135
ifconfig.co/ip
(Every time my own address).
You only set your proxy for http traffic, you need to include a key for https traffic as well.
proxi["http"] = "http://" + element
proxi["https"] = "http://" + element # or "https://" + element, depends on the proxy
As James mentioned, you should use also https proxy
proxi["https"] = "http://" + element
If you getting max retries with url it most probably means that the proxy is not working or is too slow and overloaded, so you might increase your timeout.
You can verify if your proxy is working by setting it as env variable. I took one from your list
import os
os.environ["http_proxy"] = "173.219.112.85:8080"
os.environ["https_proxy"] = "173.219.112.85:8080"
and then run your code without proxy settings by changing your request to
ips = requests.get(f'wtfismyip.com/text', headers=headers).text
Related
1 - The target DOMAIN is https://www.dnb.com/
This website is blocking access to it from many countries around the world including mine (Algeria).
So the known solution is clear (use a proxy), which I did.
2 - Configuring the system proxy in the network configuration, and connecting to the website via (Google Chrome) works, also using Firefox with the proxy settings works fine.
3 - I came to my code to start the job
import requests
# 1. Initialize the proxy
proxy = "xxx.xxx.xxx.xxx:3128"
# 2. Setting the Headers (I cloned Firefox request headers)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep - alive",
"Accept": "text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q = 0.8",
"Upgrade - Insecure - Requests": "1",
"Host": "www.dnb.com",
"DNT": "1"
}
# 3. URL
URL = "https://www.dnb.com/business-directory/company-profiles.bicicletas_monark_s-a.7ad1f8788ea84850ceef11444c425a52.html"
# 4. Make a get request.
r = requests.get(URL, headers=headers, proxies={"https": proxy})
# Nothing in return and program keep executing (like infinite loop).
Note:
I know this keeps on waiting because the default timeout is set to None, but it is sure that the setup is working, and the requests library must return a response, using the timeout here can be to assess the reliability of the proxy as an example.
So, What the cause for this, it stuck (and I'm also), I'm getting the response and the correct HTML content with (Firefox, Chrome, Postman) with the same configuration.
I checked your code and ran it on my local machine. It seems the issue is with proxy. I added a public proxy and it is working. You can confirm it by adding a "timeout" argument to the requests.get function to some seconds. Also if the code working properly(even the response is 403) it means there is an issue with the proxy.
My code works when I make request from location machine.
When I try to make the request from AWS EC2 I get the following error:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www1.xyz.com', port=443): Read timed out. (read timeout=20)
I tried checking the url and that was not the issue. I then went ahead and tried to visit the page using the url and hidemyass webproxy with location set to the AWS EC2 machine, it got a 404.
The code:
# Dummy URL's
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = 'https://www1.xyz.com/iKeys.jsp?symbol={}&date=31DEC2020'.format(
symbol)
raw_page = requests.get(url, timeout=10, headers=header).text
I have tried setting the proxies to another ip address in the request, which I searched online:
proxies = {
"http": "http://125.99.100.193",
"https": "https://125.99.100.193",}
raw_page = requests.get(url, timeout=10, headers=header, proxies=proxies).text
Still got the same error.
1- Do I need to specify the port in proxies? Could this be causing the error when proxy is set?
2- What could be a solution for this?
Thanks
I have a web link as below:
https://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp
I use the below code to collect the data but getting error as:
requests.exceptions.ConnectionError: ('Connection aborted.',
OSError("(10060, 'WSAETIMEDOUT')",))
My Code:
from requests import Session
import lxml.html
expiry_list = []
try:
session = Session()
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
session.headers.update(headers)
url = 'https://www1.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp'
params = {'symbolCode': 9999, 'symbol': 'BANKNIFTY', 'instrument': 'OPTIDX', 'date': '-', 'segmentLink': 17}
response = session.get(url, params=params)
soup = lxml.html.fromstring(response.text)
expiry_list = soup.xpath('//form[#id="ocForm"]//option/text()')
expiry_list.remove(expiry_list[0])
except Exception as error:
print("Error:", error)
print("Expiry_Date =", expiry_list)
Its working perfect in my local machine but giving error in Amazon EC2 Instance Any settings need to be changed for resolving request timeout error.
AWS houses lots of botnets, so spam blacklists frequently list AWS IPs. Your EC2 is probably part of an IP block that is blacklisted. You might be able to verify by putting your public EC2 IP in here https://mxtoolbox.com/. I would try verifying if you can even make a request via curl from the command line curl -v {URL}. If that times out, then I bet your IP is blocked by the remote server's firewall rules. Since your home IP has access, you can try to setup a VPN on your network, have the EC2 connect to your VPN, and then retry your python script. It should work then, but it will be as if you're making the request from your home (so don't do anything stupid). Most routers allow you to setup an OpenVPN or PPTP VPN right in the admin UI. I suspect that once your EC2's IP changes, you'll trick the upstream server and be able to scrape.
I want to fetch an IPv6 page with urllib.
Works with square brack IPv6 notation but I have no clue how to (easily) convince python to do an IPv6 request when I give it the FQDN
Like the below ip is: https://www.dslreports.com/whatismyip
from sys import version_info
PY3K = version_info >= (3, 0)
if PY3K:
import urllib.request as urllib
else:
import urllib2 as urllib
url = None
opener = urllib.build_opener()
opener.addheaders = [('User-agent',
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36")]
url = opener.open("http://[2607:fad0:3706:1::1000]/whatismyip", timeout=3)
content = url.read()
I finally solved my issue. Not in the most elegant way, but it works for me.
After reading:
Force requests to use IPv4 / IPv6
and
Python urllib2 force IPv4
I decided to do an DNS lookup and just send a Host header with the FQDN to grab the content. (Host headers are needed for vhosts)
Here is the ugly snippet:
# Ugly hack to get either IPv4 or IPv6 response from server
parsed_uri = urlparse(server)
fqdn = "{uri.netloc}".format(uri=parsed_uri)
scheme = "{uri.scheme}".format(uri=parsed_uri)
path = "{uri.path}".format(uri=parsed_uri)
try:
ipVersion = ip_kind(fqdn[1:-1])
ip = fqdn
except ValueError:
addrs = socket.getaddrinfo(fqdn, 80)
if haveIPv6:
ipv6_addrs = [addr[4][0] for addr in addrs if addr[0] == socket.AF_INET6]
ip = "[" + ipv6_addrs[0] + "]"
else:
ipv4_addrs = [addr[4][0] for addr in addrs if addr[0] == socket.AF_INET]
ip = ipv4_addrs[0]
server = "{}://{}{}".format(scheme, ip, path)
url = urllib.Request(server, None, {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'})
# Next line adds the host header
url.host = fqdn
content = urllib.urlopen(url).read()
This is far from ideal and it could be much cleaner but it works for me.
It is implemented here: https://github.com/SteveClement/ipgetter/tree/IPv6
This simply goes through a list of servers that return you your border gateway ip, now in IPv6 too.
[update: this line about Python 2 / Python 3 is non longer valid since the question has been updated]
First, you seem to use Python 2. This is important because the urllib module has been split into parts and renamed in Python 3.
Secondly, your code snippet seems incorrect: build_opener is not a function available with urllib. It is available with urllib2.
So, I assume that your code is in fact the following one:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36")]
url = opener.open("http://www.dslreports.com/whatismyip", timeout=3)
If your DNS resolver handles correctly IPv6 resource records, and if your operating system is built with dual-stack IPv4/IPv6 or single IPv6-only stack, and if you have a correct IPv6 network path to dslreports.com, this Python program will use IPv6 to connect to www.dslreports.com. So, there is no need to convince python to do an IPv6 request.
I am trying to do an automated task via python through the mechanize module:
Enter the keyword in a web form, submit the form.
Look for a specific element in the response.
This works one-time. Now, I repeat this task for a list of keywords.
And am getting HTTP Error 429 (Too many requests).
I tried the following to workaround this:
Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .
br=mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
br.addheaders = [('Connection', 'keep-alive')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')]
br.addheaders = [('Upgrade-Insecure-Requests','1')]
br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')]
br.addheaders = [('Accept-Language','en-US,en;q=0.8')]`
Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .
Neither of the two methods worked.
You need to limit the rate of your requests to conform to what the server's configuration permits. (Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)
mechanize uses a heavily-patched version of urllib2 (Lib/site-packages/mechanize/_urllib2.py) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector.
So, the simplest method to patch its logic seems to add a handler to your Browser object
with default_open and appropriate handler_order to place it before everyone (lower is higher priority).
that would stall until the request is eligible with e.g. a Token bucket or Leaky bucket algorithm e.g. as implemented in Throttling with urllib2 . Note that a bucket should probably be per-domain or per-IP.
and finally return None to push the request to the following handlers
Since this is a common need, you should probably publish your implementation as an installable package.