This is the script:
import requests
import json
import urlparse
from requests.adapters import HTTPAdapter
s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=1))
with open('proxies.txt') as proxies:
for line in proxies:
proxy=json.loads(line)
with open('urls.txt') as urls:
for line in urls:
url=line.rstrip()
data=requests.get(url, proxies=proxy)
data1=data.content
print data1
print {'http': line}
as you can see, its trying to access a list of urls through a list of proxies. Here is the urls.txt file:
http://api.exip.org/?call=ip
here is the proxies.txt file:
{"http":"http://107.17.92.18:8080"}
I got this proxy at www.hidemyass.com. Could it be a bad proxy? I have tried several and this is the result. Note: if you are trying to replicate this, you may have to update the proxy to a recent one at hidemyass.com. They seem to stop working eventually.
here is the full error and traceback:
Traceback (most recent call last):
File "test.py", line 17, in <module>
data=requests.get(url, proxies=proxy)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 335, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 454, in send
history = [resp for resp in gen] if allow_redirects else []
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 144, in resolve_redirects
allow_redirects=False,
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 438, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 327, in send
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPConnectionPool(host=u'219.231.143.96', port=18186): Max retries exceeded with url: http://www.google.com/ (Caused by <class 'httplib.BadStatusLine'>: '')
Looking at stack trace you've provided your error is caused by httplib.BadStatusLine exception, which, according to docs, is:
Raised if a server responds with a HTTP status code that we don’t understand.
In other words something that is returned (if returned at all) by proxy server cannot be parsed by httplib that does actual request.
From my experience with (writing) http proxies I can say that some implementations may not follow specs too strictly (rfc specs on http aren't easy reading actually) or use hacks to fix old browsers that have flaws in their implementation.
So, answering this:
Could it be a bad proxy?
... I'd say - that this is possible. The only real way to be sure is to see what is returned by proxy server.
Try to debug it with debugger or grab packet sniffer (something like Wireshark or Network Monitor) to analyze what happens in the network. Having info about what exactly is returned by proxy server should give you a key to solve this issue.
Maybe you are overloading the proxy server by sending too much requests in a short period of time, you say that you got the proxy from a popular free proxy website which means that you're not the only one using that server and it's often under heavy load.
If you add some delay between your requests like this :
from time import sleep
[...]
data=requests.get(url, proxies=proxy)
data1=data.content
print data1
print {'http': line}
sleep(1)
(note the sleep(1) which pauses the execution of the code for one second)
Does it work ?
def hello(self):
self.s = requests.Session()
self.s.headers.update({'User-Agent': self.user_agent})
return True
Try this,It worked for me :)
This happens when you send too many requests to the public IP address of https://anydomainname.example.com/. It as you can see caused due to some reason which does not allow/block access to the public IP address mapping with https://anydomainname.example.com/. One better solution is the following python script which calculates the public IP address of any domain and creates that mapping to the /etc/hosts file.
import re
import socket
import subprocess
from typing import Tuple
ENDPOINT = 'https://anydomainname.example.com/'
def get_public_ip() -> Tuple[str, str, str]:
"""
Command to get public_ip address of host machine and endpoint domain
Returns
-------
my_public_ip : str
Ip address string of host machine.
end_point_ip_address : str
Ip address of endpoint domain host.
end_point_domain : str
domain name of endpoint.
"""
# bash_command = """host myip.opendns.com resolver1.opendns.com | \
# grep "myip.opendns.com has" | awk '{print $4}'"""
# bash_command = """curl ifconfig.co"""
# bash_command = """curl ifconfig.me"""
bash_command = """ curl icanhazip.com"""
my_public_ip = subprocess.getoutput(bash_command)
my_public_ip = re.compile("[0-9.]{4,}").findall(my_public_ip)[0]
end_point_domain = (
ENDPOINT.replace("https://", "")
.replace("http://", "")
.replace("/", "")
)
end_point_ip_address = socket.gethostbyname(end_point_domain)
return my_public_ip, end_point_ip_address, end_point_domain
def set_etc_host(ip_address: str, domain: str) -> str:
"""
A function to write mapping of ip_address and domain name in /etc/hosts.
Ref: https://stackoverflow.com/questions/38302867/how-to-update-etc-hosts-file-in-docker-image-during-docker-build
Parameters
----------
ip_address : str
IP address of the domain.
domain : str
domain name of endpoint.
Returns
-------
str
Message to identify success or failure of the operation.
"""
bash_command = """echo "{} {}" >> /etc/hosts""".format(ip_address, domain)
output = subprocess.getoutput(bash_command)
return output
if __name__ == "__main__":
my_public_ip, end_point_ip_address, end_point_domain = get_public_ip()
output = set_etc_host(ip_address=end_point_ip_address, domain=end_point_domain)
print("My public IP address:", my_public_ip)
print("ENDPOINT public IP address:", end_point_ip_address)
print("ENDPOINT Domain Name:", end_point_domain )
print("Command output:", output)
You can call the above script before running your desired function :)
This happens when you overload the server with multiple requests. In order to bypass this you can increase the time between each request. But the best thing in my case was to increase the retry times in each request
requests.adapters.DEFAULT_RETRIES = 5 # increase retries number
requests.get(url)
If this is still not helpful you can find more ways here.
Related
Although this is most likely a newbie question I struggled to find any information online to help me with my problem
My code is meant to scrap onion sites, and despite being able to connect to TOR and the web scraper working fine as a stand-alone, when I tried combining both code blocks I kept getting numerous errors regarding the keyword argument in my code, even attempting to delete it presents me with bugs, I am a bit lost on what I'm supposed to do
import socket
import socks
import requests
from pywebcopy import save_webpage
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket
def get_tor_session():
session = requests.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'}
return session
session = get_tor_session()
print(session.get("http://httpbin.org/ip").text)
kwargs = {'project_name': 'site folder'}
save_webpage(
# url of the website
session.get(url="http://elfqv3zjfegus3bgg5d7pv62eqght4h6sl6yjjhe7kjpi2s56bzgk2yd.onion"),
# folder where the copy will be saved
project_folder=r"C:\Users\admin\Desktop\WebScraping",
**kwargs
)
In this case, I'm presented with the following error:
TypeError: Cannot mix str and non-str arguments
attempting to replace
project_folder=r"C:\Users\admin\Desktop\WebScraping",
**kwargs
with
kwargs,
project_folder=r"C:\Users\admin\Desktop\WebScraping"
presents me with this error:
TypeError: save_webpage() got multiple values for argument
traceback for the first error:
File "C:\Users\admin\Desktop\WebScraping\tor.py", line 43, in <module>
**kwargs
File "C:\Users\admin\anaconda3\lib\site-packages\pywebcopy\api.py", line 58, in save_webpage
config.setup_config(url, project_folder, project_name, **kwargs)
File "C:\Users\admin\anaconda3\lib\site-packages\pywebcopy\configs.py", line 189, in setup_config
SESSION.load_rules_from_url(urljoin(project_url, '/robots.txt'))
File "C:\Users\admin\anaconda3\lib\urllib\parse.py", line 487, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "C:\Users\admin\anaconda3\lib\urllib\parse.py", line 120, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
I'd really appreciate an explanation on what causes such a bug and how to avoid it in the future
Not sure why this hasn't been answered yet. As mentioned in my comment, simply change this:
save_webpage(
# url of the website
session.get(url=...),
# folder where the copy will be saved
project_folder=r"C:\Users\admin\Desktop\WebScraping",
**kwargs
)
To:
save_webpage(
# url of the website
url=...,
# folder where the copy will be saved
project_folder=r"C:\Users\admin\Desktop\WebScraping",
**kwargs
)
save_webpage makes the request internally.
SOLVED
adding the following code resolved the issue:
def getaddrinfo(*args):
return [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))]
socket.getaddrinfo = getaddrinfo
I am using private rotating proxy provided by (https://proxy.webshare.io/proxy/rotating?) in which each request to rotating proxy receives a new IP address. when I am using
requests.get('https://httpbin.org/get', headers=headers, proxies=get_proxy())
it returns a new IP each time whenever I make request. but when using
session = requests.Session()
session.headers = headers
session.proxies = get_proxy()
session.get('https://httpbin.org/get')
It returns same IP each time whenever I make request.
How does session object behaves different from requests.get() function in case of proxies.
Session uses previously set up variables/values for each subsequent request, like Cookies. If you want to change the proxy for each request in the session, then use Prepared Requests to set it each time or just put it in a function:
def send(session, url):
return session.get(url, proxy=get_proxy())
sess = requests.Session()
sess.headers = headers
resp = send(sess, 'https://httpbin.org/get')
print(resp.status_code)
But if you're trying to hide your origin IP for scraping or something, you probably don't want to persist cookies, etc. so you shouldn't use sessions.
The following code works, and it take a proxylistfile.txt file to check every proxy:
from requests import *
import bs4
import sys
if len(sys.argv) < 2:
print('Usage: ./testproxy.py <proxylistfile.txt>')
sys.exit()
ifco = 'http://ifconfig.co'
PROXIES_FILE = sys.argv[1]
proxy = dict()
with open(PROXIES_FILE) as file:
for line in file:
if line[0] == '#' or line == "\n":
continue
line_parts = line.replace('\n', '').split(':')
proxy['http'] = f'{line_parts[0]}://{line_parts[1]}:{line_parts[2]}'
try:
i = get(ifco, proxies=proxy, timeout=11)
print(f"{proxy['http']} - successfull - IP ---> ", end='')
zu = bs4.BeautifulSoup(i.text, 'html.parser')
testo = zu.findAll('p', text=True)[0].get_text()
print(testo)
except:
print(f"{proxy['http']} - unsuccessfull")
pass
It connect ot ifconfig.co site and return its real ip to check if the proxy works.
The output will be something like:
http://proxy:port - successfull - IP ---> your.real.ip
the input file format should be like:
http:1.1.1.1:3128
I finally switch to another rotating proxy provider (https://www.proxyegg.com) and the issue has been resolved now.
This question already has an answer here:
Python GAE urlfetch credentials
(1 answer)
Closed 7 years ago.
GAE Python URL Fetch throws InvalidURLError while the same URL works perfectly with Postman ( Google Chrome App ).
CODE
url = "https://abcdefgh:28dfd95928dfd95928dfd95928dfd95928dfd95928dfd959#twilix.exotel.in/v1/Accounts/abcdefgh/Sms/send"
form_fields = {
"From": "08039511111",
"To": "+919844100000",
"Body": "message for you"
}
form_data = urllib.urlencode (form_fields)
try:
result = urlfetch.fetch(url=url,
payload=form_data,
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded' }
)
logging.info ("result = ["+repr (result)+"] ")
except Exception:
logging.error ("Exception. ["+traceback.format_exc ()+"] ")
OUTPUT LOGS
2016-01-21 15:48:23.368 +0530 E Exception. [
Traceback (most recent call last): File "main.py", line 27, in get method=urlfetch.POST,
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/urlfetch.py", line 271, in fetch return rpc.get_result()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 613, in get_result return self.__get_result_hook(self)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/urlfetch.py", line 389, in _get_fetch_result 'Invalid request URL: ' + url + error_detail) InvalidURLError: Invalid request URL: https://abcdefgh:28dfd95928dfd95928dfd95928dfd95928dfd95928dfd959#twilix.exotel.in/v1/Accounts/abcdefgh/Sms/send ]
For security purpose, I have replaced sensitive text in the URL with similar different characters.
The code indicates an INVALID_URL RPC error code was received from the urlfetch service.
The most common occurence seems to be due to the URL length limit (check if your unedited URL hits that): Undocumented max length for urlfetch URL?
A long time ago it was also seen for very slow URLs (in Go land, but I suspect the urlfetch service itself is the same serving all language sandboxes) - unsure if this still stands, I also see a DEADLINE_EXCEEDED error code as well which might have been introduced specifically for such case in the meantime): Google App Engine Go HTTP request to a slow page
The failure might also be related to incorrect parsing of the rather unusual "host" portion of your URL foo:blah#hostname. Check if it you're getting the same error if dropping the foo:blah# portion. If it's indeed the case you might want to file an issue with Google - the URL seems valid, works with curl as well.
I found the problem and the solution.
We need to specify the HTTP auth info using headers.
urlfetch.make_fetch_call ( rpc,
url,
method = urlfetch.POST,
headers = { "Authorization" : "Basic %s" % base64.b64encode ( URL_USERNAME+":"+URL_PASSOWRD ) },
)
Courtesy
https://stackoverflow.com/a/8454580/1443563 by raugfer
As it says in the title, I am trying to access a url through several different proxies sequentially (using for loop). Right now this is my code:
import requests
import json
with open('proxies.txt') as proxies:
for line in proxies:
proxy=json.loads(line)
with open('urls.txt') as urls:
for line in urls:
url=line.rstrip()
data=requests.get(url, proxies={'http':line})
data1=data.text
print data1
and my urls.txt file:
http://api.exip.org/?call=ip
and my proxies.txt file:
{"https": "84.22.41.1:3128"}
{"http":"194.126.181.47:81"}
{"http":"218.108.170.170:82"}
that I got at [www.hidemyass.com][1]
for some reason, the output is
68.6.34.253
68.6.34.253
68.6.34.253
as if it is accessing that website through my own router ip address. In other words, it is not trying to access through the proxies I give it, it is just looping through and using my own over and over again. What am I doing wrong?
According to this thread, you need to specify the proxies dictionary as {"protocol" : "ip:port"}, so your proxies file should look like
{"https": "84.22.41.1.3128"}
{"http": "194.126.181.47:81"}
{"http": "218.108.170.170:82"}
EDIT:
You're reusing line for both URLs and proxies. It's fine to reuse line in the inner loop, but you should be using proxies=proxy--you've already parsed the JSON and don't need to build another dictionary. Also, as abanert says, you should be doing a check to ensure that the protocol you're requesting matches that of the proxy. The reason the proxies are specified as a dictionary is to allow lookup for the matching protocol.
There are two obvious problems right here:
data=requests.get(url, proxies={'http':line})
First, because you have a for line in urls: inside the for line in proxies:, line is going to be the current URL here, not the current proxy. And besides, even if you weren't reusing line, it would be the JSON string representation, not the dict you decoded from JSON.
Then, if you fix that to use proxy, instead of something like {'https': '83.22.41.1:3128'}, you're passing {'http': {'https': '83.22.41.1:3128'}}. And that obviously isn't a valid value.
To fix both of those problems, just do this:
data=requests.get(url, proxies=proxy)
Meanwhile, what happens when you have an HTTPS URL, but the current proxy is an HTTP proxy? You're not going to use the proxy. So you probably want to add something to skip over them, like this:
if urlparse.urlparse(url).scheme not in proxy:
continue
Directly copied from another answer of mine.
Well, actually you can, I've done this with a few lines of code and it works pretty well.
import requests
class Client:
def __init__(self):
self._session = requests.Session()
self.proxies = None
def set_proxy_pool(self, proxies, auth=None, https=True):
"""Randomly choose a proxy for every GET/POST request
:param proxies: list of proxies, like ["ip1:port1", "ip2:port2"]
:param auth: if proxy needs auth
:param https: default is True, pass False if you don't need https proxy
"""
from random import choice
if https:
self.proxies = [{'http': p, 'https': p} for p in proxies]
else:
self.proxies = [{'http': p} for p in proxies]
def get_with_random_proxy(url, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_get(url, **kwargs)
def post_with_random_proxy(url, *args, **kwargs):
proxy = choice(self.proxies)
kwargs['proxies'] = proxy
if auth:
kwargs['auth'] = auth
return self._session.original_post(url, *args, **kwargs)
self._session.original_get = self._session.get
self._session.get = get_with_random_proxy
self._session.original_post = self._session.post
self._session.post = post_with_random_proxy
def remove_proxy_pool(self):
self.proxies = None
self._session.get = self._session.original_get
self._session.post = self._session.original_post
del self._session.original_get
del self._session.original_post
# You can define whatever operations using self._session
I use it like this:
client = Client()
client.set_proxy_pool(['112.25.41.136', '180.97.29.57'])
It's simple, but actually works for me.
How do I seek to a particular position on a remote (HTTP) file so I can download only that part?
Lets say the bytes on a remote file were: 1234567890
I wanna seek to 4 and download 3 bytes from there so I would have: 456
and also, how do I check if a remote file exists?
I tried, os.path.isfile() but it returns False when I'm passing a remote file url.
If you are downloading the remote file through HTTP, you need to set the Range header.
Check in this example how it can be done. Looks like this:
myUrlclass.addheader("Range","bytes=%s-" % (existSize))
EDIT: I just found a better implementation. This class is very simple to use, as it can be seen in the docstring.
class HTTPRangeHandler(urllib2.BaseHandler):
"""Handler that enables HTTP Range headers.
This was extremely simple. The Range header is a HTTP feature to
begin with so all this class does is tell urllib2 that the
"206 Partial Content" reponse from the HTTP server is what we
expected.
Example:
import urllib2
import byterange
range_handler = range.HTTPRangeHandler()
opener = urllib2.build_opener(range_handler)
# install it
urllib2.install_opener(opener)
# create Request and set Range header
req = urllib2.Request('http://www.python.org/')
req.header['Range'] = 'bytes=30-50'
f = urllib2.urlopen(req)
"""
def http_error_206(self, req, fp, code, msg, hdrs):
# 206 Partial Content Response
r = urllib.addinfourl(fp, hdrs, req.get_full_url())
r.code = code
r.msg = msg
return r
def http_error_416(self, req, fp, code, msg, hdrs):
# HTTP's Range Not Satisfiable error
raise RangeError('Requested Range Not Satisfiable')
Update: The "better implementation" has moved to github: excid3/urlgrabber in the byterange.py file.
I highly recommend using the requests library. It is easily the best HTTP library I have ever used. In particular, to accomplish what you have described, you would do something like:
import requests
url = "http://www.sffaudio.com/podcasts/ShellGameByPhilipK.Dick.pdf"
# Retrieve bytes between offsets 3 and 5 (inclusive).
r = requests.get(url, headers={"range": "bytes=3-5"})
# If a 4XX client error or a 5XX server error is encountered, we raise it.
r.raise_for_status()
AFAIK, this is not possible using fseek() or similar. You need to use the HTTP Range header to achieve this. This header may or may not be supported by the server, so your mileage may vary.
import urllib2
myHeaders = {'Range':'bytes=0-9'}
req = urllib2.Request('http://www.promotionalpromos.com/mirrors/gnu/gnu/bash/bash-1.14.3-1.14.4.diff.gz',headers=myHeaders)
partialFile = urllib2.urlopen(req)
s2 = (partialFile.read())
EDIT: This is of course assuming that by remote file you mean a file stored on a HTTP server...
If the file you want is on an FTP server, FTP only allows to to specify a start offset and not a range. If this is what you want, then the following code should do it (not tested!)
import ftplib
fileToRetrieve = 'somefile.zip'
fromByte = 15
ftp = ftplib.FTP('ftp.someplace.net')
outFile = open('partialFile', 'wb')
ftp.retrbinary('RETR '+ fileToRetrieve, outFile.write, rest=str(fromByte))
outFile.close()
You can use httpio to access remote HTTP files as if they were local:
pip install httpio
import zipfile
import httpio
url = "http://some/large/file.zip"
with httpio.open(url) as fp:
zf = zipfile.ZipFile(fp)
print(zf.namelist())