I am trying to access the web via a proxy server in Python. I am using the requests library and I am having an issue with authenticating my proxy as the proxy I am using requires a password.
proxyDict = {
'http' : 'username:mypassword#77.75.105.165',
'https' : 'username:mypassword#77.75.105.165'
}
r = requests.get("http://www.google.com", proxies=proxyDict)
I am getting the following error:
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
r = requests.get("http://www.google.com", proxies=proxyDict)
File "C:\Python27\lib\site-packages\requests\api.py", line 78, in get
:param url: URL for the new :class:`Request` object.
File "C:\Python27\lib\site-packages\requests\api.py", line 65, in request
"""Sends a POST request. Returns :class:`Response` object.
File "C:\Python27\lib\site-packages\requests\sessions.py", line 187, in request
def head(self, url, **kwargs):
File "C:\Python27\lib\site-packages\requests\models.py", line 407, in send
"""
File "C:\Python27\lib\site-packages\requests\packages\urllib3\poolmanager.py", line 127, in proxy_from_url
File "C:\Python27\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 521, in connection_from_url
File "C:\Python27\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 497, in get_host
ValueError: invalid literal for int() with base 10: 'h6f2v6jh5dsxa#77.75.105.165'
How do I solve this?
Thanks in advance for your help.
You should remove the embedded username and password from proxyDict, and use the auth parameter instead.
import requests
from requests.auth import HTTPProxyAuth
proxyDict = {
'http' : '77.75.105.165',
'https' : '77.75.105.165'
}
auth = HTTPProxyAuth('username', 'mypassword')
r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
I've been having a similar problem on Windows and found the only way to get requests to work was to set the proxies as environment variables before I started Python. For you this would be something like this:
set HTTP_PROXY=http://77.75.105.165
set HTTPS_PROXY=https://77.75.105.165
You might also want to check is there's a specific port required, and if so set it after the url. For example, if the port is 8443 then do:
set HTTP_PROXY=http://77.75.105.165:8443
set HTTPS_PROXY=https://77.75.105.165:8443
You can use urllib library for this.
from urllib import request
request.urlopen("your URL", proxies=request.getproxies())
Related
I have an URL in the config file which I parsed using ConfigParser to get the requests
config.ini
[default]
root_url ='https://reqres.in/api/users?page=2'
FetchFeeds.py
import requests
from configparser import ConfigParser
import os
import requests
config = ConfigParser()
config.read(os.path.join(os.path.dirname(__file__), '../Config', 'config.ini'))
rootUrl = (config['default']['root_url'])
print(rootUrl)
response = requests.get(rootUrl)
Even the URL is printed properly 'https://reqres.in/api/users?page=2'but I am getting the following error on the requests module
Traceback (most recent call last):
File "C:/Users/sam/PycharmProjects/testProject/GET_Request/FetchFeeds.py", line 11, in <module>
response = requests.get(rootUrl)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\sessions.py", line 637, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\sam\PycharmProjects\testProject\venv\lib\site-packages\requests\sessions.py", line 730, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for "'https://reqres.in/api/users?page=2'"
Process finished with exit code 1
Please note I checked carefully there are no white spaces
The main issue is it tries to find a URL like 'url.com' which doesn't exist (correct would be url.com), so the solution is to not put apostrophes in config.ini files:
[default]
root_url = https://reqres.in/api/users?page=2
Also think about if you configure the parameters rather in your code than in the config file and use this instead:
requests.get(config['default']['root_url'], params={'page': 2})
I'm trying to write a redditbot; I decided to start with a simple one, to make sure I was doing things properly, and I got a RequestException.
my code (bot.py):
import praw
for s in praw.Reddit('bot1').subreddit("learnpython").hot(limit=5):
print s.title
my praw.ini file:
# The URL prefix for OAuth-related requests.
oauth_url=https://oauth.reddit.com
# The URL prefix for regular requests.
reddit_url=https://www.reddit.com
# The URL prefix for short URLs.
short_url=https://redd.it
[bot1]
client_id=HIDDEN
client_secret=HIDDEN
password=HIDDEN
username=HIDDEN
user_agent=ILovePythonBot0.1
(where HIDDEN replaces the actual id, secret, password and username.)
my Traceback:
Traceback (most recent call last):
File "bot.py", line 3, in <module>
for s in praw.Reddit('bot1').subreddit("learnpython").hot(limit=5):
File "/usr/local/lib/python2.7/dist-packages/praw/models/listing/generator.py", line 79, in next
return self.__next__()
File "/usr/local/lib/python2.7/dist-packages/praw/models/listing/generator.py", line 52, in __next__
self._next_batch()
File "/usr/local/lib/python2.7/dist-packages/praw/models/listing/generator.py", line 62, in _next_batch
self._listing = self._reddit.get(self.url, params=self.params)
File "/usr/local/lib/python2.7/dist-packages/praw/reddit.py", line 322, in get
data = self.request('GET', path, params=params)
File "/usr/local/lib/python2.7/dist-packages/praw/reddit.py", line 406, in request
params=params)
File "/usr/local/lib/python2.7/dist-packages/prawcore/sessions.py", line 131, in request
params=params, url=url)
File "/usr/local/lib/python2.7/dist-packages/prawcore/sessions.py", line 70, in _request_with_retries
params=params)
File "/usr/local/lib/python2.7/dist-packages/prawcore/rate_limit.py", line 28, in call
response = request_function(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/prawcore/requestor.py", line 48, in request
raise RequestException(exc, args, kwargs)
prawcore.exceptions.RequestException: error with request request() got an unexpected keyword argument 'json'
Any help would be appreciated. PS, I am using Python 2.7., on Ubuntu 14.04. Please ask me for any other information you may need.
The way i see it, it seems you have a problem with your request to Reddit API. Maybe try changing the user-agent in your in-file configuration. According to PRAW basic configuration Options the user-agent should follow a format <platform>:<app ID>:<version string> (by /u/<reddit username>) . Try that see what happens.
I am using my company's internet, and I need to access a webpage to scrape data off it. I am using the Python Requests module. The page I need to access is done through a POST request. My company has a proxy. I can get through the proxy using the proxies flag in requests.post(). However, there is an authentication part which uses cookies, and I can't seem to get through it. How should I do the authentication part when using a POST request?
I am trying to use the authentication process as described in this thread, but it's not working:
Authentication and python Requests
The code is set up this way:
import ssl
from MyHtmlParser import MyHTMLParser
from lxml import html
import requests
from bs4 import BeautifulSoup as bs
def authenticate(s, url):
headers = {'USER_NAME': 'me', 'PASSWORD': 'mypassword', '_Id': 'submit'}
page=s.get(url)
soup=bs(page.content)
value=soup.form.find_all('input')[2]['value']
headers.update({'value_name':value})
auth = s.post(url, params=headers, cookies=page.cookies)
post_url_finance = 'https://opsdata<company>com/scripts/finance/finance.exe'
values_finance = {'EMPLOYEE_TOTAL': 'employeeId'}
proxies = {'http': 'http://proxy-<company>.com'}
page = requests.post(post_url_finance, data=values_finance, proxies=proxies) print page.content
However, I am getting this error back:
$ python postUsingRequests.py
Traceback (most recent call last):
File "postUsingRequests.py", line 53, in <module>
page = requests.post(post_url_finance, data=values_finance, proxies=proxies)
File "C:\Python27\lib\site-packages\requests\api.py", line 109, in post
return request('post', url, data=data, json=json, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 431, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
The problem you are having seems to be caused by an untrusted SSL certificate.
The quickest fix is setting verify=False. Please note that this will cause the certificate not to be verified and expose your application to security risks. But as you mensioned, it is running in a safe network so this is not a serious problem.
s = requests.session()
s.auth = {'USER_NAME': '----', 'PASSWORD': '----'}
pageCert = requests.post(post_url_finance, proxies=proxies, verify=False)
I used s.auth with verify=False. This gave me a response back instead of giving me the SSL error.
I am working with a local html file in python, and I am trying to use lxml to parse the file. For some reason I can't get the file to load properly, and I'm not sure if this has to do with not having an http server set up on my local machine, etree usage, or something else.
My reference for this code was this: http://docs.python-guide.org/en/latest/scenarios/scrape/
This could be a related problem: Requests : No connection adapters were found for, error in Python3
Here is my code:
from lxml import html
import requests
page = requests.get('C:\Users\...\sites\site_1.html')
tree = html.fromstring(page.text)
test = tree.xpath('//html/body/form/div[3]/div[3]/div[2]/div[2]/div/div[2]/div[2]/p[1]/strong/text()')
print test
The traceback that I'm getting reads:
C:\Python27\python.exe "C:/Users/.../extract_html/extract.py"
Traceback (most recent call last):
File "C:/Users/.../extract_html/extract.py", line 4, in <module>
page = requests.get('C:\Users\...\sites\site_1.html')
File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'C:\Users\...\sites\site_1.html'
Process finished with exit code 1
You can see that it has something to do with a "connection adapter" but I'm not sure what that means.
If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
There is a better way for doing it:
using parse function instead of fromstring
tree = html.parse("C:\Users\...site_1.html")
print(html.tostring(tree))
You can also try using Beautiful Soup
from bs4 import BeautifulSoup
f = open("filepath", encoding="utf8")
soup = BeautifulSoup(f)
f.close()
I think I've discovered a problem with the Requests library's handling of redirects when using HTTPS. As far as I can tell, this is only a problem when the server redirects the Requests client to another HTTPS resource.
I can assure you that the proxy I'm using supports HTTPS and the CONNECT method because I can use it with a browser just fine. I'm using version 2.1.0 of the Requests library which is using 1.7.1 of the urllib3 library.
I watched the transactions in wireshark and I can see the first transaction for https://www.paypal.com/ but I don't see anything for https://www.paypal.com/home. I keep getting timeouts when debugging any deeper in the stack with my debugger so I don't know where to go from here. I'm definitely not seeing the request for /home as a result of the redirect. So it must be erroring out in the code before it gets sent to the proxy.
I want to know if this truly is a bug or if I am doing something wrong. It is really easy to reproduce so long as you have access to a proxy that you can send traffic through. See the code below:
import requests
proxiesDict = {
'http': "http://127.0.0.1:8080",
'https': "http://127.0.0.1:8080"
}
# This fails with "requests.exceptions.ProxyError: Cannot connect to proxy. Socket error: [Errno 111] Connection refused." when it tries to follow the redirect to /home
r = requests.get("https://www.paypal.com/", proxies=proxiesDict)
# This succeeds.
r = requests.get("https://www.paypal.com/home", proxies=proxiesDict)
This also happens when using urllib3 directly. It is probably mainly a bug in urllib3, which Requests uses under the hood, but I'm using the higher level requests library. See below:
proxy = urllib3.proxy_from_url('http://127.0.0.1:8080/')
# This fails with the same error as above.
res = proxy.urlopen('GET', https://www.paypal.com/)
# This succeeds
res = proxy.urlopen('GET', https://www.paypal.com/home)
Here is the traceback when using Requests:
Traceback (most recent call last):
File "tests/downloader_tests.py", line 22, in test_proxy_https_request
r = requests.get("https://www.paypal.com/", proxies=proxiesDict)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 505, in send
history = [resp for resp in gen] if allow_redirects else []
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 167, in resolve_redirects
allow_redirects=False,
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 375, in send
raise ProxyError(e)
requests.exceptions.ProxyError: Cannot connect to proxy. Socket error: [Errno 111] Connection refused.
Update:
The problem only seems to happen with a 302 (Found) redirect not with the normal 301 redirects (Moved Permanently). Also, I noticed that with the Chrome browser, Paypal doesn't return a redirect. I do see the redirect when using Requests - even though I'm borrowing Chrome's User Agent for this experiment. I'm looking for more URLs that return a 302 in order to get more data points.
I need this to work for all URLs or at least understand why I'm seeing this behavior.
This is a bug in urllib3. We're tracking it as urllib3 issue #295.