Using MechanicalSoup behind proxy

Using MechanicalSoup behind proxy - python

I am trying to build a simple webbot in Python, on Windows, using MechanicalSoup. Unfortunately, I am sitting behind a (company-enforced) proxy. I could not find a way to provide a proxy to MechanicalSoup. Is there such an option at all? If not, what are my alternatives?
EDIT: Following Eytan's hint, I added proxies and verify to my code, which got me a step further, but I still cannot submit a form:
import mechanicalsoup
proxies = {
'https': 'my.https.proxy:8080',
'http': 'my.http.proxy:8080'
}
url = 'https://stackoverflow.com/'
browser = mechanicalsoup.StatefulBrowser()
front_page = browser.open(url, proxies=proxies, verify=False)
form = browser.select_form('form[action="/search"]')
form.print_summary()
form["q"] = "MechanicalSoup"
form.print_summary()
browser.submit(form, url=url)
The code hangs in the last line, and submitdoesn't accept proxies as an argument.

It seems that proxies have to be specified on the session level. Then they are not required in browser.open and submitting the form also works:
import mechanicalsoup
proxies = {
'https': 'my.https.proxy:8080',
'http': 'my.http.proxy:8080'
}
url = 'https://stackoverflow.com/'
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxies # THIS IS THE SOLUTION!
front_page = browser.open(url, verify=False)
form = browser.select_form('form[action="/search"]')
form["q"] = "MechanicalSoup"
result = browser.submit(form, url=url)
result.status_code
returns 200 (i.e. "OK").

According to their doc, this should work:
browser.get(url, proxies=proxy)
Try passing the 'proxies' argument to your requests.

Related

Proxies attribute in Requests module is ignored

I'm building a small script to test the certain proxies against the API.
It seems that the actual request isn't trigger under the provided proxy. For example, the following request will be valid and I will get an response from the API.
import requests
r = requests.post("https://someapi.com", data=request_data,
proxies={"http": "http://999.999.999.999:1212"}, timeout=5)
print(r.text)
How come I get the response even if the proxy provided was invalid?

You can define the proxies like this;
import requests
pxy = "http://999.999.999.999:1212"
proxyDict = {
'http': pxy,
'https': pxy,
'ftp': pxy,
'SOCKS4': pxy
}
r = requests.post("https://someapi.com", data=request_data,
proxies=proxyDict, timeout=5)
print(r.text)

Python web scraping login

I am trying to login to a website using python.
The login URL is :
https://login.flash.co.za/apex/f?p=pwfone:login
and the 'form action' url is shown as :
https://login.flash.co.za/apex/wwv_flow.accept
When I use the ' inspect element' on chrome when logging in manually, these are the form posts that show up (pt_02 = password):
There a few hidden items that I'm not sure how to add into the python code below.
When I use this code, the login page is returned:
import requests
url = 'https://login.flash.co.za/apex/wwv_flow.accept'
values = {'p_flow_id': '1500',
'p_flow_step_id': '101',
'p_page_submission_id': '3169092211412',
'p_request': 'LOGIN',
'p_t01': 'solar',
'p_t02': 'password',
'p_checksum': ''
}
r = requests.post(url, data=values)
print r.content
How can I adjust this code to perform a login?
Chrome network:

This is more or less your script should look like. Use session to handle the cookies automatically. Fill in the username and password fields manually.
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"lxml")
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t01': 'username',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
r = s.post(posturl, data=values)
print r.content

since I cannot recreate your case I can't tell you what exactly to change, but when I was doing such things I used Postman to intercept all requests my browser sends. So I'd install that, along with browser extension and then perform login. Then you can view the request in Postman, also view the response it received there, what's more it provides you with Python code of request too, so you could simply copy and use it then.
Shortly, use Pstman, perform login, clone their request.

Python request not using proxy

I wrote a simple python script to make a request to this website http://www.lagado.com/proxy-test using the requests module.
This website essentially tells you whether the request is using a proxy or not. According to the website, the request is not going through a proxy and is in fact going through my IP address.
Here is the code:
proxiesLocal = {
'https': proxy
}
headers = RandomHeaders.LoadHeader()
url = "http://www.lagado.com/proxy-test"
res = ''
while (res == ''):
try:
res = requests.get(url, headers=headers, proxies=proxiesLocal)
proxyTest = bs4.BeautifulSoup(res.text, "lxml")
items = proxyTest.find_all("p")
print(len(items))
for item in items:
print(item.text)
quit()
except:
print('sleeping')
time.sleep(5)
continue
Assuming that proxy is a variable of type string that stores the address of the proxy, what am I doing wrong?

python httplib: connect through proxy with authentification

I am trying to send GET request through a proxy with authentification.
I have the following existing code:
import httplib
username = 'myname'
password = '1234'
proxyserver = "136.137.138.139"
url = "http://google.com"
c = httplib.HTTPConnection(proxyserver, 83, timeout = 30)
c.connect()
c.request("GET", url)
resp = c.getresponse()
data = resp.read()
print data
when running this code, I get an answer from the proxy saying that I must provide authentification, which is correct.
In my code, I don't use login and password. My problem is that i don't know how to use them !
Any idea ?

You can refer this code if you specifically want to use httplib.
https://gist.github.com/beugley/13dd4cba88a19169bcb0
But you could also use the easier requests module.
import requests
proxies = {
"http": "http://username:password#proxyserver:port/",
# "https": "https://username:password#proxyserver:port/",
}
url = 'http://google.com'
data = requests.get(url, proxies=proxies)

How can I set a single proxy for a requests session object?

I'm using the Python requests package to send http requests. I want to add a single proxy to the requests session object. eg.
session = requests.Session()
session.proxies = {...} # Here I want to add a single proxy
Currently I am looping through a bunch of proxies, and at each iteration a new session is made. I only want to set a single proxy for each iteration.
The only example I see in the documentation is:
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)
I've tried to follow this, but to no avail. Here is my code from the script:
# eg. line = 59.43.102.33:80
r = s.get('http://icanhazip.com', proxies={'http': 'http://' + line})
But I get an error:
requests.packages.urllib3.exceptions.LocationParseError: Failed to parse 59.43.102.33:80
How is it possible to set a single proxy on a session object?

In addition to #neowu' answer, if you would like to set a proxy for the lifetime of a session object, you can also do the following -
import requests
proxies = {'http': 'http://10.11.4.254:3128'}
s = requests.session()
s.proxies.update(proxies)
s.get("http://www.example.com") # Here the proxies will also be automatically used because we have attached those to the session object, so no need to pass separately in each call

In fact, you are right, but you must ensure your defination of 'line', I have tried this , it's ok:
>>> import requests
>>> s = requests.Session()
>>> s.get("http://www.baidu.com", proxies={'http': 'http://10.11.4.254:3128'})
<Response [200]>
Did you define the line like line = ' 59.43.102.33:80', there is a space at the front of address.

There are other ways you can set proxies, apart from the solutions you have got so far:
import requests
with requests.Session() as s:
# either like this
s.proxies = {'https': 'http://105.234.154.195:8888', 'http': 'http://199.188.92.69:8000'}
# or like this
s.proxies['https'] = 'http://105.234.154.195:8888'
r = s.get(link)

Hopefully this may lead to an answer:
urllib3.util.url.parse_url(url)
Given a url, return a parsed Url namedtuple. Best-effort is performed to parse incomplete urls. Fields not provided will be None.
retrived from https://urllib3.readthedocs.org/en/latest/helpers.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using MechanicalSoup behind proxy - python

According to their doc, this should work: browser.get(url, proxies=proxy) Try passing the 'proxies' argument to your requests.

Related

Proxies attribute in Requests module is ignored

Python web scraping login

Python request not using proxy

python httplib: connect through proxy with authentification

How can I set a single proxy for a requests session object?

Categories

Resources