I'm filling a form on a web page using python's request module. I'm submitting the form as a POST request, which works fine. I get the expected response from the POST. However, it's a multistep form; after the first "submit" the site loads another form on the same page (using AJAX) . The post response has this HTML page . Now, how do I use this response to fill the form on the new page? Can I intertwine Requests module with Twill or Mechanize in some way?
Here's the code for the POST:
import requests
from requests.auth import HTTPProxyAuth
import formfill
from twill import get_browser
from twill.commands import *
import mechanize
from mechanize import ParseResponse, urlopen, urljoin
http_proxy = "some_Proxy"
https_proxy = "some_Proxy"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy
}
auth = HTTPProxyAuth("user","pass")
r = requests.post("site_url",data={'key':'value'},proxies=proxyDict,auth=auth)
The response r above, contains the new HTML page that resulted from submitting that form. This HTML page also has a form which I have to fill. Can I send this r to twill or mechanize in some way, and use Mechanize's form filling API? Any ideas will be helpful.
The problem here is that you need to actually interact with the javascript on the page. requests, while being an excellent library has no support for javascript interaction, it is just an http library.
If you want to interact with javascript-rich web pages in a meaningful way I would suggest selenium. Selenium is actually a full web browser that can navigate exactly as a person would.
The main issue is that you'll see your speed drop precipitously. Rendering a web page takes a lot longer than the raw html request. If that's a real deal breaker for you you've got two options:
Go headless: There are many options here, but I personally prefer casper. You should see a ~3x speed up on browsing times by going headless, but every site is different.
Find a way to do everything through http: Most non-visual site functions have equivalent http functionality. Using the google developer tools network tab you can dig into the requests that are actually being launched, then replicate those in python.
As far as the tools you mentioned, neither mechanize nor twill will help. Since your main issue here is javascript interaction rather than cookie management, and neither of those frameworks support javascript interactions you would run into the same issue.
UPDATE: If the post response is actually the new page, then you're not actually interacting with AJAX at all. If that's the case and you actually have the raw html, you should simply mimic the typical http request that the form would send. The same approach you used on the first form will work on the second. You can either grab the information out of the HTML response, or simply hard-code the successive requests.
using Mechanize:
#get the name of the form
for form in br.forms():
print "Form name:", form.name
print form
#select 1st form on the page - nr=1 for next etc etc
#OR just select the form with the name br.select_form(form.name)
br.select_form(nr=0)
br.form['form#'] = 'Test Name'
#fill in the fields
r = br.submit() #can always pass in additional params
Related
I am looking to open a connection with python to http://www.horseandcountry.tv which takes my login parameters via the POST method. I would like to open a connection to this website in order to scrape the site for all video links (this, I also don't know how to do yet but am using the project to learn).
My question is how do I pass my credentials to the individual pages of the website? For example if all I wanted to do was use python code to open a browser window pointing to http://play.horseandcountry.tv/live/ and have it open with me already logged in, how do I go about this?
As far as I know you have two options depending how you want to crawl and what you need to crawl:
1) Use urllib. You can do your POST request with the necessary login credentials. This is the low level solution, which means that this is fast, but doesn't handle high level stuff like javascript codes.
2) Use selenium. Whith that you can simulate a browser (Chrome, Firefox, other..), and run actions via your python code. Then it is much slower but works well with too "sophisticated" websites.
What I usually do: I try the first option and if a encounter a problem like a javascript security layer on the website, then go for option 2. Moreover, selenium can open a real web browser from your desktop and give you a visual of your scrapping.
In any case, just goolge "urllib/selenium login to website" and you'll find what you need.
If you want to avoid using Selenium (opening web browsers), you can go for requests, it can login the website and grab anything you need in the background.
Here is how you can login to that website with requests.
import requests
from bs4 import BeautifulSoup
#Login Form Data
payload = {
'account_email': 'your_email',
'account_password': 'your_passowrd',
'submit': 'Sign In'
}
with requests.Session() as s:
#Login to the website.
response = s.post('https://play.horseandcountry.tv/login/', data=payload)
#Check if logged in successfully
soup = BeautifulSoup(response.text, 'lxml')
logged_in = soup.find('p', attrs={'class': 'navbar-text pull-right'})
print s.cookies
print response.status_code
if logged_in.text.startswith('Logged in as'):
print 'Logged In Successfully!'
If you need explanations for this, you can check this answer, or requests documentation
You could also use the requests module. It is one of the most popular. Here are some questions that relate to what you would like to do.
Log in to website using Python Requests module
logging in to website using requests
I came across a situation when I used Python Requests or urllib2 to open urls. I got 404 'page not found' responses. For example, url = 'https://www.facebook.com/mojombo'. However, I can copy and paste those urls in browser and visit them. Why does this happen?
I need to get some content from those pages' html source code. Since I can't open those urls using Requests or urllib2, I can't use BeautifulSoup to extract element from html source code. Is there a way to get those page's source code and extract content form it utilizing Python?
Although this is a general question, I still need some working code to solve it. Thanks!
It looks like your browser is using cookies to log you in. Try opening that url in a private or incognito tab, and you'll probably not be able to access it.
However, if you are using Requests, you can pass the appropriate login information as a dictionary of values. You'll need to check the form information to see what the fields are, but Requests can handle that as well.
The normal format would be:
payload = {
'username': 'your username',
'password': 'your password'
}
p = requests.post(myurl, data=payload)
with more or less fields added as needed.
As part of my quest to become better at Python I am now attempting to sign in to a website I frequent, send myself a private message, and then sign out. So far, I've managed to sign in (using urllib, cookiejar and urllib2). However, I cannot work out how to fill in the required form to send myself a message.
The form is located at /messages.php?action=send. There's three things that need to be filled for the message to send: three text fields named name, title and message. Additionally, there is a submit button (named "submit").
How can I fill in this form and send it?
import urllib
import urllib2
name = "name field"
data = {
"name" : name
}
encoded_data = urllib.urlencode(data)
content = urllib2.urlopen("http://www.abc.com/messages.php?action=send",
encoded_data)
print content.readlines()
just replace http://www.abc.com/messages.php?action=send with the url where your form is being submitted
reply to your comment: if the url is the url where your form is located, and you need to do this just for one website, look at the source code of the page and find
<form method="POST" action="some_address.php">
and put this address as parameter for urllib2.urlopen
And you have to realise what submit button does.
It just send a Http request to the url defined by action in the form.
So what you do is to simulate this request with urllib2
You can use mechanize to work easily with this. This will ease your work of submitting the form. Don't forget to check with the parameters like name, title, message by seeing the source code of the html form.
import mechanize
br = mechanize.Browser()
br.open("http://mywebsite.com/messages.php?action=send")
br.select_form(nr=0)
br.form['name'] = 'Enter your Name'
br.form['title'] = 'Enter your Title'
br.form['message'] = 'Enter your message'
req = br.submit()
You want the mechanize library. This lets you easily automate the process of browsing websites and submitting forms/following links. The site I've linked to has quite good examples and documentation.
Try to work out the requests that are made (e.g. using the Chrome web developer tool or with Firefox/Firebug) and imitate the POST request containing the desired form data.
In addition to the great mechanize library mentioned by Andrew, in case I'd also suggest you use BeautifulSoup to parse the HTML.
If you don't want to use mechanize but still want an easy, clean solution to create HTTP requests, I recommend the excellend requests module.
To post data to webpage, use cURL something like this,
curl -d Name="Shrimant" -d title="Hello world" -d message="Hello, how are you" -d Form_Submit="Send" http://www.example.com/messages.php?action=send
The ā-dā option tells cURL that the next item is some data to be sent to the server at http://www.example.com/messages.php?action=send
I'm trying to scrape a page (my router's admin page) but the device seems to be serving a different page to urllib2 than to my browser. has anyone found this before? How can I get around it?
this the code I'm using:
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen("http://192.168.1.254/index.cgi?active_page=9133&active_page_str=page_bt_home&req_mode=0&mimic_button_field=btn_tab_goto:+9133..&request_id=36590071&button_value=9133")
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
(html output is removed by markdown)
With firebug watch what headers and cookies are sent to server. Then with urllib2.Request and cookielib emulate the same request.
EDIT: Also you can use mechanize.
Simpler than Wireshark may be to use Firebug to see the form of the request being made, and then emulating the same in your code.
Use Wireshark to see what your browser's request looks like, and add the missing parts so that your request looks the same.
To tweak urllib2 headers, try this.
Probably this isn't working because you haven't supplied credentials for the admin page
Use mechanize to load the login page and fill out the username/password.
Then you should have a cookie set to allow you to continue to the admin page.
It is much harder using just urllib2. You will need to manage the cookies yourself if you choose to stick to that route.
in my case it was one of the following:
1) The website vould understood that the access was not from a browser, so i had to fake a browser in python like that:
# Build a opener to fake a browser... Google here I come!
opener = urllib2.build_opener()
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
#Read the page
soup = BeautifulSoup(opener.open(url).read())
2) The content of the page was filled dynamically by javascript. In that case read the following post: https://stackoverflow.com/a/11460633/2160507
I've been googling around for quite some time now and can't seem to get this to work. A lot of my searches have pointed me to finding similar problems but they all seem to be related to cookie grabbing/storing. I think I've set that up properly, but when I try to open the 'hidden' page, it keeps bringing me back to the login page saying my session has expired.
import urllib, urllib2, cookielib, webbrowser
username = 'userhere'
password = 'passwordhere'
url = 'http://example.com'
webbrowser.open(url, new=1, autoraise=1)
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
opener.open('http://example.com', login_data)
resp = opener.open('http://example.com/afterlogin')
print resp
webbrowser.open(url, new=1, autoraise=1)
First off, when doing cookie-based authentication, you need to have a CookieJar to store your cookies in, much in the same way that your browser stores its cookies a place where it can find them again.
After opening a login-page through python, and saving the cookie from a successful login, you should use the MozillaCookieJar to pass the python created cookies to a format a firefox browser can parse. Firefox 3.x no longer uses the cookie format that MozillaCookieJar produces, and I have not been able to find viable alternatives.
If all you need to do is to retrieve specific (in advance known format formatted) data, then I suggest you keep all your HTTP interactions within python. It is much easier, and you don't have to rely on specific browsers being available. If it is absolutely necessary to show stuff in a browser, you could render the so-called 'hidden' page through urllib2 (which incidentally integrates very nicely with cookielib), save the html to a temporary file and pass this to the webbrowser.open which will then render that specific page. Further redirects are not possible.
I've generally used the mechanize library to handle stuff like this. That doesn't answer your question about why your existing code isn't working, but it's something else to play with.
The provided code calls:
opener.open('http://example.com', login_data)
but throws away the response. I would look at this response to see if it says "Bad password" or "I only accept IE" or similar.