How to fill out forms with mechanize python - python

I am writing a script to login to a website and I would like to fill out a form. I can fill out the login form with no problems, but I am unsure how to fill out the actual form once I login. Below is the code I use to get to the form: The form I would like to access is made up of three tabs that navigates between pages.
import mechanize
op = mechanize.Browser() # use mecahnize's browser
op.set_handle_robots(False)
url= "http://example.com"
op.open(url)
email = raw_input("Username: ") #ask for username
password = raw_input("Password: ") #ask for password
op.select_form(nr=0) #python labels forms 0,1,2,3 etc. since there is only 1 form, just select 0
op.form["loginForm:user"] = email
op.form["loginForm:pw"] = password
op.submit()
for i in op.forms():
print i
this code prints out: I changed the url and and replace the actual javax.faces thing with x's to protect the original code. When I inspect elements with Chrome on the form, the fields are not shown here. What is going on here?
<NavForm POST https://someurl.xhtml application/x-www-form- urlencoded
<HiddenControl(NavForm=NavForm) (readonly)>
<HiddenControl(javax.faces.ViewState=XXXX) (readonly)>>

Related

Python webscraping page which requires login

I am trying to automate a web data gathering process using Python. In my case, I need to pull the information from https://app.ixml.com.br/documentos/nfe page. However, before you go to this page, you need to log in at https://app.ixml.com/login. The code below should theoretically log into the site:
import re
from robobrowser import RoboBrowser
username = 'email'
password = 'password'
br = RoboBrowser()
br.open('https://app.ixml.com.br/login')
form = br.get_form()
form['email'] = username
form['senha'] = password
br.submit_form(form)
src = str(br.parsed())
However, by printing the src variable, I get the source code from the https://app.ixml.com.br/login page, ie before logging in. If I add the following lines at the end of the previous code
br.open('https://app.ixml.com.br/documentos/nfe')
src2 = str(br.parsed())
The src2 variable contains the source code of the page https://app.ixml.com.br/. I tried some variations, such as creating a new br object, but got the same result. How can I access the information at https://app.ixml.com.br/documentos/nfe?
If it is ok to have a webpage opening you can try to solve this using selenium. This package makes it possible to create a program that reacts just like a user would.
The following code would have you login:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://app.ixml.com.br/login")
browser.find_element_by_id("email").send_keys("abc#mail")
browser.find_element_by_id("senha").send_keys("abc")
browser.find_element_by_css_selector("button").click()

Shell script to download a lot of HTML files and store them statically with all CSS

I have posted on a science forum (roughly 290 questions) that I would like to get back by downloading them with all the associated answers.
The first issue is that I have to be logged on my personal space to have the list of all the messages. How to circumvent this first barrier to be able with a shell script or a single wget command to get back all URL and their content. Can I pass to wgeta login and a password to be logged and redirected to the appropriate URL obtaining the list of all messages?
Once this first issue will be solved, the second issue is that I have to start from 6 different menu pages that all contain the title and the link of the questions.
Moreover, concerning some of my questions, the answers and the discussions may be on multiple pages.
So I wonder if I could achieve this operation of global downloading knowing I would like to store them statically with local CSS stored also on my computer (to keep the same format into my browser when I consult them on my PC).
The URL of the first menu page of questions is (once I am logged on the website : that could be an issue also to download with wget if I am obliged to be connected).
An example of URL containing the list of messages, once I am logged, is:
https://forums.futura-sciences.com/search.php?searchid=22897684
The other pages (there all 6 or 7 pages of discussions title in total appering in the main menu page) have the format:
"https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=2" (for page 2).
https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=5
(for page 5)
One can see on each of these pages the title and the link of each of the discussions that I would like to download with also CSS (knowing each discussion may contain multiple pages also) :
for example the first page of discussion "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps.html"
has page 2: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-2.html"
and page 3: "https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-3.html"
Naively, I tried to do all this with only one command (with the example of URL on my personal space that I have taken at the beginning of post, i.e "https://forums.futura-sciences.com/search.php?searchid=22897684"):
wget -r --no-check-certificate --html-extension --convert-links "https://forums.futura-sciences.com/search.php?searchid=22897684"
but unfortunately, this command downloads all files, and even maybe not what I want, i.e my discussions.
I don't know what the approach to use: must I firstly store all URL in a file (with all sub-pages containing all answers and the global discussion for each of mu initial question)?
And after, I could do maybe a wget -i all_URL_questions.txt. How can I carry out this operation?
Update
My issue needs a script, I tried with Python the following things:
1)
import urllib, urllib2, cookielib
username = 'USERNAME'
password = 'PASSWORD'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('https://forums.futura-sciences.com/login.php', login_data)
resp = opener.open('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print resp.read()
But the page printed is not the page of my home into personal space.
2)
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'USERNAME',
'inUserPass': 'PASSWORD'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text.encode('utf8')
# An authorised request.
r = s.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print r.text.encode('utf8')
Here too, this doesn't work
3)
import requests
import bs4
site_url = 'https://forums.futura-sciences.com/login.php?do=login'
userid = 'USERNAME'
password = 'PASSWWORD'
file_url = 'https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1'
o_file = 'abc.html'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'vb_login_username': userid, 'vb_login_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print("requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()
Same thing, the content is wrong
4)
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
webdriver.get('https://forums.futura-sciences.com/')
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
browser = webdriver.Firefox()
browser.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
Still not getting to log in with USERNAME and PASSSWORD and get content of homepage of personal space
5)
from selenium import webdriver
from selenium.webdriver.firefox.webdriver import FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
def MS_login(username, passwd): # call this with username and password
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['moz:webdriverClick'] = False
driver = webdriver.Firefox(capabilities=firefox_capabilities)
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2) # 0 means to download to the desktop, 1 means to download to the default "Downloads" directory, 2 means to use the directory
fp.set_preference("browser.download.dir","/Users/user/work_archives_futura/")
driver.get('https://forums.futura-sciences.com/') # change the url to your website
time.sleep(5) # wait for redirection and rendering
driver.delete_all_cookies() # clean up the prior login sessions
driver.find_element_by_xpath("//input[#name='vb_login_username']").send_keys(username)
elem = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//input[#name='vb_login_password']")))
elem.send_keys(Keys.TAB)
driver.find_element_by_xpath("//input[#type='submit']").click()
print("success !!!!")
driver.close() # close the browser
return driver
if __name__ == '__main__':
MS_login("USERNAME","PASSWORD")
The window is well opened, username filled but impossible to fill or submit the password and click on submit.
PS: the main issue could come from that password field has display:none property, So I can't simulate TAB operation to password field and pass it, once I have put the login.
It seems you're pretty knowledgeable already about scraping using the various methods. All that was missing were the correct field names in the post request.
I used the chrome dev tools (f12 - then go to networking tab). With this open if you login and quickly stop the browser window from redirecting, you'll be able to see the full request to login.php and look at the fields etc.
With that I was able to build this for you. It includes a nice dumping function for responses. To test my code works you can use your real password for positive case and the bad password line for negative case.
import requests
import json
s = requests.Session()
def dumpResponseData(r, fileName):
print(r.status_code)
print(json.dumps(dict(r.headers), indent=1))
cookieDict = s.cookies.get_dict()
print(json.dumps(cookieDict, indent=1))
outfile = open(fileName, mode="w")
outfile.write(r.text)
outfile.close()
username = "your-username"
password = "your-password"
# password = "bad password"
def step1():
data = dict()
data["do"] = "login"
data["vb_login_md5password"] = ""
data["vb_login_md5password_utf"] = ""
data["s"] = ""
data["securitytoken"] = "guest"
data["url"] = "/search.php?do=finduser&userid=1077817&contenttype=vBForum_Post&showposts=1"
data["vb_login_username"] = username
data["vb_login_password"] = password
p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=data)
# Logged In?
if "vbseo_loggedin" in s.cookies.keys():
print("Logged In!")
else:
print("Login Failed :(")
if __name__ == "__main__":
step1()
I don't have any posts in my newly created Futura account so I can't really do any more testing for you - I don't want to spam their forum with garbage.
But I would probably start by doing a request of post search url and scrape the links using bs4.
Then you could probably just use wget -r for each link you've scraped.
#Researcher is correct on their advice when it comes to the requests library. You are not posting all of the request params that the browser would send. Overall, I think it will be difficult to get requests to pull everything when you factor in static content and client side javascript
Your selenium code from section 4 has a few mistakes in it:
# yours
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
# should be
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
You may need to fiddle with the xpath for the submit button.
Hint: You can debug along the way by taking a screenshots :
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.get_screenshot_as_file('before_submit.png')
webdriver.find_element_by_xpath("//input[#type='submit']").click()
webdriver.get_screenshot_as_file('after_submit.png')

Request.Response object doesn't redirect to the right URL

2 months old Python noob here,
I'm using MechanicalSoup to fill in a login form on a webpage, which i then want to submit and go to the user-profile page.
Altough i don't get any errors in my code, after submitting the form, i still get the current url of the homepage from my new response object.
Moreover, the status code of this repsonse object is 200, which implies that the request has been succesful?
here's the relevant part of my code:
def randomstring():
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(5))
br = mechanicalsoup.StatefulBrowser(soup_config=({'features': 'html.parser'}))
browser = br.open("https://www.opinieland.nl/nl-nl")
page = br.get_current_page()
Form = br.select_form(nr=0)
form = br.get_current_form()
for tag in page.find_all(True):
LOGIN = page.select('input[class="col-md-7 col-sm-7 col-xs-6 form-control"]')
for i in LOGIN:
if i.get("name") == None:
i["name"] = randomstring()
else:
continue
apsuser = Form.form.select('input[class="col-md-7 col-sm-7 col-xs-6 form- control"]')[0]
apspas = Form.form.select('input[class="col-md-7 col-sm-7 col-xs-6 form- control"]')[1]
form.set_input({apsuser.get('name'): username, apspas.get('name'): password})
form2 = br.select_form(selector=('a[class="btn btn-danger"]'))
soup = br.submit(form2, url="https://www.opinieland.nl/nl-nl")
As is said, there code won't show up any errors. and when launching the browser, i can see that the forms are filled in correctly
help is appreciated :), any additional tips about my code too ofcourse!
MechanicalSoup is capable of submitting a login form, so long as it is not handled by JavaScript (see "When to use MechanicalSoup?"). In this case, I think a minor misuse of MechanicalSoup is causing the error.
Once you fill out the form, you generally want to submit it with br.submit_selected().
The variable form2 does not appear to be a form (no POST or GET action), just a link:
In [9]: form2.form
Out[9]: <a class="btn btn-danger" id="apslogin" style="margin-top:4px"> Inloggen</a>
To submit the correct form, you should therefore replace
form2 = br.select_form(selector=('a[class="btn btn-danger"]'))
soup = br.submit(form2, url="https://www.opinieland.nl/nl-nl")
with
br.submit_selected()
For a complete example that showcases filling out and submitting a login form, see the MechanicalSoup login tutorial.

Input html form data from python script

I am working on a project and I need to validate a piece of data using a third party site. I wrote a python script using the lxml package that successfully checks if a specific piece of data is valid.
Unfortunately, the site does not have a convenient url scheme for their data and therefor I can not predict the specific url that will contain the data for each unique request. Instead the third party site has a query page with a standard html text input that redirects to the proper url.
My question is this: is there a way to input a value into the html input and submit it all from my python script?
Yes there is.
Mechanize
Forms
List the forms
import mechanize
br = mechanize.Browser()
br.open(url)
for form in br.forms():
print "Form name:", form.name
print form
select form
br.select_form("form1")
br.form = list(br.forms())[0]
login form example
br.select_form("login")
br['login:loginUsernameField'] = user
br['login:password'] = password
br.method = "POST"
response = br.submit()
Selenium
Sending input
Given an element defined as:
<input type="text" name="passwd" id="passwd-id" />
you could find it using any of:
element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_element_by_xpath("//input[#id='passwd-id']")
You may want to enter some text into a text field:
element.send_keys("some text")
You can simulate pressing the arrow keys by using the “Keys” class:
element.send_keys("and some", Keys.ARROW_DOWN)
These are the two packages I'm aware of that can do what you've asked.

Mechanize to submit and read response

I am using mechanize in python to submit a form and print out the response but it does not seem to work
import mechanize
# The URL to this service
URL = 'http://sppp.rajasthan.gov.in/bidsearch.php'
def main():
# Create a Browser instance
b = mechanize.Browser()
# Load the page
b.open(URL)
# Select the form
b.select_form(nr=0)
# Fill out the form
b['ddlfinancialyear'] = '2015-2016'
b.submit()
b.response().read()
What I am trying to do is submit a form using the url 'sppp.rajasthan.gov.in/bidsearch.php';, and when the form is submitted( by trying to pass value '2015-2016' to 'ddfinancialyear' control) another page should be returned as a response and I am not getting any output.
Try assigning the b.submit before reading it:
S = b.submit()
S.read()

Categories

Resources