First, I just want to say that there are a number of similar questions out there, but none are working for me.
I'm a python/scraping newbie, and I'm trying to use the mechanize module to log onto a website. However, when I present the correct credentials in the form it fails to log in. Here's my code:
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
br.open('https://www.chess.com/login')
br.select_form(nr=0)
br.form['c1'] = 'username'
br.form['loginpassword'] = 'password'
br.submit()
print br.open('http://www.chess.com/home/game_archive?member=username&page=1').read()
The final print command shows the source code for the page as though I'm not logged in. If I enter the same credentials through chrome, access the final url, then view the source I can see the correct (logged in) code, so the username and password shouldn't be the problem. Anyone have a good guess about what is going on?
Related
I wrote the following script to submit a file here .
The login works fine. I'm not able to re-submit to one problem more than once, but it works fine when submitting for the first time.
import mechanize
import cookielib
from bs4 import BeautifulSoup
import html2text
import urllib2
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
# Logging in
br.open('http://codeforces.com/enter')
br.select_form(nr=1)
br.form['handle'] = username
br.form['password'] = password
br.submit()
# Submitting
br.open('http://codeforces.com/problemset/submit')
br.select_form(nr=1)
br.form['submittedProblemCode'] = problemCode
#selecting language
br.form['programTypeId'] = ['42']
br.form.add_file(open("code.cpp"), 'text/plain', "code.cpp")
br.submit()
print br.geturl()
For a successful submission, br.geturl() prints
http://codeforces.com/problemset/status
which is the required page, but for an unsuccessful submission it prints
http://codeforces.com/problemset/submit?csrf_token=/insert token/
I am trying to use mechanize to scrape a website that requires me to log in. Here is the start of me code.
#!/usr/bin/python
#scrape the admissions part of SAFE
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
# The site we will navigate into, handling it's session
br.open('https://url')
# View available forms
for f in br.forms():
print f
This gives me
<POST https://userstuff application/x-www-form-urlencoded
<HiddenControl(lt=LT-227363-Ja4QpRvdxrbQF0nb7XcR2jQDydH43s) (readonly)>
<HiddenControl(execution=e1s1) (readonly)>
<HiddenControl(_eventId=submit) (readonly)>
<TextControl(username=)>
<PasswordControl(password=)>
<SubmitButtonControl(submit=) (readonly)>
<CheckboxControl(warn=[on])>>
How can I now enter the username and password?
I tried
# Select the first (index zero) form
br.select_form(nr=0)
# User credentials
br.form['username'] = 'username'
br.form['password'] = 'password'
# Login
br.submit()
But that doesn't seem to work.
In the end this worked for me
#!/usr/bin/python
#scraper
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
# The site we will navigate into, handling it's session
br.open('url1')
# View available forms
for f in br.forms():
if f.attrs['id'] == 'fm1':
br.form = f
break
# User credentials
br.form['username'] = 'password'
br.form['password'] = 'username'
# Login
br.submit()
#Now we need to confirm again
br.open('https://url2')
# Select the first (index zero) form
br.select_form(nr=0)
# Login
br.submit()
print(br.open('https:url2').read())
I'd look at the html form rather than what mechanize gives you. Below is an example of a form I've tried to fill out in the past.
<input type="text" name="user_key" value="">
<input type="password" name="user_password">
Below is the code I use to log into that website using the form above
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# The site we will navigate into, handling it's session
br.open('https://www.website.com/login')
#select the first form
br.select_form(nr=0)
#user credentials
br['user_key'] = 'myusername#gmail.com'
br['user_password'] = 'mypassword'
# Login
br.submit()
link = 'http://www.website.com/url_i_want_to_scrape'
br.open(link)
response = br.response().read()
print response
Your issue could be that you're either choosing the wrong form giving the incorrect field names
I need to login into a website by using mechanize in python and then continue traversing that website using pycurl. So what I need to know is how to transfer a logged-in state established via mechanize into pycurl. I assume it's not just about copying the cookie over. Or is it? Code examples are valued ;)
Why I'm not willing to use pycurl alone:
I have time constraints and my mechanize code worked after 5 minutes of modifying this example as follows:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# debugging messages?
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open the site
r = br.open('https://thewebsite.com')
html = r.read()
# Show the source
print html
# or
print br.response().read()
# Show the html title
print br.title()
# Show the response headers
print r.info()
# or
print br.response().info()
# Show the available forms
for f in br.forms():
print f
# Select the first (index zero) form
br.select_form(nr=0)
# Let's search
br.form['username']='someusername'
br.form['password']='somepwd'
br.submit()
print br.response().read()
# Looking at some results in link format
for l in br.links(url_regex='\.com'):
print l
Now if I could only transfer the right information from br object to pycurl I would be done.
Why I'm not willing to use mechanize alone:
Mechanize is based on urllib and urllib is a nightmare. I had too many traumatizing issues with it. I can swallow one or two calls in order to login, but please no more. In contrast pycurl has proven for me to be stable, customizable and fast. From my experience, pycurl to urllib is like star trek to flintstones.
PS: In case anyone wonders, I use BeautifulSoup once I have the html
Solved it. Appartently it WAS all about the cookie. Here is my code to get the cookie:
import cookielib
import mechanize
def getNewLoginCookieFromSomeWebsite(username = 'someusername', pwd = 'somepwd'):
"""
returns a login cookie for somewebsite.com by using mechanize
"""
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but does not hang on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0')]
# Open login site
response = br.open('https://www.somewebsite.com')
# Select the first (index zero) form
br.select_form(nr=0)
# Enter credentials
br.form['user']=username
br.form['password']=pwd
br.submit()
cookiestr = ""
for c in br._ua_handlers['_cookies'].cookiejar:
cookiestr+=c.name+'='+c.value+';'
return cookiestr
In order to activate the usage of that cookie when using pycurl, all you have to do is to type the following before c.perform() occurs:
c.setopt(pycurl.COOKIE, getNewLoginCookieFromSomeWebsite("username", "pwd"))
Keep in mind: some websites may keep interacting with the cookie via Set-Content and pycurl (unlike mechanize) does not automatically execute any operations on cookies. Pycurl simply receives the string and leaves to the user what to do with it.
I've got a script set to log into a website. The challenge is that I'm running the script on EC2 and the website is asking for me to do additional verification by sending me a custom code.
I receive the email immediately but need to be able to update that field on the fly.
This is the script
import urllib2
import urllib2
import cookielib
import urllib
import requests
import mechanize
from bs4 import BeautifulSoup
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# The site we will navigate into, handling it's session
br.open('https://www.website.com/login')
#select the first form
br.select_form(nr=0)
#user credentials
br['user_key'] = 'username#gmail.com'
br['user_password'] = 'somePassw0rd'
# Login
br.submit()
#enter verification code
input_var = raw_input("Enter something: ")
#put verification code in form
br['Verication'] = str(input_var)
#submit form
br.submit()
The challenge for me is that I keep getting an error saying:
AttributeError: mechanize._mechanize.Browser instance has no attribute __setitem__ (perhaps you forgot to .select_form()?)
What can I do to make this run as intended?
after you br.submit() you go straight into
br['Verication'] = str(input_var)
this is incorrect since using br.submit() will make your browser not have a form selected anymore.
after submitting i would try:
for form in br.forms():
print form
to see if there is another form to be selected
read up on the html code on the login site and check to see exactly what happens when you click login. You may have to reselect a form on that same page then assign the verification code to one of the controls
I am trying to login into the website http://ogame.us using python to access the data. After looking around the web to find out how to attempt to do this, I settled on using the mechanize module. I think I have the general gist of the code down, but when I submit the html form nothing happens. Here's the code:
import sys,os
import mechanize, urllib
import cookielib
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
import datetime, time, socket
import re,sys,os,mechanize,urllib,time, urllib2
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.open('http://ogame.us')
br.select_form(name = 'loginForm' )
br['login'] = 'stackexample'
br['pass'] = 'examplepassword'
br['uni_url'] = ['uni103.ogame.us']
br.submit()
print br.geturl()
The response from geturl() is the same url that I was at before. Anyone know what is going on?
Try this:
data = br.submit()
html=data.read()
Maybe select the button directly?
response = br.submit(type="submit", id="loginSubmit")
There is a third field (uni) that I was not completing. Everything else was correct.
In the future, with Google Chrome (and probably other browsers) you can view the actual requests sent to the browser by opening Chrome Developer Tools and looking under network. This saves quite a bit of time.