I've navigated through different responses to my question but still didn't manage to run it :(.
I'm logging onto a site using python & mechanize, my code looks like this
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
...
r = br.open('http://...')
html = r.read()
form = br.forms().next()
br.form = form
br.submit()
Sending the form is not a problem, the problem is that when I write br.open() again to perform a GET request, Python doesn't send back the Cookie PHPSESSID (I looked this in wireshark), any ideas?
Thanks!
import cookielib, urllib2
ckjar = cookielib.MozillaCookieJar(os.path.join(’C:\Documents and Settings\tom\Application Data\Mozilla\Firefox\Profiles\h5m61j1i.default’, ‘cookies.txt’))
req = urllib2.Request(url, postdata, header)
req.add_header(’User-Agent’, \
‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)’)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckjar) )
f = opener.open(req)
htm = f.read()
f.close()
Related
I need to login into a website by using mechanize in python and then continue traversing that website using pycurl. So what I need to know is how to transfer a logged-in state established via mechanize into pycurl. I assume it's not just about copying the cookie over. Or is it? Code examples are valued ;)
Why I'm not willing to use pycurl alone:
I have time constraints and my mechanize code worked after 5 minutes of modifying this example as follows:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# debugging messages?
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open the site
r = br.open('https://thewebsite.com')
html = r.read()
# Show the source
print html
# or
print br.response().read()
# Show the html title
print br.title()
# Show the response headers
print r.info()
# or
print br.response().info()
# Show the available forms
for f in br.forms():
print f
# Select the first (index zero) form
br.select_form(nr=0)
# Let's search
br.form['username']='someusername'
br.form['password']='somepwd'
br.submit()
print br.response().read()
# Looking at some results in link format
for l in br.links(url_regex='\.com'):
print l
Now if I could only transfer the right information from br object to pycurl I would be done.
Why I'm not willing to use mechanize alone:
Mechanize is based on urllib and urllib is a nightmare. I had too many traumatizing issues with it. I can swallow one or two calls in order to login, but please no more. In contrast pycurl has proven for me to be stable, customizable and fast. From my experience, pycurl to urllib is like star trek to flintstones.
PS: In case anyone wonders, I use BeautifulSoup once I have the html
Solved it. Appartently it WAS all about the cookie. Here is my code to get the cookie:
import cookielib
import mechanize
def getNewLoginCookieFromSomeWebsite(username = 'someusername', pwd = 'somepwd'):
"""
returns a login cookie for somewebsite.com by using mechanize
"""
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but does not hang on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0')]
# Open login site
response = br.open('https://www.somewebsite.com')
# Select the first (index zero) form
br.select_form(nr=0)
# Enter credentials
br.form['user']=username
br.form['password']=pwd
br.submit()
cookiestr = ""
for c in br._ua_handlers['_cookies'].cookiejar:
cookiestr+=c.name+'='+c.value+';'
return cookiestr
In order to activate the usage of that cookie when using pycurl, all you have to do is to type the following before c.perform() occurs:
c.setopt(pycurl.COOKIE, getNewLoginCookieFromSomeWebsite("username", "pwd"))
Keep in mind: some websites may keep interacting with the cookie via Set-Content and pycurl (unlike mechanize) does not automatically execute any operations on cookies. Pycurl simply receives the string and leaves to the user what to do with it.
I'm trying to login with mechanize, get the session cookie, then load a protected page but mechanize doesn't seem to be saving or re-using the session. When I try to load the protected resource I get redirected to the login page. Can anyone see what I'm doing wrong from the code below?
import mechanize
import urllib
import Cookie
import cookielib
cookiejar=cookielib.LWPCookieJar()
br = mechanize.Browser()
br.set_cookiejar(cookiejar)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 Compatible')]
br.set_cookiejar(cookiejar)
params = {'email_address': 'name#company.com', 'password':pass}
data = urllib.urlencode(params)
request = mechanize.Request('/myLoginPage', data=data)
response = br.open(request)
html = response.read()
request = mechanize.Request('/myProtectedPage')
response = br.open(request)
At this point response is not the data from the protected resource its a redirect to the login page
I want to send a POST request to the page after opening it using Python (using urllib2.urlopen). Webpage is http://wireless.walmart.com/content/shop-plans/?r=wm
Code which I am using right now is:
url = 'http://wireless.walmart.com/content/shop-plans/?r=wm'
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)'
values = {'carrierID':'68',
'conditionToType':'1',
'cssPrepend':'wm20',
'partnerID':'36575'}
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
page = response.read()
walmart = open('Walmart_ContractPlans_ATT.html','wb')
walmart.write(page)
This is giving me page which opens by default, after inspecting the page using Firebug I came to know that carrierID:68 is sent when I click on the button which sends this POST request.
I want to simulate this browser behaviour.
Please help me in resolving this.
For webscraping I prefer to use requests and pyquery. First you download the data:
import requests
from pyquery import PyQuery as pq
url = 'http://wireless.walmart.com/content/getRatePlanInfo'
payload = {'carrierID':68, 'conditionToType':1, 'cssPrepend':'wm20'}
r = requests.post(url, data=payload)
d = pq(r.text)
After this you proceed to parse the elements, for example to extract all plans:
plans = []
plans_selector = '.wm20_planspage_planDetails_sub_detailsDiv_ul_li'
plans = d(plans_selector).each(lambda i, n: plans.append(pq(n).text()))
Result:
['Basic 200',
'Simply Everything',
'Everything Data 900',
'Everything Data 450',
'Talk 450',
...
I recommend looking at a browser emulator like mechanize, rather than trying to do this with raw HTTP requests.
I'm trying to fix a program which can login to my MU account and retrieve some data....
I don't know what am I doing wrong....That's the code:
#!/usr/bin/env python
import urllib, urllib2, cookielib
username = 'username'
password = 'password'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('http://megaupload.com/index.php?c=login', login_data)
resp = opener.open('http://www.megaupload.com/index.php?c=filemanager')
print resp.read()
Thx for any answer!
You can simulate the filling of the form.
For that you can use mechanize lib base on perl module WWW::Mechanize.
#!/usr/bin/env python
import urllib, urllib2, cookielib, mechanize
username = 'username'
password = 'password'
br = mechanize.Browser()
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6')]
br.open('http://www.megaupload.com/?c=login')
br.select_form('loginfrm')
br.form['username'] = username
br.form['password'] = password
br.submit()
resp = br.open('http://www.megaupload.com/index.php?c=filemanager')
print resp.read()
See Use mechanize to log into megaupload
Okay I just implemented it myself and it seems you just forgot one value - that's why I always use TamperData or something similar to just check what my browser is sending to the server - WAY easier and shorter than going through the HTML.
Anyways just add 'redir' : 1 to your dict and it'll work:
import http.cookiejar
import urllib
if __name__ == '__main__':
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
login_data = urllib.parse.urlencode({'username' : username, 'password' : password, 'login' : 1, 'redir' : 1})
response = opener.open("http://www.megaupload.com/?c=login", login_data)
with open("test.txt", "w") as file:
file.write(response.read().decode("UTF-8")) #so we can compare resulting html easily
Although I must say I'll have a look at mechanize and co now - I do something like that often enough that it could be quite worthwhile. Although I can't stress enough that the most important help is still a browser plugin that lets you check the sent data ;)
You might have more luck with mechanize or twill which are designed to streamline these kinds of processes. Otherwise, I think your opener is missing at least one important component: something to process cookies. Here's a bit of code I have laying around from the last time I did this:
# build opener with HTTPCookieProcessor
cookie_jar = cookielib.MozillaCookieJar('tasks.cookies')
o = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cookie_jar)
)
My guess is to add the c=login name/value pair to login_data rather than including it dorectly on the URL.
You're probably also breaking a TOS/EULA, but I can't say I care that much.
I am using this code:
def req(url, postfields):
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
return opener.open(url).read()
To make a simple http get request (using tor as proxy).
Now I would like to know how to make multiple request using the same cookie.
For example:
req('http://loginpage', 'postfields')
source = req('http://pageforloggedinonly', 0)
#do stuff with source
req('http://anotherpageforloggedinonly', 'StuffFromSource')
I know that my function req doesn't support POST (yet), but I have sent postfields using httplib so I guess I can figure that by myself, but I don't understand how to use cookies, I saw some examples but they are all one request only, I want to reuse the cookie from the first login request in the succeeding requests, or saving/using the cookie from a file (like curl does), that would make everything easier.
The code I posted I only to illustrate what I am trying to achieve, I think I will use httplib(2) for the final app.
UPDATE:
cookielib.LWPCOokieJar worked fine, here's a sample I did for testing:
import urllib2, cookielib, os
def request(url, postfields, cookie):
urlopen = urllib2.urlopen
cj = cookielib.LWPCookieJar()
Request = urllib2.Request
if os.path.isfile(cookie):
cj.load(cookie)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
req = Request(url, postfields, txheaders)
handle = urlopen(req)
cj.save(cookie)
return handle.read()
print request('http://google.com', None, 'cookie.txt')
The cookielib module is what you need to do this. There's a nice tutorial with some code samples.