Python - urllib2 & cookielib

Python - urllib2 & cookielib - python

I am trying to open the following website and retrieve the initial cookie and use it for the second url-open BUT if you run the following code it outputs 2 different cookies. How do I use the initial cookie for the second url-open?
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
Output shows 2 different cookies every time as you can see:
<cookielib.CookieJar[<Cookie JSESSIONID=0DEEE8331DE7D0DFDC22E860E065085F for www.idcourts.us/repository>]>
<cookielib.CookieJar[<Cookie JSESSIONID=E01C2BE8323632A32DA467F8A9B22A51 for www.idcourts.us/repository>]>

This is not a problem with urllib. That site does some funky stuff. You need to request a couple of stylesheets for it to validate your session id:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# default User-Agent ('Python-urllib/2.6') will *not* work
opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11'),
]
stylesheets = [
'https://www.idcourts.us/repository/css/id_style.css',
'https://www.idcourts.us/repository/css/id_print.css',
]
home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
sessid = cj._cookies['www.idcourts.us']['/repository']['JSESSIONID'].value
# Note the +=
opener.addheaders += [
('Referer', 'https://www.idcourts.us/repository/start.do'),
]
for st in stylesheets:
# da trick
opener.open(st+';jsessionid='+sessid)
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
# perhaps need to keep updating the referer...

Not an actual answer (but far too long for a comment); possibly useful to anyone else trying to answer this.
Despite my best attempts, I can't figure this out.
Looking in Firebug, the cookie seems to remain the same (works properly) for Firefox.
I added urllib2.HTTPSHandler(debuglevel=1) to debug what headers Python is sending, and it does appear to resend the cookie.
I also added all the Firefox request headers to see if that would help (it didn't):
opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
..
]
My test code:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), urllib2.HTTPSHandler(debuglevel=1))
opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Language', 'en-gb,en;q=0.5'),
('Accept-Encoding', 'gzip,deflate'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'),
('Keep-Alive', '115'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Referer', 'https://www.idcourts.us/repository/partySearch.do'),
]
home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
I feel like I'm missing something obvious.

I think, it is a problem with the server it is Setting a new cookie for each request.

Related

proper way of implementing a user agent into a urllib.request.build_opener

I'm trying to set the user agent for my urllib request:
opener = urllib.request.build_opener(
urllib.request.HTTPCookieProcessor(cj),
urllib.request.HTTPRedirectHandler(),
urllib.request.ProxyHandler({'http': proxy})
)
and finally:
response3 = opener.open("https://www.google.com:443/search?q=test", timeout=timeout_value).read().decode("utf-8")
What would be the best way to set the user-agent header to
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36

With urllib we have two options, as far as I know.
build_opener returns a OpenerDirector object, which has an addheaders attribute. We can change the user-agent and other headers with that attribute.
opener.addheaders = [('User-Agent', 'My User-Agent')]
url = 'http://httpbin.org/user-agent'
r = opener.open(url, timeout=5)
text = r.read().decode("utf-8")
Alternatively, we can install the OpenerDirector object to the global opener with install_opener and use urlopen to submit the request. Now can use Request to set the headers.
urllib.request.install_opener(opener)
url = 'http://httpbin.org/user-agent'
headers = {'user-agent': "My User-Agent"}
req = urllib.request.Request(url, headers=headers)
r = urllib.request.urlopen(req, timeout=5)
text = r.read().decode("utf-8")
Personally, I prefer the second method because it is more consistent. Once we install the opener all requests will have the same handlers, and we can continue using urllib the same way. However, if you don't want to use those handlers for all requests you should choose the first method and use addheaders to set headers for a specific OpenerDirector object.
With requests things are simpler.
We can use the session.heders attribute if we want to change the user-agent or other headers for all requests,
s = requests.session()
s.headers['user-agent'] = "My User-Agent"
r = s.get(url, timeout=5)
or use the headers parameter if we want to set headers for a specific request only.
headers = {'user-agent': "My User-Agent"}
r = requests.get(url, headers=headers, timeout=5)

Fetching website html using python

I am very new to Python. I had a working code to get an html page and parse text from it but it stopped working recently. Perhaps the website changed,but I am not able to retrieve the data anymore.
Any help is appreciated. Here is the code that used to work before.
from cookielib import CookieJar
from urllib2 import build_opener, HTTPCookieProcessor
from lxml import etree
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
url = 'https://www.nasdaq.com/markets/stocks/symbol-change-history.aspx?sortby=EFFECTIVE&d'
cj = CookieJar()
opener = build_opener(HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', user_agent),('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
page = opener.open(url, timeout=2)
parser = etree.HTMLParser()
rootDOM = etree.parse(page, parser)
html = etree.tostring(rootDOM.getroot(), pretty_print=True, method='html')
I get error SSLError: ('The read operation timed out',)

Consecutive requests with python Requests.Session() not working

I was trying to do this,
import requests
s=requests.Session()
login_data = dict(userName='user', password='pwd')
ra=s.post('http://example/checklogin.php', data=login_data)
print ra.content
print ra.headers
ans = dict(answer='5')
f=s.cookies
r=s.post('http://example/level1.php',data=ans,cookies=f)
print r.content
But the second post request returns a 404 error, can someone help me why ?

In the latest version of requests, the sessions object comes equipped with Cookie Persistence, look at the requests Sessions ojbects docs.
So you don't need add the cookie artificially.
Just
import requests
s=requests.Session()
login_data = dict(userName='user', password='pwd')
ra=s.post('http://example/checklogin.php', data=login_data)
print ra.content
print ra.headers
ans = dict(answer='5')
r=s.post('http://example/level1.php',data=ans)
print r.content
Just print the cookie to look up wheather you were logged.
for cookie in s.cookies:
print (cookie.name, cookie.value)
And is the example site is yours?
If not maybe the site reject the bot/crawler !
And you can change your requests's user-agent as looks likes you are using a browser.
For example:
import requests
s=requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36'
}
login_data = dict(userName='user', password='pwd')
ra=s.post('http://example/checklogin.php', data=login_data, headers=headers)
print ra.content
print ra.headers
ans = dict(answer='5')
r=s.post('http://example/level1.php',data=ans, headers = headers)
print r.content

Why isn't my Python program working? It uses HTTP Headers

EDIT: I changed the code and it still doesn't work! I used the links from the answer to do it but it didn't work!
Why does this not work? When I run it takes a long time to run and never finishes!
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Encoding','gzip, deflate')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
#user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
#headers = { 'User-Agent' : user_agent }
response = urllib2.urlopen(req)
page = response.read()
print page

The remote server (the one at www.locationary.com) is waiting for the content of your HTTP post request, based on the Content-Type and Content-Length headers. Since you're never actually sending said awaited data, the remote server waits — and so does read() — until you do so.
I need to know how to send the content of my http post request.
Well, you need to actually send some data in the request. See:
urllib2 - The Missing Manual
How do I send a HTTP POST value to a (PHP) page using Python?
Final, "working" version:
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
response = urllib2.urlopen(req)
page = response.read()
print page
Don't explicitly set the Content-Length header
Remove the req.add_header('Accept-Encoding','gzip, deflate') line, so that the response doesn't have to be decompressed (or — exercise left to the reader — ungzip it yourself)

Changing user agent on urllib2.urlopen

How can I download a webpage with a user agent other than the default one on urllib2.urlopen?

I answered a similar question a couple weeks ago.
There is example code in that question, but basically you can do something like this: (Note the capitalization of User-Agent as of RFC 2616, section 14.43.)
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('http://www.stackoverflow.com')

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()
Or, a bit shorter:
req = urllib2.Request('www.example.com', headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib2.urlopen(req).read()

Setting the User-Agent from everyone's favorite Dive Into Python.
The short story: You can use Request.add_header to do this.
You can also pass the headers as a dictionary when creating the Request itself, as the docs note:
headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

For python 3, urllib is split into 3 modules...
import urllib.request
req = urllib.request.Request(url="http://localhost/", headers={'User-Agent':' Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'})
handler = urllib.request.urlopen(req)

All these should work in theory, but (with Python 2.7.2 on Windows at least) any time you send a custom User-agent header, urllib2 doesn't send that header. If you don't try to send a User-agent header, it sends the default Python / urllib2
None of these methods seem to work for adding User-agent but they work for other headers:
opener = urllib2.build_opener(proxy)
opener.addheaders = {'User-agent':'Custom user agent'}
urllib2.install_opener(opener)
request = urllib2.Request(url, headers={'User-agent':'Custom user agent'})
request.headers['User-agent'] = 'Custom user agent'
request.add_header('User-agent', 'Custom user agent')

For urllib you can use:
from urllib import FancyURLopener
class MyOpener(FancyURLopener, object):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
myopener.retrieve('https://www.google.com/search?q=test', 'useragent.html')

Another solution in urllib2 and Python 2.7:
req = urllib2.Request('http://www.example.com/')
req.add_unredirected_header('User-Agent', 'Custom User-Agent')
urllib2.urlopen(req)

there are two properties of urllib.URLopener() namely:
addheaders = [('User-Agent', 'Python-urllib/1.17'), ('Accept', '*/*')] and
version = 'Python-urllib/1.17'.
To fool the website you need to changes both of these values to an accepted User-Agent. for e.g.
Chrome browser : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36'
Google bot : 'Googlebot/2.1'
like this
import urllib
page_extractor=urllib.URLopener()
page_extractor.addheaders = [('User-Agent', 'Googlebot/2.1'), ('Accept', '*/*')]
page_extractor.version = 'Googlebot/2.1'
page_extractor.retrieve(<url>, <file_path>)
changing just one property does not work because the website marks it as a suspicious request.

Try this :
html_source_code = requests.get("http://www.example.com/",
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
'Upgrade-Insecure-Requests': '1',
'x-runtime': '148ms'},
allow_redirects=True).content

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - urllib2 & cookielib - python

I think, it is a problem with the server it is Setting a new cookie for each request.

Related

proper way of implementing a user agent into a urllib.request.build_opener

Fetching website html using python

Consecutive requests with python Requests.Session() not working

Why isn't my Python program working? It uses HTTP Headers

Changing user agent on urllib2.urlopen

Categories

Resources