Script to connect to a web page - python

Looking for a python script that would simply connect to a web page (maybe some querystring parameters).
I am going to run this script as a batch job in unix.

urllib2 will do what you want and it's pretty simple to use.
import urllib
import urllib2
params = {'param1': 'value1'}
req = urllib2.Request("http://someurl", urllib.urlencode(params))
res = urllib2.urlopen(req)
data = res.read()
It's also nice because it's easy to modify the above code to do all sorts of other things like POST requests, Basic Authentication, etc.

Try this:
aResp = urllib2.urlopen("http://google.com/");
print aResp.read();

If you need your script to actually function as a user of the site (clicking links, etc.) then you're probably looking for the python mechanize library.
Python Mechanize

A simple wget called from a shell script might suffice.

in python 2.7:
import urllib2
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urllib2.urlopen(url+"?"+params).read()
print html
more info at https://docs.python.org/2.7/library/urllib2.html
in python 3.6:
from urllib.request import urlopen
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urlopen(url+"?"+params).read()
print(html)
more info at https://docs.python.org/3.6/library/urllib.request.html
to encode params into GET format:
def myEncode(dictionary):
result = ""
for k in dictionary: #k is the key
result += k+"="+dictionary[k]+"&"
return result[:-1] #all but that last `&`
I'm pretty sure this should work in either python2 or python3...

What are you trying to do? If you're just trying to fetch a web page, cURL is a pre-existing (and very common) tool that does exactly that.
Basic usage is very simple:
curl www.example.com

You might want to simply use httplib from the standard library.
myConnection = httplib.HTTPConnection('http://www.example.com')
you can find the official reference here: http://docs.python.org/library/httplib.html

Related

Convert urllib2 python code to use urllib module

I have the following code below which runs using the urllib2 module, but I have a requirement to upgrade to Python 3.x and this prevents the use of urllib2. I am aware it is split across urllib.request and urllib.error, but I am struggling to convert the following code to use the urllib module instead after reading through the doc and other relevant questions. Any help is greatly appreciated.
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(url=event['ResponseURL'], data=data)
request.add_header('Content-Type', '')
request.get_method = lambda: 'PUT'
url = opener.open(request)
All you need to do is replace urllib2 with urllib.request. You are not using anything that has moved to other urllib.* modules:
import urllib.request
opener = urllib.request.build_opener(urllib.request.HTTPHandler)
request = urllib.request.Request(url=event['ResponseURL'], data=data)
request.add_header('Content-Type', '')
request.get_method = lambda: 'PUT'
url = opener.open(request)
You can always run the 2to3 command-line tool on your Python 2 code and see what changes it makes; the default action is to output changes on stdout in unified diff format.
The urllib fixer will then also add imports for urllib.error and urllib.parse at the top, because it knows that code that imported urllib2 could need any of the 3 urllib.* modules; it isn't smart enough to limit the import only to those that are actually needed after transforming the rest of the urllib2 references in the module.

python requests with cookielib equivalent to wget with --load-cookies [duplicate]

I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.

How do I get the HTML of a website using Python 3?

I've been trying to do this with repl.it and have tried several solutions on this site, but none of them work. Right now, my code looks like
import urllib
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
print (urllib.urlopen(url).read())
but it just says "AttributeError: module 'urllib' has no attribute 'urlopen'".
If I add import urllib.urlopen, it tells me there's no module named that. How can I fix my problem?
The syntax you are using for the urllib library is from Python v2. The library has changed somewhat for Python v3. The new notation would look something more like:
import urllib.request
response = urllib.request.urlopen("http://www.google.com")
html = response.read()
The html object is just a string, with the returned HTML of the site. Much like the original urllib library, you should not expect images or other data files to be included in this returned object.
The confusing part here is that, in Python 3, this would fail if you did:
import urllib
response = urllib.request.urlopen("http://www.google.com")
html = response.read()
This strange module-importing behavior is, I am told, as intended and working. BUT it is non-intuitive and awkward. More importantly, for you, it makes the situation harder to debug. Enjoy.
Python3
import urllib
import requests
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
r = urllib.request.urlopen(url).read()
print(r)
or
import urllib.request
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
r = urllib.request.urlopen(url).read()
print(r)

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

Download file as string in python

I want to download a file to python as a string. I have tried the following, but it doesn't seem to work. What am I doing wrong, or what else might I do?
from urllib import request
webFile = request.urlopen(url).read()
print(webFile)
The following example works.
from urllib.request import urlopen
url = 'http://winterolympicsmedals.com/medals.csv'
output = urlopen(url).read()
print(output.decode('utf-8'))
Alternatively, you could use requests which provides a more human readable syntax. Keep in mind that requests requires that you install additional dependencies, which may increase the complexity of deploying the application, depending on your production enviornment.
import requests
url = 'http://winterolympicsmedals.com/medals.csv'
output = requests.get(url).text
print(output)
In Python3.x, using package 'urllib' like this:
from urllib.request import urlopen
data = urlopen('http://www.google.com').read() #bytes
body = data.decode('utf-8')
Another good library for this is http://docs.python-requests.org
It's not built-in, but I've found it to be much more usable than urllib*.

Categories

Resources