Python 3.X Playing with the internet

Python 3.X Playing with the internet - python

I'm doing a small project to help my work go by faster.
I currently have a program written in Python 3.2 that does almost all of the manual labour for me, with one exception.
I need to log on to the company website (username and password) then choose a month and year and click download.
I would like to write a little program to do that for me, so that the whole process is completely done by the program.
I have looked into it and I can only find tools for 2.X.
I have looked into urllib and I know that some of the 2.X moudles are now in urllib.request.
I have even found some code to start it off, however I'm confused as to how to put it into practise.
Here is what I have found:
import urllib2
theurl = 'http://www.someserver.com/toplevelurl/somepage.htm'
username = 'johnny'
password = 'XXXXXX'
# a great password
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
# this creates a password manager
passman.add_password(None, theurl, username, password)
# because we have put None at the start it will always
# use this username/password combination for urls
# for which `theurl` is a super-url
authhandler = urllib2.HTTPBasicAuthHandler(passman)
# create the AuthHandler
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
# All calls to urllib2.urlopen will now use our handler
# Make sure not to include the protocol in with the URL, or
# HTTPPasswordMgrWithDefaultRealm will be very confused.
# You must (of course) use it when fetching the page though.
pagehandle = urllib2.urlopen(theurl)
# authentication is now handled automatically for us
All Credit to Michael Foord and his page: Basic Authentication
So I changed the code around a bit and replaced all the 'urllib2' with 'urllib.request'
Then I learned how to open a webpage, figuring the program should open the webpage, use the login and password data to open the page, then I'll learn how to download the files from it.
ie = webbrowser.get('c:\\program files\\internet explorer\\iexplore.exe')
ie.open(theurl)
(I know Explorer is garbage, just using it to test then I'll be using crome ;) )
But that doesnt open the page with the login data entered, it simply opens the page as though you had typed in the url.
How do I get it to open the page with the password handle?
I sort of understand how Michael made them, but I'm not sure which to use to actually open the website.
Also an after thought, might I need to look into cookies?
Thanks for your time

you get things confused here.
webbrowser is a wrapper around your actual webbrowser, and urllib is a library for http- and url-related stuff.
They don't know each other, and serve very different purposes.
In former IE versions, you could encode HTTP Basic Auth username and password in the URL like so:
http(s)://Username:Password#Server/Ressource.ext - I believe Firefox and Chrome still support that, IE killed it: http://support.microsoft.com/kb/834489/EN-US
if you want to emulate a browser, rather than just open a real one, take a look at mechanize: http://wwwsearch.sourceforge.net/mechanize/

your browser doesn't know anything about the authenitcation you've done in python (and that has nothing to do wheater your browser is garbage or not). the webbrowser module simply offers convenience methods for launching a browser and pointing it to a webbrowser. you can't 'transfer' your credentials to the browser.
as for migrating from python2 to python3: the 2to3 tool can convert simple scripts like your automatically.

They are not running in the same environment.
You need to figure out what really happened when you click the download button. Use your browser's develop tool to get the POST format the website is using. Then build a request in python to fetch the file.
Requests is a nice lib to do that kind of things much easier.

I would use selenium, this is some code from a little script I have hacked about a bit to give you an idea:
def get_name():
user = 'johnny'
passwd = 'XXXXXX'
try :
driver = webdriver.Remote(desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)
driver.get('http://www.someserver.com/toplevelurl/somepage.htm')
assert 'Page Title' in driver.title
username = driver.find_element_by_name('name_of_userid_box')
username.send_keys(user)
password = driver.find_element_by_name('name_of_password_box')
password.send_keys(passwd)
submit = driver.find_element_by_name('name_of_login_button')
submit.click()
driver.get('http://www.someserver.com/toplevelurl/page_with_download_button.htm')
assert 'page_with_download_button title' in driver.title
download = driver.find_element_by_name('download_button')
download.click()
except :
print('process failed')
I'm new to python so that may not be the best code every written but it should give you the general idea.
Hope it helps

Related

how to convert IP address into http for urllib

I'm looking to embark on my own personal project of creating an application which i can save doc/texts/image from the site my browser is at. I have done a lot of research to conclude that either of the two ways is possible for now: using cookies or packet sniffers to identify the IP address(the packet sniffer method being more relevent at the moment).
I would like to automate the application so I would not have to copy and paste the url on my browser and paste it into the script using urllib.
Are there any suggestions that experienced network programmers can provide with regards to the process or modules or libraries I need?
thanks so much
jonathan

If you want to download all images, docs, and text while you're actively browsing (which is probably a bad idea considering the sheer amount of bandwidth) then you'll want something more than urllib2. I assume you don't want to have to keep copying and pasting all the urls into a script to download everything, if that is not the case a simple urllib2 and beautifulsoup filter would do you wonders.
However if what I assume is correct then you are probably going to want to investigate selenium. From there you can launch a selenium window (defaults to Firefox) and then do your browsing normally. The best option from there is to continually poll the current url and if it is different identify all of the elements you want to download and then use urllib2 to download them. Since I don't know what you want to download I can't really help you on that part. However here is what something like that would look like in selenium:
from selenium import webdriver
from time import sleep
# Startup the web-browser
browser = webdriver.Firefox()
current_url = browser.current_url
while True:
try:
# If we have a url, identify and download your items
if browser.current_url != current_url:
# Download the stuff here
current_url = browser.current_url
# Triggered once you close the web-browser
except:
break
# Sleep for half a second to avoid demolishing your machine from constant polling
sleep(0.5)
Once again I advise against doing this, as constantly downloading images, text, and documents would take up a huge amount of space.

Emulating a browser to download a file?

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.
Using wget, I can download the file from a different address, but not from the address that works in Chrome.
This is what I've tried to do:
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
wget -c $url --user-agent="" -O xfgs.f4v
This doesn't work either:
wget -c $url -O xfgs.f4v
The output is:
Connecting to 118.26.57.12:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-02-13 09:50:42 ERROR 403: Forbidden.
What am I doing wrong?
I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:
import mechanize
br = mechanize.Browser()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
r = br.open(url).read()
tofile=open("/tmp/xfgs.f4v","w")
tofile.write(r)
tofile.close()
This is the result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden
Can anyone explain how to get the mechanize code to work please?

First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.
If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.
Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.
Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.
Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.
When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.
Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.
If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.
Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.
Edit: Clarify steps
Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution
Edit:
You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.
I love this kind of problem

It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing

you can use selenium or watir to do all the stuff you need in a browser.
since you don't want to see the browser, you can run selenium headless.
see also this answer.

Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).
This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.
My usual approach to trying this kind of scraping is
try wget or pythons urllib2
try mechanize
drive a browser
Unless there is some captcha, the last one usually works, but the others are easier (and faster).

In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.
http://www.fiddler2.com/fiddler2/
https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project
http://www.charlesproxy.com/
Or more low level
http://netcat.sourceforge.net/
http://www.wireshark.org/
Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.

I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.
You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.

While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.
EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"
Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).
(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)
Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)

There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.
It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.

from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.

Did you try requests module? it's much simpler to use than urllib2 and pycurl etc.
yet it's powerful. it has following features: The link is here
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
Python 2.6—3.3
Thread-safe.

You could use Internet Download Manager it is able to capture and download any streaming media from any website

How do you open a URL with Python without using a browser?

I want to open a URL with Python code but I don't want to use the "webbrowser" module. I tried that already and it worked (It opened the URL in my actual default browser, which is what I DON'T want). So then I tried using urllib (urlopen) and mechanize. Both of them ran fine with my program but neither of them actually sent my request to the website!
Here is part of my code:
finalURL="http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
print finalURL
print ""
br.open(finalURL)
page = urllib2.urlopen(finalURL).read()
When I go into the site, locationary.com, it doesn't show that any changes have been made! When I used "webbrowser" though, it did show changes on the website after I submitted my URL. How can I do the same thing that webbrowser does without actually opening a browser?
I think the website wants a "GET"

I'm not sure what OS you're working on, but if you use something like httpscoop (mac) or fiddler (pc) or wireshark, you should be able to watch the traffic and see what's happening. It may be that the website does a redirect (which your browser is following) or there's some other subsequent activity.
Start an HTTP sniffer, make the request using the web browser and watch the traffic. Once you've done that, try it with the python script and see if the request is being made, and what the difference is in the HTTP traffic. This should help identify where the disconnect is.

A HTTP GET doesn't need any specific code or action on the client side: It's just the base URL (http://server/) + path + optional query.
If the URL is correct, then the code above should work. Some pointers what you can try next:
Is the URL really correct? Use Firebug or a similar tool to watch the network traffic which gives you the full URL plus any header fields from the HTTP request.
Maybe the site requires you to log in, first. If so, make sure you set up cookies correctly.
Some sites require a correct "referrer" field (to protect themselves against deep linking). Add the referrer header which your browser used to the request.
The log file of the server is a great source of information to trouble shoot such problems - when you have access to it.

Loading cookies in python

I am a novice programmer attempting to access google insights using python. I can access sites which dont require cookies fine, but i cant seem to properly pass the cookies along. The cookines file was exported from mozilla firefox, is in the Z: drive which is also where im running python from.
Im also pretty sure my code for saving the file could be better done than reading and writing but I dont know how to do that either. Any helpo would be appreciated.
import urllib2
import cookielib
import os
url = "http://www.google.com/insights/search/overviewReport?q=eagles%2Ccsco&geo=US&cmpt=q&content=1&export=2"
cj = cookielib.MozillaCookieJar()
cj.load('cookies6.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
file = opener.open(url)
output = open('test2.csv','wb')
output.write(file.read())
output.close()

I haven't tested your code however:
As far as I can tell there seems to be nothing wrong with your code
I've tried the url you're searching and had no problems downloading the csv without any cookies
In my previous experience with google, you might be looking at the problem the wrong way, it is not that you don't have the right cookies but that google automatically blocks requests from bots. If this is the case you must replace the user agent http header to mimic an actual browser. Beware however that this is against googles terms of service and if you make too many requests per minute google will block all requests from your ip for about 8h.

Controlling a browser using Python, on a Mac

I'm looking for a way to programatically control a browser on a Mac (i.e. Firefox or Safari or Chrome/-ium or Opera, but not IE) using Python.
The actions I need include following links, checking if elements exist in a page, and submitting forms.
Which solution would you recommend?

I like Selenium, it's scriptable through Python. The Selenium IDE only runs in Firefox, but Selenium RC supports multiple browsers.

Check out python-browsercontrol.
Also, you could read this forum page (I know, it's old, but it seems extremely relevant to your question):
http://bytes.com/topic/python/answers/45528-python-client-side-browser-script-language
Also: http://docs.python.org/library/webbrowser.html
Example:
from browser import *
my_browser = Firefox(99, '/usr/lib/firefox/firefox-bin') my_browser.open_url('cnn.com')
open_url returns when the cnn.com home page document is loaded in the browser frame.

Might be a bit restrictive, but py-appscript may be the easiest way of controlling a Applescript'able browser from Python.
For more complex things, you can use the PyObjC to achieve pretty much anything - for example, webkit2png is a Python script which uses WebKit to load a page, and save an image of it. You need to have a decent understanding of Objective-C and Cocoa/etc to use it (as it just exposes ObjC objects to Python)
Screen-scaping may achieve what you want with much less complexity.

Check out spynner Python module.
Spynner is a stateful programmatic web browser module for Python. It is based upon PyQT and WebKit. It supports Javascript, AJAX, and every other technology that !WebKit is able to handle (Flash, SVG, ...). Spynner takes advantage of JQuery. a powerful Javascript library that makes the interaction with pages and event simulation really easy.
Using Spynner you would able to simulate a web browser with no GUI (though a browsing window can be opened for debugging purposes), so it may be used to implement crawlers or acceptance testing tools.
See some examples at GitHub page.

Try mechanize, if you don't actually need a browser.
Example:
import re
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm.
br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)
# Submit current form. Browser calls .close() on the current response on
# navigation, so this closes response1
response2 = br.submit()

Several Mac applications can be controlled via OSAScript (a.k.a. AppleScript), which can be sent via the osascript command. O'Reilly has an article on invoking osascript from Python. I can't vouch for it doing exactly what you want, but it's a starting point.

Maybe overpowered, but check out Marionette to control Firefox. There is a tutorial at readthedocs:
You first start a Marionette-enabled firefox instance:
firefox -marionette
Then you create a client:
client = Marionette('localhost', port=2828)
client.start_session()
Navigation f.ex. is done via
url = 'http://mozilla.org'
client.navigate(url)
client.go_back()
client.go_forward()
assert client.get_url() == url

Checkout Mozmill https://github.com/mikeal/mozmill
Mozmill is a UI Automation framework for Mozilla apps like Firefox and Thunderbird. It's both an addon and a Python command-line tool. The addon provides an IDE for writing and running the JavaScript tests and the Python package provides a mechanism for running the tests from the command line as well as providing a way to test restarting the application.

Take a look at PyShell (an extension to PyXPCOM).
Example:
promptSvc = components.classes["#mozilla.org/embedcomp/prompt-service;1"].\
getService(Components.interfaces.nsIPromptService)
promptSvc.alert(None, 'Greeting...', "Hello from Python")

You can use selenium library for Python, here is a simple example (in form of unittest):
#!/usr/bin/env python3
import unittest
from selenium import webdriver
class FooTest(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.base_url = "http://example.com"
def is_text_present(self, text):
return str(text) in self.driver.page_source
def test_example(self):
self.driver.get(self.base_url + "/")
self.assertTrue(self.is_text_present("Example"))
if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(FooTest)
result = unittest.TextTestRunner(verbosity=2).run(suite)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.