I need to submit a form to the server and get csv file from the server via the internet with python.
The server website is (http://
222.158.245.253/obweb/data/c1/c1_output6.aspx?LocationNo=012), which publishes the observation data of sea in Japan.
So far, I always select the item and the date and click the button.
Then, When a file save dialog box is displayed, I preserve the csv file from the server.
I would like to automate these manual labors with python.
I have studied about python and web scraping and have used python modules(like BeautifulSoup).
However, This website is difficult to do web scraping due to aspx.
So, please help me.
You can avoid scraping if you can find out what URL the form is POSTing to. Inspect the source code of the page and see if the form tag has an action attribute. This is the URL that the form sends all of your fields to (including the item and date you specify).
You're going to want to use the requests library to make your POST request. It'll be something like this example from the requests quickstart:
payload = {'item': '<your item>', 'date': '<your date>'}
r = requests.post("<form post url>", data=payload)
You can then likely access the csv file that's returned with
print r.content
Though you may have to process r.content for it to be meaningful.
I'm browsing through the Angie's List website with my Fiddler extension open in Chrome, and after each page load, Fiddler captures an XHR response for that page (I believe it's just a pixel tracker indicating the event that I visited a new page). I'd like to be able to capture the content of these responses automatically in a CSV file. So for example if I run "python getXHR.py http://www.angieslist.com" I'd want my csv output file to append:
angieslist.com,http://536371345.log.optimizely.com/event?a=536371345&d=536371345&y=fal...
How can I do this? I know Python better than other languages, but other languages are fine. If there is a way to do this directly through Fiddler/Firebug that's fine too.
With NetExport (extension for firebug) you can save all requests as JSON and then you could use Python (and module json) to find requests with X-Requested-With: XMLHttpRequest
BTW: NetExport has Auto Export
I have a scraper script that pulls binary content off publishers websites. Its built to replace the manual action of saving hundreds of individual pdf files that colleagues would other wise have to undertake.
The websites are credential based, and we have the correct credentials and permissions to collect this content.
I have encountered a website that has the pdf file inside an iFrame.
I can extract the content URL from the HTML. When I feed the URL to the content grabber, I collect a small piece of HTML that says: <html><body>Forbidden: Direct file requests are not allowed.</body></html>
I can feed the URL directly to the browser, and the PDF file resolves correctly.
I am assuming that there is a session cookie (or something, I'm not 100% comfortable with the terminology) that gets sent with the request to show that the GET request comes from a live session, not a remote link.
I looked at the refering URL, and saw these different URLs that point to the same article that I collected over a day of testing (I have scrubbed identifers from the URL):-
http://content_provider.com/NDM3NTYyNi45MTcxODM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3NjYyMS4wNjU3MzY%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njc3Mi4wOTY3MDM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njg3Ni4yOTc0NDg%3D/elibrary//title/issue/article.pdf
This suggests that there is something in the URL that is unique, and needs associating to something else to circumvent the direct link detector.
Any suggestions on how to get round this problem?
OK. The answer was Cookies and headers. I collected the get header info via httpfox and made a identical header object in my script, and i grabbed the session ID from request.cookie and sent the cookie with each request.
For good measure I also set the user agent to a known working browser agent, just in case the server was checking agent details.
Works fine.
There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.
Using wget, I can download the file from a different address, but not from the address that works in Chrome.
This is what I've tried to do:
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
wget -c $url --user-agent="" -O xfgs.f4v
This doesn't work either:
wget -c $url -O xfgs.f4v
The output is:
Connecting to 118.26.57.12:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-02-13 09:50:42 ERROR 403: Forbidden.
What am I doing wrong?
I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:
import mechanize
br = mechanize.Browser()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
r = br.open(url).read()
tofile=open("/tmp/xfgs.f4v","w")
tofile.write(r)
tofile.close()
This is the result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden
Can anyone explain how to get the mechanize code to work please?
First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.
If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.
Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.
Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.
Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.
When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.
Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.
If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.
Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.
Edit: Clarify steps
Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution
Edit:
You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.
I love this kind of problem
It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing
you can use selenium or watir to do all the stuff you need in a browser.
since you don't want to see the browser, you can run selenium headless.
see also this answer.
Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).
This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.
My usual approach to trying this kind of scraping is
try wget or pythons urllib2
try mechanize
drive a browser
Unless there is some captcha, the last one usually works, but the others are easier (and faster).
In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.
http://www.fiddler2.com/fiddler2/
https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project
http://www.charlesproxy.com/
Or more low level
http://netcat.sourceforge.net/
http://www.wireshark.org/
Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.
I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.
You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.
While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.
EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"
Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).
(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)
Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)
There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.
It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.
from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.
Did you try requests module? it's much simpler to use than urllib2 and pycurl etc.
yet it's powerful. it has following features: The link is here
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
Python 2.6—3.3
Thread-safe.
You could use Internet Download Manager it is able to capture and download any streaming media from any website
I want to open a URL with Python code but I don't want to use the "webbrowser" module. I tried that already and it worked (It opened the URL in my actual default browser, which is what I DON'T want). So then I tried using urllib (urlopen) and mechanize. Both of them ran fine with my program but neither of them actually sent my request to the website!
Here is part of my code:
finalURL="http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
print finalURL
print ""
br.open(finalURL)
page = urllib2.urlopen(finalURL).read()
When I go into the site, locationary.com, it doesn't show that any changes have been made! When I used "webbrowser" though, it did show changes on the website after I submitted my URL. How can I do the same thing that webbrowser does without actually opening a browser?
I think the website wants a "GET"
I'm not sure what OS you're working on, but if you use something like httpscoop (mac) or fiddler (pc) or wireshark, you should be able to watch the traffic and see what's happening. It may be that the website does a redirect (which your browser is following) or there's some other subsequent activity.
Start an HTTP sniffer, make the request using the web browser and watch the traffic. Once you've done that, try it with the python script and see if the request is being made, and what the difference is in the HTTP traffic. This should help identify where the disconnect is.
A HTTP GET doesn't need any specific code or action on the client side: It's just the base URL (http://server/) + path + optional query.
If the URL is correct, then the code above should work. Some pointers what you can try next:
Is the URL really correct? Use Firebug or a similar tool to watch the network traffic which gives you the full URL plus any header fields from the HTTP request.
Maybe the site requires you to log in, first. If so, make sure you set up cookies correctly.
Some sites require a correct "referrer" field (to protect themselves against deep linking). Add the referrer header which your browser used to the request.
The log file of the server is a great source of information to trouble shoot such problems - when you have access to it.