Grabbing non-HTML data from a website using python

Grabbing non-HTML data from a website using python - python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.

It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text

If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.

Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Related

Python Scrapy Isn't Extracting Data

Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()

It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.

How to get renewable information on a web by python3?

I want to get some information on a web page. I use requests.get to abstract the page. But I cannot find what I want. Checking it carefully, I found the info I want is in a list with a scrollbar. When I drag scrollbar down, more and more info is loaded. So I guess all the info in the list is not loaded yet when I get the page by module requests. I want to know what is happened in this process and How can I gather the information I want. (I am not familiar with Html language).

I want to know what is happened in this process
It sounds like when the user scrolls, the scrolling causes some javascript(js) to execute, and the js makes repeated requests to the server for more data. Unfortunately, the requests module cannot cause the javascript on an html page to execute--all you get back is the text of the js. The unable to execute javascript on an html page in order to retrieve what the user actually sees has been a problem for a long time. Fortunately, smart programmers have largely solved that problem. You need to use a different module. Check out the selenium module.
I am not familiar with Html language
Scraping web pages can get really tricky really fast, and some web pages proactively try to prevent computer programs from scraping their content, so you need to know both html and js in order to figure out what is going on.

Python- is there a module that will automatically scrape the content of an article off a webpage?

I know there is lxml and BeautifulSoup, but that won't work for my project, because I don't know in advance what the HTML format of the site I am trying to scrape an article off of will be. Is there a python-type module similar to Readability that does a pretty good job at finding the content of an article and returning it?

It's possible to do using PhantomJS (C++) or PyPhantomJS (Python).
They're both headless WebKit based browsers, which you can fully control from JavaScript. Because you can control it from JavaScript, I find it is really easy to do stuff such as scrape the content of an article.
PyPhantomJS also has a plugin system, so that's definitely a plus. :)

Extracting the real content from a content-page can not be done automatically - at least not with the standard tools. You have to define/identify where the real content is stored (by specifying the related CSS ID or class in your own HTML extraction code).

Using HTQL, the query is:
&html_main_text

Using Gecko/Firefox or Webkit got HTML parsing in python

I am using BeautifulSoup and urllib2 for downloading HTML pages and parsing them. Problem is with mis formed HTML pages. Though BeautifulSoup is good at handling mis formed HTML still its not as good as Firefox.
Considering that Firefox or Webkit are more updated and resilient at handling HTML I think its ideal to use them to construct and normalize DOM tree of a page and then manipulate it through Python.
However I cant find any python binding for the same. Can anyone suggest a way ?
I ran into some solutions of running a headless Firefox process and manipulating it through python but is there a more pythonic solution available.

Perhaps pywebkitgtk would do what you need.

see http://wiki.python.org/moin/WebBrowserProgramming
there are quite a lot of options - i'm maintaining the page above so that i don't keep repeating myself.
you should look at pyjamas-desktop: see the examples/uitest example because we use exactly this trick to get copies of the HTML page "out", so that the python-to-javascript compiler can be tested by comparing the page results after each unit test.
each of the runtimes supported and used by pyjamas-desktop is capable of allowing access to the "innerHTML" property of the document's body element (and a hell of a lot more).
bottom line: it is trivial to do what you want to do, but you have to know where to look to find out how to do it.
l.

You might like PyWebkitDFB from http://www.gnu.org/software/pythonwebkit/

Grabbing text from a webpage

I would like to write a program that will find bus stop times and update my personal webpage accordingly.
If I were to do this manually I would
Visit www.calgarytransit.com
Enter a stop number. ie) 9510
Click the button "next bus"
The results may look like the following:
10:16p Route 154
10:46p Route 154
11:32p Route 154
Once I've grabbed the time and routes then I will update my webpage accordingly.
I have no idea where to start. I know diddly squat about web programming but can write some C and Python. What are some topics/libraries I could look into?

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want.
The Python Wiki has a good lot of stuff on this.

Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.

You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/

You can use Perl to help you complete your task.
use strict;
use LWP;
my $browser = LWP::UserAgent->new;
my $responce = $browser->get("http://google.com");
print $responce->content;
Your responce object can tell you if it suceeded as well as returning the content of the page.You can also use this same library to post to a page.
Here is some documentation. http://metacpan.org/pod/LWP::UserAgent

That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .

This is called Web scraping, and it even has its own Wikipedia article where you can find more information.
Also, you might find more details in this SO discussion.

As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grabbing non-HTML data from a website using python - python

Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Related

Python Scrapy Isn't Extracting Data

How to get renewable information on a web by python3?

Python- is there a module that will automatically scrape the content of an article off a webpage?

Using Gecko/Firefox or Webkit got HTML parsing in python

Grabbing text from a webpage

Categories

Resources