Grabbing text from a webpage

Grabbing text from a webpage - python

I would like to write a program that will find bus stop times and update my personal webpage accordingly.
If I were to do this manually I would
Visit www.calgarytransit.com
Enter a stop number. ie) 9510
Click the button "next bus"
The results may look like the following:
10:16p Route 154
10:46p Route 154
11:32p Route 154
Once I've grabbed the time and routes then I will update my webpage accordingly.
I have no idea where to start. I know diddly squat about web programming but can write some C and Python. What are some topics/libraries I could look into?

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want.
The Python Wiki has a good lot of stuff on this.

Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.

You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/

You can use Perl to help you complete your task.
use strict;
use LWP;
my $browser = LWP::UserAgent->new;
my $responce = $browser->get("http://google.com");
print $responce->content;
Your responce object can tell you if it suceeded as well as returning the content of the page.You can also use this same library to post to a page.
Here is some documentation. http://metacpan.org/pod/LWP::UserAgent

That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .

This is called Web scraping, and it even has its own Wikipedia article where you can find more information.
Also, you might find more details in this SO discussion.

As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Related

Python Scrapy Isn't Extracting Data

Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()

It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.

Suggestion on creation app for getting information from webpage

First want to say that I have experience with python and some web libraries like mechanize, beautiful soup, urllib2.
The idea is to create an app that will grab information from webpage, that I currently looking on in webbrowser. And than store it.
For example:
I manually go to the website, create a user.
Than run my app, that will grab some details from webpage, that I'm currently looking on. like user name, first name, last name and so on.
Problems:
I don't know how to make a program to run kinda on top of my webbrowser. I can't simply make a scipt to login to this webpage and do the rest with Beautiful Soup because it has a very good protection from web-crawlers and web bots.
Need some place to start. So the main question is is it possible to grab information that currently on my web browser? if yes hope to hear some suggestions on how to make my program look at the browser?
Please fill free to ask me if you not kinda understand what I'm asking, or you have some suggestions, some libraries that I can use.

The easiest thing to do is probably to save the HTML content of the current page to a file (using File -> Save Page As or whatever it is in your browser) and then running Beautiful Soup / lxml.html / whatever on that file.
You could probably also get Selenium to do what you want, though I've never used it and am not sure.

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T

I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.

You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.

In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.

Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Is there a python module which web scrapes the image, title and a description of any link?

What I'm looking for, should give me something like this ->

There are many APIs available that can accomplish your task (more precisely the task you describe on your question, not the image :) ). I personally use diffbot, which I discovered after reading this. Beware though, for this kind of "content" extraction does not always end with success, because of the nature of web pages. Instead, it relies on heuristics and training and thus may not suffice for your specific purposes...

If you're wanting an entire screenshot of the page then something like https://stackoverflow.com/questions/1041371/alexa-api may help you?
Otherwise if you're just wanting to get a few key images from the page..
you could use mechanize to assit you. When you connect to a webpage you can search through all the links on the page using:
for link in br.links():
where br is your browser object.
You can see an example here:
Download all the links(related documents) on a webpage using Python
if you print dir(link) it will show you various properties such as link.text and link.url. furthermore you can import urlparse.urlsplit and use it on the url. You can direct the browser towards the URL and scrape the images as shown in the above example.

You should really use a search engines interpretation of the page and the images in it.
You could use, the python wrapper on the bing API, or the xGoogle library.
Beware the xGoogle library fakes to google as if a browser and may not be endorsed way to consume Google's data.

This one should help: http://palewi.re/posts/2008/04/20/python-recipe-grab-a-page-scrape-a-table-download-a-file/
Learns you how to scrape content and images and store it.

Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.

It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text

If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.

Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grabbing text from a webpage - python

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want. The Python Wiki has a good lot of stuff on this.

Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.

You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/

That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .

This is called Web scraping, and it even has its own Wikipedia article where you can find more information. Also, you might find more details in this SO discussion.

As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Related

Python Scrapy Isn't Extracting Data

Suggestion on creation app for getting information from webpage

Read all pages within a domain

Is there a python module which web scrapes the image, title and a description of any link?

Grabbing non-HTML data from a website using python

Categories

Resources