data extraction from web

data extraction from web - python

I am planning to do a data extraction from web sources (web scraping) as part of my work. I would like to extract info around my company's 10km radius.
I would like to extract information such as condominiums, its address, number of units and its price per sqft. Other things like number of schools and kindergarten in the area and hotels.
I understand I need to extract from few sources/webpages. I will also be using Python.
I would like to know which library or libraries should I be using. Is web scraping the only means? Can we extract info from Google Maps?
Also, if anyone has any experience I will really appreciate if you can guide me on this.
Thanks a lot, guys.

For Google Maps, try the API. Using web scraping tools for Maps data extraction is highly discouraged by Google TOS.
If you are using Python, it has very nice libraries BeautifulSoup and Scrapy for this purpose.
Other means? You can extract POIs from OSM data, try the open source tools. Property Info? May be it's available for your county / state from Govt Office, give it a try.

Related

How to use a URL to get .csv data in Python

First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.

You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/
As you see this website above use the CSS "table.wikitable" to identify the tables.

You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!

Python Web Scripting

I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.

If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).

Python scripting for XBMC

I am new to programming and to Python itself. I have no programming experience. I have managed to read up on Python and done some fairly basic Python tutorial, now I am ready for my first project in Python.
I am basing my project around XBMC, I want to develop some addons for this awesome media center.
I have a few websites that I want to scrape and display in XBMC. One is a music website and one is a payed TV website which is only available to people with accounts with them. I have managed to scrape a website with feedparse but I have no idea how to output these titles and links to play in XBMC.
My question here is: where do I start, how do I construct the script for these websites, what tools/libraries/modules do I need. And what do I need to do to include it into XBMC.

On the general topic that has been asked a ton of times regarding webpage scraping, the common answer is always Mechanize/Beautiful Soup for python. That would allow you to actually get your data.
Once you have your data, its then just a matter of formatting it the way you want, for your xbmc app: http://wiki.xbmc.org/index.php?title=HOW-TO:Write_Python_Scripts_for_XBMC
Its a two step process.
Get your data from a source and format it into some common structure
Use the common structure to populate your elements in the xbmc script
What you actually want to do with your script will determine how you would use your data. If its just simply providing information, then that link above would pretty much explain it.

Mining Wikipedia for mapping relations for text mining

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org

Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.

Getting a list of all churches in a certain state using Python

I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.

You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).
In order to deal with the 'html junk' python has a library for that too: BeautifulSoup
It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).
Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.

Try lynx --dump <url> to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.

What you're trying to do is called Scraping or web scraping.
If you do some searches on python and scraping, you may find a list of tools that will help.
(I have never used scrapy, but it's site looks promising :)

Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.

Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.
The US census provides a data set of churches for use with geographic information systems. If finding all the x in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

data extraction from web - python

Related

How to use a URL to get .csv data in Python

Python Web Scripting

Python scripting for XBMC

Mining Wikipedia for mapping relations for text mining

Getting a list of all churches in a certain state using Python

Categories

Resources