I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.
However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?
Or is there any other technique to extract these fields from the forum's page.
I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.
For VBulletin:
enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;>
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;>
enclosed_section is the div that contains links to all the threads
thread is where you'll find the link to each thread
list_next_page is the link to the next page with list of threads
post is the div with the post text.
thread_next_page is the link to the next page of the thread
For Invision:
enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl
You'll still have to create several approaches per forum. But as Henley suggests, there are also a lot of forums that share their structure.
About easily parsing the dates of the forum's threads, dateparser was born from this specific requirement and it could be of great help.
Related
I am trying to write a program to retrieve all of the links for questions that have active bounties in a specific tag. I have not yet implemented the specific tag feature, because I am stuck just try to get all of the links.
from re import findall
from urllib.request import urlopen
def fetch_source(url):
return str(urlopen(url).read())
site = 'http://stackoverflow.com/?tab=featured'
def fetch_links(source):
source = fetch_source(source)
return findall("\/questions\/[0-9]*\/(?:[A-z]|\-)+", source)
print(fetch_links(site))
This will fetch many of the links, but it misses a lot of them because my regex only allows [A-z]|\- in the title. I'm not sure how to fix this though because some questions have quotation marks in the titles, and if I allow those, I will not know when the question link ends?
I'm sorry for being new to python, but I am just trying to figure stuff out.
Using regex would become completely infeasible for getting questions by specific tag.
You are correct that your regex is missing a lot of titles, but using findall really isn't appropriate in this situation. Beautiful soup, is a much better tool for retrieving links, and I recommend you look into it.
In this instance, however, the Stack Exchange API has you covered.
For similar questions, just search(or Google) through the API documentation until you see the feature you're looking for, in your case featured question.
Enter the parameters you want, and the API will show generate a link:
https://api.stackexchange.com/2.2/questions/featured?order=desc&sort=votes&tagged=python&site=stackoverflow
Example for retrieving all feature Python questions
I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.
If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).
I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You
If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.
You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org
Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.
Can anyone point me towards a ready made RSS screen scraper, preferably in Python in order to get full text RSS feeds?
There's a good list of them here, which mentions Feed Parser, which you use like this:
import feedparser
python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
"RecentChanges?action=rss_rc"
feed = feedparser.parse( python_wiki_rss_url )
You can then do things like:
for item in feed["items"]:
print item["title"]
feedparser.org is great
Sorry but it doesn't exist in python, though they do in php. You are more then welcome to use and improve the one I made named scraped. Though it does not do all sites, it is a recipe based system that currently only handles the NYT, WSJ and the Economist. I am working on an all inclusive algorithm, but its a major undertaking. It includes a ton of analysis to the different types of html and xml. Even the 3 sites mentioned above, have vastly different algorithms on how to scrape their sites WSJ being the most complex by far. They screw their HTML up with so much useless crap, mainly to just stop you.
Here is the program I was talking about, it requires lxml but it explains everything in the readme. It reads the config files, parses partial rss feeds, takes links and then scrapes those links, formulating in the end a RSS 2.0 xml file. Which I mainly convert into a ebook for my kindle. I utilize lxml, BeautifulSoup and feedparser.
http://tinyurl.com/yh3s9pa
You can also look at the calibre project, which uses a similar method to the way I do it, on recipes.
I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.
You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).
In order to deal with the 'html junk' python has a library for that too: BeautifulSoup
It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).
Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.
Try lynx --dump <url> to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.
What you're trying to do is called Scraping or web scraping.
If you do some searches on python and scraping, you may find a list of tools that will help.
(I have never used scrapy, but it's site looks promising :)
Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.
Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.
The US census provides a data set of churches for use with geographic information systems. If finding all the x in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.