Fetching data from the Wiki - python

I am currently developing a wiki and will keep posting information into the wiki. However, I'll have to fetch the information from the wiki using a python code. For example, if I have a wiki page about a company, say Coca Cola, I will need all the information (text) that I have posted on the wiki to be parsed to my python program. Please let me know if there's a way to do this.
Thanks!

You can use the api.php to get the Wikipedia Source text. It includes only the actual article.
I have written this one for the german wikipedia, so it works with umlauts. Some special characters of some other languages don't work (russian works, so it might be some asian languages). This is a working example:
import urllib2
from BeautifulSoup import BeautifulStoneSoup
import xml.sax.saxutils
def load(lemma, language="en", format="xml"):
""" Get the Wikipedia Source Text (not the HTML source code)
format:xml,json, ...
language:en, de, ...
Returns None if page doesn't exist
"""
url = 'http://' + language + '.wikipedia.org/w/api.php' + \
'?action=query&format=' + format + \
'&prop=revisions&rvprop=content' + \
'&titles=' + lemma
request = urllib2.Request(url)
handle = urllib2.urlopen(request)
text = handle.read()
if format == 'xml':
soup = BeautifulStoneSoup(text)
rev = soup.rev
if rev != None:
text = unicode(rev.contents[0])
text = xml.sax.saxutils.unescape(text)
else:
return None
return text
print load("Coca-Cola")
If you like to get the actual source code you have to change the url and the part with BeautifulStoneSoup.
BeautifulStoneSoup parses XML, BeautifulSoup parses HTML. Both are part of the BeautifulSoup package.

A manner is to download the page with urllib or httplib, then to analyze it with regexes to extract the precise information you want. It may be long, but it's relatively easy to do.
Maybe there are other solutions to analyze the source of the page, parsers or something like that; I don't know enough about them.

In the past for this sort of thing I've used SemanticMediawiki, and found it to work reasonably well. It's not terribly flexible, though so if you're doing something complicated you'll find yourself writing custom plugins or delegating to an external service to do the work.
I ultimately ended up writing a lot of python web services to do extra processing.

Related

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

Is there a way to use readability and python to extract just text, not HTML?

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port.
There are a number of those.
early version by gfxmonk, based on BeautifulSoup
version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.
I use Yuri's version, as it is most recent, and seems to be in active development.
I managed to make it run on Google App Engine using Python 2.7.
Now the "problem" is that it returns HTML, whereas I need pure text.
The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.
My questions:
Is there a way to get pure text from Python Readability version that I use without forking the code?
Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?
You can use html2text. It is a nifty tool.
Here is a link on how to use it with python readability tool - together they are called read2text.
http://brettterpstra.com/scripting-readability-markdownify-for-clipping-web-pages/
Hope this helps :)
Not to let it linger, my current solution
I did not find the way to use Readability ports.
I decided to use Beautiful Soup, version 4
BS has one simple function to extract text
code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
First, you extract the HTML contents with readability,
html_snippet = Document(html).summary()
Then, use a library to remove HTML tags. There are caveats:
1) you probably need spaces, "<p>some text<br>other text" shouldn't be "some textother text", and you might need the lists converted into " - ".
2) "#&39;" should be displayed as "'", and ">" should be displayed as ">" -- this is called HTML entities replacement (see below).
I usually use a library called bleach to clean out unnecessary tags and attributes:
cleaned_text = bleach.clean(html_snippet, tags=[])
or
cleaned_text = bleach.clean(html_snippet, tags=['i', 'b'])
You need to use any kind of html2text library if you want to remove all tags and get a better text formatting, or you can implement custom formatting procedure yourself.
But I think you now got the raw idea.
For a simple text formatting with bleach:
For example, if you want paragraphs as "\n", and list items as "\n - ", then:
norm_html = bleach.clean(html_snippet, tags=['p', 'br', 'li'])
replaced_html = norm_html.replace('<p>', '\n').replace('</p>', '\n')
replaced_html = replaced_html.replace('<br>', '\n').replace('<li>', '\n - ')
cleaned_text = bleach.clean(replaced_html, tags=[])
For a regexp that only strips HTML tags and does entities replacement (">" should be ">" and so on), you can take a look at https://stackoverflow.com/a/7778368/217895

Fastest, easiest, and best way to parse an HTML table?

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.
This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.
BeautifulSoup is the first thing that comes to mind.
Another possibility is copying/pasting it in TextMate and then running regular expressions.
What do you suggest?
This is the script that I ended up writing, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
tds = row.findAll('td')
if(len(tds)==4):
countrycode = tds[1].string
timezone = tds[2].string
if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improvement to my python code welcome, too ;)
For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.
Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
SimpleHTMLDom PHP
Hpricot & Nokogiri Ruby
Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.
I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.
This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).
EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.
It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.
While we were building SerpAPI we tested many platform/parser.
Here is the benchmark result for Python.
For more, here is a full article on Medium:
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd
The efficiency of a regex is superior to a DOM parser.
Look at this comparison:
http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team
You can find many more searching the web.

lxml can't parse <table>?

I want to parse tables in html, but i found lxml can't parse it? what's wrong?
# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'
url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
page = 0
link = url + keyword + '&pn=' + str(page)
f = urllib.urlopen(link)
content = f.read()
f.close()
tree = lxml.etree.HTML(content)
query_link = '//table'
info_link = tree.xpath(query_link)
print info_link
the print result is just []...
lxml's documentation says, "The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing."
And sure enough, the HTML returned by Baidu is invalid: the W3C validator reports "173 Errors, 7 warnings". I don't know (and haven't investigated) whether these particular errors have caused your trouble with lxml, because I think that your strategy of using lxml to parse HTML found "in the wild" (which is nearly always invalid) is doomed.
For parsing invalid HTML, you need a parser that implements the (surprisingly bizarre!) HTML error recovery algorithm. So I recommend swapping lxml for html5lib, which handles Baidu's invalid HTML with no problems:
>>> import urllib
>>> from html5lib import html5parser, treebuilders
>>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>>> dom = p.parse(urllib.urlopen('http://www.baidu.com/s?wd=foo').read())
>>> len(dom.getElementsByTagName('table'))
12
I see several places that code could be improved but, for your question, here are my suggestions:
Use lxml.html.parse(link) rather than lxml.etree.HTML(content) so all the "just works" automatics can kick in. (eg. Handling character coding declarations in headers properly)
Try using tree.findall(".//table") rather than tree.xpath("//table"). I'm not sure whether it'll make a difference, but I just used that syntax in a project of my own a few hours ago without issue and, as a bonus, it's compatible with non-LXML ElementTree APIs.
The other major thing I'd suggest would be using Python's built-in functions for building URLs so you can be sure the URL you're building is valid and properly escaped in all circumstances.
If LXML can't find a table and the browser shows a table to exist, I can only imagine it's one of these three problems:
Bad request. LXML gets a page without a table in it. (eg. error 404 or 500)
Bad parsing. Something about the page confused lxml.etree.HTML when called directly.
Javascript needed. Maybe the table is generated client-side.

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.
Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!
Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.
BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.

Categories

Resources