I want to extract the information in the Infobox from specific Wikipedia pages, mainly countries. Specifically I want to achieve this without scraping the page using Python + BeautifulSoup4 or any other languages + libraries, if possible. I'd rather use the official API, because I noticed the CSS tags are different for different Wikipedia subdomains (as in other languages).
In How to get Infobox from a Wikipedia article by Mediawiki API? states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites), but unfortunately doesn't work on the pages I tried (further below).
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
However, I suppose Wikimedia changed their infobox template, because when I run the above query all I get is the content, but not the infobox. E.g. running the query on Europäische_Union (European_Union) results (among others) in the following snippet
{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->
It works fine for the English version of Wikipedia though.
So the page I want to extract the infobox from would be: http://de.wikipedia.org/wiki/Europäische_Union
And this is the code I'm using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import lxml.etree
import urllib
title = "Europäische_Union"
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://de.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
print revs[-1].text
Am I missing something very substantial?
Data must not be taken from Wikipedia, but from Wikidata which is Wikipedia's structured data counterpart. (Also, that's not a standard infobox: it has no parameters and it's filled on the template itself.)
Use the Wikidata API module wbgetclaims to get all the data on the European Union:
https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q458
Much neater, eh? See https://www.wikidata.org/wiki/Wikidata:Data_access for more.
Related
i try to extract all names from this site -
https://en.wikipedia.org/wiki/Category:Masculine_given_names
(and i want to have all names which are listed on this site and the following pages - but also the subcategories which are listed at the top like Afghan masculine given names, African masculine given names, etc.)
I tried this with the following code:
import pywikibot
from pywikibot import pagegenerators
site = pywikibot.Site()
cat = pywikibot.Category(site,'Category:Masculine_given_names')
gen = pagegenerators.CategorizedPageGenerator(cat)
for idx,page in enumerate(gen):
text = page.text
print(idx)
print(text)
Which generally works fine and gave me at least the detail-page of a single name page. But how can i get all the names / from all the subpages on this site but also from the subcategories?
How to find subcategories and subpages on wikipedia using pywikibot?
This is already answered here using Category methods but you can also use pagegenerators CategorizedPageGenerator function. All you need is setting the recurse option:
>>> gen = pagegenerators.CategorizedPageGenerator(cat, recurse=True)
Refer the documentation for it. You may also include pagegenerators options within your script in such a way decribed in this example and call your script with -catr option:
pwb.py <yourscript> -catr:Masculine_given_names
Let me get there straight, I'm trying to make reader web app alike google reader, feedly etc... Hence i'm trying get rss by python using feedparser library. The thing is all website's rss is not in same format i mean some of them has no title, some of them has no publish date in RSS. However, i found that digg.com/reader is very useful digg's reader get rss with publish date and title too i wonder how this thing is work? Anyone got a clue or any little help would be appreciated
I've recently done some projects with the feed parser library and it can be very frustrating since many rss feeds are different. What works the most for me is something like this:
#to get posts from hackaday.com
import feedparser
feed = feedparser.parse("http://www.hackaday.com/blog/feed/") #get feed from hackaday
feed = feed['items'] #Get items in feed (this is the best way I've found)
print feed[0]['title'] #print post title
print feed[0]['summary'] #print post summary
print feed[0]['published'] #print date published
These are just a few of the different "fields" that feed parser has. To find the one you want just run these commands in the python shell and see what fits your needs.
you can use feedparser to know if a website have atom or rss, and then deal with each type.If a website has not a publish date or title, you can extract them using other librairies like goose-extractor (As an example :
from newspaper import Article
import feedparser
def extract_date(url):
article = Article(url)
article.download()
article.parse()
date=article.publish_date
return date
d=feedparser.parse("http://feeds.feedburner.com/webnewsit") #an italian website
d.entries[0] # the last entry
try :
d.entries[0].published
except AttributeError:
link_last_entry=d.entries[0].link
publish_date=extract_date(link_last_entry)
Let me know if you still don't get the publication date
I very often need to setup physical properties for some technical computations. It is not convenient to fill in such data by hand. I would like to grab such data from some public webpage (Wikipedia for example) using python script.
I was trying several ways:
using html parser like lxml.etree (I have no experience - I was just trying to follow tutorial)
using pandas wikitable import ( --,,-- )
using urllib2 do download html source and than search for keywords by regular expressions
What I'm able to do:
I didn't find any universal solution applicable for various sources of information. The only script I made which actually works does use just simple urllib2 and regular expression. It can grab physical properties of elements from this page which is plain HTML.
What I'm not able to do:
I'm not able to do that with more sophisticated web pages like this. The HTML code of this page which I grab by urllib2 does not contain the keywords and data I'm looking for ( like Flexural strength, Modulus of elasticity )? Actually it seem that it does not contain the wikipage at all! How is that possible? Are these wiki-tables linked somehow dynamically? How can I get contend of the table by urllib? Why urllib2 does not grab this data, and my web browser does?
I have no experience with web programming.
I don't understand why it is so hard to get any machine-readable data from free public online sources of information.
Web scraping is difficult. Not because it's rocket science, but because it's just messy. For the moment you'll need to hand-craft scrapers for different sites and use them as long as the site's structure does not change.
There are more automated approaches to web information extraction, e.g. like it is described in this paper: Harvesting Relational Tables from Lists on the Web, but this is not mainstream yet.
A large number of web pages contain data structured in the form of “lists”.
Many such lists can be further split into multi-column tables, which can then be used
in more semantically meaningful tasks. However, harvesting relational tables from such
lists can be a challenging task. The lists are manually generated and hence need not
have well defined templates – they have inconsistent delimiters (if any) and often have
missing information.
However, there are a lot of tools to get to the (HTML) content more quickly, e.g. BeautifulSoup:
Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.
>>> from BeautifulSoup import BeautifulSoup as Soup
>>> import urllib
>>> page = urllib.urlopen("http://www.substech.com/dokuwiki/doku.php?"
"id=thermoplastic_acrylonitrile-butadiene-styrene_abs").read()
>>> soup = Soup(page) # the HTML gets parsed here
>>> soup.findAll('table')
Example output: https://friendpaste.com/DnWDviSiHIYQEBduTqkWd. More documentation can be found here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree.
If you want to extract data from a bigger set of pages, take a look at scrapy.
I don't understand what you mean by
it seem that it does not contain the wikipage at all
I got this relatively rapidly:
import httplib
import re
hostu = 'www.substech.com'
timeout = 7
hypr = httplib.HTTPConnection(host=hostu,timeout = timeout)
rekete_page = ('/dokuwiki/doku.php?id='
'thermoplastic_acrylonitrile-butadiene-styrene_abs')
hypr.request('GET',rekete_page)
x = hypr.getresponse().read()
hypr.close()
#print '\n'.join('%d %r' % (i,line) for i,line in enumerate(x.splitlines(1)))
r = re.compile('\t<tr>\n.+?\t</tr>\n',re.DOTALL)
r2 = re.compile('<th[^>]*>(.*?)</th>')
r3 = re.compile('<td[^>]*>(.*?)</td>')
for y in r.findall(x):
print
#print repr(y)
print map(str.strip,r2.findall(y))
print map(str.strip,r3.findall(y))
result
[]
['<strong>Thermoplastic</strong>']
[]
['<strong>Acrylonitrile</strong><strong>-Butadiene-Styrene (ABS)</strong>']
[]
['<strong>Property</strong>', '<strong>Value in metric unit</strong>', '<strong>Value in </strong><strong>US</strong><strong> unit</strong>']
['Density']
['1.05 *10\xc2\xb3', 'kg/m\xc2\xb3', '65.5', 'lb/ft\xc2\xb3']
['Modulus of elasticity']
['2.45', 'GPa', '350', 'ksi']
['Tensile strength']
['45', 'MPa', '6500', 'psi']
['Elongation']
['33', '%', '33', '%']
['Flexural strength']
['70', 'MPa', '10000', 'psi']
['Thermal expansion (20 \xc2\xbaC)']
['90*10<sup>-6</sup>', '\xc2\xbaC\xcb\x89\xc2\xb9', '50*10<sup>-6</sup>', 'in/(in* \xc2\xbaF)']
['Thermal conductivity']
['0.25', 'W/(m*K)', '1.73', 'BTU*in/(hr*ft\xc2\xb2*\xc2\xbaF)']
['Glass transition temperature']
['100', '\xc2\xbaC', '212', '\xc2\xbaF']
['Maximum work temperature']
['70', '\xc2\xbaC', '158', '\xc2\xbaF']
['Electric resistivity']
['10<sup>8</sup>', 'Ohm*m', '10<sup>10</sup>', 'Ohm*cm']
['Dielectric constant']
['2.4', '-', '2.4', '-']
I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!
I recently discovered the genshi.builder module. It reminds me of Divmod Nevow's Stan module. How would one use genshi.builder.tag to build an HTML document with a particular doctype? Or is this even a good thing to do? If not, what is the right way?
It's not possible to build an entire page using just genshi.builder.tag -- you would need to perform some surgery on the resulting stream to insert the doctype. Besides, the resulting code would look horrific. The recommended way to use Genshi is to use a separate template file, generate a stream from it, and then render that stream to the output type you want.
genshi.builder.tag is mostly useful for when you need to generate simple markup from within Python, such as when you're building a form or doing some sort of logic-heavy modification of the output.
See documentation for:
Creating and using templates
The XML-based template language
genshi.builder API docs
If you really want to generate a full document using only builder.tag, this (completely untested) code could be a good starting point:
from itertools import chain
from genshi.core import DOCTYPE, Stream
from genshi.output import DocType
from genshi.builder import tag as t
# Build the page using `genshi.builder.tag`
page = t.html (t.head (t.title ("Hello world!")), t.body (t.div ("Body text")))
# Convert the page element into a stream
stream = page.generate ()
# Chain the page stream with a stream containing only an HTML4 doctype declaration
stream = Stream (chain ([(DOCTYPE, DocType.get ('html4'), None)], stream))
# Convert the stream to text using the "html" renderer (could also be xml, xhtml, text, etc)
text = stream.render ('html')
The resulting page will have no whitespace in it -- it'll look normal, but you'll have a hard time reading the source code because it will be entirely on one line. Implementing appropriate filters to add whitespace is left as an exercise to the reader.
Genshi.builder is for "programmatically generating markup streams"[1]. I believe the purpose of it is as a backend for the templating language. You're probably looking for the templating language for generating a whole page.
You can, however do the following:
>>> import genshi.output
>>> genshi.output.DocType('html')
('html', '-//W3C//DTD HTML 4.01//EN', 'http://www.w3.org/TR/html4/strict.dtd')
See other Doctypes here: http://genshi.edgewall.org/wiki/ApiDocs/genshi.output#genshi.output:DocType
[1] genshi.builder.__doc__