exporting wikipedia with Python - python

I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.

There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.

Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!

Related

Get all names from wikipedia-site?

i try to extract all names from this site -
https://en.wikipedia.org/wiki/Category:Masculine_given_names
(and i want to have all names which are listed on this site and the following pages - but also the subcategories which are listed at the top like Afghan masculine given names, African masculine given names, etc.)
I tried this with the following code:
import pywikibot
from pywikibot import pagegenerators
site = pywikibot.Site()
cat = pywikibot.Category(site,'Category:Masculine_given_names')
gen = pagegenerators.CategorizedPageGenerator(cat)
for idx,page in enumerate(gen):
text = page.text
print(idx)
print(text)
Which generally works fine and gave me at least the detail-page of a single name page. But how can i get all the names / from all the subpages on this site but also from the subcategories?
How to find subcategories and subpages on wikipedia using pywikibot?
This is already answered here using Category methods but you can also use pagegenerators CategorizedPageGenerator function. All you need is setting the recurse option:
>>> gen = pagegenerators.CategorizedPageGenerator(cat, recurse=True)
Refer the documentation for it. You may also include pagegenerators options within your script in such a way decribed in this example and call your script with -catr option:
pwb.py <yourscript> -catr:Masculine_given_names

How to extract Infobox from (German) Wikipedia using MediaWiki API?

I want to extract the information in the Infobox from specific Wikipedia pages, mainly countries. Specifically I want to achieve this without scraping the page using Python + BeautifulSoup4 or any other languages + libraries, if possible. I'd rather use the official API, because I noticed the CSS tags are different for different Wikipedia subdomains (as in other languages).
In How to get Infobox from a Wikipedia article by Mediawiki API? states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites), but unfortunately doesn't work on the pages I tried (further below).
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
However, I suppose Wikimedia changed their infobox template, because when I run the above query all I get is the content, but not the infobox. E.g. running the query on Europäische_Union (European_Union) results (among others) in the following snippet
{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->
It works fine for the English version of Wikipedia though.
So the page I want to extract the infobox from would be: http://de.wikipedia.org/wiki/Europäische_Union
And this is the code I'm using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import lxml.etree
import urllib
title = "Europäische_Union"
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://de.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
print revs[-1].text
Am I missing something very substantial?
Data must not be taken from Wikipedia, but from Wikidata which is Wikipedia's structured data counterpart. (Also, that's not a standard infobox: it has no parameters and it's filled on the template itself.)
Use the Wikidata API module wbgetclaims to get all the data on the European Union:
https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q458
Much neater, eh? See https://www.wikidata.org/wiki/Wikidata:Data_access for more.

wikitools, wikipedia and python

Does anybody have experience in getting a wikipedia page using wikitools for python (and django)? I am trying to get the article but I get a few first lines and that's it. I need to fetch the whole article and I can't seem to figure it out. The documentation is not very helpful either. My code is:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php?title=Some_Title&action=raw&maxlag=-1")
wikipage = page.Page(wikiobj, url, section='content')
wikidata = wikipage.getWikiText(True).decode('utf-8', 'replace')
Any help will be appreciated.
I'm using wikitools im my project, not for getting text on the page, but I initialize wiki object in a different way:
wikiobj = wiki.Wiki("http://en.wikipedia.org/w/api.php")
wikipage = page.Page(wikiobj, title="Some_Title")
You don't need to supply any query to after api.php in the Wiki class.
Next, look at the definition of Page class:
__init__(self, site, title=False, check=True, followRedir=True, section=False, sectionnumber=False, pageid=False, namespace=False)
So you need to supply title to the constructor of the Page class (you supplied some unknown url param).

ruby fetching url content is always empty

I am so frustrated trying to use Ruby to fetch a specific url content.
I've tried many different ways like open-uri, standard request none worked so far. I always get empty html. I also tried to use python to fetch the same url which always returned the correct html content. I am really not sure why... Please help as I am newbiew to both Ruby and Python... I want to use Ruby (prefer the tidy syntax and human friendly function names, easier to install libs using gem and homebrew (on mac) than python easy_install) but I am now considering Python because it just works (yet still trying to get my head around 2.x and 3.x issue). I may be doing something really stupid but I think is very unlikely.
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]
Implementation 1:
url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body #empty
Implementation 2:
doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.
Python Implementation which worked every time perfectly
f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
print s
If that is your exact code it is invalid for several reasons.
http: should be http://
URL needs a path. if you want the root page of example.com it needs to be http://example.com/ the trailing slash is significant.
if you put 2 lines of code on one line you need to use ; to denote the end of the first line
SO
require 'net/http'
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body
Same is true with using open in nokogiri
EDIT: that site is returning bad results many times:
counter = 0
20.times do
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
sleep 1
counter +=1 unless res.body.empty?
end
puts counter
for me this only returned once a non empty body. If you substitute in another site it works all the time
curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"
Yields the same inconsistent results.
Two examples with openURI (standard lib), a wrapper for (among others) the rather cumbersome Net::HTTP :
require 'open-uri'
open("http://www.stackoverflow.com/"){|f| puts f.read}
puts URI::parse("http://www.google.com/").read

how to search for specific file type with yahoo search API?

Does anyone know if there is some parameter available for programmatic search on yahoo allowing to restrict results so only links to files of specific type will be returned (like PDF for example)?
It's possible to do that in GUI, but how to make it happen through API?
I'd very much appreciate a sample code in Python, but any other solutions might be helpful as well.
Yes, there is:
http://developer.yahoo.com/search/boss/boss_guide/Web_Search.html#id356163
Thank you.
I found myself that something like this works OK (file type is the first argument, and query is the second):
format = sys.argv[1]
query = " ".join(sys.argv[2:])
srch = create_search("Web", app_id, query=query, format=format)
Here's what I do for this sort of thing. It exposes more of the parameters so you can tune it to your needs. This should print out the first ten PDFs URLs from the query "resume" [mine's not one of them ;) ]. You can download those URLs however you like.
The json dictionary that gets returned from the query is a little gross, but this should get you started. Be aware that in real code you will need to check whether some of the keys in the dictionary exist. When there are no results, this code will probably throw an exception.
The link that Tiago provided is good for knowing what values are supported for the "type" parameter.
from yos.crawl import rest
APPID="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
base_url = "http://boss.yahooapis.com/ysearch/%s/v%d/%s?start=%d&count=%d&type=%s" + "&appid=" + APPID
querystr="resume"
start=0
count=10
type="pdf"
search_url = base_url % ("web", 1, querystr, start, count, type)
json_result = rest.load_json(search_url)
for url in [recs['url'] for recs in json_result['ysearchresponse']['resultset_web']]:
print url

Categories

Resources