i try to extract all names from this site -
https://en.wikipedia.org/wiki/Category:Masculine_given_names
(and i want to have all names which are listed on this site and the following pages - but also the subcategories which are listed at the top like Afghan masculine given names, African masculine given names, etc.)
I tried this with the following code:
import pywikibot
from pywikibot import pagegenerators
site = pywikibot.Site()
cat = pywikibot.Category(site,'Category:Masculine_given_names')
gen = pagegenerators.CategorizedPageGenerator(cat)
for idx,page in enumerate(gen):
text = page.text
print(idx)
print(text)
Which generally works fine and gave me at least the detail-page of a single name page. But how can i get all the names / from all the subpages on this site but also from the subcategories?
How to find subcategories and subpages on wikipedia using pywikibot?
This is already answered here using Category methods but you can also use pagegenerators CategorizedPageGenerator function. All you need is setting the recurse option:
>>> gen = pagegenerators.CategorizedPageGenerator(cat, recurse=True)
Refer the documentation for it. You may also include pagegenerators options within your script in such a way decribed in this example and call your script with -catr option:
pwb.py <yourscript> -catr:Masculine_given_names
Related
I'm trying to extract the version number from software packages hosted on SourceForge based on this Stack Overflow post. Specifically, I'm using the Release API and the "best_release.json" call. I have the following examples:
7-zip: https://sourceforge.net/projects/sevenzip/best_release.json
KeePass: https://sourceforge.net/projects/keepass/best_release.json
OpenOffice.org:
https://sourceforge.net/projects/openofficeorg.mirror/best_release.json
Using the following code snippet:
import requests
"""
Un/comment the following lines to change the project name and test
different responses.
"""
proj = "keepass"
# proj = "sevenzip"
# proj = "openofficeorg.mirror"
r = requests.get(f'https://sourceforge.net/projects/{proj}/best_release.json')
json_resp = r.json()
print(json_resp['release']['filename'])
I receive the respective results for each package:
7-Zip: /7-Zip/22.00/7z2200-linux-x86.tar.xz
KeePass: /KeePass 2.x/2.51.1/KeePass-2.51.1.zip
Openoffice.org: /extended/iso/en/OOo_3.3.0_Win_x86_install_en-US_20110219.iso
I'm wondering how I can extract the file versions from these disparate packages. Looking at the results, one can see that there are different naming conventions. For example, 7-Zip puts the file version as "22.00" in the second directory level. KeePass, however, puts it in the second directory level as well as the filename itself. OpenOffice.org puts it inside the filename.
Is there a way to do some sort of fuzzy match that can attempt to extract a "best guess" file version given a filename?
I thought of using regular expressions, re. For example, I can use the (\d+) capture group to capture one or more digits, as demonstrated here. However, this would also capture text such as "x86," which I don't want. I just desire some text that looks closest to a version number, but I'm unsure how to do this.
in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)
I want to extract the information in the Infobox from specific Wikipedia pages, mainly countries. Specifically I want to achieve this without scraping the page using Python + BeautifulSoup4 or any other languages + libraries, if possible. I'd rather use the official API, because I noticed the CSS tags are different for different Wikipedia subdomains (as in other languages).
In How to get Infobox from a Wikipedia article by Mediawiki API? states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites), but unfortunately doesn't work on the pages I tried (further below).
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
However, I suppose Wikimedia changed their infobox template, because when I run the above query all I get is the content, but not the infobox. E.g. running the query on Europäische_Union (European_Union) results (among others) in the following snippet
{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->
It works fine for the English version of Wikipedia though.
So the page I want to extract the infobox from would be: http://de.wikipedia.org/wiki/Europäische_Union
And this is the code I'm using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import lxml.etree
import urllib
title = "Europäische_Union"
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://de.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
print revs[-1].text
Am I missing something very substantial?
Data must not be taken from Wikipedia, but from Wikidata which is Wikipedia's structured data counterpart. (Also, that's not a standard infobox: it has no parameters and it's filled on the template itself.)
Use the Wikidata API module wbgetclaims to get all the data on the European Union:
https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q458
Much neater, eh? See https://www.wikidata.org/wiki/Wikidata:Data_access for more.
I am trying to crawl pubmed with python and get the pubmed ID for all papers that an article was cited by.
For example this article (ID: 11825149)
http://www.ncbi.nlm.nih.gov/pubmed/11825149
Has a page linking to all articles that cite it:
http://www.ncbi.nlm.nih.gov/pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149
The problem is it has over 200 links but only shows 20 per page. The 'next page' link is not accessible by url.
Is there a way to open the 'send to' option or view the content on the next pages with python?
How I currently open pubmed pages:
def start(seed):
webpage = urlopen(seed).read()
print webpage
citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
print citedByPage
From this I can extract all the cited by links on the first page, but how can I extract them from all pages? Thanks.
I was able to get the cited by IDs using a method from this page
http://www.bio-cloud.info/Biopython/en/ch8.html
Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:
>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other#example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
... LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.
So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).
But first, taking the more straightforward approach of making a second (separate) call to ELink:
>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
... from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).
Now, let’s do that all again but with the history …TODO.
And finally, don’t forget to include your own email address in the Entrez calls.
I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!