Grab numbers from web page datatables - python

I very often need to setup physical properties for some technical computations. It is not convenient to fill in such data by hand. I would like to grab such data from some public webpage (Wikipedia for example) using python script.
I was trying several ways:
using html parser like lxml.etree (I have no experience - I was just trying to follow tutorial)
using pandas wikitable import ( --,,-- )
using urllib2 do download html source and than search for keywords by regular expressions
What I'm able to do:
I didn't find any universal solution applicable for various sources of information. The only script I made which actually works does use just simple urllib2 and regular expression. It can grab physical properties of elements from this page which is plain HTML.
What I'm not able to do:
I'm not able to do that with more sophisticated web pages like this. The HTML code of this page which I grab by urllib2 does not contain the keywords and data I'm looking for ( like Flexural strength, Modulus of elasticity )? Actually it seem that it does not contain the wikipage at all! How is that possible? Are these wiki-tables linked somehow dynamically? How can I get contend of the table by urllib? Why urllib2 does not grab this data, and my web browser does?
I have no experience with web programming.
I don't understand why it is so hard to get any machine-readable data from free public online sources of information.

Web scraping is difficult. Not because it's rocket science, but because it's just messy. For the moment you'll need to hand-craft scrapers for different sites and use them as long as the site's structure does not change.
There are more automated approaches to web information extraction, e.g. like it is described in this paper: Harvesting Relational Tables from Lists on the Web, but this is not mainstream yet.
A large number of web pages contain data structured in the form of “lists”.
Many such lists can be further split into multi-column tables, which can then be used
in more semantically meaningful tasks. However, harvesting relational tables from such
lists can be a challenging task. The lists are manually generated and hence need not
have well defined templates – they have inconsistent delimiters (if any) and often have
missing information.
However, there are a lot of tools to get to the (HTML) content more quickly, e.g. BeautifulSoup:
Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.
>>> from BeautifulSoup import BeautifulSoup as Soup
>>> import urllib
>>> page = urllib.urlopen("http://www.substech.com/dokuwiki/doku.php?"
"id=thermoplastic_acrylonitrile-butadiene-styrene_abs").read()
>>> soup = Soup(page) # the HTML gets parsed here
>>> soup.findAll('table')
Example output: https://friendpaste.com/DnWDviSiHIYQEBduTqkWd. More documentation can be found here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree.
If you want to extract data from a bigger set of pages, take a look at scrapy.

I don't understand what you mean by
it seem that it does not contain the wikipage at all
I got this relatively rapidly:
import httplib
import re
hostu = 'www.substech.com'
timeout = 7
hypr = httplib.HTTPConnection(host=hostu,timeout = timeout)
rekete_page = ('/dokuwiki/doku.php?id='
'thermoplastic_acrylonitrile-butadiene-styrene_abs')
hypr.request('GET',rekete_page)
x = hypr.getresponse().read()
hypr.close()
#print '\n'.join('%d %r' % (i,line) for i,line in enumerate(x.splitlines(1)))
r = re.compile('\t<tr>\n.+?\t</tr>\n',re.DOTALL)
r2 = re.compile('<th[^>]*>(.*?)</th>')
r3 = re.compile('<td[^>]*>(.*?)</td>')
for y in r.findall(x):
print
#print repr(y)
print map(str.strip,r2.findall(y))
print map(str.strip,r3.findall(y))
result
[]
['<strong>Thermoplastic</strong>']
[]
['<strong>Acrylonitrile</strong><strong>-Butadiene-Styrene (ABS)</strong>']
[]
['<strong>Property</strong>', '<strong>Value in metric unit</strong>', '<strong>Value in </strong><strong>US</strong><strong> unit</strong>']
['Density']
['1.05 *10\xc2\xb3', 'kg/m\xc2\xb3', '65.5', 'lb/ft\xc2\xb3']
['Modulus of elasticity']
['2.45', 'GPa', '350', 'ksi']
['Tensile strength']
['45', 'MPa', '6500', 'psi']
['Elongation']
['33', '%', '33', '%']
['Flexural strength']
['70', 'MPa', '10000', 'psi']
['Thermal expansion (20 \xc2\xbaC)']
['90*10<sup>-6</sup>', '\xc2\xbaC\xcb\x89\xc2\xb9', '50*10<sup>-6</sup>', 'in/(in* \xc2\xbaF)']
['Thermal conductivity']
['0.25', 'W/(m*K)', '1.73', 'BTU*in/(hr*ft\xc2\xb2*\xc2\xbaF)']
['Glass transition temperature']
['100', '\xc2\xbaC', '212', '\xc2\xbaF']
['Maximum work temperature']
['70', '\xc2\xbaC', '158', '\xc2\xbaF']
['Electric resistivity']
['10<sup>8</sup>', 'Ohm*m', '10<sup>10</sup>', 'Ohm*cm']
['Dielectric constant']
['2.4', '-', '2.4', '-']

Related

How do I find text after a given key word?

I am practicing my programming skills (in Python) and I realized that I don't know what to do when I need to find a value that is unknown but introduced by a key word. I am taking the information for this off a website where in the page source it says, '"size":"10","stockKeepingUnitId":"(random number)"'
How can I figure out what that number is.
This is what I have so far --
def stock():
global session
endpoint = '(website)'
reponse = session.get(endpoint)
soup = bs(response.text, "html.parser")
sizes = soup.find('"size":"10","stockKeepingUnitId":')
Off the top of my head there are two ways to do this. Say you have the string mystr = 'some text...content:"67588978"'. The first way is just to search for "content:" in the string and use string slicing to take everything after it:
num = mystr[mystr.index('content:"') + len('content:"'):-1]
Alternatively, as probably a better solution, you could use regular expressions
import re
nums = re.findall(r'.*?content:\"(\d+)\"')
As you haven't provided an example of the dataset you're trying to analyze, there could also be a number of other solutions. If you're trying to parse a JSON or YAML file, there are simple libraries to turn them into python dicts (json is part of the standard library, and PyYaml handles YAML files easily).

unable to fetch the list values from the website

i fetch all the detail from the desire website but unable to get the some specific information please guide me for that.
targeted domain: https://shop.adidas.ae/en/messi-16-3-indoor-boots/BA9855.html
my code isresponse.xpath('//ul[#class="product-size"]//li/text()').extract()
need to fetch data!!!
Thanks!
Often ecommerce websites have data in json format in page source and then have javscript unpack it on users end.
In this case you can open up the page source with javascript disabled and search for keywords (like specific size).
I found in this case it can be found with regular expressions:
import re
import json
data = re.findall('window.assets.sizesMap = (\{.+?\});', response.body_as_unicode())
json.loads(data[0])
Out:
{'16': {'uk': '0k', 'us': '0.5'},
'17': {'uk': '1k', 'us': '1'},
'18': {'uk': '2k', 'us': '2.5'},
...}
Edit: More accurately you probably want to get different part of the json but nevertheless the answer is more or less the same:
data = re.findall('window.assets.sizes = (\{(?:.|\n)+?\});', response.body_as_unicode())
json.loads(data[0].replace("'", '"')) # replace single quotes to doubles
The data you want to fetch is loaded from a javascript. It is said explicitly in the tag class="js-size-value ".
If you want to get it, you will need to use a rendering service. I suggest you use Splash, it is simple to install and simple to use. You will need docker to install splash.

How to extract Infobox from (German) Wikipedia using MediaWiki API?

I want to extract the information in the Infobox from specific Wikipedia pages, mainly countries. Specifically I want to achieve this without scraping the page using Python + BeautifulSoup4 or any other languages + libraries, if possible. I'd rather use the official API, because I noticed the CSS tags are different for different Wikipedia subdomains (as in other languages).
In How to get Infobox from a Wikipedia article by Mediawiki API? states that using the following method would work, which is indeed true for the given tital (Scary Monsters and Nice Sprites), but unfortunately doesn't work on the pages I tried (further below).
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
However, I suppose Wikimedia changed their infobox template, because when I run the above query all I get is the content, but not the infobox. E.g. running the query on Europäische_Union (European_Union) results (among others) in the following snippet
{{Infobox Europäische Union}}
<!--{{Infobox Staat}} <- Vorlagen-Parameter liegen in [[Spezial:Permanenter Link/108232313]] -->
It works fine for the English version of Wikipedia though.
So the page I want to extract the infobox from would be: http://de.wikipedia.org/wiki/Europäische_Union
And this is the code I'm using:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import lxml.etree
import urllib
title = "Europäische_Union"
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"content", "rvsection":0 }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://de.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
print revs[-1].text
Am I missing something very substantial?
Data must not be taken from Wikipedia, but from Wikidata which is Wikipedia's structured data counterpart. (Also, that's not a standard infobox: it has no parameters and it's filled on the template itself.)
Use the Wikidata API module wbgetclaims to get all the data on the European Union:
https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q458
Much neater, eh? See https://www.wikidata.org/wiki/Wikidata:Data_access for more.

How to access pubmed data for forms with multiple pages in python crawler

I am trying to crawl pubmed with python and get the pubmed ID for all papers that an article was cited by.
For example this article (ID: 11825149)
http://www.ncbi.nlm.nih.gov/pubmed/11825149
Has a page linking to all articles that cite it:
http://www.ncbi.nlm.nih.gov/pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149
The problem is it has over 200 links but only shows 20 per page. The 'next page' link is not accessible by url.
Is there a way to open the 'send to' option or view the content on the next pages with python?
How I currently open pubmed pages:
def start(seed):
webpage = urlopen(seed).read()
print webpage
citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
print citedByPage
From this I can extract all the cited by links on the first page, but how can I extract them from all pages? Thanks.
I was able to get the cited by IDs using a method from this page
http://www.bio-cloud.info/Biopython/en/ch8.html
Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:
>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other#example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
... LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.
So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).
But first, taking the more straightforward approach of making a second (separate) call to ELink:
>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
... from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).
Now, let’s do that all again but with the history …TODO.
And finally, don’t forget to include your own email address in the Entrez calls.

exporting wikipedia with Python

I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!

Categories

Resources