i fetch all the detail from the desire website but unable to get the some specific information please guide me for that.
targeted domain: https://shop.adidas.ae/en/messi-16-3-indoor-boots/BA9855.html
my code isresponse.xpath('//ul[#class="product-size"]//li/text()').extract()
need to fetch data!!!
Thanks!
Often ecommerce websites have data in json format in page source and then have javscript unpack it on users end.
In this case you can open up the page source with javascript disabled and search for keywords (like specific size).
I found in this case it can be found with regular expressions:
import re
import json
data = re.findall('window.assets.sizesMap = (\{.+?\});', response.body_as_unicode())
json.loads(data[0])
Out:
{'16': {'uk': '0k', 'us': '0.5'},
'17': {'uk': '1k', 'us': '1'},
'18': {'uk': '2k', 'us': '2.5'},
...}
Edit: More accurately you probably want to get different part of the json but nevertheless the answer is more or less the same:
data = re.findall('window.assets.sizes = (\{(?:.|\n)+?\});', response.body_as_unicode())
json.loads(data[0].replace("'", '"')) # replace single quotes to doubles
The data you want to fetch is loaded from a javascript. It is said explicitly in the tag class="js-size-value ".
If you want to get it, you will need to use a rendering service. I suggest you use Splash, it is simple to install and simple to use. You will need docker to install splash.
Related
I am trying to extract the text
60 Days from website A
https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667
Lifetime Access from website B
https://www.vitalsource.com/products/teaming-with-nutrients-jeff-lowenfels-v9781604695175
I tried to use abs xpath, both return nothing.
for A
//div[2]/div[1]/label[1]
for B
//div[1]/span[1]/label[1]
nor css path
.u-weight--bold.type--magic9.u-inline
I believe the texts I want to extract are not generated by javascript. so I don't know anything else I can do to solve this problem.
Please assist!
Thank you in advance.
The information you need is rendered by Javascript, but it's also available in JSON format inside the page. All you need to do is to select the element that contains the data, parse the data using JSON lib and access the desired field.
import json
import pprint
data = response.xpath(
'//div[#data-react-class="vs.CurrentRegionOnlyWarningModal"]'
'/#data-react-props')
.extract_first()
json_data = json.loads(data)
pprint.pprint(json_data)
{'selectedVariant': None,
'variants': [{'asset_id': 88677112,
'created_at': '2016-10-07T14:17:10.000Z',
'deleted_at': None,
'distributable': True,
'downloadable_duration': 'perpetual',
'full_base_currency': 'USD',
'full_base_price': '107.5',
'full_currency': 'USD',
'full_price': '107.5',
'full_price_converted': False,
'id': 476831514,
'import_id': 'a3b99a3de0df7d0442253798cba8b8ea',
'in_store': True,
'item_type': 'Single',
....
'online_duration': '60 days',
So, you can access it normally:
for x in json_data['variants']:
print(x['online_duration'])
It's important to note that this site has some variants for each product, and there are more fields with this same string. You have to understand how this site organize the products to get the right data, but this approach should be enough to access all information you need.
It is generated by javascript, unfortunately. So you would need to use something like selenium for this most likely.
I am a newbie at python and I am working on a personal project of mine. The project includes grabbing data through LDAP, which returns JSON data.
Sample of data:
cn=abcd
[[('uid=abcd,OU=active,OU=employees,OU=people,O=xxxx.com',
{'status': ['Active'],
'co': ['India'],
'cn': ['abcd'],
'msDS-UserAccountDisabled': ['FALSE'],
'departmentNumber': ['122839'],
'objectClass': ['top', 'person', 'organizationalPerson', 'user', 'inetOrgPerson', `'ciscoperson'], 'userPrincipalName': ['surahuja'], 'publishpager': ['n'],`
Let's say that the content of the data is something like
'directreportees' : ['2345','1234','6789']
Right now,the search filter is something like
for item in directreportees:
search_filter = "(employeenumber=" + item +")"
I need to put the search filter in a form where I can specify that the no of direct reportees > 0. Is it possible through search filters ? or do I have no option but to grab data and perform test on it ?
Secondly, I need to search the department too. For eg, I need to check if the JSON value such as A.B.NOS. C contains a specific sequence such as NOS. Can I provide this check in the search filter too ?
I very often need to setup physical properties for some technical computations. It is not convenient to fill in such data by hand. I would like to grab such data from some public webpage (Wikipedia for example) using python script.
I was trying several ways:
using html parser like lxml.etree (I have no experience - I was just trying to follow tutorial)
using pandas wikitable import ( --,,-- )
using urllib2 do download html source and than search for keywords by regular expressions
What I'm able to do:
I didn't find any universal solution applicable for various sources of information. The only script I made which actually works does use just simple urllib2 and regular expression. It can grab physical properties of elements from this page which is plain HTML.
What I'm not able to do:
I'm not able to do that with more sophisticated web pages like this. The HTML code of this page which I grab by urllib2 does not contain the keywords and data I'm looking for ( like Flexural strength, Modulus of elasticity )? Actually it seem that it does not contain the wikipage at all! How is that possible? Are these wiki-tables linked somehow dynamically? How can I get contend of the table by urllib? Why urllib2 does not grab this data, and my web browser does?
I have no experience with web programming.
I don't understand why it is so hard to get any machine-readable data from free public online sources of information.
Web scraping is difficult. Not because it's rocket science, but because it's just messy. For the moment you'll need to hand-craft scrapers for different sites and use them as long as the site's structure does not change.
There are more automated approaches to web information extraction, e.g. like it is described in this paper: Harvesting Relational Tables from Lists on the Web, but this is not mainstream yet.
A large number of web pages contain data structured in the form of “lists”.
Many such lists can be further split into multi-column tables, which can then be used
in more semantically meaningful tasks. However, harvesting relational tables from such
lists can be a challenging task. The lists are manually generated and hence need not
have well defined templates – they have inconsistent delimiters (if any) and often have
missing information.
However, there are a lot of tools to get to the (HTML) content more quickly, e.g. BeautifulSoup:
Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.
>>> from BeautifulSoup import BeautifulSoup as Soup
>>> import urllib
>>> page = urllib.urlopen("http://www.substech.com/dokuwiki/doku.php?"
"id=thermoplastic_acrylonitrile-butadiene-styrene_abs").read()
>>> soup = Soup(page) # the HTML gets parsed here
>>> soup.findAll('table')
Example output: https://friendpaste.com/DnWDviSiHIYQEBduTqkWd. More documentation can be found here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree.
If you want to extract data from a bigger set of pages, take a look at scrapy.
I don't understand what you mean by
it seem that it does not contain the wikipage at all
I got this relatively rapidly:
import httplib
import re
hostu = 'www.substech.com'
timeout = 7
hypr = httplib.HTTPConnection(host=hostu,timeout = timeout)
rekete_page = ('/dokuwiki/doku.php?id='
'thermoplastic_acrylonitrile-butadiene-styrene_abs')
hypr.request('GET',rekete_page)
x = hypr.getresponse().read()
hypr.close()
#print '\n'.join('%d %r' % (i,line) for i,line in enumerate(x.splitlines(1)))
r = re.compile('\t<tr>\n.+?\t</tr>\n',re.DOTALL)
r2 = re.compile('<th[^>]*>(.*?)</th>')
r3 = re.compile('<td[^>]*>(.*?)</td>')
for y in r.findall(x):
print
#print repr(y)
print map(str.strip,r2.findall(y))
print map(str.strip,r3.findall(y))
result
[]
['<strong>Thermoplastic</strong>']
[]
['<strong>Acrylonitrile</strong><strong>-Butadiene-Styrene (ABS)</strong>']
[]
['<strong>Property</strong>', '<strong>Value in metric unit</strong>', '<strong>Value in </strong><strong>US</strong><strong> unit</strong>']
['Density']
['1.05 *10\xc2\xb3', 'kg/m\xc2\xb3', '65.5', 'lb/ft\xc2\xb3']
['Modulus of elasticity']
['2.45', 'GPa', '350', 'ksi']
['Tensile strength']
['45', 'MPa', '6500', 'psi']
['Elongation']
['33', '%', '33', '%']
['Flexural strength']
['70', 'MPa', '10000', 'psi']
['Thermal expansion (20 \xc2\xbaC)']
['90*10<sup>-6</sup>', '\xc2\xbaC\xcb\x89\xc2\xb9', '50*10<sup>-6</sup>', 'in/(in* \xc2\xbaF)']
['Thermal conductivity']
['0.25', 'W/(m*K)', '1.73', 'BTU*in/(hr*ft\xc2\xb2*\xc2\xbaF)']
['Glass transition temperature']
['100', '\xc2\xbaC', '212', '\xc2\xbaF']
['Maximum work temperature']
['70', '\xc2\xbaC', '158', '\xc2\xbaF']
['Electric resistivity']
['10<sup>8</sup>', 'Ohm*m', '10<sup>10</sup>', 'Ohm*cm']
['Dielectric constant']
['2.4', '-', '2.4', '-']
there's lots of information on retrieving GET variables from a python script. Unfortunately I can't figure out how to send GET variables from a python script to an HTML page. So I'm just wondering if there's a simple way to do this.
I'm using Google App Engine webapp to develop my site. Thanks for your support!
Just append the get parameters to the url: request.html?param1=value1¶m2=value2.
Now you could just create your string with some python variables which would hold the param names and values.
Edit: better use python's url lib:
import urllib
params = urllib.urlencode({'param1': 'value1', 'param2': 'value2', 'value3': 'param3'})
url = "example.com?%s" % params
Does anyone have a nifty way to get all the three letter alphabetic currency codes (an example of the ones I mean is at http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm) into a list in Python 2.5? Note I don't want to do a screen scraping version as the code has to work offline - the website is just an example of the codes.
It looks like there should be a way using the locale library but it's not clear to me from reading the documentation and there must be a better way than copy pasting those into a file!
To clear the question up more, in C# for the same problem, the following code did it very neatly using the internal locale libraries:
CultureInfo.GetCultures(CultureTypes.SpecificCultures)
.Select(c => new RegionInfo(c.LCID).CurrencySymbol)
.Distinct()
I was hoping there might be an equivalent in python. And thanks to everyone who has provided an answer so far.
Not very elegant or nifty, but you can generate the list once and save to use it later:
import urllib, re
url = "http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm"
print re.findall(r'\<td valign\="top"\>\s+([A-WYZ][A-Z]{2})\s+\</td\>', urllib.urlopen(url).read())
output:
['AFN', 'EUR', 'ALL', 'DZD', 'USD', 'EUR', 'AOA', 'ARS', 'AMD', 'AWG', 'AUD',
...
'UZS', 'VUV', 'EUR', 'VEF', 'VND', 'USD', 'USD', 'MAD', 'YER', 'ZMK', 'ZWL', 'SDR']
Note that you'll need to prune everything after X.. as they are apparently reserved names, which means that you'll get one rogue entry (SDR, the last element) which you can just delete by yourself.
You can get currency codes (and other) data from geonames. Here's some code that downloads the data (save the file locally to achieve the same result offline) and populates a list:
import urllib2
data = urllib2.urlopen('http://download.geonames.org/export/dump/countryInfo.txt')
ccodes = []
for line in data.read().split('\n'):
if not line.startswith('#'):
line = line.split('\t')
try:
if line[10]:
ccodes.append(line[10])
except IndexError:
pass
ccodes = list(set(ccodes))
ccodes.sort()