I'm trying to build a parser which can download a data from web page. The problem is that the page is probably "dynamically generated". There is some code in curly brackets which generates html code probably. It seems like Django code.
Here is a pattern:
<script charset="utf-8" type="text/javascript">var browseDefaultColumn = 4; var browse5ColumnLength= '15,24'; var browse4ColumnLength = '20,28'; var browse3ColumnLength = '25,42';var priceFilterSliderEnabled = true;var browseLowPageLength = 24;var browseHighPageLength = 100;</script>
<script id="products-template" type="text/template">
{{#products}}
<li class="{{RowCssClass}}" style="{{RowStyle}}" li-productid="{{ItemCode}}">
<div class="s-productthumbbox">
<div class="productimage s-productthumbimage col-xs-6 col-sm-12 col-md-12">
<a href="{{PrdUrl}}" class="s-product-sache">{{#ImgSashVisible}}
<img src="{{ImgSashUrl}}" class="rtSashImg img-responsive">
{{/ImgSashVisible}}
</a>
<a href="{{PrdUrl}}" class="ProductImageList">
<div>
<img class="rtimg img-responsive" src='{{MainImage}}' alt='{{Brand}} {{DisplayName}}' />
</div>
{{#EnableAltImages}}
<div class="AlternateImageContainerDiv">
<img class="rtimg ProductImageListAlternateImage img-responsive" src='{{AltImage}}' alt='{{Brand}} {{DisplayName}}' />
</div>
{{/EnableAltImages}}
</a>
<div class="QuickBuyAndWishListContainerDiv hidden-xs {{QuickBuyAndWishListCss}}">
{{#IsQuickBuyEnabled}}
I'm looking for a way how to get the whole code containing generated code so I can parse it for example using Beautiful Soup. Or other efficient way to get the data.
The HTML you have is probably a template, and it needs to be parsed by a template engine to populate the content, after which you should be able to get the final HTML and parse that.
You do not normally get template HTML server from a server, this must be an offline file?
Related
I am trying to scrape the Euronorm and the CO2 from a list of cars from an auction website. I have so far succeeded in navigating to the correct auction webpage and downloading that webpage using Selenium. The information that I need is the {{CO2Emission}} and {{EmissionClass}} for all the cars in the following script:
<script id="lot-template" type="text/x-handlebars-template">
<li data-id="{{Id}}">
<a href="{{LotUrl}}">
{{#if IsFollowing}}<figcaption><i class="fa fa-star"></i></figcaption>{{/if}}
<img src="{{ImagePath}}" alt="{{LocaleName}}" />
</a>
<div class="list-info">
<h3>
<a class="car-title" href="{{LotUrl}}">{{LocaleName}}</a>
</h3>
<ul class="item-specs">
<li>Objectnumber: {{Number}}</li>
{{#if EngineSize}}
<li>CC: {{EngineSize}}</li>{{/if}}
<li>Fuel: {{FuelType}}</li>
{{#if PowerKW}}
<li>KW: {{PowerKW}}</li>{{/if}}
{{#if CO2Emission}}
<li>CO2: {{CO2Emission}} g/km</li>{{/if}}
{{#if EmissionClass}}
<li>Euronorm: {{EmissionClass}}</li>{{/if}}
{{#if FirstInUse}}
<li> First Registration: {{date FirstInUse}}</li>{{/if}}
{{#if Mileage}}
<li>Counter: {{Mileage}} {{MileageType}}</li>{{/if}}
{{#if Location}}
<li>Location {{Location}}</li>{{/if}}
{{#if LicensePlate}}
<li>License plate {{LicensePlate}}</li>{{/if}}
</ul>
</div>
<div class="btnrow">
{{#if HasBid}}
<span class="extra">My bid (Excl VAT): <strong>€ {{BidAmount}}</strong></span>
{{/if}}
{{#if IsOpenForBids}}
<i class="fa fa-gavel"></i>{{#if HasBid }}Change Bid{{else}}Bid now{{/if}}
{{/if}}
<a class="btn" href="{{LotUrl}}"><i class="fa fa-arrow-right"></i> details</a>
</div>
</li>
</script>
Is it possible to get the information out of this script? I am a bit stuck now and I would like to know how to proceed. I am new to web scraping, so I am just trying stuff at the moment.
Thank you!
You would not be able to get the information you need from this handlebars template. The template is combined with data to produce HTML, therefore you have two options for extracting the data you need:
Parse the HTML that is generated with this template
Find the source of the data that gets fed into this template
The source of the data may be an API, or in some form that does not require scraping, so I would try that first, then try parsing HTML.
It would be useful to know which site/page you are trying to scrape.
I am trying to parse several items from a blog but I am unable to to reach the last two items I need.
The html is:
<div class="post">
<div class="postHeader">
<h2 class="postTitle"><span></span>cuba and the cameraman</h2>
<span class="postMonth" title="2017">Nov</span>
<span class="postDay" title="2017">24</span>
<div class="postSubTitle"><span class="postCategories">TV Shows</span></div>
</div>
<div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a> <br />
n/A<br />
<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
</p>
The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.
I managed to parse correctly only all the postTile for the website:
all_titles = []
url = 'http://test.com'
browser.get(url)
titles = browser.find_elements_by_class_name('postHeader')
for title in titles:
link = title.find_element_by_tag_name('a')
all_titles.append(link.text)
But I can't get the the image and imdb links using the same method as above , class name.
COuld you support me on this? Thanks.
You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:
for post in driver.find_elements_by_xpath('//div[#class="post"]'):
title = post.find_element_by_xpath('.//h2[#class="postTitle"]//a').text
img_src = post.find_element_by_xpath('.//div[#class="postContent"]//img').get_attribute('src')
link = post.find_element_by_xpath('.//div[#class="postContent"]//a[last()]').get_attribute('href')
Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.
I'm placing here HTML code :
<div class="rendering rendering_person rendering_short rendering_person_short">
<h3 class="title">
<a rel="Person" href="https://moh-it.pure.elsevier.com/en/persons/massimo-eraldo-abate" class="link person"><span>Massimo Eraldo Abate</span></a>
</h3>
<ul class="relations email">
<li class="email"><span>massimo.abate#ior.it</span></li>
</ul>
<p class="type"><span class="family">Person: </span>Academic</p>
</div>
From above code how to extract Massimo Eraldo Abate?
Please help me.
You can extract the name using
response.xpath('//h3[#class="title"]/a/span/text()').extract_first()
Also, look at this Scrapinghub's blogpost for introduction to XPath.
Please take a look at this page. there are lots of ways of extracting text
scrapy docs
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.
xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']
I have the following given html structure
<li class="g">
<div class="vsc">
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</div>
</li>
The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?
BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]
This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.
Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.
Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.
Test html:
html="""
<html>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
</html>"""
and it's basically a oneliner:
import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//li[#class="g"][div/#class = "vsc"][div/#class = "alpha"][div/#class = "beta"][h3/#class = "r"]/h3/a/#href')
Out[264]:
['http://www.correct.com', 'http://www.correct.com']