How to scrape a Handlebar.js script in Python?

How to scrape a Handlebar.js script in Python? - python

I am trying to scrape the Euronorm and the CO2 from a list of cars from an auction website. I have so far succeeded in navigating to the correct auction webpage and downloading that webpage using Selenium. The information that I need is the {{CO2Emission}} and {{EmissionClass}} for all the cars in the following script:
<script id="lot-template" type="text/x-handlebars-template">
<li data-id="{{Id}}">
<a href="{{LotUrl}}">
{{#if IsFollowing}}<figcaption><i class="fa fa-star"></i></figcaption>{{/if}}
<img src="{{ImagePath}}" alt="{{LocaleName}}" />
</a>
<div class="list-info">
<h3>
<a class="car-title" href="{{LotUrl}}">{{LocaleName}}</a>
</h3>
<ul class="item-specs">
<li>Objectnumber: {{Number}}</li>
{{#if EngineSize}}
<li>CC: {{EngineSize}}</li>{{/if}}
<li>Fuel: {{FuelType}}</li>
{{#if PowerKW}}
<li>KW: {{PowerKW}}</li>{{/if}}
{{#if CO2Emission}}
<li>CO2: {{CO2Emission}} g/km</li>{{/if}}
{{#if EmissionClass}}
<li>Euronorm: {{EmissionClass}}</li>{{/if}}
{{#if FirstInUse}}
<li> First Registration: {{date FirstInUse}}</li>{{/if}}
{{#if Mileage}}
<li>Counter: {{Mileage}} {{MileageType}}</li>{{/if}}
{{#if Location}}
<li>Location {{Location}}</li>{{/if}}
{{#if LicensePlate}}
<li>License plate {{LicensePlate}}</li>{{/if}}
</ul>
</div>
<div class="btnrow">
{{#if HasBid}}
<span class="extra">My bid (Excl VAT): <strong>€ {{BidAmount}}</strong></span>
{{/if}}
{{#if IsOpenForBids}}
<i class="fa fa-gavel"></i>{{#if HasBid }}Change Bid{{else}}Bid now{{/if}}
{{/if}}
<a class="btn" href="{{LotUrl}}"><i class="fa fa-arrow-right"></i> details</a>
</div>
</li>
</script>
Is it possible to get the information out of this script? I am a bit stuck now and I would like to know how to proceed. I am new to web scraping, so I am just trying stuff at the moment.
Thank you!

You would not be able to get the information you need from this handlebars template. The template is combined with data to produce HTML, therefore you have two options for extracting the data you need:
Parse the HTML that is generated with this template
Find the source of the data that gets fed into this template
The source of the data may be an API, or in some form that does not require scraping, so I would try that first, then try parsing HTML.
It would be useful to know which site/page you are trying to scrape.

Related

Extract URLs from a class using Scrapy

I am trying to use scrapy to get a list of URLs from this website. I have the class of the div and I want all a tags in it.
here is the link for the website I am trying to get each URL for the profiles.
https://www.letsmakeaplan.org/find-a-cfp-professional?limit=10&pg=1&sort=random&distance=5
This is the code to try and pull the URLs from the page above
sel = Selector(text=driver.page_source)
books1 = sel.xpath("//div[#class='faceted-search-results-container-listing']/a/#herf").extract()
this comes back empty
This is is the code from the website
<<div class="faceted-search-results-container-listing" style="">
<a href="/find-a-cfp-professional/certified-professional-profile/a9a0ca36-3c70-4ea4-a853-7f704fe4cc98" class="find-cfp-item js-card-link">
<div class="find-cfp-item-top">
<div class="h5 find-cfp-item-name">C. H. Simmons, CFP®</div>
<div class="find-cfp-item-read-more"><span>view details</span></div>
</div>
<div class="find-cfp-item-bottom">
<div class="find-cfp-item-column" data-column="1">
<img src="https://login.cfp.net/eweb/photos/91475.jpg" data-default-img="/-/media/feature/cfp/lmapprofile/default-profile-avatar.jpeg" data-default-img-backup="/images/default-profile-avatar.jpeg" alt="C. Simmons Headshot" class="find-cfp-item-headshot" onerror="handleImg(this, event);">
<div class="find-cfp-item-text">
Simmons and Starzl Wealth Management<br>
110 Bay St<br>
Gadsden, AL 35901-5229<br>
</div>
</div>
<div class="find-cfp-item-column" data-column="2">
<div class="h6 find-cfp-item-column-heading">Planning Services Offered</div>
<div class="find-cfp-item-text" data-line-clamp="4">
Investment Planning, Retirement Planning
</div>
</div>
<div class="find-cfp-item-column" data-column="3">
<div class="find-cfp-item-column-inner">
<div class="h6 find-cfp-item-column-heading">Client Focus</div>
<div class="find-cfp-item-text" data-line-clamp="1">
None Provided
</div>
</div>
<div class="find-cfp-item-column-inner">
<div class="h6 find-cfp-item-column-heading">Minimum Investable Assets</div>
<div class="find-cfp-item-text" data-line-clamp="1">
$500,000
</div>
</div>
</div>
</div>
</a>

It looks like the search results come from an ajax call to an api in json format and rendered dynamically.
You can get all of the information from the raw json data if you scrape the api url instead...
scrapy.Request(url='https://www.letsmakeaplan.org/api/feature/lmapprofilesearch/search?limit=10&pg=1&sort=random&distance=5')
def parse(response):
data = response.json()
results = data["results"]
links = [i["item_url"] for i in results]
yield {'links': links}
output:
'/find-a-cfp-professional/certified-professional-profile/b1a27bac-77f0-4796-ab7f-7e15c19d8421'
'/find-a-cfp-professional/certified-professional-profile/e493f31f-88c7-4fdd-9863-9712ba85c95c'
'/find-a-cfp-professional/certified-professional-profile/2d634f05-331e-4699-b1a8-96e7a20aa0bf'
'/find-a-cfp-professional/certified-professional-profile/d9074216-7321-469f-b42f-2988d84d4a2b'
'/find-a-cfp-professional/certified-professional-profile/7f55e98c-df27-4922-b3a4-07c341a87f65'
'/find-a-cfp-professional/certified-professional-profile/1b0377a2-4545-45af-9ac4-18a8af2ffecd'
'/find-a-cfp-professional/certified-professional-profile/66b78e79-608b-4079-86c2-d9ae84c3a762'
'/find-a-cfp-professional/certified-professional-profile/e884f42b-8239-475a-b55f-5bb6f1130a36'
'/find-a-cfp-professional/certified-professional-profile/b00abd44-5969-4f02-a052-e6ef34b60e9b'
'/find-a-cfp-professional/certified-professional-profile/10ae9e9f-f11e-4f79-91c4-05f24e0c7a0e'

Selenium python how to upload file when there is no input type file?

I am using Selenium python to automate a site. The problem I have face is, I have to upload file but there is no input type file available where I could have been using send_keys() method.
The File upload element:
<div id="data-assets-interior-file-upload" data-upload-properties="{"formId":"form-main-1","path":"data[assets][interior]","modalUpload":"Uploading...","warnExtensionsStrings":{"pdf":"<div class=\"a-box a-alert-inline a-alert-inline-warning\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"><\/i><div class=\"a-alert-content\">\n Most PDF files do not produce great results in an automated conversion process. We recommend using a Word, Mobi, ePUB or HTML file if you have one. <a href=\"\/en_US\/help\/topic\/A2GF0UFHIYG9VQ?ref_=_fg\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more.<\/a>\n <\/div><\/div><\/div>\n <div id=\"file-warn-actions\" class=\"a-form-actions a-spacing-none a-spacing-top-large\">\n <div class=\"a-row a-spacing-none\">\n <div class=\"a-column a-span6\">\n <span class=\"a-declarative\" data-action=\"potter-file-warn-extension-continue\" data-potter-file-warn-extension-continue=\"{}\">\n <span id=\"file-warn-extension-continue\" class=\"a-button a-button-base button-fill\"><span class=\"a-button-inner\"><button id=\"file-warn-extension-continue-announce\" class=\"a-button-text\" type=\"button\">\n Continue with PDF\n <\/button><\/span><\/span>\n <\/span>\n <\/div>\n <div class=\"a-column a-span6 a-span-last\">\n <span class=\"a-declarative\" data-action=\"potter-file-warn-extension-cancel\" data-potter-file-warn-extension-cancel=\"{}\">\n <span id=\"file-warn-extension-cancel\" class=\"a-button a-button-primary button-fill\"><span class=\"a-button-inner\"><button id=\"file-warn-extension-cancel-announce\" class=\"a-button-text\" type=\"button\">\n I have another format\n <\/button><\/span><\/span>\n <\/span>\n <\/div>\n <\/div>\n <\/div>","pdf-header":"Do you have another format?"},"acceptedExtensions":"doc,docx,zip,htm,html,mobi,azw,epub,rtf,txt,pdf,kpf","multipart":null,"persistSuccess":true,"warnExtensions":["pdf"],"key":"save","url":"\/en_US\/title-setup\/kindle\/A3U1YUNVSBYTMZ\/content\/action\/save","workflowId":"assets.interior","assetType":"DIGITAL_BOOK_BLOCK"}" class="a-section jele-file-field">
<div class="a-section a-spacing-none file-upload-options-section">
<p class="a-spacing-small">
</p>
<div class="a-row file-upload-extra-info-message-section">
<div class="a-column a-span12">
<div class="a-box a-alert a-alert-info"><div class="a-box-inner a-alert-container"><i class="a-icon a-icon-alert"></i><div class="a-alert-content">Use Kindle Create to transform your manuscript to an eBook with professional book themes, images, and Table of Contents. Click here to download for free.</div></div></div>
</div>
</div>
<br/>
<div class="a-row file-upload-browse-section">
<div class="a-column a-span12">
<span class="a-declarative" data-action="browse-clicked" data-browse-clicked="{"signInRequired":false,"id":"data-assets-interior-file-upload"}">
<span id="data-assets-interior-file-upload-browse-button" class="a-button a-button-primary file-upload-browse-button"><span class="a-button-inner"><button id="data-assets-interior-file-upload-browse-button-announce" class="a-button-text" type="button">
Upload Book
</button></span></span>
</span>
<span class="a-declarative" data-action="file-selected" data-file-selected="{}" id="data-assets-interior-uploader">
<span class="fileuploader a-hidden"></span>
</span>
<p class="a-spacing-top-small a-size-mini a-color-tertiary a-text-italic">
</p>
</div>
</div>
<input type="hidden" name="" value="doc,docx,zip,htm,html,mobi,azw,epub,rtf,txt,pdf,kpf" id="data-assets-interior-file-upload-accepted-extensions" class="accepted-extensions"/>
</div> </div>
Can anyone let me know, how to handle this scenario? If you are gonna recommend me some other library for it then please post relevant examples in python as well. Thank you

Django: sidebar with dynamic URLs: how to dynamically create URLs which have dynamic folders in the path

I have a problem with dynamic URLs in sidebar navigation in Django and I hope some of you can help me shed some lights on how to solve it. I have looked for similar questions but I couldn't find an answer for my case.
Basically, what I want to achieve is to have a sidebar with links. This sidebar will be reused on many pages, so it sits in a separate sidebar.py file, which is later imported to the pages.
<h6 class="sidebar-heading d-flex justify-content-between align-items-center px-3 mt-4 mb-1 text-muted">
<span>Content</span>
<a class="d-flex align-items-center text-muted" href="#">
<span data-feather="plus-circle"></span>
</a>
</h6>
<ul class="nav flex-column">
<li class="nav-item">
<a class="nav-link active" href="DYNAMIC LINK HERE">
<span data-feather="home"></span>
Status codes</span>
</a>
</li>
<li class="nav-item">
<a class="nav-link" href="#">
<span data-feather="file"></span>
Depth
</a>
</li>
</ul>
The links I want to display are the following:
urls.py
path('<id>/<crawl_id>/dashboard/', ProjectDashboard, name="crawl_dashboard"),
path('<id>/<crawl_id>/dashboard/status-codes/', StatusCodeDashboard, name="status_code_dashboard"),
path('<id>/<crawl_id>/dashboard/url-depth/', UrlDepthDashboard, name="url_depth_dashboard"),
As you can see, they are dynamic URLs which take an id and crawl_id. So, for each crawl dashboard I want the sidebar to link to its relative status_code_dashboard page and url_depth_dashboard page.
As an example:
/22/123/dashboard --> should have a sidebar with links to:
/22/123/dashboard/status-code/
/22/123/dashboard/url-depth/
What I tried to do is to create a context processor like the following:
def get_dashboard_paths(request):
# Get current path
current_path = request.get_full_path()
depths_dashboard = current_path + 'url-depth/'
return {
'depths_dashboard': depths_dashboard
}
...and then in the sidebar.py template use {{depths_dashboard}}...
This works but it's not scalable: when I am in /22/123/dashboard/status-code/ for example, I still want to have the sidebar to link to the other sections. If I use the above context processor, due to the bad solution wrong links would be created like:
/22/123/dashboard/status-code/status-code/
/22/123/dashboard/status-code/url-depth/
Do you have a hint on how I can display the sidebar on all of the above pages with dynamic URLs based on id and crawl_id? Basically the question is, how can I correctly send those parameters dynamically depending on which id and crawl_id context I am in?
Thanks a lot!

Just pass the id and crawl_id into your template. Then in the template:
Dashboard
Status code
URL depth
If you specifically want to use the preprocessor, you can also get these numbers from get_full_path().split('/').

Using Selenium Webdriver, grabbing data not showing up in innerhtml

I am trying to use selenium to grab text data from a page.
Printing the html attributes:
element = driver.find_element_by_id("divresults")
Results:
print(element.get_attribute('innerHTML'))
<div id="divDesktopResults"> </div>
Results:
print(element.get_attribute('outerHTML'))
<div id="divresults" data-bind="html:resultsContent"><div id="divDesktopResults"> </div></div>
Tried grabbing this element
Results:
driver.find_element_by_css_selector("span[class='glyphicon glyphicon-tasks']")
Message: no such element: Unable to locate element: {"method":"css selector","selector":"span[class='glyphicon glyphicon-tasks']"}
This is the code when copied from the Browser. There is much more below 'divresults' that did not show up in the innerhtml printout
<div id="divresults" data-bind="html:resultsContent">
<div>
<div class="row" style="font-size:8pt;">
<a data-toggle="tooltip" style="text-decoration:underline" href="#pdfviewer?ID=D218101736">
<strong>D218101736 </strong>
<span class="glyphicon glyphicon-new-window"></span>
</a>
<div class="btn-group" style="font-size:8pt;margin-left:10px;" id="btnD218101736">
<span style="display:none;font-size:8pt;" id="lblD218101736"> Added To Cart</span>
<button type="button" style="font-size:8pt;" class="btn btn-primary dropdown-toggle" data-toggle="dropdown"> Add To Cart
<span class="caret"></span>
</button>
<ul class="dropdown-menu" role="menu">
<li> <strong>Regular ($7.00)</strong> </li>
<li> <strong>Certified ($12.00)</strong> </li>
</ul>
</div>
</div> <br>
<ul class="nav nav-tabs compact">
<li class="active">
<a data-toggle="tab" href="#D218101736_Doc">
<span class="glyphicon glyphicon-file"></span>
<span>Doc Info</span>
</a>
</li>
<li class="hidden-xs">
<a data-toggle="tab" href="#D218101736_Thumbnail">
<span class="glyphicon glyphicon-th-large"></span>
<span>Thumbnail</span>
</a>
</li>
....
How to I get data beneath divresults in the instance?

My guess is that it's one of two things:
There is more than one element that matches that locator. To investigate this, try using $$("#divresults") in the dev console and make sure that it returns 1. If it returns more than one, run $$("#divresults")[0] and make sure the element returned is the one you want. If it is, go on to step 2. If it isn't, you will need to find a locator that is more specific. If you want our help, you will need to provide a link to the page or more of the surrounding HTML to the desired element.
You need to add a wait so that the contents of the element can finish loading. You could wait for a locator like #divresults strong or any number of locators to find some of the elements that were missing. You would wait for them to be visible (or at least present). See the docs for more info and options.

Need help scraping items from a list with Scrapy using ancestor

I am trying to scrape the details like Contact, Location, Phone and Rate. The html is as below. The list is a dynamic one so sometimes only few of the items like Contact and Location may appear on the page while sometimes all of them can appear. I am thinking I can use the icon tag to get the required text but am unable to find any documentation on this. Any help would be highly appreciated.
Thanks in advance.
<div class="detail-all-label">
<i class="abc-Contact"></i>
<div class="detail-all-text"><b>Contact</b>: Ram Bahadur</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Location"></i>
<div class="detail-all-text"><b>Location</b>: Kathmandu</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Website"></i>
<div class="detail-all-text"><b>Website</b>: itworkremotely</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Phone"></i>
<div class="detail-all-text"><b>Phone</b>: 3283550121</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Rate"></i>
<div class="detail-all-text"><b>Rate</b>: €700 - 10000</div>
</div>

You can get all of the detail values that have a preceding b element inside the div with class="detail-all-text":
for detail in response.xpath("//div[#class='detail-all-text']/b"):
name = detail.xpath("text()").extract()[0]
value = detail.xpath("following-sibling::text()")[0]
print name, value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a Handlebar.js script in Python? - python

Related

Extract URLs from a class using Scrapy

Selenium python how to upload file when there is no input type file?

Django: sidebar with dynamic URLs: how to dynamically create URLs which have dynamic folders in the path

Using Selenium Webdriver, grabbing data not showing up in innerhtml

Need help scraping items from a list with Scrapy using ancestor

Categories

Resources