Python 2.7 Beautiful Soup- parsing list of links

Python 2.7 Beautiful Soup- parsing list of links - python

I am trying to parse all of the links on this page which have an identical hierarchy. I am not getting any traceback, but not getting the data either.
I am trying to get the href tag from the highlighted portion of code:
My current code is:
def link_parser(soup,itemsList):
for item in soup.findAll("div", { "class" : "tileInfo" }):
for link in item.findAll("a", { "class" : "productClick productTitle" }):
try:
itemsList.put(removeNonAscii(html_parser.unescape(link.string)).replace(',',' ')+","+clean_a_url(link['href']))
except Exception:
print "Formatting error: "
traceback.print_exc(file=sys.stdout)
return ""

It looks like you were trying to scrape Target's website - perhaps this page.
You've encountered one of the fundamental difficulties with web page scraping - what you see is not always what you get. In this case they are AJAXing in a bunch of content after you load the page. Notice the little pinwheel animation when you first load the page - the content you were trying to access simply does not exist in the DOM until all the various js scripts they've got on that page run. (and they've got a whole lot of them)
I clicked through a bit and it looks like the code responsible for generating that content is this bit of jquery:
<script id="productTitleTmpl" type="text/x-jquery-tmpl" >
{{if $item.parent.parent.viewType != "details"}}
{{tmpl($data.itemAttributes) "#productBrandTmpl"}}
{{/if}}
<a class="productClick productTitle" id="prodTitle-{{= $item.parent.parent.viewType}}-{{= $item.parent.parent.currentPageNumber}}-{{= $item.parent.productCounter}}" href="/{{= productDetailPageURL}}#prodSlot={{= $item.parent.parent.viewType}}_{{= $item.parent.parent.currentPageNumber}}_{{= $item.parent.productCounter}}" title="{{= title}}" name="prodTitle_{{= $item.catalogEntryId}}">
{{= $item.parent.parent.fetchProductTitleForView($item.productTitle)}}
</a>
So, anyway. If you really are dead-set on scraping this page, you will need to ditch urllib (or whatever you were using to fetch the html). Instead visit this page with a javascript-enabled headless browser (like selenium), let the javascript run, and then scrape it. All of that is outside the realm of this answer but you can google around for various headless browser solutions and find one that works for you.

Related

Python Selenium: How do I print the correct tag?

I am trying to print by ID using Selenium. As far as I can tell, "a" is the tag and "title" is the attribute. See HTML below.
When I run the following code:
print(driver.find_elements(By.TAG_NAME, "a")[0].get_attribute('title'))
I get the output:
Zero Tolerance
So I'm getting the first attribute correctly. When I increment the code above:
print(driver.find_elements(By.TAG_NAME, "a")[1].get_attribute('title'))
My expected output is:
Aaliyah Love
However, I'm just getting blank. No errors. What am I doing incorrectly? Pls don't suggest using xpath or css, I'm trying to learn Selenium tags.
HTML:
<a class=" Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4 Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4" href="/en/channel/ztfilms" title="Zero Tolerance" rel="">Zero Tolerance</a>
...
<a class=" Link ActorThumb-ActorImage-Link styles_3dXcTxVCON Link ActorThumb-ActorImage-Link styles_3dXcTxVCON" href="/[RETRACTED]/Aaliyah-Love/63565" title="Aaliyah Love"

Selenium locators are a toolbox and you're saying you only want to use a screwdriver (By.TAG_NAME) for all jobs. We aren't saying that you shouldn't use By.TAG_NAME, we're saying that you should use the right tool for the right job and sometimes (most times) By.TAG_NAME is not the right tool for the job. CSS selectors are WAY more powerful locators because they can search for not only tags but also classes, properties, etc.
It's hard to say for sure what's going on without access to the site/page. It could be that the entire page isn't loaded and you need to add a wait for the page to finish loading (maybe count links expected on the page?). It could be that your locator isn't specific enough and is catching other A tags that don't have a title attribute.
I would start by doing some debugging.
links = driver.find_elements(By.TAG_NAME, "a")
for link in links:
print(link.get_attribute('title'))
What does this print?
If it prints some blank lines sprinkled throughout the actual titles, your locator is probably not specific enough. Try a CSS selector
links = driver.find_elements(By.CSS_SELECTOR, "a[title]")
for link in links:
print(link.get_attribute('title'))
If instead it returns some titles and then nothing but blank lines, the page is probably not fully loaded. Try something like
count = 20 # the number of expected links on the page
link_locator = (By.TAG_NAME, "a")
WebDriverWait(driver, 10).until(lambda wd: len(wd.find_elements(link_locator)) == count)
links = driver.find_elements(link_locator)
for link in links:
print(link.get_attribute('title'))

Export data from dynamic website with BS4 + Python:

I want to export all store data from the following website into a excel-file:
https://www.ybpn.de/ihre-parfuemerien
The problem: The Map is "dynamic", so the needed data loads when you enter a postal code.
The data is need is stored in the div-class "storefinder__list-item" with a unique reference in the data-"storefinder-reference" div-class, example: data-storefinder-reference="132"
I tried:
soup.find("div", {"data-storefinder-reference": "132"})
But the output is: NONE
I think this problem is caused by the fact that the page is dynamic, so the needed data loads then, when you enter a postal code. So when I search for the reference id "132" its "there", but not loaded on the website and bs4 cant find this id.
Any ideas to improve the code?

For this you might need to look into tools like selenium and/or "firefox-headless".
Especially selenium allows you to "remote-control" web-pages with Python
Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/

If the problem is waiting for the page to load, you can do it with selenium.
`result = driver.execute_script('var text = document.title ; return text')`
If there is jquery on the page, it certainly does
result=driver.execute_script("""
$(document).ready({
var $text=$('yourselector').text()
return $text
})
""")
Note: For selenium you can look here

You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.

Part of HTML not visible for Scrapy

Set-up
I'm using scrapy to scrape housing ads.
For each ad, I'm trying to obtain info on year of construction.
This info is stated in most ads.
Problem
I can see the year of construction and the other info around it in the about section when I check the ad in the browser and its HTML code in developer mode.
However, when I use Scrapy I get returned an empty list. I can scrape other parts of the ad page (price, rooms, etc.), but not the about section.
Check this example ad.
If I use response.css('#caracteristique_bien').extract_first(), I get,
<div id="caracteristique_bien"></div>
That's as far as I can go. Any deeper returns emptiness.
How can I obtain the year of construction?

As I mentioned, this is rendered using javascript, which means that some parts of the html will be loaded dynamically by the browser (Scrapyis not a browser).
The good thing for this case is that the javascript is inside the actual request, which means you can still parse the information that information, but differently.
for example to get the description, you can find it inside:
import re
import demjson
script_info = response.xpath('//script[contains(., "Object.defineProperty")]/text()').extract_first()
# getting description
description_json = re.search("descriptionBien', (\{.+?\});", script_info, re.DOTALL)
real_description = demjson.decode(description_json)['value']
# getting surface area
surface_json = re.search("surfaceT', (\{.+?\})\);", script_info, re.DOTALL).group(1)
real_surface = demjson.decode(surface_json)['value']
...
As you can see script_info contains all the information, you just need to come up with a way to parse that to get what you want
But there is some information that isn't inside the same response. To get it you need to do a GET request to:
https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
As you can see, it only requires the idannonce, which you can get from the previous response with:
demjson.decode(re.search("idAnnonce', (\{.+?\})\);", script_info, re.DOTALL).group(1))['value']
Later with the second request, you can get for example the "construction year" with:
import json
...
[y for y in [x for x in json.loads(response.body)['categories'] if x['name'] == 'Général'][0]['criteria'] if 'construction' in y['value']][0]['value']

Loaded the page, opened devtools of the browser, and did a ctrl-F with the css selector you used (caracteristique_bien), and found out this request: https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
where you can find what you are looking for

Looking at your example, the add is loaded dynamically with javascript so you won't be able to get it via scrapy.
You can use Selenium for (massive) scraping (I did similar things on a famous french ads website)
Just use it headless with Chrome options and this will be fine :
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options = options)

Making external links open in a new window in wagtail

I recently implemented adding target="_blank" to external links like this:
#hooks.register('after_edit_page')
def do_after_page_edit(request, page):
if hasattr(page, "body"):
soup = BeautifulSoup(page.body)
for a in soup.findAll('a'):
if hasattr(a, "href"):
a["target"] = "_blank"
page.body = str(soup)
page.body = page.body.replace("<html><head></head><body>", "")
page.body = page.body.replace("</body></html>", "")
page.body = page.body.replace("></embed>", "/>")
page.save()
#hooks.register('construct_whitelister_element_rules')
def whitelister_element_rules():
return {
'a': attribute_rule({'href': check_url, 'target': True}),
}
Problems:
Beautiful soup messes with the output, adding html, head & body tags - Don't put html, head and body tags automatically, beautifulsoup
It also messes with the embed tags - How to get BeautifulSoup 4 to respect a self-closing tag?
Hence my crappy "fix" manually replacing parts of the output with blank strings.
Question:
What is the correct and best way to do this?

Starting with Wagtail v2.5, there is an API to do customisations like this as part of Wagtail’s rich text processing: Rewrite handlers, with the register_rich_text_features hook.
Here is an example of using this new API to make a rewrite handler that sets a target="_blank" attribute to all external links:
from django.utils.html import escape
from wagtail.core import hooks
from wagtail.core.rich_text import LinkHandler
class NewWindowExternalLinkHandler(LinkHandler):
# This specifies to do this override for external links only.
# Other identifiers are available for other types of links.
identifier = 'external'
#classmethod
def expand_db_attributes(cls, attrs):
href = attrs["href"]
# Let's add the target attr, and also rel="noopener" + noreferrer fallback.
# See https://github.com/whatwg/html/issues/4078.
return '<a href="%s" target="_blank" rel="noopener noreferrer">' % escape(href)
#hooks.register('register_rich_text_features')
def register_external_link(features):
features.register_link_type(NewWindowExternalLinkHandler)
In this example I'm also adding rel="noopener" to fix a known security issue with target="_blank".
Compared to previous solutions to this problem, this new approach is the most reliable: it’s completely server-side and only overrides how links are rendered on the site’s front-end rather than how they are stored, and only relies on documented APIs instead of internal ones / implementation details.

Have been struggling with the same problem and couldn’t achieve it using wagtailhooks. My initial solution was to manipulate the content in base.html, using a filter. The filter to cut pieces of code works perfectly when placed in the content block, example:
{{ self.body|cut: ‘ href="http:’}}
Above filter deletes parts of the content, but unfortunately ‘replace’ is not available as a filter (I´m using Python 3.x). Therefor my next approach was building a custom_filter to create ´replace´ as filter option. Long story short: It partly worked but only if the content was converted from the original ‘StreamValue’ datatype to ‘string’. This conversion resulted in content with all html tags shown, so the replacement did not result in working html. I couldn´t get the content back to StreamValue again and no other Python datatype remedied the issue.
Eventually JQuery got the job done for me:
$(document).ready(function(){
$('a[href^="http://"]').attr('target', '_blank');
});
This code adds ‘target="_blank"’ to each link containing ‘http://’, so all internal links stay in the existing tab. It needs to be placed at the end of your base.html (or similar) and of course you need to load JQuery before you run it.
Got my answer from here .
Don’t know if JQuery is the correct and best way to do it, but it works like a charm for me with minimal coding.

python, collecting links / script values from page

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.
With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):
<a class="visit" href="/tet?id=12&mv=13&san=221">
221
</a>
and this is the script:
<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>
I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)
I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after
extra info: This might be important. to get to the page I have to click on a button with this code:
<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">
load2
</a>
after which a "new page" loads in a part of the window (but the url never changes)

I think you pasted the wrong script of yours ;)
I'm not sure what you need exactly - there are at least two different approaches.
Matching all hrefs using regex
Matching specific tags and using getAttribute(...)
For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):
<a.+?href=['"](.*?)['"].*?/?>
If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.
This could result in code like this:
hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')
for element in elements:
hrefs.append(element.getAttribute('href'))
Or a one liner using list comprehension:
hrefs = [element.getAttribute('href') for element \
in webdriver.find_elements_by_css_selector('.visit')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 2.7 Beautiful Soup- parsing list of links - python

Related

Python Selenium: How do I print the correct tag?

Export data from dynamic website with BS4 + Python:

Part of HTML not visible for Scrapy

Making external links open in a new window in wagtail

python, collecting links / script values from page

Categories

Resources