I am scraping metadata from the New York Times' website. I'm looking to gather three pieces of information:
Headline
Article URL
Thumbnail image
I have been successful in gathering all three except in cases where the NYTimes homepage shows the article's image on the homepage. In that case, I've tried to capture that homepage thumbnail image, but have been unsuccessful. Here is my code so far:
for item in soup.select('.story-wrapper'):
try:
headline = item.find('h3').get_text()
link = item.find('a')['href']
image = item.select('.css-hdqqnp')
The css selector .css-hdqqnp references the class of the thumbnail image for article images that are displayed on the NYTimes homepage (as opposed to being just text).
How can I get the thumbnail image for an article if it's already displayed on the homepage, as opposed to being available only on the article page (which I've already successfully gathered)?
The problem is that the HTML structure is
<div class="..." span="4">
<div class="....">
<section class="story-wrapper"> ... </section>
</div>
</div>
<div class="..." span="6">
<div class="....">
<!-- ... your nested img-tag inside a div-tag with css class 'css-hdqqnp' -->
</div>
</div>
That is, the image is not inside the section-tag. Instead, it's inside the next sibling tag of the section's grandparent tag. Consequently, you could search for the image thumbnails like this:
for item in soup.select('.story-wrapper'):
headline = item.find('h3').get_text()
link = item.find('a')['href']
if (sibling := item.parent.parent.next_sibling) is not None:
if (image := sibling.find("img")) is not None:
image_url = image["src"]
Related
This is the html code of a website I want to scrape
<div>
<div class="activityinstance">
<a class="" onclick="" href="https://www.blablabla.com">
<img src="http://www.blablabla.com/justapicture.jpg" class="iconlarge activityicon" alt="" role="presentation" aria-hidden="true">
<span class="instancename">title<span class="accesshide "> text
</span>
</span>
</a>
</div>
</div>
IMAGELINK = "http://www.blablabla.com/justapicture.jpg"
My aim is to find in particular page all of the hrefs that are associated with IMAGELINK using python.
the picture from this url tend to be shown multiple times and I want to recieve all the links so I could click on them.
I tried to find elements by class name "a" to extract all of the links in the page, and that way if I could find their xPath I could just format "/img" and get attribute "src" from that element.
But the problem is I haven't found a way to extract the xPath with given webdriver element.
NOTE: I don't have access to the Xpath of the element unless I write some function to generate it
Find all elements with tag img and print the src attribute:
imgs = driver.find_elements_by_xpath("//img")
for img in imgs:
print(img.get_attribute("src"))
I think he wanted the parent href with the img src which equals
imgs = driver.find_elements_by_xpath("//img[src='http://www.blablabla.com/justapicture.jpg']/parent::a")
for img in imgs:
print(img.get_attribute("href"))
I have a problem with retrieving the text from a div class of a website.
The structure of the page is attached below. I've trying to retrieve that <span class="product-details__toggler-selected" title="black". Only the text 'black' from it.
For the moment I don't retrieve nothing with it.
My xpath is this:
color = response.xpath("//div[#class='product-details__toggler-info-title']/p/span[#class='product-details__toggler-selected']/text()").extract()
Structure of page:
<div class="product-details__toggler-info-title">
<span class="product-details__toggler-title">Culoare</span>
<span class="product-details__toggler-selected" title="black"><em class="s-color-bg" style="background-color: #000000">black</em><span class="s-color-name">black</span></span>
</div>
Try below XPath to get required value:
//div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/span/text()
or
//div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/#title
Currently I am generating a pdf from a html template in django/python.
Here is a relevant snipit from my view
result = StringIO.StringIO()
html = render_to_string(template='some_ref/pdf.html', { dictionary passed to template},)
pdf = pisa.pisaDocument(StringIO.StringIO(html), dest=result)
return HttpResponse(result.getvalue(), content_type='application/pdf')
And my template is an html file that I would like to insert a hyperlink into. Something like
<td style="padding-left: 5px;">
{{ some_other_variable }}
</td>
Actually, the pdf generates fine and the template variables are passed correctly and show in the pdf. What is inside the a tag is highlighted in blue and underlined as if you could click on it, but when I try to click on it, the link is not followed. I have seen pdfs before with clickable links, so I believe it can be done.
Is there a way I can do this to make clickable hyperlinks on my pdf using pisa?
it works with the complete url: http protocol and domain
{{ some_other_variable }}
I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.
Upon inspection, the links are provided but the HTML looks like this for all of them:
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?
*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.
How to read the mangled attribute, data-cachedhtml
The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:
Cleanup the markup mess.
Get the attribute value of data-cachedhtml.
Use XPath to extract the image links.
XPath part
For the de-mangled data-chachedhtml in this form:
<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
<div class="media-preview-content">
<a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
<img class="preview" src="https://i.redditmedia.com/elided"
width="861" height="638"/>
</a>
</div>
<span class="error">loading...</span>
</div>
This XPath will retrieve the preview image links:
//a/img/#src
(That is, all src attributes of img element children of a elements.)
or
This XPath will retrieve the click-through image links:
//a[img]/#href
(That is, all href attributes of the a elements that have a img child.)
I am trying to parse several items from a blog but I am unable to to reach the last two items I need.
The html is:
<div class="post">
<div class="postHeader">
<h2 class="postTitle"><span></span>cuba and the cameraman</h2>
<span class="postMonth" title="2017">Nov</span>
<span class="postDay" title="2017">24</span>
<div class="postSubTitle"><span class="postCategories">TV Shows</span></div>
</div>
<div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a> <br />
n/A<br />
<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
</p>
The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.
I managed to parse correctly only all the postTile for the website:
all_titles = []
url = 'http://test.com'
browser.get(url)
titles = browser.find_elements_by_class_name('postHeader')
for title in titles:
link = title.find_element_by_tag_name('a')
all_titles.append(link.text)
But I can't get the the image and imdb links using the same method as above , class name.
COuld you support me on this? Thanks.
You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:
for post in driver.find_elements_by_xpath('//div[#class="post"]'):
title = post.find_element_by_xpath('.//h2[#class="postTitle"]//a').text
img_src = post.find_element_by_xpath('.//div[#class="postContent"]//img').get_attribute('src')
link = post.find_element_by_xpath('.//div[#class="postContent"]//a[last()]').get_attribute('href')
Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.