Finding an href link using Python, Selenium, and XPath - python

I want to get the href from a <p> tag using an XPath expression.
I want to use the text from <h1> tag ('Cable Stripe Knit L/S Polo') and simultaneously text from the <p> tag ('White') to find the href in the <p> tag.
Note: There are more colors of one item (more articles with different <p> tags, but the same <h1> tag)!
HTML source
<article>
<div class="inner-article">
<a href="/shop/tops-sweaters/ix4leuczr/a1ykz7f2b" style="height:150px;">
</a>
<h1>
<a href="/shop/tops-sweaters/ix4leuczr/a1ykz7f2b" class="name-link">Cable Stripe Knit L/S Polo
</a>
</h1>
<p>
White
</p>
</div>
</article>
I've tried this code, but it didn't work.
specificProductColor = driver.find_element_by_xpath("//div[#class='inner-article' and contains(text(), 'White') and contains(text(), 'Cable')]/p")
driver.get(specificProductColor.get_attribute("href"))

As per the HTML source, the XPath expression to get the href tags would be something like this:
specificProductColors = driver.find_elements_by_xpath("//div[#class='inner-article']//a[contains(text(), 'White') or contains(text(), 'Cable')]")
specificProductColors[0].get_attribute("href")
specificProductColors[1].get_attribute("href")
Since there are two hyperlink tags, you should be using find_elements_by_xpath which returns a list of elements. In this case it would return two hyperlink tags, and you could get their href using the get_attribute method.

I've got working code. It's not the fastest one - this part takes approximately 550 ms, but it works. If someone could simplify that, I'd be very thankful :)
It takes all products with the specified keyword (Cable) from the product page and all products with a specified color (White) from the product page as well. It compares href links and matches wanted product with wanted color.
I also want to simplify the loop - stop both for loops if the links match.
specificProduct = driver.find_elements_by_xpath("//div[#class='inner-article']//*[contains(text(), '" + productKeyword[arrayCount] + "')]")
specificProductColor = driver.find_elements_by_xpath("//div[#class='inner-article']//*[contains(text(), '" + desiredColor[arrayCount] + "')]")
for i in specificProductColor:
specProductColor = i.get_attribute("href")
for i in specificProduct:
specProduct = i.get_attribute("href")
if specProductColor == specProduct:
print(specProduct)
wantedProduct = specProduct
driver.get(wantedProduct)

Related

How to scrape the content from all the div tags which also contain another tag?

The website I am trying to scrape has all of its content laid out under the same div class type: mock-div. Upon inspecting its HTML, the relevant content is only present under those div tags which also contain the figure tag. What should be the correct XPath?
I tried to see if the following would work
response.xpath("//figure~//").getall()
but it returns ValueError: XPath error: Invalid expression in //figure~// and rightly so.
<div class="mock-div">
<h2 class="mock-h2" id="id1"> hello world </h2>
<figure class="mock-fig"><img src="file.jpg" alt="filename">
<figcaption>file caption</figcaption> </figure>
<p>text1</p>
<p>text2</p>
</div>
...
<div class="mock-div">
<h2 class="mock-h2" id="id2"> footer </h2>
<p> end of the webpage </p>
From the HTML above, we want to extract from all the matching div tag the following information:
<h2> tag: hello world
<p> tag: text1, text2
src value from img tag: file.jpg
alt value from img tag: filename
figcaption tag: file caption
Use the class as the xpath identifier.
for section in response.xpath('//div[#class="mock-div"]'):
h2 = section.xpath('./h2/text()').get()
p_s = section.xpath('./p/text()').getall()
src = section.xpath('.//img/#src').get()
alt = section.xpath('.//img/#alt').get()
fig_caption = section.xpath('.//figcaption/text()').get()
Here is an example that uses the method you described, by grabbing the parent div of the figure elements. You would simply use the .. xpath selector to grab the parent of the figure element.
For example:
import scrapy
html = """
<div class="mock-div">
<h2 class="mock-h2" id="id1"> hello world </h2>
<figure class="mock-fig"><img src="file.jpg" alt="filename">
<figcaption>file caption</figcaption> </figure>
<p>text1</p>
<p>text2</p>
</div>
"""
html = scrapy.Selector(text=html)
for elem in html.xpath("//figure"):
section = elem.xpath('./..')
print({
'h2': section.xpath('./h2/text()').get(),
'p_s': section.xpath('./p/text()').getall(),
'src': section.xpath('.//img/#src').get(),
'alt': section.xpath('.//img/#alt').get(),
'fig_caption': section.xpath('.//figcaption/text()').get() })
OUTPUT:
{'h2': ' hello world ', 'p_s': ['text1', 'text2'], 'src': 'file.jpg', 'alt': 'filename', 'fig_caption': 'file caption'}
This isn't a strategy that I would usually recommend though.
You could use XPath Axes for this. For example, this XPath will get all div tags that contain a figure tag:
response.xpath("//div[descendant::figure]").getall()
XPath Axes
An axis represents a relationship to the context (current) node, and
is used to locate nodes relative to that node on the tree.
descendant
Selects all descendants (children, grandchildren, etc.) of the current
node
See also:
XPath Axes
XPath Axes and their Shortcuts

How to find elemenst under a located element?

I have a web page something like:
<div class='product-pod-padding'>
<a class='header product-pod--ie-fix' href='link1'/>
<div> SKU#1</div>
</div>
<div class='product-pod-padding'>
<a class='header product-pod--ie-fix' href='link2'/>
<div> SKU#2</div>
</div>
<div class='product-pod-padding'>
<a class='header product-pod--ie-fix' href='link3'/>
<div> SKU#3</div>
</div>
When I tried to loop through the products with the following code, it will give us expected outcome:
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(product.text)
SKU#1
SKU#2
SKU#3
However, if I try to locate the href of each product, it will only return the first item's link:
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(index)
print(product.text)
url=product.find_element_by_xpath("//a[#class='header product-pod--ie-fix']").get_attribute('href')
print(url)
SKU#1
link1
SKU#2
link1
SKU#3
link1
What should I do to get the corrected links?
This should make your code functional:
[...]
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(index)
print(product.text)
url=product.find_element_by_xpath(".//a[#class='header product-pod--ie-fix']").get_attribute('href')
print(url)
[..]
The crux here is the dot in front of xpath, which means searching within the element only.
You need to use a relative XPath in order to locate a node inside another node.
//a[#class='header product-pod--ie-fix'] will always return a first match from the beginning of the DOM.
You need to put a dot . on the front of the XPath locator
".//a[#class='header product-pod--ie-fix']"
This will retrieve you a desired element inside a parent element:
url=product.find_element_by_xpath(".//a[#class='header product-pod--ie-fix']").get_attribute('href')
So, your entire code could be as following:
products=driver.find_elements_by_xpath("//div[#class='product-pod-padding']")
for index, product in enumerate(products):
print(index)
url=product.find_element_by_xpath(".//a[#class='header product-pod--ie-fix']").get_attribute('href')
print(url)

Extracting string from <h1> element with logic attached

I am trying to scrape some sports game data and I have ran into some issues with my code. Eventually I will move this data into a dataframe and then eventually a database.
I am trying to scrape some sports data.
In the code, I have found the class element of one of the headers I would like to parse. There are multiple h1's in the HTML I am parsing.
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
With this HTML structure, how can I get the h1 to return to a string I can use to populate a dataframe?
Code I have tried so far is:
req = requests.get(url) # + str(page) + '/')
soup = bs(req.text, 'html.parser')
stype = soup.find('h1', class_ ='type-game')
print(stype)
This code returns "None". I have checked other articles on here and nothing has worked so far.
For the next level of my question, is there a way to create a For loop or similar to go through all of the pages (website is numbered sequentially for events) for any games that contain a string?
For example, if I wanted to only save games that have the Chicago Blackhawks in the h1 for the div element that has class= type-game?
Pseudocode would be something like this:
For webpages 1 to 10000:
if class_='type-game' 'h1' contains "Blackhawks"
then proceed with parsing the code
if not, skip the code and go to the next webpage
I know this is a little open ended, but I have a good VBA background and trying to apply those coding ideas to Python has been a challenge.
Select your elements more specific for example with css selectors:
soup.select('h1:-soup-contains("Blackhawks")')
or
soup.select('div.type-game h1:-soup-contains("Blackhawks")')
To get the text from a tag just use .text or get_text()
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Example
html='''
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Hawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Ducks vs. Blackhawks</h1>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Output
Blackhawks vs. Ducks
Ducks vs. Blackhawks
EDIT
for e in soup.select('div.type-game h1'):
if 'Blackhawks' in e:
pint(e.text)#or do what ever is to do

How can I find <img src> nested within <div> using Beautiful Soup?

New to both Python and Beautiful Soup. I am trying to collect the src of an img inserted into a collapsible section on an e-commerce site. The collapsible sections that contain the images have the class of accordion__contents, but <img> inserted into the collapsible sections do not have a specific class. Not every page contains an image; some contain multiple.
I am trying to extract the src from img that are randomly nested within <div>. In the HTML example below, my desired output would be: <[https://example.com/image1.png]>
<div class="accordion__title">Description</div>
<div class="accordion__contents">
<p>Enjoy Daiya’s Hon’y Mustard Dressing on your salads</p>
</div>
<div class="accordion__title">Ingredients</div>
<div class="accordion__contents">
<p>Non-GMO Expeller Pressed Canola Oil, Filtered Water</p>
<p><strong>CONTAINS: MUSTARD</strong></p>
</div>
<div class="accordion__title">Nutrition</div>
<div class="accordion__contents">
<p>
<img alt="" class="alignnone size-medium wp-image-57054" height="300" src="https://example.com/image1.png" width="162"/>
</p>
</div>
<div class="accordion__title">Warnings</div>
<div class="accordion__contents">
<p><strong>Contains mustard</strong></p>
</div>
I've written the following code that successfully drills down to the full tag, but I can't figure out how to extract src once I'm there.
img_href = container.find_all(class_ ='accordion__contents') # generates the output above, in a list form
img_href = [img.find_all('img') for img in img_href]
for x in img_href:
if len(x)==0: # skip over empty items in the list that don't have images
continue
else:
print(x) # print to make sure the image is there
x.find('img')[`src`] # generates error - see below
The error I am getting is ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? My intent is not to be treating a list like an item, thus the loop.
I've tried find_all() combined with .attrs('src') but that also didn't work. What am I doing wrong?
I've simplified my example, but the URL for the page I'm scraping is here.
You can use CSS selector ".accordion__contents img":
import requests
from bs4 import BeautifulSoup
url = "https://gtfoitsvegan.com/product/hony-mustard-dressing-by-daiya/?v=7516fd43adaa"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_imgs = [img["src"] for img in soup.select(".accordion__contents img")]
print(all_imgs)
Prints:
['https://gtfoitsvegan.com/wp-content/uploads/2021/04/Daiya-Honey-Mustard-Nutrition-Facts-162x300.png']

How to I get text from <p> tag using regex applied within BeautifulSoup?

I've written some script in python using regex to fetch text from certain p tags but he script is giving me empty list.
This is the magnetic portion of html elements:
<div class="result__links">
<p class="result__outcome u-hide-phablet">Kolkata Knight Riders won by 7 wickets</p>
<p class="result__info u-hide-phablet">
Match 15, 20:00 IST (14:30 GMT), Sawai Mansingh Stadium, Jaipur
</p>
<a class="result__button result__button--mc btn" href="/match/2018/15?tab=scorecard">Match Centre</a>
</div>
How do I fetch the text of p tag wrapped within the below class?
classs='result__info u-hide-phablet'
The purpose is to fetch the text of above mentioned tag using regex.
This is what I've tried so far:
winner = soup.find_all('p',class_="result__outcome u-hide-phablet")
win_list = re.findall(r'>(.*?)</p>', str(winner))
The above portion produces empty list. Any help on this will be highly appreciated.
Post script: I'm looking for any solution related to regex.
For accessing the tags you are interested in you can do:
for p in soup.findAll("p", {"class" : "result__outcome u-hide-phablet"}):
tags_text = p.text
In the same way for span you need to do:
for span in soup.findAll("span", {"class" : "result__score result__score--winner"}):
tags_text = span.text
That is to get the text in each tag, as you have asked in your question.

Categories

Resources