How can I find <img src> nested within <div> using Beautiful Soup? - python

New to both Python and Beautiful Soup. I am trying to collect the src of an img inserted into a collapsible section on an e-commerce site. The collapsible sections that contain the images have the class of accordion__contents, but <img> inserted into the collapsible sections do not have a specific class. Not every page contains an image; some contain multiple.
I am trying to extract the src from img that are randomly nested within <div>. In the HTML example below, my desired output would be: <[https://example.com/image1.png]>
<div class="accordion__title">Description</div>
<div class="accordion__contents">
<p>Enjoy Daiya’s Hon’y Mustard Dressing on your salads</p>
</div>
<div class="accordion__title">Ingredients</div>
<div class="accordion__contents">
<p>Non-GMO Expeller Pressed Canola Oil, Filtered Water</p>
<p><strong>CONTAINS: MUSTARD</strong></p>
</div>
<div class="accordion__title">Nutrition</div>
<div class="accordion__contents">
<p>
<img alt="" class="alignnone size-medium wp-image-57054" height="300" src="https://example.com/image1.png" width="162"/>
</p>
</div>
<div class="accordion__title">Warnings</div>
<div class="accordion__contents">
<p><strong>Contains mustard</strong></p>
</div>
I've written the following code that successfully drills down to the full tag, but I can't figure out how to extract src once I'm there.
img_href = container.find_all(class_ ='accordion__contents') # generates the output above, in a list form
img_href = [img.find_all('img') for img in img_href]
for x in img_href:
if len(x)==0: # skip over empty items in the list that don't have images
continue
else:
print(x) # print to make sure the image is there
x.find('img')[`src`] # generates error - see below
The error I am getting is ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? My intent is not to be treating a list like an item, thus the loop.
I've tried find_all() combined with .attrs('src') but that also didn't work. What am I doing wrong?
I've simplified my example, but the URL for the page I'm scraping is here.

You can use CSS selector ".accordion__contents img":
import requests
from bs4 import BeautifulSoup
url = "https://gtfoitsvegan.com/product/hony-mustard-dressing-by-daiya/?v=7516fd43adaa"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_imgs = [img["src"] for img in soup.select(".accordion__contents img")]
print(all_imgs)
Prints:
['https://gtfoitsvegan.com/wp-content/uploads/2021/04/Daiya-Honey-Mustard-Nutrition-Facts-162x300.png']

Related

Extracting string from <h1> element with logic attached

I am trying to scrape some sports game data and I have ran into some issues with my code. Eventually I will move this data into a dataframe and then eventually a database.
I am trying to scrape some sports data.
In the code, I have found the class element of one of the headers I would like to parse. There are multiple h1's in the HTML I am parsing.
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
With this HTML structure, how can I get the h1 to return to a string I can use to populate a dataframe?
Code I have tried so far is:
req = requests.get(url) # + str(page) + '/')
soup = bs(req.text, 'html.parser')
stype = soup.find('h1', class_ ='type-game')
print(stype)
This code returns "None". I have checked other articles on here and nothing has worked so far.
For the next level of my question, is there a way to create a For loop or similar to go through all of the pages (website is numbered sequentially for events) for any games that contain a string?
For example, if I wanted to only save games that have the Chicago Blackhawks in the h1 for the div element that has class= type-game?
Pseudocode would be something like this:
For webpages 1 to 10000:
if class_='type-game' 'h1' contains "Blackhawks"
then proceed with parsing the code
if not, skip the code and go to the next webpage
I know this is a little open ended, but I have a good VBA background and trying to apply those coding ideas to Python has been a challenge.
Select your elements more specific for example with css selectors:
soup.select('h1:-soup-contains("Blackhawks")')
or
soup.select('div.type-game h1:-soup-contains("Blackhawks")')
To get the text from a tag just use .text or get_text()
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Example
html='''
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Hawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Ducks vs. Blackhawks</h1>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Output
Blackhawks vs. Ducks
Ducks vs. Blackhawks
EDIT
for e in soup.select('div.type-game h1'):
if 'Blackhawks' in e:
pint(e.text)#or do what ever is to do

Having trouble savings links to a list variable with selenium

Practicing web scraping through selenium by opening user's dating profiles through a dating site. I need selenium to save a href link for every profile on the page but unfortunately it only saves the first profile on the list, rather than creating a list variable with all the links saved. All of the profiles start with the same two div class/style which is "member-thumbnail" and "position: absolute". Thank you for any help that you can offer.
Here is the website code:
<div class="member-thumbnail">
<div style="position: absolute;">
<a href="/Member/Details/LvL-Up">
<img src="//storage.com/imgcdn/m/t/502b24cb-3f75-49a1-a61a-ae80e18d86a0" class="presenceLine online">
</a>
</div>
</div>
Here is my code:
link_list = []
link_list = browser.find_element_by_css_selector('.member-thumbnail a').get_attribute('href')
length_link_list = len(link_list)
for i in range (0, length_link_list):
browser.get(link_list[i])
use find_elements_by_css_selector instead of find_element_by_css_selector
link
if you're going to loop through the whole list returned from find_elements_by_css_selector, consider using this instead, a bit more pythonic way.
link_list = browser.find_elements_by_css_selector('.member-thumbnail a')
for element in linklist:
browser.get(element.get_attribute('href'))

Using Python and BeautifulSoup to scrape list with variable orders and tags based on text strings

Details: MacOS, Python3, BeautifulSoup4
I am new to Python and even newer to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape html pages which do not heavily differentiate their tags by classes or div ids. In other words, I am trying to scrape the middle section of a list. The list will have an unpredictable amount of tags and elements (sometimes they use an unordered list, other times they are using a description list) so what I am scraping is fairly unpredictable, however, I do have two known variables and those would be the header string text I want to START at and the header string text I want to END at.
I have assembled the following example html to test this on:
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">First Section Title - Known Variable or String</h3>
</div>
</div>
<div>
<ul class="unstyled">
<li>Item1</li>
<li>Item2</li>
<li>Empty LI Tags Also Exist</li>
</ul>
<dl class="dl-horizontal">
<dt>Title of some description list</dt>
<dd>Another item may exist here</dd>
</dl>
</div>
<div>
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Another Section Title</h3>
</div>
</div>
<ul class="unstyled">
<li>Item1</li>
<li></li>
</ul>
<dl class="dl-horizontal">
<dt>Another Description List Title</dt>
<dd>Another item may exist here</dd>
<dt>And here</dt>
<dd>And Here</dd>
</dl>
</div>
<div>
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Section Title (String) I Wish To Stop At - Known Variable or String</h3>
</div>
</div>
</div>
Again, using the above model, I want to start at the first section I listed and end at the known text string of a particular section towards the bottom.
I have listed my Python script below. So far, the following Python is grabbing the correct information, however, I do not believe it will work under all circumstances, and there is probably a more efficient way to go about this. Here are some of the issues I believe are in my script:
My script is rather static - while it appears to start at the correct header, I have pieced out two sections separately as I do not believe my For loop is working the way it should be (I do not think ##Section 2 should be needed if written correctly).
Because my For loop is likely not doing what I probably think it is (I'd like it to iterate through the sections) I never had to define the stopping point (the string of text at the section I wish to stop at).
Since I am not convinced the loop is working correctly, I do not believe this will handle any curveballs I am thrown by the site - for example variable numbers of items on the list and if they add an additional section I would want between the "Beginning section" and "Ending section" defined.
I believe what needs to happen is:
Librarys need to be imported
Locate first section
Find next sibling
Keep finding siblings and returning text until the stop string matches
Python:
##Scrape
#import beautifulsoup and requests library
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(open("mock.html"), "html.parser")#BeautifulSoup(page.read())
#Begin by grabbing the section
stuff = soup.find_all(class_="panel-heading")
#Search for the first section title text string
next_elem = soup.find(text="First Section Title - Known Variable or String").findNext('li').contents[0]
#Attempt to scan the remainder of the section, starting with the next line item
next_next = next_elem.parent.find_next_sibling()
for item in next_next.findAll('li','dt','dd'):
if isinstance(item, Tag):
print(item.text)
print(next_elem)
print(next_next.text)
##Section 2 - I'd like to cut this out
s2_elem = soup.find(text="Another Section Title").findNext('li').contents[0]
s2_nxnx = s2_elem.parent.find_next_sibling()
s2_nxnxnx = s2_nxnx.parent.find_next_sibling()
print(s2_elem)
print(s2_nxnx.text)
print(s2_nxnxnx.text)
You could use a variable to spot when you are between search_start and search_end:
from bs4 import BeautifulSoup, Tag
import requests
search_start = "First Section Title - Known Variable or String"
search_end = "Section Title (String) I Wish To Stop At - Known Variable or String"
soup = BeautifulSoup(open("mock.html"), "html.parser")
start = False
for el in soup.find_all(['li', 'dt', 'dd', 'h3']):
if el.name == 'h3':
if el.text == search_start:
start = True
elif el.text == search_end:
break
elif start and isinstance(el, Tag):
print(el.text)
This would give you the following output:
Item1
Item2
Empty LI Tags Also Exist
Title of some description list
Another item may exist here
Item1
Another Description List Title
Another item may exist here
And here
And Here

Python/selenium webscraping

for link in data_links:
driver.get(link)
review_dict = {}
# get the size of company
size = driver.find_element_by_xpath('//[#id="EmpBasicInfo"]//span')
#location = ??? need to get this part as well.
my concern:
I am trying to scrape a website. I am using selenium/python to scrape the "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span, but I am not able to extract the text element from the website using xpath.I have tried getText, get attribute everything. Please, help!
This is the output for each iteration:I am not getting the text value.
Thank you in advance!
It seems you want only the text, instead of interacting with some element, one solution is to use BeautifulSoup to parse the html for you, with selenium getting the code built by JavaScript, you should first get the html content with html = driver.page_source, and then you can do something like:
html ='''
<div id="CompanyContainer">
<div id="EmpBasicInfo">
<div class="">
<div class="infoEntity"></div>
<div class="infoEntity">
<label>Industry</label>
<span class="value">Woodcliff</span>
</div>
<div class="infoEntity">
<label>Size</label>
<span class="value">501 to 1000 employees</span>
</div>
</div>
</div>
</div>
''' # Just a sample, since I don't have the actual page to interact with.
soup = BeautifulSoup(html, 'html.parser')
>>> soup.find("div", {"id":"EmpBasicInfo"}).findAll("div", {"class":"infoEntity"})[2].find("span").text
'501 to 1000 employees'
Or, of course, avoiding specific indexing and looking for the <label>Size</label>, it should be more readable:
>>> [a.span.text for a in soup.findAll("div", {"class":"infoEntity"}) if (a.label and a.label.text == 'Size')]
['501 to 1000 employees']
Using selenium you can do:
>>> driver.find_element_by_xpath("//*[#id='EmpBasicInfo']/div[1]/div/div[3]/span").text
'501 to 1000 employees'

Finding an href link using Python, Selenium, and XPath

I want to get the href from a <p> tag using an XPath expression.
I want to use the text from <h1> tag ('Cable Stripe Knit L/S Polo') and simultaneously text from the <p> tag ('White') to find the href in the <p> tag.
Note: There are more colors of one item (more articles with different <p> tags, but the same <h1> tag)!
HTML source
<article>
<div class="inner-article">
<a href="/shop/tops-sweaters/ix4leuczr/a1ykz7f2b" style="height:150px;">
</a>
<h1>
<a href="/shop/tops-sweaters/ix4leuczr/a1ykz7f2b" class="name-link">Cable Stripe Knit L/S Polo
</a>
</h1>
<p>
White
</p>
</div>
</article>
I've tried this code, but it didn't work.
specificProductColor = driver.find_element_by_xpath("//div[#class='inner-article' and contains(text(), 'White') and contains(text(), 'Cable')]/p")
driver.get(specificProductColor.get_attribute("href"))
As per the HTML source, the XPath expression to get the href tags would be something like this:
specificProductColors = driver.find_elements_by_xpath("//div[#class='inner-article']//a[contains(text(), 'White') or contains(text(), 'Cable')]")
specificProductColors[0].get_attribute("href")
specificProductColors[1].get_attribute("href")
Since there are two hyperlink tags, you should be using find_elements_by_xpath which returns a list of elements. In this case it would return two hyperlink tags, and you could get their href using the get_attribute method.
I've got working code. It's not the fastest one - this part takes approximately 550 ms, but it works. If someone could simplify that, I'd be very thankful :)
It takes all products with the specified keyword (Cable) from the product page and all products with a specified color (White) from the product page as well. It compares href links and matches wanted product with wanted color.
I also want to simplify the loop - stop both for loops if the links match.
specificProduct = driver.find_elements_by_xpath("//div[#class='inner-article']//*[contains(text(), '" + productKeyword[arrayCount] + "')]")
specificProductColor = driver.find_elements_by_xpath("//div[#class='inner-article']//*[contains(text(), '" + desiredColor[arrayCount] + "')]")
for i in specificProductColor:
specProductColor = i.get_attribute("href")
for i in specificProduct:
specProduct = i.get_attribute("href")
if specProductColor == specProduct:
print(specProduct)
wantedProduct = specProduct
driver.get(wantedProduct)

Categories

Resources