Python Find text at end of html element - python

I need to pull the movie title and year out of the HTML text below using the BeautifulSoup find() method.
the below returns the name of the movie, but I'm unable to return only the year
find('p').find('a').text
<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>

my_element.contents[-1]
This will give you the last element contained inside my_element: in this case, if my_element is the <p>, this will give the text "2019" as a NavigableString. (The first child is the <strong> tag, which contains <a> and all the rest.)

Use the following code.find the <a> tag and then use next_element
from bs4 import BeautifulSoup
html='''<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>'''
soup=BeautifulSoup(html,'html.parser')
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p')
print(item.text)
Output:
A Tale of Two Kitchens2019
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').text
print(item)
output:
A Tale of Two Kitchens
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').next_element.next_element.next_element
print(item)
output:
2019

Related

When using beautifulsoup to web scrape then save to csv, I am only receiving one row of information instead of all desired rows

Disclaimer: I am new to coding.
I assume my issue is within my for loop, but I am not sure what to change even after browsing answered questions on stackoverflow. So, here is my code with regards to my question:
csv_file = open('converter_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Category Name', 'Price'])
entries = soup.find_all('div', class_="prices")
for entry in entries:
cat_name = entry.h3.text.strip()
print(cat_name)
cat_price = entry.p.text.strip()
print(cat_price)
csv_writer.writerow([cat_name, cat_price])
csv_file.close()
The above script produces "Small Breadloaf Cat" and "$105-$200/each" from the website. This is what I want, but there are more after this one. The for loop stops after retrieving one. I am seeking the name, and the corresponding price (Small Breadloaf Cat, Large GM Cat, Large Foreign Cat, etc). However my csv is only getting the very first category+price and not all of them.
<div class="prices">
<div class="price-list">
<div class="price ">
<a href="https://X.com/metal/small-breadloaf-cat/">
<h3>Small Breadloaf Cat</h3>
<p> $105-$200/each </p>
</a>
</div>
<div class="price ">
<a href="https://X.com/metal/large-gm-cat/">
<h3>Large GM Cat</h3>
<p> $165-$256/each </p>
</a>
</div>
<div class="price ">
<a href="https://X.com/metal/large-foreign-cat/">
<h3>Large Foreign Cat</h3>
<p> $200-$351/each </p>
</a>
</div>
<div class="price ">
<a href="https://X.com/metal/xl-foreign-cat/">
<h3>XL Foreign Cat</h3>
<p> $350-$500/each </p>
</a>
</div>
<div class="price ">
<a href="https://X.com/metal/small-gm-cat/">
<h3>Small GM Cat</h3>
<p> $85-$168/each </p>
</a>
</div>
<div class="price ">
<a href="https://X.com/metal/small-foreign-cat/">
<h3>Small Foreign Cat</h3>
<p> $108-$149/each </p>
</a>
</div>
Try this in the for loop:
for entry in entries:
entry.find({'class': 'price'})
The find_all method only returns the 'prices' class whereas you want each entry in the 'price' class. So we need two finding methods.

selenium scrape multiple attributes within a block at the same time

I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.
You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]
You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]

Beautiful soup multiple Span Extract Table

I am currently working on my class assignment. I have to extract the data from the SPECS table from this webpage.
https://www.consumerreports.org/products/drip-coffee-maker/behmor-connected-alexa-enabled-temperature-control-396982/overview/
The data I need is stored as
<h2 class="crux-product-title">Specs</h2>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Programmable
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Programmable</span>
<span class="crux-body-copy crux-body-copy--small">Programmable models have a clock and can be set to brew at a specified time.</span>
</span>
</span>
</span>
</div>
<div class="col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-value">
<span class='crux-body-copy crux-body-copy--small'>Yes</span>
</div>
</div>
</div>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Thermal carafe/mug
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Thermal carafe/mug</span>
<span class="crux-body-copy crux-body-copy--small">Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.</span>
</span>
</span>
</span>
I need to create Lists for the three span class
class="crux-body-copy crux-body-copy--small--bold
crux-body-copy crux-body-copy--small
crux-body-copy crux-body-copy--small
The problem with extracting the table is because of multiple span used in the table.
I used BEAUTIFUL SOUP and used find_all and find and used the span name to call it.
I always got the first value.
How do I do this?
I don't know if this will work for you.
from simplified_scrapy import SimplifiedDoc,req,utils
html = ''' ''' # Your html
doc = SimplifiedDoc(html)
spans = doc.selects('span.crux-body-copy crux-body-copy--small--bold')
for span in spans:
# print (span.firstText())
print (span.select('span.crux-body-copy crux-body-copy--small--bold').text)
print (span.select('span.crux-body-copy crux-body-copy--small').unescape())
Result:
Programmable
Programmable models have a clock and can be set to brew at a specified time.
Thermal carafe/mug
Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.

xpath to match the specfic element based on inner html child tag text

I have an html as shown below
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Setting</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >Home</span>
</span>
</div>
<div class="xtree">
<img class="dojoimg">
<span class="presentation">+</span>
<span class ="treenode">
<div class="ctreefolder">.... </div>
<div class="presentationfolder">.... </div>
<span >products</span>
</span>
</div>
I want to click the img icon based on the text in the last span tag.
for example , I want to select the first img tag , if the last span contains "Setting" . Can you please help me in writing xpath for this UI element to use in selenium webdriver python
I think this XPath will help you.Here i find the img class then match the text contains
//*[#class="dojoimg"]//span[contains(text(), "Setting")]
Hope this concept will help you.
Here is my solution :
Using find_element_by_link_text
driver.find_element_by_link_text("Reveal").click()

Beutifulsoup parse, get information from child tag

I have the following "web-site" (here is the piece of the html):
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
I would like to extract the sometext and somelink. For this purpose, I have written the python code, here it is:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
if not("video" in (link['href'])):
print "Name: "+link.text
#sibling_page=urllib2.urlopen("major_link"+link['href'])
print " Link extracted: "+link['href']
However, this code prints nothing. Could you suggest where is my mistake?
Your div does not have href attribute. You have to look one level down at the <a> element.
from bs4 import BeautifulSoup
html = """
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html)
for links in soup.find_all("div", "moduleBody"):
for link in links.find_all("div", "feature"):
for a in links.find_all("a"):
if not "video" in a['href']:
print("Name: " + a.text)
print("Link extracted: " + a['href'])
Prints:
Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink
It finds it twice, as your html is broken. BeautifulSoup fixes it as follows:
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">
sometext
</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">
22 Mar 2014
</span>
</span>
</div>
</div>
</div>
</div>
Inside your second for loop, your link variable holds reference to <div class="feature">...</div>, which do not have the attribute href.
It highly depends on your structure, but if the <div class="feature"> tag always starts with <h2> tag which contains only <a> tag, then what you can do is to get the anchor tag <a> first:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
anchor_tag = link.h2.a
if not 'video' in anchor_tag['href']:
print 'Name: %s' % anchor_tag.text
print 'Link extracted: %s' % anchor_tag['href']
By the way, your HTML is not well-formed, the first <div class="feature"> tag should be closed.
<div class="moduleBody">
<div class="feature"></div>
<div class="feature">
<h2>
sometext
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>

Categories

Resources