I am currently working on my class assignment. I have to extract the data from the SPECS table from this webpage.
https://www.consumerreports.org/products/drip-coffee-maker/behmor-connected-alexa-enabled-temperature-control-396982/overview/
The data I need is stored as
<h2 class="crux-product-title">Specs</h2>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Programmable
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Programmable</span>
<span class="crux-body-copy crux-body-copy--small">Programmable models have a clock and can be set to brew at a specified time.</span>
</span>
</span>
</span>
</div>
<div class="col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-value">
<span class='crux-body-copy crux-body-copy--small'>Yes</span>
</div>
</div>
</div>
</div>
</div>
<div class="row">
<div class="col-xs-12">
<div class="product-model-features-specs-item">
<div class="row">
<div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model-features-specs-item-key'>
<span class="crux-body-copy crux-body-copy--small--bold">
Thermal carafe/mug
<span class="product-model-tooltip">
<span class="crux-icons crux-icons-help-information" aria-hidden="true"></span>
<span class="product-model-tooltip-window">
<span class="crux-icons crux-icons-close" aria-hidden="true"></span>
<span class="crux-body-copy crux-body-copy--small--bold">Thermal carafe/mug</span>
<span class="crux-body-copy crux-body-copy--small">Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.</span>
</span>
</span>
</span>
I need to create Lists for the three span class
class="crux-body-copy crux-body-copy--small--bold
crux-body-copy crux-body-copy--small
crux-body-copy crux-body-copy--small
The problem with extracting the table is because of multiple span used in the table.
I used BEAUTIFUL SOUP and used find_all and find and used the span name to call it.
I always got the first value.
How do I do this?
I don't know if this will work for you.
from simplified_scrapy import SimplifiedDoc,req,utils
html = ''' ''' # Your html
doc = SimplifiedDoc(html)
spans = doc.selects('span.crux-body-copy crux-body-copy--small--bold')
for span in spans:
# print (span.firstText())
print (span.select('span.crux-body-copy crux-body-copy--small--bold').text)
print (span.select('span.crux-body-copy crux-body-copy--small').unescape())
Result:
Programmable
Programmable models have a clock and can be set to brew at a specified time.
Thermal carafe/mug
Keeps coffee warm for about four hours; thermal mugs don't hold heat as well.
Related
Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!
If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls
The example im stuck with is like this
<div class="nav-links">
<div class="nav-previous">
<a href="prevlink" rel="prev">
<span class="meta-nav" aria-hidden="true">Previous </span>
<span class="screen-reader-text">Previous post:</span> <br>
<span class="post-title">
Title
</span>
</a>
</div>
<div class="nav-next">
<a href="nextlink" rel="next">
<span class="meta-nav" aria-hidden="true">Next </span> <span class="screen-reader-text">Next post:</span>
<br>
<span class="post-title">
Title
</span>
</a>
</div>
my ultimate goal is to get the value of href but all i could is bet the whole <div class ... element. im using beautifulsoup python
You can print all the values for href by finding all the links in the page
links = soup.find_all("a")
for link in links:
print(link.attrs['href'])
This is a part of HTML code from following page following page:
<div>
<div class="sidebar-labeled-information">
<span>
Economic skill:
</span>
<span>
10.646
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Strength:
</span>
<span>
2336
</span>
</div>
<div class="sidebar-labeled-information">
<span>
Location:
</span>
<div>
<a href="region.html?id=454">
Little Karoo
<div class="xflagsSmall xflagsSmall-Argentina">
</div>
</a>
</div>
</div>
<div class="sidebar-labeled-information">
<span>
Citizenship:
</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland">
</div>
<small>
<a href="pendingCitizenshipApplications.html">
change
</a>
</small>
</div>
</div>
</div>
I want to extract region.html?id=454 from it. I don't know how to narrow the search down to <a href="region.html?id=454">, since there are a lot of <a href=> tags.
Here is the python code:
session=session()
r = session.get('https://orange.e-sim.org/battle.html?id=5377',headers=headers,verify=False)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find_all('div',attrs={'class':'sidebar-labeled-information'})
And the output of this code is:
[<div class="sidebar-labeled-information" id="levelMission">
<span>Level:</span> <span>15</span>
</div>, <div class="sidebar-labeled-information" id="currRankText">
<span>Rank:</span>
<span>Colonel</span>
</div>, <div class="sidebar-labeled-information">
<span>Economic skill:</span>
<span>10.646</span>
</div>, <div class="sidebar-labeled-information">
<span>Strength:</span>
<span>2336</span>
</div>, <div class="sidebar-labeled-information">
<span>Location:</span>
<div>
<a href="region.html?id=454">Little Karoo<div class="xflagsSmall xflagsSmall-Argentina"></div>
</a>
</div>
</div>, <div class="sidebar-labeled-information">
<span>Citizenship:</span>
<div>
<div class="xflagsSmall xflagsSmall-Poland"></div>
<small>change
</small>
</div>
</div>]
But my desired output is region.html?id=454.
The page which I'm trying to search in is located here, but you need to have an account to view the page.
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)
for link in links:
href = link['href']
url = urlparse(href)
if url.path == "region.html":
print (url.path + "?" + url.query)
This prints region.html?id=454
you can try using this class:
xflagsSmall
and find the parrent of that element
element=soup.find("div",{"class": "xflagsSmall"})
parent_element=element.find_parent()
link=parent_element.attrs["href"]```
You can query on base of href value:
element=soup.find("a",{"href": "region.html?id=454"})
element.attrs["href"]
I need to pull the movie title and year out of the HTML text below using the BeautifulSoup find() method.
the below returns the name of the movie, but I'm unable to return only the year
find('p').find('a').text
<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>
my_element.contents[-1]
This will give you the last element contained inside my_element: in this case, if my_element is the <p>, this will give the text "2019" as a NavigableString. (The first child is the <strong> tag, which contains <a> and all the rest.)
Use the following code.find the <a> tag and then use next_element
from bs4 import BeautifulSoup
html='''<div class="col-sm-6 col-lg-3">
<div class="poster-container">
<a class="poster-link" href="/title/80244680/">
<img alt="A Tale of Two Kitchens (2019)" class="poster" src="https://occ-0-37-33.1.nflxso.net/dnm/api/v6/0DW6CdE4gYtYx8iy3aj8gs9WtXE/AAAABfTGUtIG2HYlEhUbvzPHmiAyPSkDcBIhQx_Ey06KfkgaUEwELBtJsJYP71-Vsx06NTKFKWZQupZGNVE8DCo8dC0j-zpcaNCPGFiyNJKN7tonZ3gMSAM.jpg?r=397"/>
<div class="overlay d-none d-lg-block text-center">
<span class="d-block font-weight-bold small mt-3">Documentaries</span>
<span class="d-block font-weight-bold small">International Movies</span>
</div>
</a>
</div>
<p><strong>A Tale of Two Kitchens</strong><br/>2019</p>
</div>
A Tale of Two Kitchens
<br/>'''
soup=BeautifulSoup(html,'html.parser')
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p')
print(item.text)
Output:
A Tale of Two Kitchens2019
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').text
print(item)
output:
A Tale of Two Kitchens
item=soup.select_one('.col-sm-6.col-lg-3').find_next('p').find('a').next_element.next_element.next_element
print(item)
output:
2019
I want to get player and referee content from this site and store it in a db. At first, when I looked through it, all the players and the referees were in response.css("div.prelims p.indent::text"), and I could use regex to parse the ones with players from the ones with referees. No problem.
Then I took a harder look at the rest of the site, only to see that they DO NOT follow this structure consistently. Here is an example:
<div class="prelims">
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p1">
<span class="num">1</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p2">
<span class="num">2</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p3">
<span class="num">3</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p4">
<span class="num">4</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p5">
<span class="num">5</span>
<p class="indent">Text about referee.</p>
</div>
<div class="num" id="p6">
Not only does this page have this 'num' and 'span' that the other page didn't, but my regex, which worked fine on the test page, breaks on the first p class=indent here.
What are some general principles of spider design that can make my spider more resilient against all this variability, and still be able to get the results into the right tables in my db? I am using DjangoItem, and was looking forward to a smooth pipeline into my db, but now I may have to wrangle this data to even get it into the right shape to insert. Your wisdom, insight, and experience greatly appreciated.
I think you can ignore the div tags if all the p tags that you want to capture have the indent class:
import re
text = r'''
<div class="prelims">
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p1">
<span class="num">1</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p2">
<span class="num">2</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p3">
<span class="num">3</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p4">
<span class="num">4</span>
<p class="indent">Text about players.</p>
</div>
<div class="num" id="p5">
<span class="num">5</span>
<p class="indent">Text about referee.</p>
</div>
<div class="num" id="p6">
'''
pattern = re.compile(r"<p.*class=[\"\']indent[\"\'].*>(.+)<\/p>", re.MULTILINE)
for m in re.findall(pattern, text):
print(m)
Output:
Text about players.
Text about players.
Text about players.
Text about players.
Text about players.
Text about referee.