I'm trying to use BeautifulSoup to scrape HTML tags off of something that was returned using ExchangeLib. What I have so far is this:
from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup
credentials = Credentials('myemail#notreal.com', 'topSecret')
account = Account('myemail#notreal.com', credentials=credentials, autodiscover=True)
for item in account.inbox.all().order_by('-datetime_received')[:1]:
soup = BeautifulSoup(item.unique_body, 'html.parser')
print(soup)
As is, this will use exchangeLib to grab the first email from my inbox via Exchange, and print specifically the unique_body which contains the body text of the email. Here is a sample of the output from print(soup):
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
My end goal is to have it print:
Hey John,
Here is a test email
From what I'm reading on BeautifulSoup documentation, the process of scraping falls between my "Soup =" line and the final print line.
My issue is that in order to run the scraping portion of BeautifulSoup, it requires a class and h1 tags such as: name_box = soup.find(‘h1’, attrs={‘class’: ‘name’}), however from what I currently have, I have none of this.
As someone who is new to Python, how should I go about doing this?
You can try Find_all to get all the font tag value and then iterate.
from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
print(span.text)
Output:
Hey John,
Here is a test email
You need to print the font tag content. You can use select method and pass it type selector for the element of font.
from bs4 import BeautifulSoup as bs
html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''
soup = bs(html, 'lxml')
textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)
Related
I want to get all the information within <span> tags within <p> tags within the <div> tag using Python and BeautifulSoup. I am looking for the information within the 'Data that I want to read' <span>.
<body>
<div id='output'>
<p style="overflow-wrap: break-word">CONNECTED</p>
<p style="overflow-wrap: break-word">SENT</p>
<p style="overflow-wrap: break-word">
<span style="color: blue">
Data that I want to read
<span/>
</p>
<div/>
<body/>
I have the following, which finds the text within the <div> tags and nothing else.
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
websiteData = soup.find_all("div")
for someData in websiteData:
childElement = someData.findChildren("p", recursive=True)
for child in childElement:
childElementofChildElement = child.findChildren("span", recursive=True)
for child in childElementofChildElement:
print(child)
You can use CSS selector for the task:
from bs4 import BeautifulSoup
html_doc = '''\
<body>
<div id='output'>
<p style="overflow-wrap: break-word">CONNECTED</p>
<p style="overflow-wrap: break-word">SENT</p>
<p style="overflow-wrap: break-word">
<span style="color: blue">
Data that I want to read
</span>
</p>
<div/>
<body/>'''
soup = BeautifulSoup(html_doc, 'html.parser')
for t in soup.select('div#output p span'):
print(t.text.strip())
Prints:
Data that I want to read
CSS selector div#output p span means select all <span> tags that are under <p> tag and the <p> tag should be under <div> tag with id="output".
I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything
Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:
Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/
I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.
since the "a href=" is not under any class, how do I conduct a search and get all the countries url?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")
# Partial html structure shown as below
[<div class="well span4">
<a href="https://www.rulac.org/browse/countries/myanmar">
<div class="map-wrap">
<img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=19.7633057,96.07851040000003&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Myanmar</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/myanmar">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/the-netherlands">
<div class="map-wrap">
<img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=52.203566364441,5.7275408506393&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Netherlands</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/niger">
<div class="map-wrap">
<img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=13.5115963,2.1253854000000274&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Niger</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/niger">Read on <i class="icon-caret-right"></i></a>
</div>,
You can use soup.select() with a CSS selector to get all <a> elements of class btn that are children of <div>s with classes well and span4. Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")
# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
print(href)
I'm trying to use Beautiful Soup to extract the title of a job. The title in the span tag is the same as the text. Eg: text is 'Barista' but so is the title. So far I've been using .findall but idk how it can work for this.
Sample html:
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class="label">new</span>
</div>
<span title="Barista">Barista</span>
</h2>
Try something like this.
# Imports.
from bs4 import BeautifulSoup
# HTML code.
html_str = '''<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class="label">new</span>
</div>
<span title="Barista">Barista</span>
</h2>'''
# Parsing HTML.
soup = BeautifulSoup(html_str, 'lxml')
# Searching for `span` tags with `title` attributes.
list_html_titles = soup.find_all('span', attrs={'title': True})
# Getting titles from HTML code blocks.
list_titles = [x.text for x in list_html_titles]
You can take advantage of the recursive attribute from beautifulSoup, to get just the direct child of h2.
I tested the following code sample and it works:
from bs4 import BeautifulSoup
html_str = '''<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class="label">new</span>
</div>
<span title="Barista">Barista</span>
</h2>'''
soup = BeautifulSoup(html_str, 'lxml')
title = soup.h2.find('span', recursive=False).text
print(title)
Hi am trying to use python beautiful-soup web crawler to get data from imdb i have followed the documentation online am able to retrieve all the data using this code
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)
with the above code am able to retrieve a list of all the data in the div class tagged as image just as show below
<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>
but am trying to get the value of the attributes data-const as gotten from the result i want to display just the values of the data-const attribute instead of the whole html result Expected Result : tt1486497, tt1485650
Instead use the class name that div is using.
from bs4 import BeautifulSoup
html = """<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>"""
soup = BeautifulSoup(html, "lxml")
for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
print(div["data-const"])
Output:
tt1486497
tt1485650
Try something along the lines of:
for dc in movie_containers.select('div.hover-over-image'):
print(dc['data-const'])
output:
tt1486497
tt1485650
I recommend using requests-html. It's more intuitive than just using beautiful soup.
Example:
from requests_html import HTMLSession
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
session = HTMLSession()
response = session.get(url)
html = response.html
imageContainers = html.find_all("div.image")
dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
This should exactly do what you need, but I couldn't test it
Good luck!