Not getting the entire <li> line using BeautifulSoup - python

I am using BeautifulSoup to extract the list items under the class "secondary-nav-main-links" from the https://www.champlain.edu/current-students web page. I thought my working code below would extract the entire "li" line but the last portion, "/li", is placed on its own line. I included screen captures of the current output and the indended output. Any ideas? Thanks!!
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.champlain.edu/current-students')
bs = BeautifulSoup(html.read(), 'html.parser')
soup = bs.find(class_='secondary-nav secondary-nav-sm has-callouts')
for div in soup.find_all('li'):
print(div)
Current output:
capture1
Intended output:
capture2

You can remove the newline character with str.replace
And you can unescape html characters like & with html.unescape
str(div).replace('\n','')
To replace & with &, add this to the print statement
import html
html.unescape(str(div))
So your code becomes
from urllib.request import urlopen
from bs4 import BeautifulSoup
import html
html = urlopen('https://www.champlain.edu/current-students')
bs = BeautifulSoup(html.read(), 'html.parser')
soup = bs.find(class_='secondary-nav secondary-nav-sm has-callouts')
for div in soup.find_all('li'):
print(html.unescape(str(div).replace('\n','')))

Related

How to get the text in <script>

A while ago I used the following code to get window._sharedData; but the same code just now has no way, what should I do
If I change script to div it can work but I need is use script
code.py
from bs4 import BeautifulSoup
html1 = '<h1><script>window._sharedData;</script></h1>'
soup = BeautifulSoup(html1)
print(soup.find('script').text)
Add html.parser or lxml and call .string instead .text
from bs4 import BeautifulSoup
html = '<h1><script>window._sharedData;</script></h1>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('script').string)
You should use BeautifulSoup(html1, 'lxml') instead of BeautifulSoup(html1). If Output is empty, you will use .string instead of .text. You can try it:
from bs4 import BeautifulSoup
html1 = '<h1><script>window._sharedData;</script></h1>'
soup = BeautifulSoup(html1, 'lxml')
print(soup.find('script').text)
or
print(soup.find('script').string)
Output will be:
window._sharedData;

I am trying to use Xpath to retrieve the script from a TV show, but it is returning an empty list

from lxml import html
import requests
page = requests.get('http://officequotes.net/no1-01.php')
tree = html.fromstring(page.content)
complete_script = tree.xpath('/html/body/table/tbody/tr[2]/td[2]')
print(complete_script)
I expected the entire (TV show) script to be displayed, but all I am getting is an empty list.
You can skip the tbody and directly scrape the table as:
from lxml import html
import requests
page = requests.get('http://officequotes.net/no1-01.php')
tree = html.fromstring(page.content)
complete_script = tree.xpath('//table/tr[2]/td[2]//text()')
#to strip the characters from xml
results = [esc.strip() for esc in complete_script]
remove={'','&nbsp'}
results= [rem for rem in results if rem not in remove]
print(results)
But I would prefer BeautifulSoup to easily extract the same thing as
from bs4 import BeautifulSoup
import requests
page = requests.get('http://officequotes.net/no1-01.php')
soup = BeautifulSoup(page.content,'lxml')
complete_script = soup.select('table > tr > td')[2].get_text()
print(complete_script)
I would use bs4 4.7.1 and nth-of-type to get right td then stripped strings to loop and print out
Edit: From looking at #johnsnow06's answer (+), and wondering why when I used get_text I had a less well formatted output, I discovered it is due to my using lxml over html.parser. So, my code below could be
print(soup.select_one('td:nth-child(2)').get_text())
provided the parser is 'html.parser'. The nbsp's are then removed as is need for loop.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://officequotes.net/no1-01.php')
soup = bs(r.content, 'lxml')
for i in soup.select_one('td:nth-child(2)').stripped_strings:
print(i.replace('&nbsp', ' '))
With other versions of bs4 you could use
lines = soup.select('td')[2]
for line in lines.stripped_strings:
print(line.replace('&nbsp', ' '))
With xpath you want something more like:
from lxml import html
import requests
page = requests.get('http://officequotes.net/no1-01.php')
tree = html.fromstring(page.content)
complete_script = tree.xpath('*//tr[2]/td[2]//text()')
for item in complete_script:
print(item.replace('&nbsp', ' '))

Python Beautiful Soup Parse First Href in Each Div

Given the following code:
# import the module
import bs4 as bs
import urllib.request
import re
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.findAll('ul', {'class': 'song-list'}):
for span in div:
for link in span:
for a in link:
print(a)
I can parse multiple divs, and i get a result as follows :
My question is instead of getting the full contents of the div how can I only return the highlighted portion, the URL of the Href?
Try this. You need to specify the right class to fetch the urls connected to it.
from bs4 import BeautifulSoup
import urllib.request
masterURL = 'http://www.metrolyrics.com/top100.html'
sauce = urllib.request.urlopen(masterURL).read()
soup = BeautifulSoup(sauce,'lxml')
for div in soup.find_all(class_='subtitle'):
print(div.get("href"))
Output:
http://www.metrolyrics.com/charles-goose-lyrics.html
http://www.metrolyrics.com/param-singh-lyrics.html
http://www.metrolyrics.com/westlife-lyrics.html
http://www.metrolyrics.com/luis-fonsi-lyrics.html
http://www.metrolyrics.com/grease-lyrics.html
http://www.metrolyrics.com/shanti-dope-lyrics.html
and so on ---
if 'href' in a.attrs:
a.attrs['href']
this will give you what you need.

BeautifulSoup returns empty list

I am new to scraping with python. I am using the BeautifulSoup to extract quotes from a website and here's my code:
#!/usr/bin/python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
r = urlopen("http://quotes.toscrape.com/tag/inspirational/")
bsObj = BeautifulSoup(r, "lxml")
links = bsObj.find_all("div", {"class:" "quote"})
print(links)
It returns:
[]
But when I try this:
for link in links :
print(link)
It returns nothing.
(Note: this happened to me for every website )
Edit: the propose of the code above is just to return a Tag but not the text (the quote)

Crawl a news website and getting the news content

I'm trying to download the text from a news website. The HTML is:
<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>
The output should be: My Text
I'm using the following python code:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)
But the output of the code is: "None". Do you know what is wrong with my code??
The problem is that you are not parsing the HTML, you are parsing the URL string:
html = "My URL"
parsed_html = BeautifulSoup(html)
Instead, you need to get/retrieve/download the source first, example in Python 2:
from urllib2 import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
In Python 3, it would be:
from urllib.request import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
Or, you can use the third-party "for humans"-style requests library:
import requests
html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)
Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
with just:
from bs4 import BeautifulSoup
BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.
Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:
import urllib
from bs4 import BeautifulSoup
# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()
# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)
From here, just parse as you attempted previously.
p = soup.find("div", attrs={'class':'pane-content'})
print(p)

Categories

Resources