The following code works:
# -*- coding: utf-8 -*-
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
news_uri = 'http://www3.nhk.or.jp/news/easy/k10011356641000/k10011356641000.html'
r = requests.get(news_uri)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'html.parser')
body = soup.find('div', attrs={'id':'newsarticle'})
#body.div.unwrap()
for match in body.findAll('span'):
match.unwrap()
for match in body.findAll('a'):
match.unwrap()
print(str(body))
However, if you uncomment body.div.unwrap() it results in the following error:
Traceback (most recent call last):
File "test_div.py", line 13, in <module>
body.div.unwrap()
AttributeError: 'NoneType' object has no attribute 'unwrap'
I have done a test using the plain text output from:
body = soup.find('div', attrs={'id':'newsarticle'})
This then works as expected and removes the outer div. Any suggestions?
As body = soup.find('div', attrs={'id':'newsarticle'}), the body variable contains the following HTML:
<div id="newsarticle">
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<p></p>
<p></p>
</div>
That means, the direct descendants of the <div> tag are only the <p> tags. Using body.div means the code tries to find a div tag which is a direct descendant of the current div tag. Since there is no such tag present, body.div evaluates to None.
Because of this, body.div.unwrap() evaluates to None.unwrap() which as you can see will throw the error AttributeError: 'NoneType' object has no attribute 'unwrap'.
If you want to remove the div tag, simply use this:
body = soup.find('div', attrs={'id':'newsarticle'})
body.unwrap()
or
soup.find('div', attrs={'id':'newsarticle'}).unwrap()
Related
I'm using the following piece of code to find an attribute in a piece of HTML code:
results = soup.findAll("svg", {"data-icon" : "times"})
This works, and it returns me a list with the tag and attributes. However, I would also like to move from that part of the HTML code, to the sibling (if that's the right term) below it, and retrieve the contents of that paragraph. See the example below.
<div class="382"><svg aria-hidden="true" data-icon="times".......</svg></div>
<div class="405"><p>Example</p></div>
I can't seem to figure out how to do this properly. Searching for the div class names does not work, because the class name is randomised.
You can use CSS selector with +:
from bs4 import BeautifulSoup
html_doc = """
<div class="382"><svg aria-hidden="true" data-icon="times"> ... </svg></div>
<div class="405"><p>Example</p></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
div = soup.select_one('div:has(svg[data-icon="times"]) + div')
print(div.text)
Prints:
Example
Or without CSS selector:
div = soup.find("svg", attrs={"data-icon": "times"}).find_next("div")
print(div.text)
Prints:
Example
I'm using Python 3.7. I want to locate all the elements in my HTML page that have an attribute, "data-permalink", regardless of what its value is, even if the value is empty. However, I'm confused about how to do this. I'm using the bs4 package and tried the following
soup = BeautifulSoup(html)
soup.findAll("data-permalink")
[]
soup.findAll("a")
[<a href=" ... </a>]
soup.findAll("a.data-permalink")
[]
The attribute is normally only found in anchor tags on my page, hence my unsuccessful, "a.data-permalink" attempt. I would like to return the elements that contain the attribute.
Your selector is invalid
soup.findAll("a.data-permalink")
it should be used for the method .select() but still it invalid because it mean select <a> with the class not the attribute.
to match everything use the * for select()
.select('*[data-permalink]')
or True if using findAll()
.findAll(True, attrs={'data-permalink' : True})
example
from bs4 import BeautifulSoup
html = '''<a data-permalink="a">link</a>
<b>bold</b>
<i data-permalink="i">italic</i>'''
soup= BeautifulSoup(html, 'html.parser')
permalink = soup.select('*[data-permalink]')
# or
# permalink = soup.findAll(True, attrs={'data-permalink' : True})
print(permalink)
Results, the <b> element is skipped
[<a data-permalink="a">link</a>, <i data-permalink="i">italic</i>]
I am a beginner and struggling though a course, so this problem is probably really simple, but I am running this (admittedly messy) code (saved under file x.py) to extract a link and a name from a website with line formats like:
<li style="margin-top: 21px;">
Prabhjoit
</li>
So I set up this:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
for line in soup:
if not line.startswith('<li'):
continue
stuff = line.split('"')
link = stuff[3]
thing = stuff[4].split('<')
name = thing[0].split('>')
count = count + 1
if count == 18:
break
print(name[1])
print(link)
And it keeps producing the error:
Traceback (most recent call last):
File "x.py", line 15, in <module>
if not line.startswith('<li'):
TypeError: 'NoneType' object is not callable
I have struggled with this for hours, and I would be grateful for any suggestions.
line is not a string, and it has no startswith() method. It is a BeautifulSoup Tag object, because BeautifulSoup has parsed the HTML source text into a rich object model. Don't try to treat it as text!
The error is caused because if you access any attribute on a Tag object that it doesn't know about, it does a search for a child element with that name (so here it executes line.find('startswith')), and since there is no element with that name, None is returned. None.startswith() then fails with the error you see.
If you wanted to find the 18th <li> element, just ask BeautifulSoup for that specific element:
soup = BeautifulSoup(html, 'html.parser')
li_link_elements = soup.select('li a[href]', limit=18)
if len(li_link_elements) == 18:
last = li_link_elements[-1]
print(last.get_text())
print(last['href'])
This uses a CSS selector to find only the <a> link elements whose parent is a <li> element and that have a href attribute. The search is limited to just 18 such tags, and the last one is printed, but only if we actually found 18 in the page.
The element text is retrieved with the Element.get_text() method, which will include text from any nested elements (such as <span> or <strong> or other extra markup), and the href attribute is accessed using standard indexing notation.
I'm attempting to scrape a page that has a section like this:
<a name="id_631"></a>
<hr>
<div class="store-class">
<div>
<span><strong>Store City</strong</span>
</div>
<div class="store-class-content">
<p>Event listing</p>
<p>Event listing2</p>
<p>Event listing3</p>
</div>
<div>
Stuff about contact info
</div>
</div>
The page is a list of sections like that and the only way to differentiate them is by the name attribute in the <a> tag.
So I'm thinking I want to target that then go to the next_sibling to get the <hr> then again to the next sibling to get the <div class="store-class"> section. All I want is the info in that div tag.
I'm not sure how to target that <a> tag to move down two siblings though. When I try print(soup.find_all('a', {"name":"id_631"})) that just gives me what's in the tag, which is nothing.
Here's my script:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.tandyleather.com/en/leathercraft-classes")
soup = soup = BeautifulSoup(r.text, 'html.parser')
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
But I get the error:
Traceback (most recent call last):
File "tandy.py", line 8, in <module>
print(soup.find("a", id="id_631").find_next_sibling("div", class_="store-class"))
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
find_next_sibling() to the rescue:
soup.find("a", attrs={"name": "id_631"}).find_next_sibling("div", class_="store-class")
Also, html.parser has to replaced with either lxml or html5lib.
See also:
Differences between parsers
I am trying to scrape a website with BeautifulSoup but am having a problem.
I was following a tutorial done in python 2.7 and it had exactly the same code in it and had no problems.
import urllib.request
from bs4 import *
htmlfile = urllib.request.urlopen("http://en.wikipedia.org/wiki/Steve_Jobs")
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
title = (soup.title.text)
body = soup.find("Born").findNext('td')
print (body.text)
If I try to run the program I get,
Traceback (most recent call last):
File "C:\Users\USER\Documents\Python Programs\World Population.py", line 13, in <module>
body = soup.find("Born").findNext('p')
AttributeError: 'NoneType' object has no attribute 'findNext'
Is this a problem with python 3 or am i just too naive?
The find and find_all methods do not search for arbitrary text in the document, they search for HTML tags. The documentation makes that clear (my italics):
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match. This is the simplest usage:
soup.find_all("title")
# [<title>The Dormouse's story</title>]
That's why your soup.find("Born") is returning None and hence why it complains about NoneType (the type of None) having no findNext() method.
That page you reference contains (at the time this answer was written) eight copies of the word "born", none of which are tags.
Looking at the HTML source for that page, you'll find the best option may be to look for the correct span (formatted for readabilty):
<th scope="row" style="text-align: left;">Born</th>
<td>
<span class="nickname">Steven Paul Jobs</span><br />
<span style="display: none;">(<span class="bday">1955-02-24</span>)</span>February 24, 1955<br />
</td>
The find method looks for tags, not text. To find the name, birthday and birthplace, you would have to look up the span elements with the corresponding class name, and access the text attribute of that item:
import urllib.request
from bs4 import *
soup = BeautifulSoup(urllib.request.urlopen("http://en.wikipedia.org/wiki/Steve_Jobs"))
title = soup.title.text
name = soup.find('span', {'class': 'nickname'}).text
bday = soup.find('span', {'class': 'bday'}).text
birthplace = soup.find('span', {'class': 'birthplace'}).text
print(name)
print(bday)
print(birthplace)
Output:
Steven Paul Jobs
1955-02-24
San Francisco, California, US
PS: You don't have to call read on urlopen, BS accept file-like objects.