python - beautifulsoup - removing a line of code - python

I started to learn the beautifulsoup. I am trying to remove from html script a line of code containing </div> .
The most examples in the documentation are presented for the whole tags (opening and closing part).
Is it possible to modify just one part of a tag?
For example:
</div>
<div >Hello</div>
<div data-foo="value">foo!</div>
how to remove just the first line of the code?

You can use BeautifulSoup's unwrap() to specify the invalid tag, which will only remove the extra tags that don't have a open/close counterpart, while retaining others:
soup = BeautifulSoup(html_doc, 'html.parser')
invalid_tags = ['</div>']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.unwrap()
print(soup)
result:
<div>Hello</div>
<div data-foo="value">foo!</div>

you don't need do anything it will repaired automatically
from bs4 import BeautifulSoup
html_doc = '''</div>
<div>World</div>
<div data-foo="value">foo!''' # also invalid, no closing
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)
output
<div>World</div>
<div data-foo="value">foo!</div>
unwrap() is for removing not repairing tag.

Related

How to get data from nested HTML using BeautifulSoup in Django

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)
This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

Python BeautifulSoup - Text Between Div

I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4

Extracting text between <br> with beautifulsoup, but without next tag

I'm using python + beautifulsoup to try to get the text between the br's. The closest I got to this was by using next_sibling in the following manner:
<html>
<body>
</a><span class="strong">Title1</span>
<p>Text1</p>
<br>The Text I want to get<br>
<p>Text I dont want</p>
</body>
</html>
for span in soup.findAll("span", {"class" : "strong"}):
print(span.next_sibling.next_sibling.text)
But this prints:
The Text I want to getText I dont want
So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references.
I need it to print:
The Text I want to get
Since the HTML you've provided is broken, the behavior would differ from parser to parser that BeautifulSoup uses.
In case of lxml parser, BeautifulSoup would convert the br tag into a self-closing one:
>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup
<html>
<body>
<span class="strong">Title1</span>
<p>Text1</p>
<br/>The Text I want to get<br/>
<p>Text I dont want</p>
</body>
</html>
Note that you would need lxml to be installed. If it is okay for you - find the br and get the next sibling:
from bs4 import BeautifulSoup
data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')
print(soup.br.next_sibling) # prints "The Text I want to get"
Also see:
Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)
Parsing unclosed `<br>` tags with BeautifulSoup
Using Python Scrapy
In [4]: hxs.select('//body/text()').extract()
Out[4]: [u'\n', u'\n', u'\n', u'The Text I want to get', u'\n', u'\n']

Parsing out data using BeautifulSoup in Python

I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.

Categories

Resources