I want to do a search and replace on the textual part of the content of the HTML elements.
E.g., replacing foo with <b>bar</b> in
<div id="foo">foo <i>foo</i> hi foo hi</div>
should result in
<div id="foo"><b>bar</b> <i><b>bar</b></i> hi <b>bar</b> hi</div>
I already have a working version in Perl, but the HTML parser there is buggy:
#!/usr/bin/env perl
##
use strict;
use warnings;
use v5.34.0;
use Mojo::DOM;
##
my $input = do { local $/; <STDIN> };
my $dom = Mojo::DOM->new($input);
$dom->descendant_nodes->grep(sub { $_->type eq 'text' })
->each(sub{
$_->replace(s/(sth)/<span class="todo at_tag">$1<\/span>/gr)
});
say $dom;
Search all text nodes containing foo
Create a b element
Replace the text with the new element
Insert the desired text into the b
from bs4 import BeautifulSoup, NavigableString, Tag
import re
import html
htmlString = '''
<div id="foo">foo <i>foo</i> hi foo hi</div>
'''
soup = BeautifulSoup(htmlString, "html.parser")
for n in soup.find_all(text=re.compile('foo')):
bold = soup.new_tag("b")
n.replaceWith(bold)
bold.insert(0, 'bar')
print(soup)
Output:
<div id="foo"><b>bar</b><i><b>bar</b></i><b>bar</b></div>
It's not recomended to use string manupulation functions such as .replace & regex on Html strings...As you are looking solution in that area Just writing solution. Orginally we have to do with BeautifulSoup
html = """<div id="foo">foo <i>foo</i> hi foo hi</div>"""
res = html.replace("foo", "<b>bar</b>").replace("<b>bar</b>", "foo", 1)
print(res)
output#
<div id="foo"><b>bar</b> <i><b>bar</b></i> hi <b>bar</b> hi</div>
Related
I am trying to use the BeautifulSoup module with Python to do the following:
Within a div for HTML, for each paragraph tag, I want to add a bold tag to the first letter of each word within the paragraph. For example:
<div class="body">
<p>The quick brown fox</p>
</div>
which would read: The quick brown fox
would then become
<div class="body">
<p><b>T</b>he <b>q</b>uick <b>b</b>rown <b>f</b>ox</p>
</div>
that would read: The quick brown fox
Using bs4 i've been unable to find a good solution to do this and am open to ideas.
You could use replace_with() combined with list comprehension - Extract text / string from tag / bs4 object, process it as text and later on replace the tag with new bs4 object:
soup.p.replace_with(
BeautifulSoup(
' '.join([s.replace(s[0],f'<b>{s[0]}</b>') for s in soup.p.string.split(' ')]),'html.parser'
)
)
Example
from bs4 import BeautifulSoup
html = '''
<div class="body">
<p>The quick brown fox</p>
</div>'''
soup = BeautifulSoup(html,'html.parser')
soup.p.replace_with(
BeautifulSoup(
' '.join([s.replace(s[0],f'<b>{s[0]}</b>') for s in soup.p.string.split(' ')]),'html.parser'
)
)
soup
Output
<div class="body">
<b>T</b>he <b>q</b>uick <b>b</b>rown <b>f</b>ox
</div>
I don't know much about how Python parses HTML in detail, but I can provide you with some ideas.
To find <p> tags, you can use RegEx <p.*?>.*?</p> or use str.find("<p>") and walk until </p>.
To add <b> tags, perhaps this code will work:
def add_bold(s: str) -> str:
ret = ""
isFirstLet = True
for i in s:
if isFirstLet:
ret += "<b>" + i + "</b>"
isFirstLet = False
else:
ret += i
if i == " ": isFirstLet = True
return ret
Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)
I'm doing a request to a webpage and I'm trying to retrieve some text on it. The text is splitup with span tags like this:
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
There are "inline style sheets" (CSS sheets) that says if we have to print or not the text to the screen and thus, not print the gibberish text on the screen. This is an example of 1 of the sheet:
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}
but there are more CSS files like this.. So I don't know if there are any better way to achieve my goal (print the text that shows on screen and not use the gibberish that is not displayed)
My script is able to print the text.. but all of it (with gibberish) as the following: "This is jvgviehrgjfne my gt4ugirdfgr script!"
If i understood you right, what you should do is to parse css files with regex for attributes associated with inline and provide the results to beautiful soup api. Here is a way:
import re
import bs4
page_txt = """
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
"""
css_file_read_output = """
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}"""
css_file_lines = css_file_read_output.splitlines()
css_lines_text = []
for line in css_file_lines:
inline_search = re.search(".*inline.*", line)
if inline_search is not None:
inline_group = inline_search.group()
class_name_search = re.search("\..*\{", inline_group)
class_name_group = class_name_search.group()
class_name_group = class_name_group[1:-1] # getting rid of the last { and first .
css_lines_text.append(class_name_group)
else:
pass
page_bs = bs4.BeautifulSoup(page_txt,"lxml")
wanted_text_list = []
for line in css_lines_text:
wanted_line = page_bs.find("span", class_=line)
wanted_text = wanted_line.get_text(strip=True)
wanted_text_list.append(wanted_text)
wanted_string = " ".join(wanted_text_list)
Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)
how can I find all span's with a class of 'blue' that contain text in the format:
04/18/13 7:29pm
which could therefore be:
04/18/13 7:29pm
or:
Posted on 04/18/13 7:29pm
in terms of constructing the logic to do this, this is what i have got so far:
new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
result = re.findall(pattern, _)
print result
I've been referring to https://stackoverflow.com/a/7732827 and https://stackoverflow.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.
edit:
to clarify the scenario, there are span's with:
<span class="blue">here is a lot of text that i don't need</span>
and
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
and note i only need 04/18/13 7:29pm not the rest of the content.
edit 2:
I also tried:
pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
result = re.findall(pattern, _)
print result
and got error:
'TypeError: expected string or buffer'
import re
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""
# parse the html
soup = BeautifulSoup(html_doc)
# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})
# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]
# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
if m:
found_dates.append(m.group(1))
# print the dates we collected
for date in found_dates:
print(date)
output:
04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm
This is a flexible regex that you can use:
"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
Example:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
for line in lines
for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
print i
04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM
This pattern seems to satisfy what you're looking for:
>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)