Remove all the Html content from a string in python - python

I would like to remove the all HTML contents from the string.
I have a string
str= "I am happy with <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> 3333 <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> your code"
I want the final string
str= "I am happy with 3333 your code"
I have written this code to do above task.
def removetags(input_str):
result = ''
startflag = 0
start=True
count=0
for ch in input_str:
if ch == '<':
if count!=len(input_str)-1:
if input_str[count+1]!='/':
start=True
startflag += 1
elif (ch == '>') and startflag :
if not start:
startflag -= 1
start=False
elif (not startflag) :
result += ch
count += 1
return result
print(removetags(str))
This works fine but if you have a < in the text then that will not output correctly. So I want to remove using html parsing. Is there any way to do that? I found this library but I couldn't find the way to do that.Thanks in advance.

from html.parser import HTMLParser
str = "I am happy with <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> 3333 > <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> your code"
class MyHTMLParser(HTMLParser):
got_html_in_tags = False
html_free_text = []
def handle_starttag(self, tag, attrs):
self.got_html_in_tags = True
def handle_endtag(self, tag):
self.got_html_in_tags = False
def handle_data(self, data):
if not self.got_html_in_tags:
self.html_free_text.append(data)
parser = MyHTMLParser()
parser.feed(str)
print("".join(parser.html_free_text))
This will print I am happy with 3333 your code even with '>' or '<' in text

Another re solution:
re.sub(r'(<(?P<tag>[a-zA-Z0-9]+)>.*?</(?P=tag)>)', '', string)
Tests:
>>> re.sub(r'(<(?P<tag>[a-zA-Z0-9]+)>.*?</(?P=tag)>)', '', string)
'I am happy with 3333 your code'
>>> string = "I am happy with <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> 3333 > <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> your code"
>>> re.sub(r'(<(?P<tag>[a-zA-Z0-9]+)>.*?</(?P=tag)>)', '', string)
'I am happy with 3333 > your code'
>>> string = "I am <a happy with <body> </body> lal"
>>> re.sub(r'(<(?P<tag>[a-zA-Z0-9]+)>.*?</(?P=tag)>)', '', string)
'I am <a happy with lal'

You can use regex library for that,
import re
str= "I am happy with <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> 3333 <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> your code"
comp = re.compile(r'<([\w]+)[^>]*>(.*?)<\/\1>')
data = re.sub(comp, '', str)
print(data)
May be this help

Let's do this recursively ;)
Base case 1: when the text is an empty string,
return an empty string
Base case 2: when the first letter of the text is a caret,
search for the closing tag and return call to function with remaining text after closing tag.
def remove_tags(text, tags=[]):
if text == '':
return text
if text[0] == '<':
closing_caret_pos = text.find('>')
tag = text[0:closing_caret_pos+1]
is_open_tag = '/' not in tag
is_close_tag = not is_open_tag
is_valid_tag = tag[1:-1].isalpha() or tag[2:-1].isalpha()
if is_valid_tag and is_open_tag:
tags.append(tag)
return remove_tags(text[1:], tags)
if is_valid_tag and is_close_tag:
tags.pop()
return remove_tags(text[len(tag):], tags)
if len(tags) != 0: # when an open tag exists keeping looking
return remove_tags(text[1:], tags)
return text[0] + remove_tags(text[1:], tags)
Test runs:
text = "I am happy with <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> 3333 <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> your code"
print(remove_tags(text))
>
I am happy with 3333 your code
text = "x<=1 <div> cookies </div>"
print(remove_tags(text))
>
x<=1
text = "I am <a happy with <body> </body> lal"
print(remove_tags(text))
>
I am <a happy with lal

Related

Replace abitrary HTML (subtree) within HTML document with other HTML (subtree) with BS4 or regex

I am trying to build a function along the following lines:
import bs4
def replace(html: str, selector: str, old: str, new: str) -> str:
soup = bs4.BeautifulSoup(html) # likely complete HTML document
old_soup = bs4.BeautifulSoup(old) # can contain HTML tags etc
new_soup = bs4.BeautifulSoup(new) # can contain HTML tags etc
for selected in soup.select(selector):
### pseudo-code start
for match in selected.find_everything(old_soup):
match.replace_with(new_soup)
### pseudo-code end
return str(soup)
I want to be able to replace an arbitrary HTML subtree below a CSS selector within a full HTML document with another arbitrary HTML subtree. selector, old and new are read as strings from a configuration file.
My document could look as follows:
before = r"""<!DOCTYPE html>
<html>
<head>
<title>No target here</head>
</head>
<body>
<h1>This is the target!</h1>
<p class="target">
Yet another <b>target</b>.
</p>
<p>
<!-- Comment -->
Foo target Bar
</p>
</body>
</html>
"""
This is supposed to work:
after = replace(
html = before,
selector = 'body', # from config text file
old = 'target', # from config text file
new = '<span class="special">target</span>', # from config text file
)
assert after == r"""<!DOCTYPE html>
<html>
<head>
<title>No target here</head>
</head>
<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
Yet another <b><span class="special">target</span></b>.
</p>
<p>
<!-- Comment -->
Foo <span class="special">target</span> Bar
</p>
</body>
</html>
"""
A plain str.replace does not work because the "target" can appear literally everywhere ... I have briefly considered to do this with a regular expression. I have to admit that I did not succeed, but I'd be happy to see this working. Currently, I think my best chance is to use beautifulsoup.
I understand how to swap a specific tag. I can also replace specific text etc. However, I am really failing to replace an "arbitrary HTML subtree", as in I want to replace some HTML with some other HTML in a sane manner. In this context, I want to treat old and new really as HTML, so if old is simply a "word" that does also appear for instance in a class name, I really only want to replace it if it is content in the document, but not if it is a class name as shown above.
Any ideas how to do this?
The solution below works in three parts:
All matches of selector from html are discovered.
Then, each match (as a soup object) is recursively traversed and every child is matched against old.
If the child object is equivalent to old, then it is extracted and new is inserted into the original match at the same index as the child object.
import bs4
from bs4 import BeautifulSoup as soup
def replace(html:str, selector:str, old:str, new:str) -> str:
def update_html(d:soup, old:soup) -> None:
i = 0
while (c:=getattr(d, 'contents', [])[i:]):
if isinstance((a:=c[0]), bs4.element.NavigableString) and str(old) in str(a):
a.extract()
for j, k in enumerate((l:=str(a).split(str(old)))):
i += 1
d.insert(i, soup(k, 'html.parser'))
if j + 1 != len(l):
i += 1
d.insert(i, soup(new, 'html.parser'))
elif a == old:
a.extract()
d.insert(i, soup(new, 'html.parser'))
i += 1
else:
update_html(a, old)
i += 1
source, o = [soup(j, 'html.parser') for j in [html, old]]
for i in source.select(selector):
update_html(i, o.contents[0])
return str(source)
after = replace(
html = before,
selector = 'body', # from config text file
old = 'target', # from config text file
new = '<span class="special">target</span>', # from config text file
)
print(after)
Output:
<!DOCTYPE html>
<html>
<head>
<title>No target here</title></head>
<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
Yet another <b><span class="special">target</span></b>.
</p>
<p>
<!-- Comment -->
Foo <span class="special">target</span> Bar
</p>
</body>
</html>

How to get all the tags with same name inside an XML file using BeautifulSoup in Python?

The XML that I am using is of this format -
<head>
<body>
Sample Text1
</body>
<body>
Sample Text2
</body>
</head>
I am trying to get all the tags with tag <body> into a single variable final_value. For that, I am using the code below -
soup = Soup(target_xml, 'html.parser')
for value in soup.find_all("body"):
final_value = value.prettify()
Using this, I am getting only one <body> tag inside the final_value variable. How can I get both the tags inside the variable so that the output would be like this -
>> final_value
<body>
Sample Text1
</body>
<body>
Sample Text2
</body>
This should help.
Demo:
from bs4 import BeautifulSoup
target_xml = """<head>
<body>
Sample Text1
</body>
<body>
Sample Text2
</body>
</head>"""
final_value = ""
soup = BeautifulSoup(target_xml, 'html.parser')
for value in soup.find_all("body"):
final_value += value.prettify()
print(final_value)
Output:
<body>
Sample Text1
</body>
<body>
Sample Text2
</body>
You are essentially overwriting the first value with the second one in these lines:
for value in soup.find_all("body"):
final_value = value.prettify()
Instead try something like this:
for value in soup.find_all("body"):
final_value += value.prettify()

wrap implicit section of an HTML document into section tags using lxml.etree

I'm trying to write a small function to wrap implicit section of an HTML document into section tags. I'm trying to do so with lxml.etree.
Let say my input is:
<html>
<head></head>
<body>
<h1>title</h1>
<p>some text</p>
<h1>title</h1>
<p>some text</p>
</body>
</html>
I'd like to end up with:
<html>
<head></head>
<body>
<section>
<h1>title</h1>
<p>some text</p>
</section>
<section>
<h1>title</h1>
<p>some text</p>
</section>
</body>
</html>
Here is what I have at the moment
def outline(tree):
pattern = re.compile('^h(\d)')
section = None
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
# If a header tag is found
if match:
depth = int(match.group(1))
if section is not None:
child.addprevious(section)
section = lxml.etree.Element('section')
section.append(child)
else:
if section is not None:
section.append(child)
else:
pass
if child is not None:
outline(child)
which I call like this
outline(tree.find('body'))
But it doesn't work with subheadings at the moment, eg.:
<section>
<h1>ONE</h1>
<section>
<h3>TOO Deep</h3>
</section>
<section>
<h2>Level 2</h2>
</section>
</section>
<section>
<h1>TWO</h1>
</section>
Thanks
when it comes to transforming xml, xslt is the best way to go, see lxml and xslt docs.
this is only a direction as requested, let me know if you need further help writing that xslt
Here is the code I ended up with, for the record:
def outline(tree, level=0):
pattern = re.compile('^h(\d)')
last_depth = None
sections = [] # [header, <section />]
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
#print("%s%s" % (level * ' ', child))
if match:
depth = int(match.group(1))
if depth <= last_depth or last_depth is None:
#print("%ssection %d" % (level * ' ', depth))
last_depth = depth
sections.append([child, lxml.etree.Element('section')])
continue
if sections:
sections[-1][1].append(child)
for section in sections:
outline(section[1], level=((level + 1) * 4))
section[0].addprevious(section[1])
section[1].insert(0, section[0])
Works pretty well for me

How to search multiple files for textblocks and write those textblocks to another file

Here is an example of an input file:
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
<br>
<div id="text"><div id="text-interesting1">11/222-AA</div>
<h2>This is the title</h2>
<P>Here is some multiline desc-<br>
cription about what is <br><br>
going on here
</div>
<div id="text2"><div id="text-interesting2">IV-VI</div>
<br>
<h1> Some really interesting text</h1>
</body>
</html>
Now I want to grep multiple blocks of this file, like that between <div id="text-interesting1"> and </div> then between <P> and </div> then between <div id="text-interesting2"> and </div> and many more. The point is, there are multiple values that I want to retrieve.
I want to write those values to a file, e.g. comma separated. How can that be done?
From the example that Luke provided I made the following:
import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
text = open(path + '/' + infile).read()
match = re.search('<div id="text-interesting1">', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<h2>', text)
if match is None:
continue
start = match.end()
end = re.search('</h2>', text).start()
print (text[start:end])
match = re.search('<P>', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<div id="text-interesting2">', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<h1>', text)
if match is None:
continue
start = match.end()
end = re.search('</h1>', text).start()
print (text[start:end])
print ('--------------------------------------')
Output is:
11/222-AA
This is the title
Some really interesting text
--------------------------------------
22/4444-AA
22222 This is the title2
22222222222222222222222
--------------------------------------
33/4444-AA
3333 This is the title3
333333333333333333333333
--------------------------------------
Why does the part not work?
Here's a start:
import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
text = open(path + '/' + infile).read()
match = re.search('<div id="text-interesting1">', text)
if match is None:
continue
start = match.start()
end = re.search('<div id="text-interesting2">', text).start()
print text[start:end]
Another strategy is to parse the XML. You will need to tidy your file up since strict XML requires matching tags, case consistency, etc. Here is an example:
from xml.etree import ElementTree
from cStringIO import StringIO
import sys
tree = ElementTree.ElementTree()
tree.parse(StringIO(sys.stdin.read()))
print "All tags:"
for e in tree.getiterator():
print e.tag
print e.text
print "Only div:"
for i in tree.find("{http://www.w3.org/1999/xhtml}body").findall("{http://www.w3.org/1999/xhtml}div"):
print i.text
Run on a slight modification of your file:
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
<br></br>
<div id="text"><div id="text-interesting1">11/222-AA</div>
<h2>This is the title</h2>
<p>Here is some multiline desc-<br></br>
cription about what is <br></br><br></br>
going on here</p>
</div>
<div id="text-interesting2">IV-VI</div>
<br></br>
<h1> Some really interesting text</h1>
</body>
</html>
Example output,
> cat file.xml | ./tb.py
All tags:
{http://www.w3.org/1999/xhtml}html
{http://www.w3.org/1999/xhtml}head
{http://www.w3.org/1999/xhtml}body
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}div
None
{http://www.w3.org/1999/xhtml}div
11/222-AA
{http://www.w3.org/1999/xhtml}h2
This is the title
{http://www.w3.org/1999/xhtml}p
Here is some multiline desc-
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}div
IV-VI
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}h1
Some really interesting text
Only div:
None
IV-VI
But a lot of HTML is difficult to parse as strict XML, so this may prove hard to implement for your case.

Using BeautifulSoup to grab all the HTML between two tags

I have some HTML that looks like this:
<h1>Title</h1>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>
I want to copy all of the HTML from the first h1, to the next h1. How can I do this?
This is the clear BeautifulSoup way, when the second h1 tag is a sibling of the first:
html = u""
for tag in soup.find("h1").next_siblings:
if tag.name == "h1":
break
else:
html += unicode(tag)
I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.
Example:
m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)
Interesting question. There is no way you can use just DOM to select it. You'll have to loop trough all elements preceding the first h1 (including) and put them into intro = str(intro), then get everything up to the 2nd h1 into chapter1. Than remove the intro from the chapter1 using
chapter = chapter1.replace(intro, '')
Here is a complete, up-to-date solution:
Contents of temp.html:
<h1>Title</h1>
<p>hi</p>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>
Code:
import copy
from bs4 import BeautifulSoup
with open("resources/temp.html") as file_in:
soup = BeautifulSoup(file_in, "lxml")
print(f"Before:\n{soup.prettify()}")
first_header = soup.find("body").find("h1")
siblings_to_add = []
for curr_sibling in first_header.next_siblings:
if curr_sibling.name == "h1":
for curr_sibling_to_add in siblings_to_add:
curr_sibling.insert_after(curr_sibling_to_add)
break
else:
siblings_to_add.append(copy.copy(curr_sibling))
print(f"\nAfter:\n{soup.prettify()}")
Output:
Before:
<html>
<body>
<h1>
Title
</h1>
<p>
hi
</p>
//a random amount of p/uls or tagless text
<h1>
Next Title
</h1>
</body>
</html>
After:
<html>
<body>
<h1>
Title
</h1>
<p>
hi
</p>
//a random amount of p/uls or tagless text
<h1>
Next Title
</h1>
//a random amount of p/uls or tagless text
<p>
hi
</p>
</body>
</html>

Categories

Resources