How to add a meta tag just after title tag in a HTML page by using Beautiful Soup(library). I am using python language for coding and unable to do this.
Use soup.create_tag() to create a new <meta> tag, set attributes on that and add it to your document <head>.
metatag = soup.new_tag('meta')
metatag.attrs['http-equiv'] = 'Content-Type'
metatag.attrs['content'] = 'text/html'
soup.head.append(metatag)
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head><title>Hello World!</title>
... </head><body>Foo bar</body></html>
... ''')
>>> metatag = soup.new_tag('meta')
>>> metatag.attrs['http-equiv'] = 'Content-Type'
>>> metatag.attrs['content'] = 'text/html'
>>> soup.head.append(metatag)
>>> print soup.prettify()
<html>
<head>
<title>
Hello World!
</title>
<meta content="text/html" http-equiv="Content-Type"/>
</head>
<body>
Foo bar
</body>
</html>
Related
I am trying to build a function along the following lines:
import bs4
def replace(html: str, selector: str, old: str, new: str) -> str:
soup = bs4.BeautifulSoup(html) # likely complete HTML document
old_soup = bs4.BeautifulSoup(old) # can contain HTML tags etc
new_soup = bs4.BeautifulSoup(new) # can contain HTML tags etc
for selected in soup.select(selector):
### pseudo-code start
for match in selected.find_everything(old_soup):
match.replace_with(new_soup)
### pseudo-code end
return str(soup)
I want to be able to replace an arbitrary HTML subtree below a CSS selector within a full HTML document with another arbitrary HTML subtree. selector, old and new are read as strings from a configuration file.
My document could look as follows:
before = r"""<!DOCTYPE html>
<html>
<head>
<title>No target here</head>
</head>
<body>
<h1>This is the target!</h1>
<p class="target">
Yet another <b>target</b>.
</p>
<p>
<!-- Comment -->
Foo target Bar
</p>
</body>
</html>
"""
This is supposed to work:
after = replace(
html = before,
selector = 'body', # from config text file
old = 'target', # from config text file
new = '<span class="special">target</span>', # from config text file
)
assert after == r"""<!DOCTYPE html>
<html>
<head>
<title>No target here</head>
</head>
<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
Yet another <b><span class="special">target</span></b>.
</p>
<p>
<!-- Comment -->
Foo <span class="special">target</span> Bar
</p>
</body>
</html>
"""
A plain str.replace does not work because the "target" can appear literally everywhere ... I have briefly considered to do this with a regular expression. I have to admit that I did not succeed, but I'd be happy to see this working. Currently, I think my best chance is to use beautifulsoup.
I understand how to swap a specific tag. I can also replace specific text etc. However, I am really failing to replace an "arbitrary HTML subtree", as in I want to replace some HTML with some other HTML in a sane manner. In this context, I want to treat old and new really as HTML, so if old is simply a "word" that does also appear for instance in a class name, I really only want to replace it if it is content in the document, but not if it is a class name as shown above.
Any ideas how to do this?
The solution below works in three parts:
All matches of selector from html are discovered.
Then, each match (as a soup object) is recursively traversed and every child is matched against old.
If the child object is equivalent to old, then it is extracted and new is inserted into the original match at the same index as the child object.
import bs4
from bs4 import BeautifulSoup as soup
def replace(html:str, selector:str, old:str, new:str) -> str:
def update_html(d:soup, old:soup) -> None:
i = 0
while (c:=getattr(d, 'contents', [])[i:]):
if isinstance((a:=c[0]), bs4.element.NavigableString) and str(old) in str(a):
a.extract()
for j, k in enumerate((l:=str(a).split(str(old)))):
i += 1
d.insert(i, soup(k, 'html.parser'))
if j + 1 != len(l):
i += 1
d.insert(i, soup(new, 'html.parser'))
elif a == old:
a.extract()
d.insert(i, soup(new, 'html.parser'))
i += 1
else:
update_html(a, old)
i += 1
source, o = [soup(j, 'html.parser') for j in [html, old]]
for i in source.select(selector):
update_html(i, o.contents[0])
return str(source)
after = replace(
html = before,
selector = 'body', # from config text file
old = 'target', # from config text file
new = '<span class="special">target</span>', # from config text file
)
print(after)
Output:
<!DOCTYPE html>
<html>
<head>
<title>No target here</title></head>
<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
Yet another <b><span class="special">target</span></b>.
</p>
<p>
<!-- Comment -->
Foo <span class="special">target</span> Bar
</p>
</body>
</html>
I'm trying to automate a process where I take a snapshot everyday but change the filename to that date. For example, I'd like to reference today's file as "20200219 snapshot.png" and change it to "20200220 snapshot.png" tomorrow. The problem is, I can't input the variable name filename after the img src and have to put in the hardcoded exact String.
date = date.strftime('%Y%m%d')
filename = date + " snapshot.png"
html = """\
<html>
<head></head>
<body>
<img src="Directory/snapshot.png"/>
</body>
</html>
"""
You can use ElementTree to parse through the HTML DOM, use the find method to search for img tag. Then you can assign the src attribute value. The attributes are returned as a dict with the attrib parameter and you just need to look for the 'src' key:
import datetime
date = datetime.datetime.now().strftime('%Y%m%d')
filename = date + " snapshot.png"
import xml.etree.ElementTree as et
html = """\
<html>
<head></head>
<body>
<img src="Directory/snapshot.png"/>
</body>
</html>
"""
tree = et.fromstring(html)
image_attributes = tree.find('body/img').attrib
for k in image_attributes.keys():
if 'src' in k:
image_attributes[k] = filename
html_new = et.tostring(tree)
print(html_new)
Output:
b'<html>\n <head />\n <body>\n <img src="20200220 snapshot.png" />\n </body>\n</html>'
To pretty print this output, you can use the method provided in official docs here and just do:
et.dump(tree)
Output:
<html>
<head />
<body>
<img src="20200220 snapshot.png" />
</body>
</html>
Just make it a string preceded by f and add your variable between {} to the string
import datetime
date = datetime.datetime.now().strftime('%Y%m%d')
filename = date + " snapshot.png"
html = f"""\
<html>
<head></head>
<body>
<img src="Directory/{filename}"/>
</body>
</html>
"""
print(html)
Or use simple string concatenation instead
import datetime
date = datetime.datetime.now().strftime('%Y%m%d')
filename = date + " snapshot.png"
html = f"""\
<html>
<head></head>
<body>
<img src="Directory/"""
html += filename
html += """/>
</body>
</html>
"""
print(html)
In the following example I am expecting to get Foo for the <h2> text:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]
h2.text
Unfortunately I get ''. Why?
Strangly, foo is in the text:
>>> list(h2.itertext())
['1. ', 'Foo', '¶']
>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]
>>> [node.text for node in h2.getchildren()]
['1. ', '¶']
So where is Foo?
I think you are one level too shallow in the tree. Try this:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail
More generally, to crawl all text and tail, try a loop like this:
for u in etree.findall('.//h2')[0]:
print(u.text, u.tail)
Using lxml:
fp2 = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
'''
import lxml.html
tree = lxml.html.fromstring(fp2)
for item in tree.xpath('//h2'):
target = item.text_content().strip()
print(target.split('\n')[1].strip())
Output:
Foo
See this HTML code:
<html>
<body>
<p class="fixedfonts">
LINK1
</p>
<h2>Results</h2>
<p class="fixedfonts">
LINK2
</p>
<p class="fixedfonts">
LINK3
</p>
</body>
</html>
It contains 3 links. However I need to retrieve only the links after the title Results
I am using python with BeautifulSoup:
from bs4 import BeautifulSoup, SoupStrainer
# at this point html contains the code as string
# parse the HTML file
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
links = list()
for link in soup:
if link.has_attr('href'):
links.append(link['href'].replace('%20', ' '))
print(links)
With the the presented code I get all the links in the document, but as I said I only need those that are after the Results tag/title.
Guidance is appreciated
You can solve that using the find_all_next() method:
results = soup.find("h2", text="Results")
for link in results.find_all_next("a"):
print(link.get("href"))
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <html>
... <body>
... <p class="fixedfonts">
... LINK1
... </p>
...
... <h2>Results</h2>
...
... <p class="fixedfonts">
... LINK2
... </p>
...
... <p class="fixedfonts">
... LINK3
... </p>
... </body>
... </html>"""
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>> results = soup.find("h2", text="Results")
>>> for link in results.find_all_next("a"):
... print(link.get("href"))
...
B.pdf
C.pdf
Split the html data into two parts, before and after the "Results" Then use the one after to process it:
data = html.split("Results")
need = data[1]
So just implement that:
from bs4 import BeautifulSoup, SoupStrainer
data = html.split("Results")
need = data[1]
soup = BeautifulSoup(need.replace('\n', ''), parse_only=SoupStrainer('a'))
Tested and seemed to work.
from bs4 import BeautifulSoup, SoupStrainer
html = '''<html>
<body>
<p class="fixedfonts">
LINK1
</p>
<h2>Results</h2>
<p class="fixedfonts">
LINK2
</p>
<p class="fixedfonts">
LINK2
</p>
<p class="fixedfonts">
LINK3
</p>
</body>
</html>'''
# at this point html contains the code as string
# parse the HTML file
dat = html.split("Result")
need = dat[1]
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
links = list()
for link in soup:
if link.has_attr('href'):
links.append(link['href'].replace('%20', ' '))
n_links = list()
for i in set(links):
if need.count(i) > 0:
for x in range(1, need.count(i) + 1):
n_links.append(i)
print(n_links)
I have a template like this:
<html><body><div id="here"></div></body></html>
and an input HTML like this
<html><body>COMPLEX HTML</body></html>
where COMPLEX_HTML is a lot of sub tags (it's clean -- validates)
I'm trying to move just the HTML inside the body tag of the input HTML into the #here div in the template, to get this
<html><body><div id="here">COMPLEX HTML</div></body></html>
I tried:
t = BeautifulSoup("<html><body><div id=\"here\"></div></body></html>")
pc = t.find("div", id="here")
s = BeautifulSoup(open("complex.html"))
# this prints every tag in body
for b in s.body.contents:
print b.name
# this prints only some of the tags
for b in s.body.contents:
print b.name
pc.append(b)
pc ends up with every other tag from s.body
It's as if appending b moves the iterator forward. How do I take HTML structure from one soup and put it in another?
You could do it something like this :
from bs4 import BeautifulSoup
html = """<html><body><div id="here"></div></body></html>"""
soup = BeautifulSoup(html)
div = soup.find("div", id="here")
html2 = """<html><body><script src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="//cdn.sstatic.net/Js/stub.en.js?v=283ea58c715b"></script>
<link rel="stylesheet" type="text/css" href="//cdn.ss tatic.net/stackoverflow/all.css? v=71d362e7c10c">
</body></html>"""
soup1 = BeautifulSoup(html2)
value = soup1.body.extract()
div.append(value)
print div
And the output is :
<div id="here"><body><script src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="//cdn.sstatic.net/Js/stub.en.js?v=283ea58c715b"></script>
<link href="//cdn.sstatic.net/stackoverflow/all.css?v=71d362e7c10c" rel="stylesheet" type="text/css">
</link></body></div>
If you want the content inside the body you can do it something like this instead :
#the above same lines
soup1 = BeautifulSoup(html2)
value = soup1.body.extract()
div.append(value)
# replaces a tag with whatever’s inside that tag.
div.body.unwrap()
print div
And the output is :
<div id="here"><script src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="//cdn.sstatic.net/Js/stub.en.js?v=283ea58c715b"></script>
<link href="//cdn.sstatic.net/stackoverflow/all.css?v=71d362e7c10c" rel="stylesheet" type="text/css">
</link></div>
Ok, so append(tag) removes the tag from its structure, so in effect, the next tag gets skipped (since you are altering the structure while iterating)
I used this
bc = soup.body.contents
while len(bc) > 0:
pc.append(bc[0])
Still removes bc[0] from the body, but I am not depending on the structure not being altered.
This is ok for me, because I don't need the original soup.