I have a html file where I would like to insert a <meta> tag between the <head> & </head> tags using python. If I open the file in append mode how do I get to the relevant position where the <meta> tag is to be inserted?
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup
You need a markup manipulation library like the one in this link Python and HTML Processing, I would advice against just opening the the file and try to append it.
Hope it helps.
Related
I am using the python requests package to scrape a webpage. This is the code:
import requests
from bs4 import BeautifulSoup
# Configure Settings
url = "https://mangaabyss.com/read/"
comic = "the-god-of-pro-wrestling"
# Run Scraper
page = requests.get(url + comic + "/")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
The url it uses is "https://mangaabyss.com/read/the-god-of-pro-wrestling/"
But in the output of soup, I only get the first div and no other child elements that are inside it.
This is the output I get:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="/favicon.ico" rel="icon"/>
<meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=1,viewport-fit=cover" name="viewport"/>
<meta content="#250339" name="theme-color"/>
<title>
MANGA ABYSS
</title>
<script crossorigin="" src="/assets/index.f4dc01fb.js" type="module">
</script>
<link href="/assets/index.9b4eb8b4.css" rel="stylesheet"/>
</head>
<body>
<div id="manga-mobile-app">
</div>
</body>
</html>
The content that I want to scrape is way deep inside that div
I am looking to extract the number of chapters.
This is the selector for it:
#manga-mobile-app > div > div.comic-info-component > div.page-normal.with-margin > div.comic-deatil-box.tab-content.a-move-in-right > div.comic-episodes > div.episode-header.f-clear > div.f-left > span
Can anyone help me where I'm going wrong?
The data is loaded from external URL so beautifulsoup doesn't see it. You can use requests module to simulate this call:
import json
import requests
slug = "the-god-of-pro-wrestling"
url = "https://mangaabyss.com/circinus/Manga.Abyss.v1/ComicDetail?slug="
data = requests.get(url + slug).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for ch in data["data"]["chapters"]:
print(
ch["chapter_name"],
"https://mangaabyss.com/read/{}/{}".format(slug, ch["chapter_slug"]),
)
Prints:
...
Chapter 4 https://mangaabyss.com/read/the-god-of-pro-wrestling/chapter-4
Chapter 3 https://mangaabyss.com/read/the-god-of-pro-wrestling/chapter-3
Chapter 2 https://mangaabyss.com/read/the-god-of-pro-wrestling/chapter-2
Chapter 1 https://mangaabyss.com/read/the-god-of-pro-wrestling/chapter-1
I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?
You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))
I am having a problem and I can find a way to solve it. I am trying to parse an html page and then replace a string, while using Beautiful Soup. Although the process looks correct and I do not get any errors when I open the new html page I get some utf-8 characters inside that I do not want.
Sample of working code:
#!/usr/bin/python
import codecs
from bs4 import BeautifulSoup
html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""
try:
my_soup = BeautifulSoup(html_sample.decode('utf-8'), 'html.parser') # html5lib or html.parser
forecast = my_soup.find("div", {"class": "date"})
forecast.tag = unicode(forecast).replace('LAST UPDATE', 'TEST')
forecast.replace_with(forecast.tag)
# print(my_soup.prettify())
f = codecs.open('test.html', "w", encoding='utf-8')
f.write(my_soup.prettify().encode('utf-8'))
f.close()
except UnicodeDecodeError as e:
print('Error, encoding/decoding: {}'.format(e))
except IOError as e:
print('Error Replacing: {}'.format(e))
except RuntimeError as e:
print('Error Replacing: {}'.format(e))
And the output with utf-8 characters in the new html page:
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</meta>
</head>
<body>
<div class="date">TEST</div>
</body>
</html>
I think that I have mixed up, the encoding and decoding process. Someone with more knowledge on this area can possible elaborate more. I am a total beginner on coding and encoding.
Thank you for your time and effort in advance.
There is no need to get into encoding here. You can replace the text content of a Beautiful Soup element by setting the element.string as follows:
from bs4 import BeautifulSoup
html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""
soup = BeautifulSoup(html_sample)
forecast = soup.find("div", {"class": "date"})
forecast.string = 'TEST'
print(soup.prettify())
Output
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</head>
<body>
<div class="date">
TEST
</div>
</body>
</html>
I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the following html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title>
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" />
<script src="js/h5utils.js"></script>
</head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?
Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)
def doctype(soup):
items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
return items[0] if items else None
You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:
for child in soup.contents:
if isinstance(child, BS.Declaration):
declaration_type = child.string.split()[0]
if declaration_type.upper() == 'DOCTYPE':
declaration = child
You could just fetch the first item in soup contents:
>>> soup.contents[0]
u'DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"'
With beautifulsoup I get the html code of a site, let say it's this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
How I can add this line body {background-color:#b0c4de;} inside the head tag using beautifulsoup?
Lets say python code is:
#!/usr/bin/python
import cgi, cgitb, urllib2, sys
from bs4 import BeautifulSoup
site = "www.example.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
You can use:
soup.head.append('body {background-color:#b0c4de;}')
But you should create a <style> tag before.
For instance:
head = soup.head
head.append(soup.new_tag('style', type='text/css'))
head.style.append('body {background-color:#b0c4de;}')