Error with Beautiful Soup

Error with Beautiful Soup - python

I have to remove the text in the title tag from this source:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />
</title>
I am using this to remove the text:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
ourUrl = opener.open("http://www.thehindubusinessline.com/industry-and-economy/info-tech/nokia-cannot-license-brand-nokia-post-microsoft-deal/article5156470.ece").read()
soup = BeautifulSoup(ourUrl)
print soup
dem = soup.findAll('p')
hea = soup.findAll('title')
This code correctly extracts the p tags however fails when trying to extract title. Thanks. I have only included a part of the code, dont worry the rest of it works fine.

There is an error in your html code! You have 2 </title> endtags:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />
</title> #You already have endtag of <title>
So the fixed code should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />

Related

HTML parser - How to copy/export a text code from some 500 html pages to another 500 pages with the same link address

I want to copy/export a different text code from some 500 html pages to another 500 pages with the same link address, but with different content.
For example, the lines below from the page-1.html must be exported / copied to another folder, also in a file with the same name page-1.html. And the same to the other thousands of pages.
In fact, I must export/copy everything before <body> from file-1.html (Folder-1) to file-1.html (Folder-2). Remember that the files content is different, the only connection id the same name.
<!-- START HERE -->
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/love.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="cars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-1.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.feedburner.com/my-website"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="my, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="never"/>
<meta name="revisit-after" content="10 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-facebook.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561911"/>
<meta property="fb:admins" content="716441"/>
<meta name="yandex-verification" content="6b7169b283c6c9cc" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags...
You can see here exactly what I want to do
and here:

There are probably other ways to do this that may be more optimal, but below is a variation of some PowerShell logic that appears to do the trick.
This will take care of updating the content part in the destination files as described from the source file contents which have matching file names.
PowerShell
$src = Get-ChildItem -Path "C:\Folder1" -Filter "*.html";
$destFld = "C:\Folder2";
$src | % { Process {
If ( Test-Path "$destFld\$($_.Name)" ) {
Clear-Variable -Name ("a","b","y","z");
$z = Get-Content $_.FullName -Raw;
$y = "$((($z -split "</head>")[0]).Trim())`r`n";
$a = Get-Content "$destFld\$($_.Name)" -Raw;
$b = (($a -split "</head>")[1]).Trim();
$y | Out-File "$destFld\$($_.Name)";
"</head>" | Out-File "$destFld\$($_.Name)" -Append;
$b | Out-File "$destFld\$($_.Name)" -Append;
}
}};
Before and After Result Examples
File-1.html (used for update content)
<!-- START HERE -->
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/love.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="cars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-1.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.feedburner.com/my-website"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="my, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="never"/>
<meta name="revisit-after" content="10 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-facebook.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561911"/>
<meta property="fb:admins" content="716441"/>
<meta name="yandex-verification" content="6b7169b283c6c9cc" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags...
File-2.html (before update)
<!-- START HERE -->
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/hate.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="bars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-2.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.fastlearner.com/my-mess"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="no, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="always"/>
<meta name="revisit-after" content="2 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-myspace.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-myspace.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-myspace.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561022"/>
<meta property="fb:admins" content="716552"/>
<meta name="yandex-verification" content="6b7169b283c6c8dd" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags 2...
File-2.html (after update)
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/love.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="cars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-1.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.feedburner.com/my-website"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="my, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="never"/>
<meta name="revisit-after" content="10 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-facebook.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561911"/>
<meta property="fb:admins" content="716441"/>
<meta name="yandex-verification" content="6b7169b283c6c9cc" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags 2...
Supporting Resources
ForEach-Object
Standard Aliases for Foreach-Object: the '%' symbol, ForEach
If()
Split()
Trim()
Clear-Variable
Get-Content
Out-File
About Special Characters
a. `n : New line
b. `r : Carriage return
Where b. and a. : CRLF EOL
How-to: Change the line endings of a text file

Another solution in PowerShell, easy to understand, is to use REGEX: \A(.*)[\s\S]+(<body>) (This regex selects/copy everything from beginning of file to <body>, including <body>.
$sourceFiles = Get-ChildItem 'c:\Folder1'
$destinationFolder = 'c:\Folder2'
foreach ($file in $sourceFiles) {
$sourceContent = Get-Content $file.FullName -Raw
$contentToInsert = [regex]::match($sourceContent,"\A(.*)[\s\S]+(<body>)").value
$destinationContent = Get-Content $destinationFolder\$($file.Name) -Raw
$destinationContent = $destinationContent -replace '\A(.*)[\s\S]+(<body>)',$contentToInsert
Set-Content -Path $destinationFolder\$($file.Name) -Value $destinationContent -Encoding UTF8
} #end foreach file

That's why you should use a template engine before working on those 500 html pages. All the headers logic in one single file and then the specifics on other places.
About the html parser. You can use any language to parse those 500 html pages and then create the others. For example, there is this post where the author explains how to parse a website and export it to other formats. You can try to export to HTML.

How to add className to body tag

I am trying to do something like below in plotly
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" href="images/favicon.ico" type="image/ico" />
</head>
<body class="nav-md">
</body>
Specifically define className for body tag and add some meta info to head tag. Could anyone please help how can I accomplish the same.

If your HTML content isn't static or if you would like to introspect or modify the templated variables, then you can override the Dash.interpolate_index method.
https://dash.plotly.com/external-resources
import dash
import dash_html_components as html
class CustomDash(dash.Dash):
def interpolate_index(self, **kwargs):
return """
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" href="images/favicon.ico" type="image/ico" />
</head>
<body class='nav-md'>
{app_entry}
{config}
{scripts}
{renderer}
</body>
</html>
""".format(
app_entry=kwargs["app_entry"],
config=kwargs["config"],
scripts=kwargs["scripts"],
renderer=kwargs["renderer"],
)
app = CustomDash()
app.layout = html.P("Hello World")
if __name__ == "__main__":
app.run_server(debug=True)

BeautifulSoup changes > to >

I need to edit some existing html files, using BeautifulSoup. A problem appears when the DOCTYPE includes an ATTLIST element.
Here's a minimal example.
from bs4 import BeautifulSoup
doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
"""
soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify())
The output is
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
As seen, the last '>' of DOCTYPE turns into an entity.
With
print(soup.prettify(formatter=None))
I get
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type">
<meta content="CA43667" name="dc:identifier">
</head>
</html>
Now the DOCTYPE is fine, but the ending slashes in the "meta" elements disappear, and the document won't validate on our system. Other formatter options don't seem to work either.
Any solution for this?

Are you running the latest version of BeautifulSoup? I think you'll just need to update BeautifulSoup. Or it may be a weird installation of BeautifulSoup. Try this in your command line:
pip uninstall beautifulsoup4
pip install beautifulsoup4
As when I run this:
from bs4 import BeautifulSoup
doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
"""
soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify(formatter=None))
This is outputting:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
Which I believe is what you're looking for. I've tested this on an online IDE too and seems to be matching my computer. Here is a link: https://onlinegdb.com/HyzXahzAE

Parsing HTML and json both using python

When we try to acces profiles like linkedIn we get both html and json format text
https://www.linkedin.com/in/aaron-jacobs-3b513261/
The useful data i want to grab is in json format, how can i parse it with
json.loads(data)
neglecting the HTML part
<!DOCTYPE html>
<html lang="en">
<head>
<script type="text/javascript" src="https://gc.kis.v2.scr.kaspersky-labs.com/9E1E45EF-3F97-184C-B471-44EF675548EA/main.js" charset="UTF-8"></script><link rel="stylesheet" crossorigin="anonymous" href="https://gc.kis.v2.scr.kaspersky-labs.com/AE845576FE44-174B-C481-79F3-FE54E1E9/abn/main.css"/><script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n. liVisibilityChangeListener))}(document,window);</script>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>LinkedIn</title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0">
<meta name="theme-color" content="#0077B5">
<!---Data is Here--->
<meta name="extended/config/environment" content="%7B%22modulePrefix%22%3A%22extended%22%2C%22environment%22%3A%22production%22%2C%22datawarm%22%3A%7B%22enabled%22%3Atrue%7D%2C%22lix%22%3A%7B%22tests%22%3A%5B%22voyager.web.feed.authorDisabledComments%22%2C%22voyager.web.feed.badge%22%2C%22voyager.web.feed.channelUpdates%22%2C%22voyager.web.feed.comment-article%22%2C%22voyager.web.feed.comment-image%22%2C%22voyager.web.feed.connectionUpdates%22%2C%22voyager.web.feed.deleteReportModal%22%2C%22voyager.web.feed.feedBadgeCountInTotal%22%2C%22voyager.web.feed.followRecommendationUpdates%22%2C%22voyager.web.feed.lyndaUpdates%22%2C%22voyager.web.feed.index.disable-top-text-ad%22%2C%22voyager.web.feed.mentionedInNewsUpdate%22%2C%22voyager.web.feed.promos%22%2C%22voyager.web.feed.updateIndicatorThreshold%22%2C%22voyager.feed.web.share-via-control-panel%22%2C%22voyager.feed.web.dropshadow-on-article-box%22%2C%22voyager.feed.web.hoverable-link-text%22%2C%22voyager.feed.web.enable-sort-toggle%22%2C%22voyager.feed.web.disable-identity-module-stickiness%22%2C%22voyager.feed.client.photo-upload%22%2C%22voyager.feed.client.su-actor-subheadline%22%2C%22voyager.feed.client.su-commentary%22%2C%22voyager.feed.client.su-follow-button%22%2C%22voyager.feed.client.su-highlight-comment-see-more%22%2C%22voyager.feed.client.su-ad-choice%22%2C%22voyager.feed.client.update.cp.enabled%22%2C%22voyager.feed.client.update.cp.report.enabled%22%2C%22voyager.feed.like-on-comment%22%2C%22voyager.feed.reply-on-comment%22%2C%22voyager.web.feed.enable-share-as-message%22%2C%22voyager.me.web.content_analytics_feed_entry_shares%22%2C%22voyager.feed.web.full-width-images%22%2C%22voyager.feed.web.max-small-image-width%22%2C%22neptune.feed.web.max-small-image-width%22%2C%22publishin%22%2C%22voyager.feed.web.sharing.twitter-visibility%22%2C%22voyager.feed.web.sharing.hide-url-input%22%2C%22voyager.feed.web.extended.sharing.subaction-bar-rounded-button-theme%22%2C%22voyager.feed.web.sharing.increase-char-limit%22%2C %22voyager.web.feed.sponsoredUpdateTracking%22%2C%22voyager.feed.client.hashtags%22%2C%22voyager.search.client.vertical-nav%22%2C%22voyager.search.web.postsVertical%22%2C%22voyager.search.web.right-rail-news-module%22%2C%22voyager.feed.web.rich-media.hide-reshare-button%22%2C%22voyager.web.feed.rmv.hideDetailAndActions%22%2C%22neptune.jobs.enabledNeptune%22%2C%22voyager.web.feed.occlusion-culling%22%2C%22voyager.web.prefetch-lazy-images%22%2C%22voyager.web.feed.editors-pick%22%2C%22voyager.web.feed.right-rail.follow-recommendations%22%2C%22voyager.web.feed.use-composition%22%2C%22voyager.web.feed.follow-page%22%2C%22voyager.web.feed.initial-fetch-update-count%22%2C%22voyager.feed.video.expand.support%22%2C%22voyager.feed.video.autoplay.support%22%2C%22voyager.feed.video.heartbeat.interval%22%2C%22voyager.feed.web.video-upload%22%2C%22voyager.feed.web.video-upload.duration-limit%22%2C%22voyager.feed.web.hide-comments-initially%22%2C%22voyager.web.feed.updateIndicatorThreshold%22%2C%22voyager.web.feed.nup%22%2C%22voyager.feed.video.provider.linkedin%22%2C%22voyager.feed.video.provider.slideshare%22%2C%22voyager.feed.video.provider.vimeo%22%2C%22voyager.feed.video.provider.youtube%22%2C%22voyager.feed.web.likers-modal.additional-paging-request%22%2C%22voyager.sharing.web.remember-visibility-settings%22%2C%22voyager.web.sharing.keep-post-button-active%22%2C%22voyager.web.feed.fie.visibleHeight%22%2C%22voyager.web.feed.perf.layered-rendering%22%2C%22voyager.web.feed.improve-feed-via-control-menu%22%2C%22voyager.jobs.web.deferJymbii%22%2C%22postal_code_location_typeahead_jserp%22%2C%22voyager.premium.web.jobs-fastGrowingCompaniesUpsell%22%2C%22voyager.search.jobs-search.web.create-search-alert-hovercard-enabled%22%2C%22voyager.premium.web.jobPosterUpsell%22%2C%22neptune.launchpad.gate%22%2C%22neptune.launchpad.one-step-flow%22%2C%22voyager.messaging.client.draft-leave-prompt%22%2C%22voyager.messaging.client.forwarding%22%2C%22voyager.messaging.client.enable-group-topcard-facepile%22%2C%22voyager.messaging.client.enable-image-gif-virus-scan%22%2C%22voyager.messaging.client.enable-image-unrolling%22%2C%22voyager.messaging.client.enable-image-virus-scan%22%2C%22voyager.messaging.client.enable-impression-tracking%22%2C%22voyager.messaging.client.enable-leave-web%22%2C%22voyager.messaging.client.enable-lss-unsubscribe%22%2C%22voyager.messaging.client.enable-member-actions-web%22%2C%22voyager.messaging.clien

BeautifulSoup cannot parse the html tags which don't have closing element

Here is the HTML code I working on it
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sdasdsadsad</title>
<link rel="alternate" media="only screen and (max-width: 640px)" href="local:80" />
<meta name="description" content="sdddsdsdsdsdsd">
<meta name="keywords" content="3333333333333333">
<meta property="og:title" content="444444444444444444444444">
<meta property="og:type" content="article">
<meta property="og:description" content="dsdsdsdsddsds">
</head>
<body></body>
</html>
I want to get the line contains tag "<meta name = description" , which doesn't have close element </meta>. There is my code
import glob, os, re, urllib2, codecs
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sdasdsadsad</title>
<link rel="alternate" media="only screen and (max-width: 640px)" href="local:80" />
<meta name="description" content="sdddsdsdsdsdsd">
<meta name="keywords" content="3333333333333333">
<meta property="og:title" content="444444444444444444444444">
<meta property="og:type" content="article">
<meta property="og:description" content="dsdsdsdsddsds">
</head>
<body></body>
</html>
"""
soup = BeautifulSoup(html_doc)
aa = soup.find("meta", {"name":"description"})
print aa.encode("utf-8")
Running the Python code, but the console show
<meta content="sdddsdsdsdsdsd" name="description">
<meta content="3333333333333333" name="keywords">
<meta content="444444444444444444444444" property="og:title">
<meta content="article" property="og:type">
<meta content="dsdsdsdsddsds" property="og:description">
</meta></meta></meta></meta></meta>
But if "<meta content="sdddsdsdsdsdsd" name="description">" has close element </meta>, I can get exactly the line:
<meta content="sdddsdsdsdsdsd" name="description"> </meta>
Would you like to tell me why the reason BeautifulSoup get all HTML tag under <meta name = description , and how to get the line contains <meta name = description
Thanks.

Use the lxml module as the parser and it will work, I've tested it.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
aa = soup.find("meta", {"name":"description"})
print aa.encode('utf-8')
# console output
<meta content="sdddsdsdsdsdsd" name="description"/>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error with Beautiful Soup - python

Related

HTML parser - How to copy/export a text code from some 500 html pages to another 500 pages with the same link address

How to add className to body tag

BeautifulSoup changes > to >

Parsing HTML and json both using python

BeautifulSoup cannot parse the html tags which don't have closing element

Categories

Resources