Parsing HTML and json both using python

Parsing HTML and json both using python - python

When we try to acces profiles like linkedIn we get both html and json format text
https://www.linkedin.com/in/aaron-jacobs-3b513261/
The useful data i want to grab is in json format, how can i parse it with
json.loads(data)
neglecting the HTML part
<!DOCTYPE html>
<html lang="en">
<head>
<script type="text/javascript" src="https://gc.kis.v2.scr.kaspersky-labs.com/9E1E45EF-3F97-184C-B471-44EF675548EA/main.js" charset="UTF-8"></script><link rel="stylesheet" crossorigin="anonymous" href="https://gc.kis.v2.scr.kaspersky-labs.com/AE845576FE44-174B-C481-79F3-FE54E1E9/abn/main.css"/><script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n. liVisibilityChangeListener))}(document,window);</script>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>LinkedIn</title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0">
<meta name="theme-color" content="#0077B5">
<!---Data is Here--->
<meta name="extended/config/environment" content="%7B%22modulePrefix%22%3A%22extended%22%2C%22environment%22%3A%22production%22%2C%22datawarm%22%3A%7B%22enabled%22%3Atrue%7D%2C%22lix%22%3A%7B%22tests%22%3A%5B%22voyager.web.feed.authorDisabledComments%22%2C%22voyager.web.feed.badge%22%2C%22voyager.web.feed.channelUpdates%22%2C%22voyager.web.feed.comment-article%22%2C%22voyager.web.feed.comment-image%22%2C%22voyager.web.feed.connectionUpdates%22%2C%22voyager.web.feed.deleteReportModal%22%2C%22voyager.web.feed.feedBadgeCountInTotal%22%2C%22voyager.web.feed.followRecommendationUpdates%22%2C%22voyager.web.feed.lyndaUpdates%22%2C%22voyager.web.feed.index.disable-top-text-ad%22%2C%22voyager.web.feed.mentionedInNewsUpdate%22%2C%22voyager.web.feed.promos%22%2C%22voyager.web.feed.updateIndicatorThreshold%22%2C%22voyager.feed.web.share-via-control-panel%22%2C%22voyager.feed.web.dropshadow-on-article-box%22%2C%22voyager.feed.web.hoverable-link-text%22%2C%22voyager.feed.web.enable-sort-toggle%22%2C%22voyager.feed.web.disable-identity-module-stickiness%22%2C%22voyager.feed.client.photo-upload%22%2C%22voyager.feed.client.su-actor-subheadline%22%2C%22voyager.feed.client.su-commentary%22%2C%22voyager.feed.client.su-follow-button%22%2C%22voyager.feed.client.su-highlight-comment-see-more%22%2C%22voyager.feed.client.su-ad-choice%22%2C%22voyager.feed.client.update.cp.enabled%22%2C%22voyager.feed.client.update.cp.report.enabled%22%2C%22voyager.feed.like-on-comment%22%2C%22voyager.feed.reply-on-comment%22%2C%22voyager.web.feed.enable-share-as-message%22%2C%22voyager.me.web.content_analytics_feed_entry_shares%22%2C%22voyager.feed.web.full-width-images%22%2C%22voyager.feed.web.max-small-image-width%22%2C%22neptune.feed.web.max-small-image-width%22%2C%22publishin%22%2C%22voyager.feed.web.sharing.twitter-visibility%22%2C%22voyager.feed.web.sharing.hide-url-input%22%2C%22voyager.feed.web.extended.sharing.subaction-bar-rounded-button-theme%22%2C%22voyager.feed.web.sharing.increase-char-limit%22%2C %22voyager.web.feed.sponsoredUpdateTracking%22%2C%22voyager.feed.client.hashtags%22%2C%22voyager.search.client.vertical-nav%22%2C%22voyager.search.web.postsVertical%22%2C%22voyager.search.web.right-rail-news-module%22%2C%22voyager.feed.web.rich-media.hide-reshare-button%22%2C%22voyager.web.feed.rmv.hideDetailAndActions%22%2C%22neptune.jobs.enabledNeptune%22%2C%22voyager.web.feed.occlusion-culling%22%2C%22voyager.web.prefetch-lazy-images%22%2C%22voyager.web.feed.editors-pick%22%2C%22voyager.web.feed.right-rail.follow-recommendations%22%2C%22voyager.web.feed.use-composition%22%2C%22voyager.web.feed.follow-page%22%2C%22voyager.web.feed.initial-fetch-update-count%22%2C%22voyager.feed.video.expand.support%22%2C%22voyager.feed.video.autoplay.support%22%2C%22voyager.feed.video.heartbeat.interval%22%2C%22voyager.feed.web.video-upload%22%2C%22voyager.feed.web.video-upload.duration-limit%22%2C%22voyager.feed.web.hide-comments-initially%22%2C%22voyager.web.feed.updateIndicatorThreshold%22%2C%22voyager.web.feed.nup%22%2C%22voyager.feed.video.provider.linkedin%22%2C%22voyager.feed.video.provider.slideshare%22%2C%22voyager.feed.video.provider.vimeo%22%2C%22voyager.feed.video.provider.youtube%22%2C%22voyager.feed.web.likers-modal.additional-paging-request%22%2C%22voyager.sharing.web.remember-visibility-settings%22%2C%22voyager.web.sharing.keep-post-button-active%22%2C%22voyager.web.feed.fie.visibleHeight%22%2C%22voyager.web.feed.perf.layered-rendering%22%2C%22voyager.web.feed.improve-feed-via-control-menu%22%2C%22voyager.jobs.web.deferJymbii%22%2C%22postal_code_location_typeahead_jserp%22%2C%22voyager.premium.web.jobs-fastGrowingCompaniesUpsell%22%2C%22voyager.search.jobs-search.web.create-search-alert-hovercard-enabled%22%2C%22voyager.premium.web.jobPosterUpsell%22%2C%22neptune.launchpad.gate%22%2C%22neptune.launchpad.one-step-flow%22%2C%22voyager.messaging.client.draft-leave-prompt%22%2C%22voyager.messaging.client.forwarding%22%2C%22voyager.messaging.client.enable-group-topcard-facepile%22%2C%22voyager.messaging.client.enable-image-gif-virus-scan%22%2C%22voyager.messaging.client.enable-image-unrolling%22%2C%22voyager.messaging.client.enable-image-virus-scan%22%2C%22voyager.messaging.client.enable-impression-tracking%22%2C%22voyager.messaging.client.enable-leave-web%22%2C%22voyager.messaging.client.enable-lss-unsubscribe%22%2C%22voyager.messaging.client.enable-member-actions-web%22%2C%22voyager.messaging.clien

Related

Is there a CSS parser for Python to see applied styles to each tag in HTML?

My Python Scrapy code is crawling through each HTML tag and I need to find CSS styles that are applied to each tag/element. I've been using Selenium but it's very resource-intensive and need to find another solution. Are there any Python libraries (i.e. tinycss, cssutils and etc.) that can handle these tasks?
This is the HTML example.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="styles.css">
<title>My HTML</title>
</head>
<body>
<p class="p-styles">My paragraph</p>
</body>
</html>
And below is the CSS file
.p-styles {
font-size: large;
color: black;
}
The ask is to get all styles (i.e. font-size: large, color: black) from ".p-styles" style when crawling to the p tag.

can't able to login whalewisdom website using BeautifulSoup

I'm trying to login whalewisdom website for last two week but I'm not able to log in, I was tried many libraries like scrapy, selenium, beautifulsoup, etc...
from requests import Session
from bs4 import BeautifulSoup as bs
with Session() as s:
login_url = s.get("https://whalewisdom.com/login")
bs_content = bs(login_url.content, "lxml")
authenticity_token = bs_content.find("input", {"name":"authenticity_token"})["value"]
login_data = {
"authenticity_token": authenticity_token,
"login": "info#example.com",
"password": "***********",
"commit": "Log+In",
}
s.post("https://whalewisdom.com/session", data=login_data)
html_data = bs(s.get("https://whalewisdom.com/dashboard").content, "html.parser")
print(html_data)
enter image description here
Here the outputenter image description here:
<!DOCTYPE html>
<html lang="en">
<head>[enter image description here][1]
<meta charset="utf-8"/>
<title>WhaleWisdom Dashboard</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1.0" name="viewport"/>
<meta content="WhaleWisdom tracks 13F, Schedule 13D, and 13G EDGAR filings by hedge funds. Hedge Fund Whale Backtesting and search tools" name="description"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/favicon-96x96.png" rel="icon" sizes="96x96" type="image/png"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<meta content="r4hQnHlN2H-GtcIb06YHl49VSipApmfQQWIOvZzfnAU" name="google-site-verification">
<link href="https://fonts.googleapis.com/css?family=Roboto:100,300,400,500,700|Material+Icons" rel="stylesheet" type="text/css"/>
<link href="https://cdn.jsdelivr.net/npm/font-awesome#4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/packs/css/whalewisdom-24fbc382.css" media="screen" rel="stylesheet">
<meta content="authenticity_token" name="csrf-param">
<meta content="XMAu/LK+dKi/zt/XSTvxIJ8jKl2x8Rx47/ZnAiN6MQCcZmSSlUrOLMeURRr54eCfEWHY8oyS8c6GYxLoIMomNQ==" name="csrf-token">
</meta></meta></link></meta></head>
<body>
<noscript>
<strong>We're sorry but the WhaleWisdom Dashboard doesn't work properly without JavaScript enabled. Please enable it to continue.</strong>
</noscript>
<div id="app"></div>
<script src="https://d27mjrcvcy56qq.cloudfront.net/packs/js/whalewisdom-4b32da19479fdebf5332.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-11651599-1', 'auto');
ga('send', 'pageview');
</script>
<script async="" charset="utf-8" src="//ads.investingchannel.com/adtags/WhaleWisdom/quotepages/970x91.js" type="text/javascript"></script>
</body>
</html>

You can see that in the HTML output, at line 23, there is an error type message stating that the WhaleWisdom dashboard doesn't work properly without JavaScript.
<!DOCTYPE html>
<html lang="en">
<head>[enter image description here][1]
<meta charset="utf-8"/>
<title>WhaleWisdom Dashboard</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1.0" name="viewport"/>
<meta content="WhaleWisdom tracks 13F, Schedule 13D, and 13G EDGAR filings by hedge funds. Hedge Fund Whale Backtesting and search tools" name="description"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/favicon-96x96.png" rel="icon" sizes="96x96" type="image/png"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/images/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<meta content="r4hQnHlN2H-GtcIb06YHl49VSipApmfQQWIOvZzfnAU" name="google-site-verification">
<link href="https://fonts.googleapis.com/css?family=Roboto:100,300,400,500,700|Material+Icons" rel="stylesheet" type="text/css"/>
<link href="https://cdn.jsdelivr.net/npm/font-awesome#4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
<link href="https://d27mjrcvcy56qq.cloudfront.net/packs/css/whalewisdom-24fbc382.css" media="screen" rel="stylesheet">
<meta content="authenticity_token" name="csrf-param">
<meta content="XMAu/LK+dKi/zt/XSTvxIJ8jKl2x8Rx47/ZnAiN6MQCcZmSSlUrOLMeURRr54eCfEWHY8oyS8c6GYxLoIMomNQ==" name="csrf-token">
</meta></meta></link></meta></head>
<body>
----
<noscript>
<strong>We're sorry but the WhaleWisdom Dashboard doesn't work properly without JavaScript enabled. Please enable it to continue.</strong>**
</noscript>
----
<div id="app"></div>
<script src="https://d27mjrcvcy56qq.cloudfront.net/packs/js/whalewisdom-4b32da19479fdebf5332.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-11651599-1', 'auto');
ga('send', 'pageview');
</script>
<script async="" charset="utf-8" src="//ads.investingchannel.com/adtags/WhaleWisdom/quotepages/970x91.js" type="text/javascript"></script>
</body>
</html>
I think because of this it is not working. I also can't test it right now because I don't use WhaleWisdom.

HTML parser - How to copy/export a text code from some 500 html pages to another 500 pages with the same link address

I want to copy/export a different text code from some 500 html pages to another 500 pages with the same link address, but with different content.
For example, the lines below from the page-1.html must be exported / copied to another folder, also in a file with the same name page-1.html. And the same to the other thousands of pages.
In fact, I must export/copy everything before <body> from file-1.html (Folder-1) to file-1.html (Folder-2). Remember that the files content is different, the only connection id the same name.
<!-- START HERE -->
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/love.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="cars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-1.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.feedburner.com/my-website"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="my, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="never"/>
<meta name="revisit-after" content="10 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-facebook.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561911"/>
<meta property="fb:admins" content="716441"/>
<meta name="yandex-verification" content="6b7169b283c6c9cc" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags...
You can see here exactly what I want to do
and here:

There are probably other ways to do this that may be more optimal, but below is a variation of some PowerShell logic that appears to do the trick.
This will take care of updating the content part in the destination files as described from the source file contents which have matching file names.
PowerShell
$src = Get-ChildItem -Path "C:\Folder1" -Filter "*.html";
$destFld = "C:\Folder2";
$src | % { Process {
If ( Test-Path "$destFld\$($_.Name)" ) {
Clear-Variable -Name ("a","b","y","z");
$z = Get-Content $_.FullName -Raw;
$y = "$((($z -split "</head>")[0]).Trim())`r`n";
$a = Get-Content "$destFld\$($_.Name)" -Raw;
$b = (($a -split "</head>")[1]).Trim();
$y | Out-File "$destFld\$($_.Name)";
"</head>" | Out-File "$destFld\$($_.Name)" -Append;
$b | Out-File "$destFld\$($_.Name)" -Append;
}
}};
Before and After Result Examples
File-1.html (used for update content)
<!-- START HERE -->
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/love.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="cars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-1.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.feedburner.com/my-website"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="my, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="never"/>
<meta name="revisit-after" content="10 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-facebook.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561911"/>
<meta property="fb:admins" content="716441"/>
<meta name="yandex-verification" content="6b7169b283c6c9cc" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags...
File-2.html (before update)
<!-- START HERE -->
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/hate.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="bars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-2.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.fastlearner.com/my-mess"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="no, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="always"/>
<meta name="revisit-after" content="2 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-myspace.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-myspace.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-myspace.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561022"/>
<meta property="fb:admins" content="716552"/>
<meta name="yandex-verification" content="6b7169b283c6c8dd" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags 2...
File-2.html (after update)
<?php
// Use API site scope.
define('RW_SDK__API_SCOPE', 'site');
$item_id = 1; // Replace that with your rating id.
$rating_class = 'page';
?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<title>My page 1</title>
<link rel="icon" href="https://my-website.com/love.ico" sizes="192x192" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<link rel="stylesheet" type="text/css" title="main" href="cars.css"/>
<meta http-equiv="Content-Language" content="en"/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<link rel="canonical" href="https://my-website/my-page-1.html" />
<meta name="resource-type" content="document"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">
<meta name="distribution" content="global"/>
<meta http-equiv="Cache-control" content="public"/>
<link rel="alternate" type="application/rss+xml" title="Latest News" href="https://feeds.feedburner.com/my-website"/>
<meta name="description" content="My content"/>
<meta name="keywords" content="my, content"/>
<meta name="Robots" content="index,follow"/>
<meta name="googlebot" content="index,follow"/>
<meta name="expire" content="never"/>
<meta name="revisit-after" content="10 days"/>
<link rel="sitemap" type="application/rss+xml" href="rss.xml" />
<link rel="image_src" type="image/jpeg" href="https://my-website.com/icon-facebook.jpg" style="display:none"/>
<meta itemprop="image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:image" content="https://my-website.com/icon-facebook.jpg"/>
<meta property="og:type" content="article" />
<meta property="fb:app_id" content="721561911"/>
<meta property="fb:admins" content="716441"/>
<meta name="yandex-verification" content="6b7169b283c6c9cc" />
<meta property="og:title" content="My page 1" />
<!-- END HERE -->
</head>
<body>
...other tags 2...
Supporting Resources
ForEach-Object
Standard Aliases for Foreach-Object: the '%' symbol, ForEach
If()
Split()
Trim()
Clear-Variable
Get-Content
Out-File
About Special Characters
a. `n : New line
b. `r : Carriage return
Where b. and a. : CRLF EOL
How-to: Change the line endings of a text file

Another solution in PowerShell, easy to understand, is to use REGEX: \A(.*)[\s\S]+(<body>) (This regex selects/copy everything from beginning of file to <body>, including <body>.
$sourceFiles = Get-ChildItem 'c:\Folder1'
$destinationFolder = 'c:\Folder2'
foreach ($file in $sourceFiles) {
$sourceContent = Get-Content $file.FullName -Raw
$contentToInsert = [regex]::match($sourceContent,"\A(.*)[\s\S]+(<body>)").value
$destinationContent = Get-Content $destinationFolder\$($file.Name) -Raw
$destinationContent = $destinationContent -replace '\A(.*)[\s\S]+(<body>)',$contentToInsert
Set-Content -Path $destinationFolder\$($file.Name) -Value $destinationContent -Encoding UTF8
} #end foreach file

That's why you should use a template engine before working on those 500 html pages. All the headers logic in one single file and then the specifics on other places.
About the html parser. You can use any language to parse those 500 html pages and then create the others. For example, there is this post where the author explains how to parse a website and export it to other formats. You can try to export to HTML.

How to add className to body tag

I am trying to do something like below in plotly
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" href="images/favicon.ico" type="image/ico" />
</head>
<body class="nav-md">
</body>
Specifically define className for body tag and add some meta info to head tag. Could anyone please help how can I accomplish the same.

If your HTML content isn't static or if you would like to introspect or modify the templated variables, then you can override the Dash.interpolate_index method.
https://dash.plotly.com/external-resources
import dash
import dash_html_components as html
class CustomDash(dash.Dash):
def interpolate_index(self, **kwargs):
return """
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" href="images/favicon.ico" type="image/ico" />
</head>
<body class='nav-md'>
{app_entry}
{config}
{scripts}
{renderer}
</body>
</html>
""".format(
app_entry=kwargs["app_entry"],
config=kwargs["config"],
scripts=kwargs["scripts"],
renderer=kwargs["renderer"],
)
app = CustomDash()
app.layout = html.P("Hello World")
if __name__ == "__main__":
app.run_server(debug=True)

BeautifulSoup cannot parse the html tags which don't have closing element

Here is the HTML code I working on it
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sdasdsadsad</title>
<link rel="alternate" media="only screen and (max-width: 640px)" href="local:80" />
<meta name="description" content="sdddsdsdsdsdsd">
<meta name="keywords" content="3333333333333333">
<meta property="og:title" content="444444444444444444444444">
<meta property="og:type" content="article">
<meta property="og:description" content="dsdsdsdsddsds">
</head>
<body></body>
</html>
I want to get the line contains tag "<meta name = description" , which doesn't have close element </meta>. There is my code
import glob, os, re, urllib2, codecs
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sdasdsadsad</title>
<link rel="alternate" media="only screen and (max-width: 640px)" href="local:80" />
<meta name="description" content="sdddsdsdsdsdsd">
<meta name="keywords" content="3333333333333333">
<meta property="og:title" content="444444444444444444444444">
<meta property="og:type" content="article">
<meta property="og:description" content="dsdsdsdsddsds">
</head>
<body></body>
</html>
"""
soup = BeautifulSoup(html_doc)
aa = soup.find("meta", {"name":"description"})
print aa.encode("utf-8")
Running the Python code, but the console show
<meta content="sdddsdsdsdsdsd" name="description">
<meta content="3333333333333333" name="keywords">
<meta content="444444444444444444444444" property="og:title">
<meta content="article" property="og:type">
<meta content="dsdsdsdsddsds" property="og:description">
</meta></meta></meta></meta></meta>
But if "<meta content="sdddsdsdsdsdsd" name="description">" has close element </meta>, I can get exactly the line:
<meta content="sdddsdsdsdsdsd" name="description"> </meta>
Would you like to tell me why the reason BeautifulSoup get all HTML tag under <meta name = description , and how to get the line contains <meta name = description
Thanks.

Use the lxml module as the parser and it will work, I've tested it.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
aa = soup.find("meta", {"name":"description"})
print aa.encode('utf-8')
# console output
<meta content="sdddsdsdsdsdsd" name="description"/>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML and json both using python - python

Related

Is there a CSS parser for Python to see applied styles to each tag in HTML?

can't able to login whalewisdom website using BeautifulSoup

HTML parser - How to copy/export a text code from some 500 html pages to another 500 pages with the same link address

How to add className to body tag

BeautifulSoup cannot parse the html tags which don't have closing element

Categories

Resources