I'm trying to parse the testObj in the html into JSON, but it includes so much formatting.
I already tried to remove the non-ascii characters in the object, but json.loads() and yaml still can't parse the string into an object.
How can I parse the string into an object?
html
<!DOCTYPE html>
<html lang="en">
<head>
<title>Sample Document</title>
</head>
<body></body>
<script>
const testObj = {
a: 1,
b: 2,
c: 3,
};
</script>
</html>
Python Script
import lxml.html
import urllib.request
import os
import json
import yaml
def removeNonAscii(str):
return ''.join(i for i in str if ord(i)>31 and ord(i)<126)
with urllib.request.urlopen('file:///'+os.path.abspath('./test.html')) as url:
page = url.read()
tree = lxml.html.fromstring(page)
x = tree.xpath("//script")[0].text_content()
json_str = x.strip().split('testObj = ')[1][:-1]
str = removeNonAscii(json_str)
print(str)
# >>> {a: 1,b: 2,c: 3,}
# Attempt 1 - This doesn't work as object doesn't originally have double quotes
# data = json.loads(str)
# >>> json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes
# Attempt 2 - Not sure how to detect or get rid of formatting
# data = yaml.load(str, yaml.SafeLoader)
# >>> ScannerError: While scanning for the next token found character '\t' that cannot start any token
print(data.a)
# >>> Should return 1
Edit: In my actual use case, the JSON object is very large and I cannot recreate the string. I need to remove the formatting and/or add double quotes to make it proper JSON so it can parse, but not sure how to do it. I'm close getting it to {a: 1,b: 2,c: 3,} but it still doesn't want to parse.
If it is as shown (not minified) then you can use the following regex to extract the string then hjson to add the quoted keys
import hjson, re
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
<title>Sample Document</title>
</head>
<body></body>
<script>
const testObj = {
a: 1,
b: 2,
c: 3,
};
</script>
</html>'''
s = re.search(r'const testObj = ([\s\S]+?);', html).group(1)
res = hjson.loads(s)
print(res)
Regex:
Related
I'm trying to scrape some data on two websites. I successfully scraped it. I also want to develop an API using this scraped data using Django. But when I try to display the scraped data in JSON format in Django. It only shows an empty list. Below I attached my code snippets.
from django.shortcuts import render
from bs4 import BeautifulSoup
import requests
import re
import json
import time
data = []
def getURL(url):
url = url.replace(' ', '-').lower()
for char in url:
if char in "?.!:;|/[]&()":
url = url.replace(char, '-')
if char == "'" or char == ",":
url = url.replace(char, '')
decodeUrl = re.sub(r'-+', '-', url)
# check whether the URL is up or not
parsedUrl = "http://www.tutorialbar.com/" + decodeUrl + "/"
if requests.head(parsedUrl).status_code == 200:
return parsedUrl
urls = ['https://www.discudemy.com/all', 'https://www.discudemy.com/all/2']
for url in urls:
source = requests.get(url).text
soup = BeautifulSoup(source, 'html5lib')
# print(soup)
for content in soup.find_all('section', class_="card"):
# print(content)
try:
language = content.label.text
header = content.div.a.text
day = content.find('span', class_="category").text
i = content.find('div', class_="image")
img = i.find('amp-img')['src']
image = img.replace('240x135', '750x422')
description = content.find('div', class_="description").text.lstrip()
myURL = getURL(header)
udemyURL = requests.get(myURL).text
udemySoup = BeautifulSoup(udemyURL, 'html5lib')
udemylink = udemySoup.find_all('a', class_="btn_offer_block re_track_btn")[0]["href"]
entry = {
'language': language,
'header': header,
'day': day,
'image': image,
'description': description,
'courselink': udemylink,
}
data.append(entry)
print()
except Exception as e:
continue
print(json.dumps(data))
print()
print(data)
def index(req):
return render(req, 'index.html', {'courses': json.dumps(data)})
Below is my HTML file for displaying JSON data.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>UdemyCourses</title>
</head>
<body>
{{ courses }}
</body>
</html>
There is some delay in scraping data. I think it might be a problem. I don't know how to handle asynchronous programming in python. Is there any way to achieve it? I'm a beginner. Help me out. Thanks in advance
I used python to scrape an HTML file, but the data I really need is embedded in a CDATA file.
My code:
import requests
from bs4 import BeautifulSoup
url='https://www.website.com'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='react-container')
print(results.prettify())
The output is:
<div class="react-container" id="react-container">
<script type="text/javascript">
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
I eventually want to iterate through the full CDATA dictionary to compare if certain dictionary values are present in another list.
Please let me know if you have any ideas on how I can extract and manipulate the contents of the CDATA file. Thanks!
This example will print string inside the <script> tag and then parses the data with re/json module:
import re
import json
from bs4 import BeautifulSoup
txt = '''<div class="react-container" id="react-container">
<script type="text/javascript">
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
</script>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
# select desired <script> tag
script_tag = soup.select_one('#react-container script')
# print contents of the <script> tag:
print(script_tag.string)
# parse the json data inside <script> tag to variable 'data'
data = json.loads( re.search(r'window\.REACT_OPTS = ({.*})', script_tag.string).group(1) )
# print data to screen:
print(json.dumps(data, indent=4))
Prints:
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
{
"components": [
{
"component_name": "",
"props": {},
"router": true,
"redux": true,
"selector": "#react-container",
"ignoreMissingSelector": false
}
]
}
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
thank you for the response, Finally I solved using requests package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
I'm parsing some XML with BeautifulSoup and have data that looks like:
soup.FindAll('title')
<title>main title</title>
<title>other title</title>
<title>another title</title>
When iterating over the tags, I want to skip the first title. So I have:
for e in soup.findAll('title'):
if e == '<title>main title</title>':
pass
else:
print (e)
This still returns all the titles, including main title. I've tried removing the title tags as well.
Thanks for any help.
You boolean is not working as you want to compare an object <class 'bs4.element.Tag'> to string, it will always be False. You could convert it to string then compare them.
Try this:
for e in soup.find_all("title"):
if str(e) == '<title>main title</title>':
pass
else:
print (e)
Output:
<title>other title</title>
<title>another title</title>
If you want to skip the first title, then a better solution is to slice the list:
>>> soup.findAll('title')[1:]
[<title>other title</title>, <title>another title</title>]
You can check the text attribute of the node, instead of the node itself.
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<title>main title</title>
<title>other title</title>
<title>another title</title>""", "html.parser")
for e in soup.find_all("title"):
if e.text != 'main title':
print(e)
#<title>other title</title>
#<title>another title</title>
I have configured xampp on windows to work with python 2.7 and Pygments. My php code is highlighted properly in Pygments on the website. The code has colors, span elements, classes.
That is how it looks:
But I cannot get line numbers.
As I have read tutorials it depends on the linenos value in python script. The value should be either table or inline or 1 or True.
But it does not work for me. I still gives the same final code
<!doctype html>
<html lang="pl">
<head>
<meta charset="UTF-8">
<title>Document</title>
<link rel="stylesheet" href="gh.css">
</head>
<body>
<div class="highlight highlight-php"><pre><code><span class="nv">$name</span> <span class="o">=</span> <span class="s2">"Jaś"</span><span class="p">;</span>
<span class="k">echo</span> <span class="s2">"Zażółć gęślą jaźń, "</span> <span class="o">.</span> <span class="nv">$name</span> <span class="o">.</span> <span class="s1">'.'</span><span class="p">;</span>
<span class="k">echo</span> <span class="s2">"hehehe#jo.io"</span><span class="p">;</span>
</code></pre></div>
</html>
How to add line numbers? I put two files of the website below:
index.py
import sys
from pygments import highlight
from pygments.formatters import HtmlFormatter
# If there isn't only 2 args something weird is going on
expecting = 2;
if ( len(sys.argv) != expecting + 1 ):
exit(128)
# Get the code
language = (sys.argv[1]).lower()
filename = sys.argv[2]
f = open(filename, 'rb')
code = f.read()
f.close()
# PHP
if language == 'php':
from pygments.lexers import PhpLexer
lexer = PhpLexer(startinline=True)
# GUESS
elif language == 'guess':
from pygments.lexers import guess_lexer
lexer = guess_lexer( code )
# GET BY NAME
else:
from pygments.lexers import get_lexer_by_name
lexer = get_lexer_by_name( language )
# OUTPUT
formatter = HtmlFormatter(linenos='table', encoding='utf-8', nowrap=True)
highlighted = highlight(code, lexer, formatter)
print highlighted
index.php
<?php
define('MB_WPP_BASE', dirname(__FILE__));
function mb_pygments_convert_code($matches)
{
$pygments_build = MB_WPP_BASE . '/index.py';
$source_code = isset($matches[3]) ? $matches[3] : '';
$class_name = isset($matches[2]) ? $matches[2] : '';
// Creates a temporary filename
$temp_file = tempnam(sys_get_temp_dir(), 'MB_Pygments_');
// Populate temporary file
$filehandle = fopen($temp_file, "w");
fwrite($filehandle, html_entity_decode($source_code, ENT_COMPAT, 'UTF-8'));
fclose($filehandle);
// Creates pygments command
$language = $class_name ? $class_name : 'guess';
$command = sprintf('C:\Python27/python %s %s %s', $pygments_build, $language, $temp_file);
// Executes the command
$retVal = -1;
exec($command, $output, $retVal);
unlink($temp_file);
// Returns Source Code
$format = '<div class="highlight highlight-%s"><pre><code>%s</code></pre></div>';
if ($retVal == 0)
$source_code = implode("\n", $output);
$highlighted_code = sprintf($format, $language, $source_code);
return $highlighted_code;
}
// This prevent throwing error
libxml_use_internal_errors(true);
// Get all pre from post content
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding('
<pre class="php">
<code>
$name = "Jaś";
echo "Zażółć gęślą jaźń, " . $name . \'.\';
echo "<address>hehehe#jo.io</address>";
</code>
</pre>', 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NODEFDTD);
$pres = $dom->getElementsByTagName('pre');
foreach ($pres as $pre) {
$class = $pre->attributes->getNamedItem('class')->nodeValue;
$code = $pre->nodeValue;
$args = array(
2 => $class, // Element at position [2] is the class
3 => $code // And element at position [2] is the code
);
// convert the code
$new_code = mb_pygments_convert_code($args);
// Replace the actual pre with the new one.
$new_pre = $dom->createDocumentFragment();
$new_pre->appendXML($new_code);
$pre->parentNode->replaceChild($new_pre, $pre);
}
// Save the HTML of the new code.
$newHtml = "";
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$newHtml .= $dom->saveHTML($child);
}
?>
<!doctype html>
<html lang="pl">
<head>
<meta charset="UTF-8">
<title>Document</title>
<link rel="stylesheet" href="gh.css">
</head>
<body>
<?= $newHtml ?>
</body>
</html>
Thank you
While reading the file try readlines:
f = open(filename, 'rb')
code = f.readlines()
f.close()
This way you do the following it will get multiple lines :
formatter = HtmlFormatter(linenos='table', encoding='utf-8', nowrap=True)
Suggestion:
More pythonic way of opening files is :
with open(filename, 'rb') as f:
code = f.readlines()
That's it python context manager closes this file for you.
Solved!
nowrap
If set to True, don’t wrap the tokens at all, not even inside a tag. This disables most other options (default: False).
http://pygments.org/docs/formatters/#HtmlFormatter