How do I extract a dictionary from a CDATA embedded in HTML? - python

I used python to scrape an HTML file, but the data I really need is embedded in a CDATA file.
My code:
import requests
from bs4 import BeautifulSoup
url='https://www.website.com'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='react-container')
print(results.prettify())
The output is:
<div class="react-container" id="react-container">
<script type="text/javascript">
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
I eventually want to iterate through the full CDATA dictionary to compare if certain dictionary values are present in another list.
Please let me know if you have any ideas on how I can extract and manipulate the contents of the CDATA file. Thanks!

This example will print string inside the <script> tag and then parses the data with re/json module:
import re
import json
from bs4 import BeautifulSoup
txt = '''<div class="react-container" id="react-container">
<script type="text/javascript">
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
</script>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
# select desired <script> tag
script_tag = soup.select_one('#react-container script')
# print contents of the <script> tag:
print(script_tag.string)
# parse the json data inside <script> tag to variable 'data'
data = json.loads( re.search(r'window\.REACT_OPTS = ({.*})', script_tag.string).group(1) )
# print data to screen:
print(json.dumps(data, indent=4))
Prints:
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
{
"components": [
{
"component_name": "",
"props": {},
"router": true,
"redux": true,
"selector": "#react-container",
"ignoreMissingSelector": false
}
]
}

Related

Extract data from script tag [HTML] using BeautifulSoup in Python

I want to Extract data from a variable which is inside of a script:
<script>
var Itemlist = 'null';
var ItemData = '[{\"item_id\":\"107\",\"id\":\"79\",\"line_item_no\":\"1\",\"Amount\":\"99999.00\"}]';
</script>
I want the item_id and the Amount inside of a variable in python
I tried using regex it worked for a while but when the cookies session updated it stopped working
Is there any other way to get those values??
I am using this method to get the script from the html but it changes when the cookie session updates
soup = bs(response.content, 'html.parser')
script = soup.find('script')[8]
so i have to change the number that i've put after ('script') for now it's [8] if cookies session updates i have to keep changing the number until i find the script i am looking for
To get the data from the <script> you can use this example:
import re
import json
from bs4 import BeautifulSoup
html_data = """
<script>
var Itemlist = 'null';
var ItemData = '[{\"item_id\":\"107\",\"id\":\"79\",\"line_item_no\":\"1\",\"Amount\":\"99999.00\"}]';
</script>
"""
soup = BeautifulSoup(html_data, "html.parser")
data = soup.select_one("script").text
data = re.search(r"ItemData = '(.*)';", data).group(1)
data = json.loads(data)
print("Item_id =", data[0]["item_id"], "Amount =", data[0]["Amount"])
Prints:
Item_id = 107 Amount = 99999.00

How to extract var (values) from <script> of html using beautifulsoup

i am currently using
import requests
from bs4 import BeautifulSoup
source = requests.get('www.randomwebsite.com').text
soup = BeautifulSoup(source,'lxml')
details= soup.find('script')
this is returning me the following script.
<script>
var Url = "https://www.example.com";
if(Url != ''){code}
else {code
}
</script>
i want to have the output as following.
https://www.example.com
import re
text = """
<script>
var Url = "https://www.example.com";
if(Url != ''){code}
else {code
}
</script>
"""
match = re.search('Url = "(.*?)"', text)
print(match.group(1))
Output:
https://www.example.com
To print the cashback_url, you can try this script:
import re
import requests
url = 'https://tracking.earnkaro.com/visitretailer/508?id=103894&shareid=ENKR2020090345700421&dl=https%3A%2F%2Fwww.amazon.in%2Fgp%2Fproduct%2FB08645RXJ6%2Fref%3Dox_sc_act_title_1%3Fsmid%3DAT95IG9ONZD7S%26psc%3D1'
html_data = requests.get(url).text
cashback_url = re.search(r'var cashbackUrl = "(.*?)"', html_data).group(1)
print(cashback_url)
Prints:
https://www.amazon.in/gp/product/B08645RXJ6/ref=ox_sc_act_title_1?smid=AT95IG9ONZD7S&psc=1&ck&tag=EK003221-21

Beautifulsoup not extractig information from <script type="text/javascript">

I need a way to get informations from a web page. That info is stored in <script> tag and i can't find a way to extract it. Here is the last iteration of the code i used.
for link in urls:
driver.get(link)
#print(driver.title)
content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")
for a in soup.findAll(string=['script', "EM.", "productFullPrice"]):
print (a)
name=a.find(string=['EM.sef_name'])
print(name);
print(a) and print(name) return nothing.
The source code i want to scrape looks like this:
<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
If you're wanting the text inside the tag you can't just pass 'EM' to the string tag because it is looking for a string that exactly matches 'EM'. That also means it won't match the script tag either and will only look for the string script inside the tag itself. To get the node you need to pass script as a node to the findAll function. If you look at the text value of what's between the script tag it looks like this "\n var EM = EM || {};\n EM.CDN = 'link1';\n EM.something = value; \n ". So it won't find EM because EM isn't equal to that string I posted above. There are a couple ways you can go about this here is one I chose to help return values similar to what you posted:
import bs4
html_string = '''
<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
</script>
'''
wanted_strings= string=["EM.", "productFullPrice"]
soup = bs4.BeautifulSoup(html_string, features="html.parser")
wanted=[]
test = soup.findAll('script')
for word in wanted_strings:
for tag in test:
if word in tag.text:
wanted.append(tag)
wanted
which will then give you the script lines in a list like this with the tags that contain the strings you need
[<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
</script>]
Another way to do this is just look for the tag and then place each line of code in a list
import bs4
html_string = '''
<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
</script>
'''
soup = bs4.BeautifulSoup(html_string, features="html.parser")
test = soup.findAll('script')
blah = [x.strip() for x in test[0].text.split('\n') if x.strip()]
blah
which gives you something like this that may be easier to search for what you need depending on your use case
['var EM = EM || {};', "EM.CDN = 'link1';", 'EM.something = value;']

Getting javascript variable value while scraping with python

I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
thank you for the response, Finally I solved using requests package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()

Get JS var value in HTML source using BeautifulSoup in Python

I'm trying to get a JavaScript var value from an HTML source code using BeautifulSoup.
For example I have:
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
I want something to return the value of the var "my" in Python
How can I achieve that?
The simplest approach is to use a regular expression pattern to both locate the element via BeautifulSoup and extract the desired substring:
import re
from bs4 import BeautifulSoup
data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))
Prints hello.
Another idea would be to use a JavaScript parser and locate a variable declaration node, check the identifier to be of a desired value and extract the initializer. Example using slimit parser:
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<script>
var my = 'hello';
var name = 'hi';
var is = 'halo';
</script>
"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: text and "var my" in text)
# parse js
parser = Parser()
tree = parser.parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
print(node.initializer.value)
Prints hello.
the answer, pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
should get a wrong way, have to remove the line-end sign $ when set re.MULTILINE re.DOTALL at same time.
try with python 3.6.4
Building on #alecxe's answer, but considering a more complex case of an array of dictionaries - or an array of flat json objects:
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<script>
var my = [{'dic1key1':1}, {'dic2key1':1}];
var name = 'hi';
var is = 'halo';
</script>
"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: text and "var my" in text)
# parse js
parser = Parser()
tree = parser.parse(script.text)
array_items = []
for node in nodevisitor.visit(tree):
if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
for item in node.initializer.items:
parsed_dict = {getattr(n.left, 'value', '')[1:-1]: getattr(n.right, 'value', '')[1:-1]
for n in nodevisitor.visit(item)
if isinstance(n, slimit.ast.Assign)}
array_items.append(parsed_dict)
print(array_items)

Categories

Resources