I am trying to pull static_token, but I am definitely not on the right track. I pulled all the info that is javascript and my next step I figured was i needed to read it with JSON or turn it into JSON format as this is what i was told on my previous question. I tried to use regex to pull this as well, no luck on that end either, as pointed out in the comments I don't know what I'm doing. Can you point me in the right direction to pull the static_token and a brief description of what I need to actually do in these instances?
CODE
import requests
import json, re
from bs4 import BeautifulSoup
url =''' '''
response = requests.get(url)
soup1 = BeautifulSoup(response.text,'lxml')
html1 = soup1.find_all('script')[1] #.text
print(html1)
soup2 = BeautifulSoup(b'html1', 'lxml')
var = soup2.get('var static_token')
print(var)
My attempt with using regex -
static_token = re.findall(('static_token":"([^"]+)'), soup.text)
print(static_token)
Source I'm trying to pull the info
<script type="text/javascript">var CUSTOMIZE_TEXTFIELD = 1;
var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var contentOnly = false;
var roundMode = 2;
var static_token = '850424cb3feab732e072a3d84e378908';
var taxEnabled = 1;
var titleDelivery = '<h3 class="page-subheading">Your delivery address</h3>';
var titleInvoice = '<h3 class="page-subheading">Your billing address</h3>';
var token = '0c7a8347a263972696d6514dcfa24741';
var txtConditionsIsNotNeeded = 'You do not need to accept the Terms of Service for this order.';
var txtDeliveryAddress = 'Delivery address';
var txtErrors = 'Error(s)';
var txtFree = 'Free';
var txtThereis = 'There is';
var txtWithTax = '(tax incl.)';
var txtWithoutTax = '(tax excl.)';
var usingSecureMode = true;
var vat_management = 0;</script>
You are confusing a lot of data types in this question. This is not the only way to approach this (or the most robust) but it is simple and can point you in the right direction.
You seem to be able to read the html and extract the script tags from it using BeautifulSoup into html1.
You need to look at the documentation to understand the data types being used. I would also recommend adding statements like this in your code to help.
html1 = soup.find_all('script')
print('html1', type(html1), html1) # the type is bs4.element.ResultSet
This var contains all of the script tags from your document. You can iterate over the tags, and find the text fields for each tag. But it is NOT a JSON formatted type.
Once you have a string and you want part of that string - regex should be one of your first thoughts. You don't want to use regex on the original html - that's the whole point of the BS4 library and others like it. HTML is often mangled, messy, and not well-suited for simple regular expressions. Use BeautifulSoup to parse the html and find the section you want, THEN use regex to parse the string.
for tag in html1:
print('tag', type(tag), tag) # bs4.element.Tag
print()
print('tag.text', type(tag.text), tag.text) # str
print()
toks = re.findall("var static_token =\s*'(\w+)'", tag.text)
print('toks', type(toks), toks) # list
print()
static_token = toks[0]
print('static_token', type(static_token), static_token) # str
print()
Related
Is it possible to download the code from the javascript function on the website through requests?
What I mean:
from requests_html import HTMLSession
url = 'link'
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=2)
print(r.html.text)
Output:
Hello
Some text on site
function randomname() {
if (checkbox() == true)
document.getElementById('randomcode').value = '123456789023';
else alert('Checkbox not clicked!'); }
randomname();
completef("1823947239d23dc23dsar238rt3r")
And now i want to get two values:
123456789023
1823947239d23dc23dsar238rt3r
I tried to find it with
r.html.search("checkbox");
r.html.search("completef");
but unfortunately I failed :(
Any help? Thanks :)
Given your output as:
output = """
Hello
Some text on site
function randomname() {
if (checkbox() == true)
document.getElementById('randomcode').value = '123456789023';
else alert('Checkbox not clicked!'); }
randomname();
completef("1823947239d23dc23dsar238rt3r")
"""
You can use a regular expression:
import re
randomcode = re.search("document.getElementById\('randomcode'\).value = '(\d+)';", output).group(1)
completef = re.search('completef\("([a-z0-9]+)"\)', output).group(1)
Make sure to put a \ infront of every special character like the (,) etc and then use \d+ to search for numbers or [a-z0-9]+ to search for numbers and characters.
I want to Extract data from a variable which is inside of a script:
<script>
var Itemlist = 'null';
var ItemData = '[{\"item_id\":\"107\",\"id\":\"79\",\"line_item_no\":\"1\",\"Amount\":\"99999.00\"}]';
</script>
I want the item_id and the Amount inside of a variable in python
I tried using regex it worked for a while but when the cookies session updated it stopped working
Is there any other way to get those values??
I am using this method to get the script from the html but it changes when the cookie session updates
soup = bs(response.content, 'html.parser')
script = soup.find('script')[8]
so i have to change the number that i've put after ('script') for now it's [8] if cookies session updates i have to keep changing the number until i find the script i am looking for
To get the data from the <script> you can use this example:
import re
import json
from bs4 import BeautifulSoup
html_data = """
<script>
var Itemlist = 'null';
var ItemData = '[{\"item_id\":\"107\",\"id\":\"79\",\"line_item_no\":\"1\",\"Amount\":\"99999.00\"}]';
</script>
"""
soup = BeautifulSoup(html_data, "html.parser")
data = soup.select_one("script").text
data = re.search(r"ItemData = '(.*)';", data).group(1)
data = json.loads(data)
print("Item_id =", data[0]["item_id"], "Amount =", data[0]["Amount"])
Prints:
Item_id = 107 Amount = 99999.00
How can you get the value of the variable ue_mid if you were trying to scrape a web page using BeautifulSoup and also using this function: soup.select_one()?
This is how the list of variables on the source code looks like:
var ue_id = 'XXXXXXXXXXXX',
ue_mid = 'ValueToGet',
ue_navtiming = 1;
Thank you so much in advance! 🙏
It is JavaScript. You can use select_one() only to get text from tag <script> and later you have to use string's functions (or regex) to extract it from string.
html = '''<script>
var ue_id = 'XXXXXXXXXXXX',
ue_mid = 'ValueToGet',
ue_navtiming = 1;
</script>'''
from bs4 import BeautifulSoup as BS
soup = BS(html, 'html.parser')
text = soup.select_one('script').get_text()
text = text.split("ue_mid = '")[1]
text = text.split("',")[0]
print(text)
# ValueToGet
I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.
So far I have this EDITED & UPDATED CURRENT CODE:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip<!-- --> tags and everything in between them. How would I go about that?
QUESTION EDIT #jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>
I am still trying to figure out why it
doesn't find and strip tags like this:
<!-- //-->. Those backslashes cause
certain tags to be overlooked.
This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:
import re, copy
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
if mutation isn't your bag, you can
[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]
I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.