How to get value of variable through BeautifulSoup with soup.select_one()? - python

How can you get the value of the variable ue_mid if you were trying to scrape a web page using BeautifulSoup and also using this function: soup.select_one()?
This is how the list of variables on the source code looks like:
var ue_id = 'XXXXXXXXXXXX',
ue_mid = 'ValueToGet',
ue_navtiming = 1;
Thank you so much in advance! 🙏

It is JavaScript. You can use select_one() only to get text from tag <script> and later you have to use string's functions (or regex) to extract it from string.
html = '''<script>
var ue_id = 'XXXXXXXXXXXX',
ue_mid = 'ValueToGet',
ue_navtiming = 1;
</script>'''
from bs4 import BeautifulSoup as BS
soup = BS(html, 'html.parser')
text = soup.select_one('script').get_text()
text = text.split("ue_mid = '")[1]
text = text.split("',")[0]
print(text)
# ValueToGet

Related

Extract data from script tag [HTML] using BeautifulSoup in Python

I want to Extract data from a variable which is inside of a script:
<script>
var Itemlist = 'null';
var ItemData = '[{\"item_id\":\"107\",\"id\":\"79\",\"line_item_no\":\"1\",\"Amount\":\"99999.00\"}]';
</script>
I want the item_id and the Amount inside of a variable in python
I tried using regex it worked for a while but when the cookies session updated it stopped working
Is there any other way to get those values??
I am using this method to get the script from the html but it changes when the cookie session updates
soup = bs(response.content, 'html.parser')
script = soup.find('script')[8]
so i have to change the number that i've put after ('script') for now it's [8] if cookies session updates i have to keep changing the number until i find the script i am looking for
To get the data from the <script> you can use this example:
import re
import json
from bs4 import BeautifulSoup
html_data = """
<script>
var Itemlist = 'null';
var ItemData = '[{\"item_id\":\"107\",\"id\":\"79\",\"line_item_no\":\"1\",\"Amount\":\"99999.00\"}]';
</script>
"""
soup = BeautifulSoup(html_data, "html.parser")
data = soup.select_one("script").text
data = re.search(r"ItemData = '(.*)';", data).group(1)
data = json.loads(data)
print("Item_id =", data[0]["item_id"], "Amount =", data[0]["Amount"])
Prints:
Item_id = 107 Amount = 99999.00

How to get a particular script tag?

I am trying to build a download manager script with python, The web page contains some script tags, i want to isolate a particular script, the script html5player.setVideoUrlHigh('https://*****');,
I don't know how to go about it, I was able to get all the script tags but i am unable to get the script tag with this code html5player.setVideoUrlHigh('https://*****');
Here is my python code
from urllib.request import urlopen
import re
from bs4 import BeautifulSoup
Url = '*****'
pg = urlopen(Url)
sp = BeautifulSoup(pg)
script_tag = sp.find_all('script')
# print(script_tag[1])
print(re.search("setVideoHLS\(\'(.*?)\'\)", script_tag).group(1))
The script tag i want to get is this:
<script>
logged_user = false;
var static_id_cdn = 17;
var html5player = new HTML5Player('html5video', '56420147');
if (html5player) {
html5player.setVideoTitle('passionate hotel room');
html5player.setSponsors(false);
html5player.setVideoUrlLow('https://*****');
html5player.setVideoUrlHigh('https://******');
html5player.setVideoHLS('https://****');
html5player.setThumbUrl('https://**');
html5player.setStaticDomain('***');
html5player.setHttps();
html5player.setCanUseHttps();
document.getElementById('html5video').style.minHeight = '';
html5player.initPlayer();
}
How can I get parameter from this function `html5player.setVideoUrlHigh('https://******').
You can get the script tag using this code,
import re
from bs4 import BeautifulSoup
html = """<script> logged_user = false;
var static_id_cdn = 17;
var html5player = new HTML5Player('html5video', '56420147');
if (html5player) {
html5player.setVideoTitle('passionate hotel room');
html5player.setSponsors(false);
html5player.setVideoUrlLow('https://*****');
html5player.setVideoUrlHigh('https://******');
html5player.setVideoHLS(''https://****');
html5player.setThumbUrl('https://**');
html5player.setStaticDomain('***');
html5player.setHttps();
html5player.setCanUseHttps();
document.getElementById('html5video').style.minHeight = '';
html5player.initPlayer();
}</script>"""
soup = BeautifulSoup(HTML)
txt = soup.script.get_text()
print(txt)
Output:
logged_user = false;
var static_id_cdn = 17;
var html5player = new HTML5Player('html5video', '56420147');
if (html5player) {
html5player.setVideoTitle('passionate hotel room');
html5player.setSponsors(false);
html5player.setVideoUrlLow('https://*****');
html5player.setVideoUrlHigh('https://******');
html5player.setVideoHLS(''https://****');
html5player.setThumbUrl('https://**');
html5player.setStaticDomain('***');
html5player.setHttps();
html5player.setCanUseHttps();
document.getElementById('html5video').style.minHeight = '';
html5player.initPlayer();
}
EDIT
import requests
import bs4
import re
url = 'url'
r = requests.get(url)
bs = bs4.BeautifulSoup(r.text, "html.parser")
scripts = bs.find_all('script')
src = scripts[7] #Needed script is in position 7
print(re.search("html5player.setVideoUrlHigh\(\'(.*?)\'\)", str(src)).group(1))

After finding all the links and texts using find_all in Beautiful Soup how do grab the one that you need

My example
from bs4 import BeautifulSoup
import requests
result = requests.get("https://pythonprogramming.net/parsememcparseface/")
c = result.content
soup = BeautifulSoup(c,'lxml')
patch_name = soup.find_all(["a", "p"])
u = soup.get_text()
print(u)
How do I get the text I need for I can store it in a variable for later usage.
this will return a list of a and p tag:
patch_name = soup.find_all(["a", "p"])
you can get all the text of the list :
[tag.get_text() for tag in patch_name]

What is the ideal way to use xml data in python html parsing with Beautiful Soup?

What is the ideal way to convert xml to text in python html parsing with Beautiful Soup?
When I am doing html parsing with Python 2.7 BeautifulSoup library, I can get to the step to "soup", but I have no idea how to extract the data I need, so I tried converting them all to string.
In the following example, I want to extract all number in the span tag and add them up. Is there a better way?
XML data:
http://python-data.dr-chuck.net/comments_324255.html
CODE:
import urllib2
from BeautifulSoup import *
import re
url = 'http://python-data.dr-chuck.net/comments_324255.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
spans = soup('span')
lis = list()
span_str = str(spans)
sp = re.findall('([0-9]+)', span_str)
count = 0
for i in sp:
count = count + int(i)
print('Sum:', count)
Don't need regex:
from bs4 import BeautifulSoup
from requests import get
url = 'http://python-data.dr-chuck.net/comments_324255.html'
html = get(url).text
soup = BeautifulSoup(html, 'lxml')
count = sum(int(n.text) for n in soup.findAll('span'))
import requests, bs4
r = requests.get("http://python-data.dr-chuck.net/comments_324255.html")
soup = bs4.BeautifulSoup(r.text, 'lxml')
sum(int(span.text) for span in soup.find_all(class_="comments"))
output:
2788

how to get the text of the url while scraping webpage [duplicate]

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get TypeError: list indices must be integers, not str
Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here... but I am no specialist, and I may have misunderstood.
Any suggestion is greatly appreciated!
.find_all() returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find() method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']
In Python 3.x, simply use get(attr_name) on your tag object that you get using find_all:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlDecoded = xmlData
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
print("Attribute id = %s" % repElemID)
print("Attribute name = %s" % repElemName)
against XML file conf//test1.xml that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
prints:
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary
For me:
<input id="color" value="Blue"/>
This can be fetched by below snippet.
page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])
If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.
I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.
suppose say a tag xyz has that attritube named "staininfo"..
full_tag = soup.findAll("xyz")
And i wan't you to understand that full_tag is a list
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]
print staininfo_attrb_value
Thus you can get all the attrb values of staininfo for all the tags xyz
you can also use this :
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)
You could try to use the new powerful package called requests_html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date) # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'
I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.
Here is an example for how to extract the href attrbiutes of all a tags:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')
hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
# print(href.get("href"))
links = href.get("href")
all_hrefs.append(links)
print(all_hrefs)
You can try gazpacho:
Install it using pip install gazpacho
Get the HTML and make the Soup using:
from gazpacho import get, Soup
soup = Soup(get("http://ip.add.ress.here/")) # get directly returns the html
inputs = soup.find('input', attrs={'name': 'stainfo'}) # Find all the input tags
if inputs:
if type(inputs) is list:
for input in inputs:
print(input.attr.get('value'))
else:
print(inputs.attr.get('value'))
else:
print('No <input> tag found with the attribute name="stainfo")

Categories

Resources