Isolating class attribute from html using beautifulsoup - python

HTML:
<body class="" data-is-article="story" data-new-gr-c-s-check-loaded="14.1094.0" data-gr-ext-installed="">
How would I extract "story" as a string variable from "data-is-article" using beautiful soup?
I have tried:
type = soup.find('body', class_="data-is-article")
But get a None response

Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.
data-is-article is not a class, it is an attribute - so simply access the attribute value of element via .get('ATTRIBUT NAME'):
soup.body.get('data-is-article')
or based on your selection:
soup.find('body', {'data-is-article':True}).get('data-is-article')
Example
from bs4 import BeautifulSoup
html = '''<body class="" data-is-article="story" data-new-gr-c-s-check-loaded="14.1094.0" data-gr-ext-installed="">'''
soup = BeautifulSoup(html)
soup.body.get('data-is-article')

Related

How to get attribute value from li tag in python BS4

How can I get the src attribute of this link tag with BS4 library?
Right now I'm using the code below to achieve the resulte but i can't
<li class="active" id="server_0" data-embed="<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>"><a><span><i class="fa fa-eye"></i></span> <strong>vk</strong></a></li>
i want this value src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'
this my code i access ['data-embed'] i don't how to exract the link this my code
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper()
access = "https://w.mycima.cc/play.php?vid=d4d8322b9"
response = scraper.get(access)
doc2 = bs(response.content, "lxml")
container2 = doc2.find("div", id='player').find("ul", class_="list_servers list_embedded col-sec").find("li")
link = container2['data-embed']
print(link)
Result
<Response [200]>
https://w.mycima.cc/play.php?vid=d4d8322b9
<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>
Process finished with exit code 0
From the beautiful soup documentation
You can access a tag’s attributes by treating the tag like a
dictionary
They give the example:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser')
tag['id']
# 'boldest'
Reference and further details,
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So, for your case specifically, you could write
print(link.find("iframe")['src'])
if link turns out to be plain text, not a soup object - which may be the case for your particular example based on the comments - well then you can resort to string searching, regex, or more beautiful soup'ing, for example:
link = """<Response [200]>https://w.mycima.cc/play.php?vid=d4d8322b9<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'></iframe>"""
iframe = re.search(r"<iframe.*>", link)
if iframe:
soup = BeautifulSoup(iframe.group(0),"html.parser")
print("src=" + soup.find("iframe")['src'])

How do I get the width inside of style inside of class of span?

I'm trying to get the result out so that I'll be able to get what's inside of "width: 100%" and only the number part. I'm using beautifulsoup python. and my code is here for getting it:
rating_data.append(e.select_one('.review-list__rating__active'))
but it keeps giving me the whole entire line which is
<span class="review-list__rating__active" style="width: 100%">
above result.
I'm trying to get the value inside of the width. is that any possible using beautifulsoup, python? Thanks guys.
The cssutils library can help with this:
from bs4 import BeautifulSoup
import cssutils
html = """<span class="review-list__rating__active" style="width: 100%"></span>"""
soup = BeautifulSoup(html, "html.parser")
span_style = soup.select_one('.review-list__rating__active')['style']
style = cssutils.parseStyle(span_style)
print(style.width)
This displays:
100%
To get the values of an attribute use get():
e.select_one('.review-list__rating__active').get('style')
If there is only one style you could use split() to get the value:
e.select_one('.review-list__rating__active').get('style').split(':')[-1]
If there are multiple you have to iterate:
for s in e.select_one('.review-list__rating__active').get('style').split(';'):
if 'with' in s:
print(s.split(':')[-1])
Another alternativ would be to use regex.

Retrieve content field from html span

I have the following html code inside an object:
<span itemprop="price" content="187">187,00 €</span>
My idea is to get the contet of the span object (the price). In order to do so, I am doing the following:
import requests
from lxml import html
tree = html.fromstring(res.content)
prices = tree.xpath('//span[#class="price"]/text()')
print(float(prices[0].split()[0].replace(',','.')))
Here, res.content contains inside the span object shown above. As you can see, I am getting the price from 187,00 € (after some modifications) when it would be easier to get it from the "content" tag inside span. I have tried using:
tree.xpath('//span[#class="price"]/content()')
But it does not work. Is there a way to retrieve this data? I am open to use any other libraries.
You can use the BeautifulSoup library for html parsing:
from bs4 import BeautifulSoup as soup
d = soup('<span itemprop="price" content="187">187,00 €</span>', 'html.parser')
content = d.find('span')['content']
Output:
'187'
To be event more specific, you can provide the itemprop value:
content = d.find('span', {'itemprop':'price'})['content']
To get the content between the tags, use soup.text:
content = d.find('span', {'itemprop':'price'}).text
Output:
'187,00\xa0€'
You can try
prices = tree.xpath('//span[#class="price"]')
for price in prices:
print(price.get("content"))

In Python, how do I find elements that contain a specific attribute?

I'm using Python 3.7. I want to locate all the elements in my HTML page that have an attribute, "data-permalink", regardless of what its value is, even if the value is empty. However, I'm confused about how to do this. I'm using the bs4 package and tried the following
soup = BeautifulSoup(html)
soup.findAll("data-permalink")
[]
soup.findAll("a")
[<a href=" ... </a>]
soup.findAll("a.data-permalink")
[]
The attribute is normally only found in anchor tags on my page, hence my unsuccessful, "a.data-permalink" attempt. I would like to return the elements that contain the attribute.
Your selector is invalid
soup.findAll("a.data-permalink")
it should be used for the method .select() but still it invalid because it mean select <a> with the class not the attribute.
to match everything use the * for select()
.select('*[data-permalink]')
or True if using findAll()
.findAll(True, attrs={'data-permalink' : True})
example
from bs4 import BeautifulSoup
html = '''<a data-permalink="a">link</a>
<b>bold</b>
<i data-permalink="i">italic</i>'''
soup= BeautifulSoup(html, 'html.parser')
permalink = soup.select('*[data-permalink]')
# or
# permalink = soup.findAll(True, attrs={'data-permalink' : True})
print(permalink)
Results, the <b> element is skipped
[<a data-permalink="a">link</a>, <i data-permalink="i">italic</i>]

HTML parsing , nested div issue using BeautifulSoup

I am trying to extract specific nested div class and the corresponding h3 attribute (salary value).
So, I have tried the search by class method
soup.find_all('div',{'class':"vac_display_field"}
which returns an empty list.
Snippet code:
<div class="vac_display_field">
<h3>
Salary
</h3>
<div class="vac_display_field_value">
£27,951 - £30,859
</div>
</div>
Example here
First make sure you've instantiated your BeautifulSoup object correctly. Should look something like this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.civilservicejobs.service.gov.uk/csr/index.cgi?SID=cGFnZWNsYXNzPUpvYnMmb3duZXJ0eXBlPWZhaXImY3NvdXJjZT1jc3FzZWFyY2gmcGFnZWFjdGlvbj12aWV3dmFjYnlqb2JsaXN0JnNlYXJjaF9zbGljZV9jdXJyZW50PTEmdXNlcnNlYXJjaGNvbnRleHQ9MjczMzIwMTcmam9ibGlzdF92aWV3X3ZhYz0xNTEyMDAwJm93bmVyPTUwNzAwMDAmcmVxc2lnPTE0NzcxNTIyODItYjAyZmM4ZTgwNzQ2ZTA2NmY5OWM0OTBjMTZhMWNlNjhkZDMwZDU4NA=='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # the 'html.parser' part is optional.
Your code used to scrape the div tags looks correct (it's missing a closing parentheses, however). If, for some reason it still hasn't worked, try calling your find_all() method in this way:
soup.find_all('div', class_='vac_display_field')
If you look at the page's code, upon inspecting you'll find that the div tag you need is the second from the top:
Thus, your code can reflect that, using simple index notation:
Salary_info = soup.find_all(class_='vac_display_field')[1]
Then output the text:
for info in Salary_info:
print info.get_text()
HTH.

Categories

Resources