In Python, how do I find elements that contain a specific attribute? - python

I'm using Python 3.7. I want to locate all the elements in my HTML page that have an attribute, "data-permalink", regardless of what its value is, even if the value is empty. However, I'm confused about how to do this. I'm using the bs4 package and tried the following
soup = BeautifulSoup(html)
soup.findAll("data-permalink")
[]
soup.findAll("a")
[<a href=" ... </a>]
soup.findAll("a.data-permalink")
[]
The attribute is normally only found in anchor tags on my page, hence my unsuccessful, "a.data-permalink" attempt. I would like to return the elements that contain the attribute.

Your selector is invalid
soup.findAll("a.data-permalink")
it should be used for the method .select() but still it invalid because it mean select <a> with the class not the attribute.
to match everything use the * for select()
.select('*[data-permalink]')
or True if using findAll()
.findAll(True, attrs={'data-permalink' : True})
example
from bs4 import BeautifulSoup
html = '''<a data-permalink="a">link</a>
<b>bold</b>
<i data-permalink="i">italic</i>'''
soup= BeautifulSoup(html, 'html.parser')
permalink = soup.select('*[data-permalink]')
# or
# permalink = soup.findAll(True, attrs={'data-permalink' : True})
print(permalink)
Results, the <b> element is skipped
[<a data-permalink="a">link</a>, <i data-permalink="i">italic</i>]

Related

How to get attribute value from li tag in python BS4

How can I get the src attribute of this link tag with BS4 library?
Right now I'm using the code below to achieve the resulte but i can't
<li class="active" id="server_0" data-embed="<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>"><a><span><i class="fa fa-eye"></i></span> <strong>vk</strong></a></li>
i want this value src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'
this my code i access ['data-embed'] i don't how to exract the link this my code
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper()
access = "https://w.mycima.cc/play.php?vid=d4d8322b9"
response = scraper.get(access)
doc2 = bs(response.content, "lxml")
container2 = doc2.find("div", id='player').find("ul", class_="list_servers list_embedded col-sec").find("li")
link = container2['data-embed']
print(link)
Result
<Response [200]>
https://w.mycima.cc/play.php?vid=d4d8322b9
<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>
Process finished with exit code 0
From the beautiful soup documentation
You can access a tag’s attributes by treating the tag like a
dictionary
They give the example:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser')
tag['id']
# 'boldest'
Reference and further details,
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So, for your case specifically, you could write
print(link.find("iframe")['src'])
if link turns out to be plain text, not a soup object - which may be the case for your particular example based on the comments - well then you can resort to string searching, regex, or more beautiful soup'ing, for example:
link = """<Response [200]>https://w.mycima.cc/play.php?vid=d4d8322b9<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'></iframe>"""
iframe = re.search(r"<iframe.*>", link)
if iframe:
soup = BeautifulSoup(iframe.group(0),"html.parser")
print("src=" + soup.find("iframe")['src'])

Extracting href from 'a' element with text only attribute

I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])

How to find links within a specified class with Beautiful Soup

I'm using Beautiful Soup 4 to parse a news site for links contained in the body text. I was able to find all the paragraphs that contained the links but the paragraph.get('href') returned type none for each link. I'm using Python 3.5.1. Any help is really appreciated.
from bs4 import BeautifulSoup
import urllib.request
import re
soup = BeautifulSoup("http://www.cnn.com/2016/11/18/opinions/how-do-you-deal-with-donald-trump-dantonio/index.html", "html.parser")
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
print(paragraph.get('href'))
Do you really want this?
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
for a in paragraph("a"):
print(a.get('href'))
Note that paragraph.get('href') tries to find attribute href in <div> tag you found. As there's no such attribute, it returns None. Most probably you actually have to find all tags <a> which a descendants of your <div> (this can be done with paragraph("a") which is a shortcut for paragraph.find_all("a") and then for every element <a> look at their href attribute.

HTML parsing , nested div issue using BeautifulSoup

I am trying to extract specific nested div class and the corresponding h3 attribute (salary value).
So, I have tried the search by class method
soup.find_all('div',{'class':"vac_display_field"}
which returns an empty list.
Snippet code:
<div class="vac_display_field">
<h3>
Salary
</h3>
<div class="vac_display_field_value">
£27,951 - £30,859
</div>
</div>
Example here
First make sure you've instantiated your BeautifulSoup object correctly. Should look something like this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.civilservicejobs.service.gov.uk/csr/index.cgi?SID=cGFnZWNsYXNzPUpvYnMmb3duZXJ0eXBlPWZhaXImY3NvdXJjZT1jc3FzZWFyY2gmcGFnZWFjdGlvbj12aWV3dmFjYnlqb2JsaXN0JnNlYXJjaF9zbGljZV9jdXJyZW50PTEmdXNlcnNlYXJjaGNvbnRleHQ9MjczMzIwMTcmam9ibGlzdF92aWV3X3ZhYz0xNTEyMDAwJm93bmVyPTUwNzAwMDAmcmVxc2lnPTE0NzcxNTIyODItYjAyZmM4ZTgwNzQ2ZTA2NmY5OWM0OTBjMTZhMWNlNjhkZDMwZDU4NA=='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # the 'html.parser' part is optional.
Your code used to scrape the div tags looks correct (it's missing a closing parentheses, however). If, for some reason it still hasn't worked, try calling your find_all() method in this way:
soup.find_all('div', class_='vac_display_field')
If you look at the page's code, upon inspecting you'll find that the div tag you need is the second from the top:
Thus, your code can reflect that, using simple index notation:
Salary_info = soup.find_all(class_='vac_display_field')[1]
Then output the text:
for info in Salary_info:
print info.get_text()
HTH.

Extract href from html

I am given the following html :
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acaryochloris_marina_MBIC11017_> Jun 12 2013
<IMG border="0" SRC="SOMETHING" ALT="[DIR] "> Acetobacter_pasteurianus_386B_u> Aug 8 2013
and many more...
I want to extract the href from here.
Here's my python script : (page_source contains the html)
soup = BeautifulSoup(page_source)
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag.get('href',None)
if link != None:
print link
But this keeps returning the following error :
links = soup.find_all('A',attrs={'HREF': re.compile("^http://")})
TypeError: 'NoneType' object is not callable
You are using BeautifulSoup version 3, not version 4. soup.find_all is then not interpreted as a method, but as a search for the first <find_all> element. Because there is no such element, soup.find_all resolves to None.
Install BeautifulSoup 4 instead, the import is:
from bs4 import BeautifulSoup
BeautifulSoup 3 is instead imported as from BeautifulSoup import BeautifulSoup.
If you are sure you wanted to use BeautifulSoup 3 (not recommended), then use:
links = soup.findAll('a', attrs={'href': re.compile("^http://")})
As a side note, because you limit your search to <a> tags with a certain value, *there will always be a href attribute on the elements that are found. Using .get() and testing for None is entirely redundant. The following is equivalent:
links = soup.find_all('a',attrs={'href': re.compile("^http://")})
for tag in links:
link = tag['href']
print link
BeautifulSoup 4 also supports CSS selectors, which could make your query a little more readable still, removing the need for you to specify a regular expression:
for tag in soup.select('a[href^=http://]'):
link = tag['href']
print link
Why not use the split command?
Iterate over all lines of the file and d something like that:
href = line.split("HREF=\"")[1].split("\"")[0]

Categories

Resources