BeautifulSoup and remove entire tag - python

I'm working with BeautifulSoup. I wish that if I see the tag -a href- the entire line is deleted, but, actually, not.
By example, if I have :
<a href="/psf-landing/">
This is a test message
</a>
Actually, I can have :
<a>
This is a test message
</a>
So, how can I just get :
This is a test message
Here is my code :
soup = BeautifulSoup(content_driver, "html.parser")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
for titles in soup.findAll('a'):
del titles['href']
tree = soup.prettify()

Try to use .extract() method. In your case, you're just deleting an attribute
for titles in soup.findAll('a'):
if titles['href'] is not None:
titles.extract()

Here,you can see the detailed examples Dzone NLP examples
what you need is :
text = soup.get_text(strip=True)
This is the sample example:
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

You are looking for the unwrap() method. Have a look at the following snippet:
html = '''
<a href="/psf-landing/">
This is a test message
</a>'''
soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('a', href=True):
el.unwrap()
print(soup)
# This is a test message
Using href=True will match only the tags that have href as an attribute.

Related

How to get attribute value from li tag in python BS4

How can I get the src attribute of this link tag with BS4 library?
Right now I'm using the code below to achieve the resulte but i can't
<li class="active" id="server_0" data-embed="<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>"><a><span><i class="fa fa-eye"></i></span> <strong>vk</strong></a></li>
i want this value src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'
this my code i access ['data-embed'] i don't how to exract the link this my code
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper()
access = "https://w.mycima.cc/play.php?vid=d4d8322b9"
response = scraper.get(access)
doc2 = bs(response.content, "lxml")
container2 = doc2.find("div", id='player').find("ul", class_="list_servers list_embedded col-sec").find("li")
link = container2['data-embed']
print(link)
Result
<Response [200]>
https://w.mycima.cc/play.php?vid=d4d8322b9
<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b' scrolling='no' frameborder='0' width='100%' height='100%' allowfullscreen='true' webkitallowfullscreen='true' mozallowfullscreen='true' ></iframe>
Process finished with exit code 0
From the beautiful soup documentation
You can access a tag’s attributes by treating the tag like a
dictionary
They give the example:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser')
tag['id']
# 'boldest'
Reference and further details,
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So, for your case specifically, you could write
print(link.find("iframe")['src'])
if link turns out to be plain text, not a soup object - which may be the case for your particular example based on the comments - well then you can resort to string searching, regex, or more beautiful soup'ing, for example:
link = """<Response [200]>https://w.mycima.cc/play.php?vid=d4d8322b9<iframe src='https://vk.com/video_ext.php?oid=757563422&id=456240701&hash=1d8fcd32c5b5f28b'></iframe>"""
iframe = re.search(r"<iframe.*>", link)
if iframe:
soup = BeautifulSoup(iframe.group(0),"html.parser")
print("src=" + soup.find("iframe")['src'])

Get Href by text using Beautifulsoup

I'm using "requests" and "beautifulsoup" to search for all the href links from a webpage with a specific text. I've already made it but if the text comes in a new line, beautifulsoup doesn't "see" it and don't return that link.
soup = BeautifulSoup(webpageAdress, "lxml")
path = soup.findAll('a', href=True, text="Something3")
print(path)
Example:
Like this, it returns Href of Something3 text:
...
Something3
...
Like this, it doesn't return the Href of Something3 text:
...
<a href="page1/somethingC.aspx">
Something3</a>
...
The difference is that Href text (Something3) is in a new line.
And i can't change HTML code because i'm not the webmaster of that webpage.
Any idea how can i solve that?
Note: i've already tried to use soup.replace('\n', ' ').replace('\r', '') but i get the error NoneType' object is not callable.
You can use regex to find any text that contains `"Something3":
html = '''Something3
<a href="page1/somethingC.aspx">
Something3</a>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "lxml")
path = soup.findAll('a', href=True, text=re.compile("Something3"))
for link in path:
print (link['href'])
You can use :contains pseudo class with bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = 'Something3'
soup = bs(html, 'lxml')
links = [link.text for link in soup.select('a:contains(Something3)')]
print(links)
And a solution without regex:
path = soup.select('a')
if path[0].getText().strip() == 'Something3':
print(path)
Output:
[<a href="page1/somethingC.aspx">
Something3</a>]

Having trouble extracting text from inside scraped html tags using beautiful soup

The code I am using to scrape the content
class Scraper(object):
# contains methods to scrape data from curse
def scrape(url):
req = request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
return request.urlopen(req).read()
def lookup(page, tag, class_name):
parsed = BeautifulSoup(page, "html.parser")
return parsed.find_all(tag, class_=class_name)
This returns a list with entries similar to this
<li class="title"><h4>World Quest Tracker</h4></li>
I'm attempting to extract the text inbetween the href tags, in this instance
World Quest Tracker
How could I accomplish this?
Try this.
from bs4 import BeautifulSoup
html='''
<li class="title"><h4>World Quest Tracker</h4></li>
'''
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".title"):
print(item.text)
Result:
World Quest Tracker
html_doc = '<li class="title"><h4>World Quest Tracker</h4></li>'
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find('a').text
this will print
u'World Quest Tracker'
I'm attempting to extract the text inbetween the href tags
If you actually want the text in the href attribute, and not the text content wrapped by the <a></a> anchor (your wording is a bit unclear), use get('href'):
from bs4 import BeautifulSoup
html = '<li class="title"><h4>World Quest Tracker</h4></li>'
soup = BeautifulSoup(html, 'lxml')
soup.find('a').get('href')
'/addons/wow/world-quest-tracker'

How to further filter a result of ResultSet?

I'm trying to get a list of all hrefs in a html document. I'm using Beautiful Soap to parse my html file.
print soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})[0]
The result I get is:
<a class="m0 vl" data-tag="Homepage Library" href="/video?lang=pl&format=lite&v=AZpftzD9jVs" title="abc">
text
</a>
I'm interested in href="" part only. So I would like the ResultSet to return the value of href only.
I'm not sure how to extend this query, so it returns the href part.
Use attrs:
links = soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})
print [link.attrs['href'] for link in links]
or, get attributes directly from the element by treating it like a dictionary:
links = soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})
print [link['href'] for link in links]
DEMO:
from bs4 import BeautifulSoup
page = """<body>
text1
text2
text3
text4
</body>"""
soup = BeautifulSoup(page)
links = soup.body.find_all('a')
print [link.attrs['href'] for link in links]
prints
['link1', 'link2', 'link3', 'link4']
Hope that helps.
Finally this worked for me:
soup.body.find_all('a', attrs={'data-tag':'Homepage Library'}).attrs["href"]
for link in soup.find_all('a', attrs={'data-tag':'Homepage Library'}):
print(link.get('href'))

Extract link from url using Beautifulsoup

I am trying to get the web link of the following, using beautifulsoup
<div class="alignright single">
Hadith on Clothing: Women should lower their garments to cover their feet » </div>
</div>
My code is as follow
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
nextlink = soup.findAll("div", {"class" : "alignright single"})
a = nextlink.find('a')
print a.get('href')
I get the following error, please help
a = nextlink.find('a')
AttributeError: 'ResultSet' object has no attribute 'find'
Use .find() if you want to find just one match:
nextlink = soup.find("div", {"class" : "alignright single"})
or loop over all matches:
for nextlink in soup.findAll("div", {"class" : "alignright single"}):
a = nextlink.find('a')
print a.get('href')
The latter part can also be expressed as:
a = nextlink.find('a', href=True)
print a['href']
where the href=True part only matches elements that have a href attribute, which means that you won't have to use a.get() because the attribute will be there (alternatively, no <a href="..."> link is found and a will be None).
For the given URL in your question, there is only one such link, so .find() is probably most convenient. It may even be possible to just use:
nextlink = soup.find('a', rel='next', href=True)
if nextlink is not None:
print a['href']
with no need to find the surrounding div. The rel="next" attribute looks enough for your specific needs.
As an extra tip: make use of the response headers to tell BeautifulSoup what encoding to use for a page; the urllib2 response object can tell you what, if any, character set the server thinks the HTML page is encoded in:
response = urllib2.urlopen(url1)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
Quick demo of all the parts:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> response = urllib2.urlopen('http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/')
>>> soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
>>> soup.find('a', rel='next', href=True)['href']
u'http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/'
You need to unpack the list so Try this instead:
nextlink = soup.findAll("div", {"class" : "alignright single"})[0]
Or since there's only one match the find method also ought to work:
nextlink = soup.find("div", {"class" : "alignright single"})

Categories

Resources