python beautifulsoup: replace links with url in string - python

In a string containing HTML I have several links that I want to replace with the pure href value:
from bs4 import BeautifulSoup
a = "<a href='www.google.com'>foo</a> some text <a href='www.bing.com'>bar</a> some <br> text'
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all()
for tag in tags:
if tag.has_attr('href'):
html = html.replace(str(tag), tag['href'])
Unfortunatly this creates some issues:
the tags in the html use single quotes ', but beautifulsoup will create with str(tag) an tag with " quotes (foo). So replace() will not find the match.
<br> get identified as <br/>. Again replace() will not find the match.
So it seems using python's replace() method will not give reliable results.
Is there a way to use beautifulsoup's methods to replace a tag with a string?
edit:
Added value for str(tag) = foo

Relevant part of the docs: Modifying the tree
html="""
<html><head></head>
<body>
foo some text
bar some <br> text
</body></html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.string = a_tag.get('href')
print(soup)
output
<html><head></head>
<body>
www.google.com some text
www.bing.com some <br/> text
</body></html>

Related

How do I extract the code in-between using BeautifulSoup?

I would like to extract the text 'THIS IS THE TEXT I WANT TO EXTRACT' from the snippet below. Does anyone have any suggestions? Thanks!
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
from bs4 import BeautifulSoup
html = """<span class="cw-type__h2 Ingredients-title">Ingredients</span><p>THIS IS THE TEXT I WANT TO EXTRACT</p>"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.text)
Assuming there is likely more html, I would use the class of the preceeding span with adjacent sibling combinator and p type selector to target the appropriate p tag
from bs4 import BeautifulSoup as bs
html = '''
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
'''
soup = bs(html, 'lxml')
print(soup.select_one('.Ingredients-title + p').text.strip())

BeautifulSoup and remove entire tag

I'm working with BeautifulSoup. I wish that if I see the tag -a href- the entire line is deleted, but, actually, not.
By example, if I have :
<a href="/psf-landing/">
This is a test message
</a>
Actually, I can have :
<a>
This is a test message
</a>
So, how can I just get :
This is a test message
Here is my code :
soup = BeautifulSoup(content_driver, "html.parser")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
for titles in soup.findAll('a'):
del titles['href']
tree = soup.prettify()
Try to use .extract() method. In your case, you're just deleting an attribute
for titles in soup.findAll('a'):
if titles['href'] is not None:
titles.extract()
Here,you can see the detailed examples Dzone NLP examples
what you need is :
text = soup.get_text(strip=True)
This is the sample example:
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
You are looking for the unwrap() method. Have a look at the following snippet:
html = '''
<a href="/psf-landing/">
This is a test message
</a>'''
soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('a', href=True):
el.unwrap()
print(soup)
# This is a test message
Using href=True will match only the tags that have href as an attribute.

How to extract partial text from href using BeautifulSoup in Python

Here's my code:
for item in data:
print(item.find_all('td')[2].find('a'))
print(item.find('span').text.strip())
print(item.find_all('td')[3].text)
print(item.find_all('td')[2].find(target="_blank").string.strip())
It prints this text below.
<a href="argument_transcripts/2016/16-399_3f14.pdf"
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile"
target="_blank">16-399. </a>
Perry v. Merit Systems Protection Bd.
04/17/17
16-399.
All I want from the href tag is this part: 16-399_3f14
How can I do that? Thanks.
You can use the find_all to pull the the anchor elements that have the href attribute and then parse the href values for the information that you are looking for.
from BeautifulSoup import BeautifulSoup
html = '''<a href="argument_transcripts/2016/16-399_3f14.pdf"
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile"
target="_blank">16-399. </a>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
url = a['href'].split('/')
print url[-1]
This should output the the string you are looking for.
16-399_3f14.pdf

Extracting text between <br> with beautifulsoup, but without next tag

I'm using python + beautifulsoup to try to get the text between the br's. The closest I got to this was by using next_sibling in the following manner:
<html>
<body>
</a><span class="strong">Title1</span>
<p>Text1</p>
<br>The Text I want to get<br>
<p>Text I dont want</p>
</body>
</html>
for span in soup.findAll("span", {"class" : "strong"}):
print(span.next_sibling.next_sibling.text)
But this prints:
The Text I want to getText I dont want
So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references.
I need it to print:
The Text I want to get
Since the HTML you've provided is broken, the behavior would differ from parser to parser that BeautifulSoup uses.
In case of lxml parser, BeautifulSoup would convert the br tag into a self-closing one:
>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup
<html>
<body>
<span class="strong">Title1</span>
<p>Text1</p>
<br/>The Text I want to get<br/>
<p>Text I dont want</p>
</body>
</html>
Note that you would need lxml to be installed. If it is okay for you - find the br and get the next sibling:
from bs4 import BeautifulSoup
data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')
print(soup.br.next_sibling) # prints "The Text I want to get"
Also see:
Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)
Parsing unclosed `<br>` tags with BeautifulSoup
Using Python Scrapy
In [4]: hxs.select('//body/text()').extract()
Out[4]: [u'\n', u'\n', u'\n', u'The Text I want to get', u'\n', u'\n']

How to extract the text between some anchor tags?

I need to extract the name of the artists from an HTML page. Here's a snippet of the page:
</td>
<td class="playbuttonCell">
<a class="playbutton preview-track" href="/music/example" data-analytics-redirect="false" >
<img class="transparent_png play_icon" width="13" height="13" alt="Play" src="http://cdn.last.fm/flatness/preview/play_indicator.png" style="" />
</a>
</td>
<td class="subjectCell" title="example, played 3 times">
<div>
<a href="/music/example-artist" >Example artist name</a>
I've tried this but isn't doing the job.
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen('http://www.last.fm/user/Jehl/charts?rangetype=overall&subtype=artists').read()
soup = BeautifulSoup(html)
print soup('a')
for link in soup('a'):
print html
Where am I screwing up?
You can try this:
In [1]: from bs4 import BeautifulSoup
In [2]: s = # Your string here...
In [3]: soup = BeautifulSoup(s)
In [4]: for anchor in soup.find_all('a'):
...: print anchor.text
...:
...:
here lies the text i need
Here, the find_all method returns a list that contains all matching anchor tags, after which we can print the text property to get the value between the tags.
for link in soup.select('td.subjectCell a'):
print link.text
It selects (just like CSS) the a elements inside td elements that have the subjectCell class.
spans = soup.find_all("div", {"class": "overlay tran3s"})
for span in spans:
links = span.find_all('a')
for link in links:
print(link.text)
soup.findAll and link.attrs can be used to read the href attributes easily.
Working code:
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
print (link.attrs['href'])
Output:
/music/example
/music/example-artist
Regular expressions are your friend here. As an alternative to RocketDonkey's answer, which uses BeautifulSoup properly; you can parse through soup('a') with a regular expression like
>([a-zA-Z]*|[0-9]|(\w\s*)*)</a>
you can utilize the re.findall method to grab the text in between the anchor tags directly.

Categories

Resources