How to extract partial text from href using BeautifulSoup in Python

How to extract partial text from href using BeautifulSoup in Python - python

Here's my code:
for item in data:
print(item.find_all('td')[2].find('a'))
print(item.find('span').text.strip())
print(item.find_all('td')[3].text)
print(item.find_all('td')[2].find(target="_blank").string.strip())
It prints this text below.
<a href="argument_transcripts/2016/16-399_3f14.pdf"
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile"
target="_blank">16-399. </a>
Perry v. Merit Systems Protection Bd.
04/17/17
16-399.
All I want from the href tag is this part: 16-399_3f14
How can I do that? Thanks.

You can use the find_all to pull the the anchor elements that have the href attribute and then parse the href values for the information that you are looking for.
from BeautifulSoup import BeautifulSoup
html = '''<a href="argument_transcripts/2016/16-399_3f14.pdf"
id="ctl00_ctl00_MainEditable_mainContent_rptTranscript_ctl01_hypFile"
target="_blank">16-399. </a>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
url = a['href'].split('/')
print url[-1]
This should output the the string you are looking for.
16-399_3f14.pdf

Related

python beautifulsoup: replace links with url in string

In a string containing HTML I have several links that I want to replace with the pure href value:
from bs4 import BeautifulSoup
a = "<a href='www.google.com'>foo</a> some text <a href='www.bing.com'>bar</a> some <br> text'
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all()
for tag in tags:
if tag.has_attr('href'):
html = html.replace(str(tag), tag['href'])
Unfortunatly this creates some issues:
the tags in the html use single quotes ', but beautifulsoup will create with str(tag) an tag with " quotes (foo). So replace() will not find the match.
<br> get identified as <br/>. Again replace() will not find the match.
So it seems using python's replace() method will not give reliable results.
Is there a way to use beautifulsoup's methods to replace a tag with a string?
edit:
Added value for str(tag) = foo

Relevant part of the docs: Modifying the tree
html="""
<html><head></head>
<body>
foo some text
bar some <br> text
</body></html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.string = a_tag.get('href')
print(soup)
output
<html><head></head>
<body>
www.google.com some text
www.bing.com some <br/> text
</body></html>

How to select an href class tag using BeautifulSoup?

How can I select a href class tag?
Example of html code:
<a title="bla" class="example"> text </a>
So I wish to identify which a tag to grab from via either "title" or "class" then output the text within the a tags, so in this case output would be
text
Code I'm using
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.example.com').text
soup = BeautifulSoup(source, 'lxml')
for profile in soup.select(" select input here "):
print(profile.text.encode("utf-8"))

Aside from what #Stack suggested in comments:
soup.find_all('a', {'title': 'bla'})
soup.find_all('a', {'class': 'example'})
You can do that using CSS selectors (I even see that you already have that select() call there:
soup.select("a[title=bla]")
soup.select("a.example")

How to find links within a specified class with Beautiful Soup

I'm using Beautiful Soup 4 to parse a news site for links contained in the body text. I was able to find all the paragraphs that contained the links but the paragraph.get('href') returned type none for each link. I'm using Python 3.5.1. Any help is really appreciated.
from bs4 import BeautifulSoup
import urllib.request
import re
soup = BeautifulSoup("http://www.cnn.com/2016/11/18/opinions/how-do-you-deal-with-donald-trump-dantonio/index.html", "html.parser")
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
print(paragraph.get('href'))

Do you really want this?
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
for a in paragraph("a"):
print(a.get('href'))
Note that paragraph.get('href') tries to find attribute href in <div> tag you found. As there's no such attribute, it returns None. Most probably you actually have to find all tags <a> which a descendants of your <div> (this can be done with paragraph("a") which is a shortcut for paragraph.find_all("a") and then for every element <a> look at their href attribute.

How to further filter a result of ResultSet?

I'm trying to get a list of all hrefs in a html document. I'm using Beautiful Soap to parse my html file.
print soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})[0]
The result I get is:
<a class="m0 vl" data-tag="Homepage Library" href="/video?lang=pl&format=lite&v=AZpftzD9jVs" title="abc">
text
</a>
I'm interested in href="" part only. So I would like the ResultSet to return the value of href only.
I'm not sure how to extend this query, so it returns the href part.

Use attrs:
links = soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})
print [link.attrs['href'] for link in links]
or, get attributes directly from the element by treating it like a dictionary:
links = soup.body.find_all('a', attrs={'data-tag':'Homepage Library'})
print [link['href'] for link in links]
DEMO:
from bs4 import BeautifulSoup
page = """<body>
text1
text2
text3
text4
</body>"""
soup = BeautifulSoup(page)
links = soup.body.find_all('a')
print [link.attrs['href'] for link in links]
prints
['link1', 'link2', 'link3', 'link4']
Hope that helps.

Finally this worked for me:
soup.body.find_all('a', attrs={'data-tag':'Homepage Library'}).attrs["href"]

for link in soup.find_all('a', attrs={'data-tag':'Homepage Library'}):
print(link.get('href'))

How to extract the text between some anchor tags?

I need to extract the name of the artists from an HTML page. Here's a snippet of the page:
</td>
<td class="playbuttonCell">
<a class="playbutton preview-track" href="/music/example" data-analytics-redirect="false" >
<img class="transparent_png play_icon" width="13" height="13" alt="Play" src="http://cdn.last.fm/flatness/preview/play_indicator.png" style="" />
</a>
</td>
<td class="subjectCell" title="example, played 3 times">
<div>
<a href="/music/example-artist" >Example artist name</a>
I've tried this but isn't doing the job.
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen('http://www.last.fm/user/Jehl/charts?rangetype=overall&subtype=artists').read()
soup = BeautifulSoup(html)
print soup('a')
for link in soup('a'):
print html
Where am I screwing up?

You can try this:
In [1]: from bs4 import BeautifulSoup
In [2]: s = # Your string here...
In [3]: soup = BeautifulSoup(s)
In [4]: for anchor in soup.find_all('a'):
...: print anchor.text
...:
...:
here lies the text i need
Here, the find_all method returns a list that contains all matching anchor tags, after which we can print the text property to get the value between the tags.

for link in soup.select('td.subjectCell a'):
print link.text
It selects (just like CSS) the a elements inside td elements that have the subjectCell class.

spans = soup.find_all("div", {"class": "overlay tran3s"})
for span in spans:
links = span.find_all('a')
for link in links:
print(link.text)

soup.findAll and link.attrs can be used to read the href attributes easily.
Working code:
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
print (link.attrs['href'])
Output:
/music/example
/music/example-artist

Regular expressions are your friend here. As an alternative to RocketDonkey's answer, which uses BeautifulSoup properly; you can parse through soup('a') with a regular expression like
>([a-zA-Z]*|[0-9]|(\w\s*)*)</a>
you can utilize the re.findall method to grab the text in between the anchor tags directly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract partial text from href using BeautifulSoup in Python - python

Related

python beautifulsoup: replace links with url in string

How to select an href class tag using BeautifulSoup?

How to find links within a specified class with Beautiful Soup

How to further filter a result of ResultSet?

How to extract the text between some anchor tags?

Categories

Resources