Get Href by text using Beautifulsoup

Get Href by text using Beautifulsoup - python

I'm using "requests" and "beautifulsoup" to search for all the href links from a webpage with a specific text. I've already made it but if the text comes in a new line, beautifulsoup doesn't "see" it and don't return that link.
soup = BeautifulSoup(webpageAdress, "lxml")
path = soup.findAll('a', href=True, text="Something3")
print(path)
Example:
Like this, it returns Href of Something3 text:
...
Something3
...
Like this, it doesn't return the Href of Something3 text:
...
<a href="page1/somethingC.aspx">
Something3</a>
...
The difference is that Href text (Something3) is in a new line.
And i can't change HTML code because i'm not the webmaster of that webpage.
Any idea how can i solve that?
Note: i've already tried to use soup.replace('\n', ' ').replace('\r', '') but i get the error NoneType' object is not callable.

You can use regex to find any text that contains `"Something3":
html = '''Something3
<a href="page1/somethingC.aspx">
Something3</a>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "lxml")
path = soup.findAll('a', href=True, text=re.compile("Something3"))
for link in path:
print (link['href'])

You can use :contains pseudo class with bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = 'Something3'
soup = bs(html, 'lxml')
links = [link.text for link in soup.select('a:contains(Something3)')]
print(links)

And a solution without regex:
path = soup.select('a')
if path[0].getText().strip() == 'Something3':
print(path)
Output:
[<a href="page1/somethingC.aspx">
Something3</a>]

Related

BS4 Scraper is producing html of the entire div code, not just the href link

The code for the website is here: https://i.imgur.com/uIJO20R.png
The code I am using:
import requests
import time
from bs4 import BeautifulSoup
import sys
sys.stdout = open("links.txt", "a")
for x in range(0, 2):
try:
URL = f'https://link.com/{x}'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('div', id='view')
for row in rows:
print(row.text)
time.sleep(5)
except:
continue
I just want an output of the list of links as shown in the highlighted code. But instead it results in the entire view code, not just the HREF, which is what I want.
Example of output that it produces:
<div id="view">
<img src="/thumbs/jpg/8f310ba6dfsdfsdfsdf.jpg" width="300"/>
...
...
When what I want it to produce is:
/watch/8f310ba6dfsdfsdfsdf
...
...

Use following code which will find all anchor tag under div tag and then get the href value.
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find('div',id='view').find_all('a'):
print(links['href'])
If you Bs4 4.7.1 or above you can use following css selector.
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.select('#view>a'):
print(links['href'])

You are retrieving the whole content of the div tag so if you want to get the links within the div then you need to add the a tag to the css seelctor as follows :
links = soup.select('div[id="view"] a')
for link in links :
print(link.get('href'))

By extracting the href attribute of the a inside the div you can get your desired result
rows = soup.find_all('div', id='view')
for row in rows:
links = row.find_all('a')
for link in links:
print(link['href'])

BeautifulSoup and remove entire tag

I'm working with BeautifulSoup. I wish that if I see the tag -a href- the entire line is deleted, but, actually, not.
By example, if I have :
<a href="/psf-landing/">
This is a test message
</a>
Actually, I can have :
<a>
This is a test message
</a>
So, how can I just get :
This is a test message
Here is my code :
soup = BeautifulSoup(content_driver, "html.parser")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
for titles in soup.findAll('a'):
del titles['href']
tree = soup.prettify()

Try to use .extract() method. In your case, you're just deleting an attribute
for titles in soup.findAll('a'):
if titles['href'] is not None:
titles.extract()

Here,you can see the detailed examples Dzone NLP examples
what you need is :
text = soup.get_text(strip=True)
This is the sample example:
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

You are looking for the unwrap() method. Have a look at the following snippet:
html = '''
<a href="/psf-landing/">
This is a test message
</a>'''
soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('a', href=True):
el.unwrap()
print(soup)
# This is a test message
Using href=True will match only the tags that have href as an attribute.

How regex until last occurrence?

I am using python, I need regex to get contacts link of web page. So, I made <a (.*?)>(.*?)Contacts(.*?)</a> and result is:
href="/ru/o-nas.html" id="menu263" title="About">About</a></li><li>Photo</li><li class="last"><a href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
,but I need on last <a ... like
href="/ru/kontakt.html" class="last" id="menu583" title="">Contacts
What regex pattern should I use?
python code:
match = re.findall('<a (.*?)>(.*?)Contacts(.*?)</a>', body)
if match:
for m in match:
print ''.join(m)

Since you are parsing HTML, I would suggest to use BeautifulSoup
# sample html from question
html = '<li>About</li><li>Photo</li><li class="last">Contacts</li>'
from bs4 import BeautifulSoup
doc = BeautifulSoup(html)
aTag = doc.find('a', id='menu583') # id for Contacts link
print(aTag['href'])
# '/ru/kontakt.html'

Try BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import re
links = []
urls ['www.u1.com','www.u2.om'....]
for url in urls:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
if link.string.lower() == 'contact':
links.append(link.get('href'))

How do I grab all the links within an element in HTML using python?

First, please check the image below so I can better explain my question:
I am trying to take a user input to select one of the links below "Course Search By Term".... (ie. Winter 2015).
The HTML opened shows the part of the code for this webpage. I would like to grab all the href links in the element , which consists of five term links I want. I am following the instructions from this website (www.gregreda.com/2013/03/03/web-scraping-101-with-python/), but it doesn't explain this part. Here is some code I have been trying.
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://classes.uoregon.edu/"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
pldefault = soup.find("td", "pldefault")
ul_links = pldefault.find("ul")
category_links = [BASE_URL + ul.a["href"] for i in ul_links.findAll("ul")]
return category_links
Any help is appreciated! Thanks. Or if you would like to see the website, its classes.uoregon.edu/

I would keep it simple and locate all links containing 2015 in the text and term in href:
for link in soup.find_all("a",
href=lambda href: href and "term" in href,
text=lambda text: text and "2015" in text):
print link["href"]
Prints:
/pls/prod/hwskdhnt.p_search?term=201402
/pls/prod/hwskdhnt.p_search?term=201403
/pls/prod/hwskdhnt.p_search?term=201404
/pls/prod/hwskdhnt.p_search?term=201406
/pls/prod/hwskdhnt.p_search?term=201407
If you want full URLs, use urlparse.urljoin() to join the links with a base url:
from urlparse import urljoin
...
for link in soup.find_all("a",
href=lambda href: href and "term" in href,
text=lambda text: text and "2015" in text):
print urljoin(url, link["href"])
This would print:
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201402
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201403
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201404
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201406
http://classes.uoregon.edu/pls/prod/hwskdhnt.p_search?term=201407

How to remove all a href tags from text

I have a script to replace a word in a "ahref" tag. However i want to remove the a href entirely, so that you have the word Google without a link.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Hello Google</p>')
for a in soup.findAll('a'):
a['href'] = a['href'].replace("google", "mysite")
result = str(soup)
Also can you find all the words placed in a href and place a " " before and after them. I'm not sure how to. I guess this is done before the replacing.

Use del a['href'] instead, just like you would on a plain dictionary:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Hello Google</p>')
for a in soup.findAll('a'):
del a['href']
gives you:
>>> print str(soup)
<p>Hello <a>Google</a></p>
UPDATE:
If you want to get rid of the <a> tags altogether, you can use the .replaceWithChildren() method:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Hello Google</p>')
for a in soup.findAll('a'):
a.replaceWithChildren()
gives you:
>>> print str(soup)
<p>Hello Google</p>
...and, what you requested in the comment (wrap the text content of the tag with spaces), can be achieved with:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Hello Google</p>')
for a in soup.findAll('a'):
del a['href']
a.setString(' %s ' % a.text)
gives you:
>>> print str(soup)
<p>Hello <a> Google </a></p>

You can use bleach
pip install bleach
then use it like this...
import bleach
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('hello world')
clean = bleach.clean(soup,tags[],strip=True)
This results in...
>>> print clean
u'hello world'
here are the docs for bleach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Href by text using Beautifulsoup - python

You can use :contains pseudo class with bs4 4.7.1 from bs4 import BeautifulSoup as bs html = 'Something3' soup = bs(html, 'lxml') links = [link.text for link in soup.select('a:contains(Something3)')] print(links)

And a solution without regex: path = soup.select('a') if path[0].getText().strip() == 'Something3': print(path) Output: [<a href="page1/somethingC.aspx"> Something3</a>]

Related

BS4 Scraper is producing html of the entire div code, not just the href link

BeautifulSoup and remove entire tag

How regex until last occurrence?

How do I grab all the links within an element in HTML using python?

How to remove all a href tags from text

Categories

Resources