How to retrieve href that contain specific text in Beautifulsoup 4?

How to retrieve href that contain specific text in Beautifulsoup 4? - python

My soup
import requests
from bs4 import BeautifulSoup
page = requests.get('https://example.com')
soup = BeautifulSoup(page.text, 'html.parser')
property_list = soup.find(class_='listing-list ListingsListstyle__ListingsListContainer-cNRhPr hqtMPr')
property_link_list = property_list.find_all('a',{ "class" : "depth-listing-card-link" },string="View details")
print(property_link_list)
I just got an empty array. What I need is to retrieve all the hrefs that contain View details text.
This is an example of the input
<a class="depth-listing-card-link" href="https://example.com">View details<i class="rui-icon rui-icon-arrow-right-small Icon-cFRQJw cqwgEb"></i></a>
I am using Python 3.7.

Try changing the last 2 lines of your code to:
property_link_list = property_list.find_all('a',{ "class" : "depth-listing-card-link" })
for pty in property_link_list:
if pty.text=="View details":
print(pty['href'])
My output is:
/property/bandar-sungai-long/sale-7700845/
/property/bandar-sungai-long/sale-7700845/
/property/bandar-sungai-long/sale-4577620/
/property/bandar-sungai-long/sale-4577620/
/property/port-dickson/sale-8387235/
/property/port-dickson/sale-8387235/
etc.

Related

Webscrape Beautifulsoup on website (get multiple hrefs)

I want to web scrape this webpage (carbuzz.com). I want to get the links (href) of all the car brands from "Acura" to "Volvo" (link to picture).
Currently, I only get the first entry (Acura). How do I get the remaining ones? As I just started scraping and coding would highly appreciate your input!
Code:
from bs4 import BeautifulSoup
import requests
import time
#Inputs/URLs to scrape:
URL2 = ('https://carbuzz.com/cars')
(response := requests.get(URL2)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
car_brand = overview.find(class_='bg-make-preview')['href']
car_brand_url ='https://carbuzz.com'+car_brand
print(car_brand_url)
Output:
[Finished in 1.2s]

You can use find_all to get the tag with class name bg-make-preview.
soup = BeautifulSoup(response.text, 'lxml')
for elem in soup.find_all(class_='bg-make-preview'):
car_brand_url ='https://carbuzz.com' + elem['href']
print(car_brand_url)
This gives us the expected output
https://carbuzz.com/cars/acura
https://carbuzz.com/cars/alfa-romeo
https://carbuzz.com/cars/aston-martin
https://carbuzz.com/cars/audi
https://carbuzz.com/cars/bentley
https://carbuzz.com/cars/bmw
https://carbuzz.com/cars/bollinger
https://carbuzz.com/cars/bugatti
https://carbuzz.com/cars/buick
https://carbuzz.com/cars/cadillac
https://carbuzz.com/cars/caterham
https://carbuzz.com/cars/chevrolet
https://carbuzz.com/cars/chrysler
https://carbuzz.com/cars/dodge
https://carbuzz.com/cars/ferrari
https://carbuzz.com/cars/fiat
https://carbuzz.com/cars/fisker
https://carbuzz.com/cars/ford
https://carbuzz.com/cars/genesis
https://carbuzz.com/cars/gmc
https://carbuzz.com/cars/hennessey
https://carbuzz.com/cars/honda
https://carbuzz.com/cars/hyundai
https://carbuzz.com/cars/infiniti
https://carbuzz.com/cars/jaguar
https://carbuzz.com/cars/jeep
https://carbuzz.com/cars/karma
https://carbuzz.com/cars/kia
https://carbuzz.com/cars/koenigsegg
https://carbuzz.com/cars/lamborghini
https://carbuzz.com/cars/land-rover
https://carbuzz.com/cars/lexus
https://carbuzz.com/cars/lincoln
https://carbuzz.com/cars/lordstown
https://carbuzz.com/cars/lotus
https://carbuzz.com/cars/lucid
https://carbuzz.com/cars/maserati
https://carbuzz.com/cars/mazda
https://carbuzz.com/cars/mclaren
https://carbuzz.com/cars/mercedes-benz
https://carbuzz.com/cars/mini
https://carbuzz.com/cars/mitsubishi
https://carbuzz.com/cars/nissan
https://carbuzz.com/cars/pagani
https://carbuzz.com/cars/polestar
https://carbuzz.com/cars/porsche
https://carbuzz.com/cars/ram
https://carbuzz.com/cars/rimac
https://carbuzz.com/cars/rivian
https://carbuzz.com/cars/rolls-royce
https://carbuzz.com/cars/spyker
https://carbuzz.com/cars/subaru
https://carbuzz.com/cars/tesla
https://carbuzz.com/cars/toyota
https://carbuzz.com/cars/volkswagen
https://carbuzz.com/cars/volvo
https://carbuzz.com/cars/hummer
https://carbuzz.com/cars/maybach
https://carbuzz.com/cars/mercury
https://carbuzz.com/cars/pontiac
https://carbuzz.com/cars/saab
https://carbuzz.com/cars/saturn
https://carbuzz.com/cars/scion
https://carbuzz.com/cars/smart
https://carbuzz.com/cars/suzuki

Loop through div class tags in BeautifulSoup only gets the first result

So, usually what I do when I want to loop through all the elements on a webpage is just do something like:
for i in range(..):
print(get_stuff[i])
But in this case the entire HTML is all in one element, and findAll only gets you the first one, so even if I do this:
from bs4 import BeautifulSoup
import requests
req = requests.get(f"https://jisho.org/search/%23words%20%23n%20?page=1")
soup = BeautifulSoup(req.text, 'html.parser')
concepts = soup.findAll("div",{"class":"concepts"})
tango = concepts[0].findAll("div",{"class":"concept_light clearfix"})
for _ in tango:
tango1 = tango[0].findAll("span",{"class":"text"})[0].text
print(tango1)
I just get the output of the first result repeated. How do I loop through all the "concept_light clearfix" tags instead? I've looked at other answers for a similar question but I didn't understand the solutions (or how to apply them to my case) at all. Please explain simply, thank you.

Try this:-
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
req = session.get("https://jisho.org/search/%23words%20%23n%20?page=1")
req.raise_for_status()
soup = BeautifulSoup(req.text, 'html.parser')
for concept in soup.findAll("div", attrs={"class": "concepts"}):
for tango in concept.find_all("div", attrs={"class": "concept_light clearfix"}):
for span in tango.find_all("span", attrs={"class": "text"}):
print(span.text)

You can select all tags with class="concept_light" and then select the the <span class="text"> within this tag. For example:
import requests
from bs4 import BeautifulSoup
req = requests.get(f"https://jisho.org/search/%23words%20%23n%20?page=1")
soup = BeautifulSoup(req.content, "html.parser")
for concept in soup.select(".concept_light"):
print(concept.select_one("span.text").get_text(strip=True))
Prints:
学校
川
手
戸
眼鏡
煙草
赤
仕事
英語
問題
部屋
子供
時間
雨
先生
年
手紙
電話
水
病気

You are almost there.
The issue is in the for-loop. You are looping correctly, but using only the first item of tango. This:
tango1 = tango[0].findAll("span",{"class":"text"})[0].text
The for-loop should be like this:
for i in tango:
tango1 = i.findAll("span",{"class":"text"})[0].text.strip()
print(tango1)
Output with the above for-loop.
学校
川
手
戸
眼鏡
煙草
赤
仕事
英語
問題
部屋
子供
時間
雨
先生
年
手紙
電話
水
病気

How to get just links of articles in list using BeautifulSoup

Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you

To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html

BS "find_all" method does not match all target

When I use find_all method on this page beautiful soup doesn't find all targets.
This code:
len(mySoup.find_all('div', {'class': 'lo-liste row'}))
Returns 1, yet there are 4.
This is the soup url.

When i looked in source code of given link i found there are only 1 div with class name "lo-liste row" other three div is having class name as follows "lo-liste row not-first-ligne" so that's why you got only 1 as output.
try following code
len(soup.findAll('div', {'class': ['lo-liste row','not-first-ligne']}))
enter code here
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.ubaldi.com/offres/jeu-de-3-disques-feutres-electrolux--ma-92ca565jeud-4b9yz--727536.php")
soup = BeautifulSoup(page.content, 'html.parser')
print(len(soup.findAll('div', {'class': ['lo-liste row','not-first-ligne']})))

The find_all DOES correctly match all targets.
The first product has class=lo-liste row
The next 3 products have class=lo-liste row not-first-ligne
import requests
url = 'https://www.ubaldi.com/offres/jeu-de-3-disques-feutres-electrolux--ma-92ca565jeud-4b9yz--727536.php'
response = requests.get(url)
mySoup = BeautifulSoup(response.text, 'html.parser')
for product in mySoup.find_all('div', {'class': 'lo-liste row'}):
print (product.find('a').find_next('span').text.strip())
for product in mySoup.find_all('div', {'class': 'lo-liste row not-first-ligne'}):
print (product.find('a').find_next('span').text.strip())
# or to combine those 2 for loops into 1
#for product in mySoup.findAll('div', {'class': ['lo-liste row','not-first-ligne']}):
#print (product.find('a').find_next('span').text.strip())
Output:
SOS Accessoire
Stortle
Groupe-Dragon
Asdiscount

Use select instead. It will match on all 4 for that class.
items = soup.select('.lo-liste.row')

Use Regular Expression re to find the element.
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.ubaldi.com/offres/jeu-de-3-disques-feutres-electrolux--ma-92ca565jeud-4b9yz--727536.php'
html= requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(len(soup.find_all('div', class_=re.compile('lo-liste row'))))
Output:
4

I want my code to not extract links with 0 seeders using python

i wrote my code but it extract all links no matter what value is the seeders count,
here is the code i wrote:
from bs4 import BeautifulSoup
import urllib.request
import re
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
url = input('What site you working on today, sir?\n-> ')
opener = AppURLopener()
html_page = opener.open(url)
soup = BeautifulSoup(html_page, "lxml")
pd = str(soup.findAll('td', attrs={'align':re.compile('right')}))
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
if not('0' is pd[18]):
print (link.get('href'),'\n')
and this is the html am working on : https://imgur.com/a/32J9qF4
in this case it's 0 seeders but it still gives me the magnet link.. HELP

This code snippet will extract all magnet links from the page, where seeders != 0:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
soup = BeautifulSoup(requests.get('https://pirateproxy.mx/browse/201/1/3').text, 'lxml')
tds = soup.select('#searchResult td.vertTh ~ td')
links = [name.select_one('a[href^=magnet]')['href'] for name, seeders, leechers in zip(tds[0::3], tds[1::3], tds[2::3]) if seeders.text.strip() != '0']
pprint(links, width=120)
Prints:
['magnet:?xt=urn:btih:aa8a1f7847a49e640638c02ce851effff38d440f&dn=Affairs.of.State.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:819cb9b477462cd61ab6653ebc4a6f4e790589c3&dn=Bad.Samaritan.2018.BRRip.x264.AC3-Manning&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:843d01992aa81d52be68190ee6a733ec9eee9b13&dn=The+Darkest+Minds+2018+HDCAM-1XBET&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:09a23daa69c42003d905ecf0a1cefdb0474e7d88&dn=Insidious+The+Last+Key+2018+BRRip+x264+AAC-SSN&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:98c42d5d620b4db834c5437a75f6da6f2d158207&dn=The+Darkest+Minds+2018+HDCAM-1XBET%5BTGx%5D&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
'magnet:?xt=urn:btih:f30ebc409b215f2a5237433d7508c7ebfabb0e16&dn=Journeyman.2017.SWESUB.BRRiP.x264.mp4&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969',
...and so on.
EDIT:
The soup.select('#searchResult td.vertTh ~ td') will select all <td> siblings of tag <td> with class vertTh which is inside tag with id=searchResult. There are three siblings like this in each row.
The select_one('a[href^=magnet]') will then select all links that href begins with magnet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to retrieve href that contain specific text in Beautifulsoup 4? - python

Related

Webscrape Beautifulsoup on website (get multiple hrefs)

Loop through div class tags in BeautifulSoup only gets the first result

How to get just links of articles in list using BeautifulSoup

BS "find_all" method does not match all target

I want my code to not extract links with 0 seeders using python

Categories

Resources