BeautifulSoup4 find multiple href's links with specific text in links - python

I'm trying filter all href links with the string "3080" in it, I saw some examples, but I just can't apply them to my code. Can someone tell me how to print only the links.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import driver_functions
gpu = '3080'
url = f'https://www.alternate.de/listing.xhtml?q={gpu}'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
if __name__ == '__main__':
browser = webdriver.Chrome(options=options, service=Service('chromedriver.exe'))
try:
browser.get(url)
time.sleep(2)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
gpu_list = soup.select("a", class_="grid-container listing")
for link in gpu_list:
print(link['href'])
browser.quit()
except:
driver_functions.browserstatus(browser)
Output

You could use a css attribute = value css selector with * contains operator to target hrefs, within the listings, that contain that gpu variable. You can obviously develop this css selector list if you find edge cases to account for. I only looked at the url given.
gpu_links= [i['href'] for i in soup.select(f".listing [href*='{gpu}']")]

Try this as your selector gpu_list = soup.select('#lazyListingContainer > div > div > div.grid-container.listing > a')

Related

Why can't I scrape the urls of the dispayed results of a search on a website?

I've been trying to get the url of the first result of that kind of "results display page" and I can't, the html parser doesn't include it...
The link of the website : https://www.sobrico.com/#Prod_Live_Sobrico%5Bquery%5D=2608664131
I've been trying with different codes using BeautifulSoup, resquests, but nothing comes. I'm able to scrape many infos when I am on a product page like this one : https://www.sobrico.com/p/bosch-2608664131-coffret-lames-best-for-cutting-bosch-2608664131_SKU726760.html
But on a search results page, some part of the code, above all the one that contains the results shown, isn't available.. I hope to get an answer, it would really help.
Here is my code :
import requests
from bs4 import BeautifulSoup
URL = "https://www.sobrico.com/#Prod_Live_Sobrico%5Bquery%5D=2608664131"
page = requests.get(URL)
soup = BeautifulSoup(page.content, features='lxml')
for link in soup("a"):
print(link.get("href"))
You can use Selenium instead to get the first item on the search results page. Use a CSS selector to identify the item of interest. Then get the url by using the get_attribute() function.
Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.sobrico.com/#Prod_Live_Sobrico%5Bquery%5D=2608664131")
first_item = driver.find_element(By.CSS_SELECTOR, "div#algoliasearch-hits li:nth-child(1) > article > a")
url = first_item.get_attribute("href")
print(url)
driver.close()
Output:
https://www.sobrico.com/p/bosch-2608664131-coffret-lames-best-for-cutting-bosch-2608664131_SKU726760.html
Note:
If you want to get the urls from each of the 11 items on the page, you could iterate through them:
for i in range(1,12):
url = driver.find_element(By.CSS_SELECTOR, f"div#algoliasearch-hits li:nth-child({i}) > article > a").get_attribute("href")
print(url)
Output:
https://www.sobrico.com/p/bosch-2608664131-coffret-lames-best-for-cutting-bosch-2608664131_SKU726760.html
https://www.sobrico.com/p/bosch-coffret-starlockmax-4-pieces-bosch-best-of-heavy-duty-2608664132_SKUM89146.html
https://www.sobrico.com/p/bosch-expert-2608644113-lame-scie-circulaire-260x30x2-8-mm-aluminium-96-dents-bosch-expert-2608644113_SKU744566.html
https://www.sobrico.com/p/bosch-2608631319-5-lames-scie-sauteuse-hss-132-mm-metal-t318a-bosch-2608631319_SKU478349.html
https://www.sobrico.com/p/bosch-2608641185-lame-scie-circulaire-190x30x2mm-bois-24-dents-bosch-2608641185_SKU491105.html
https://www.sobrico.com/p/bosch-2608636431-lames-scie-sauteuse-t101-bif-special-for-laminate-pack-5-bosch-2608636431_SKU478354.html
https://www.sobrico.com/p/bosch-expert-2608606131-bande-abrasive-ponceuses-x440-best-for-wood-and-paint-grain-80-610mm-bosch-2608606131_SKU490187.html
https://www.sobrico.com/p/bosch-lames-scie-circulaires-bosch-construct-wood_SKUM75099.html
https://www.sobrico.com/p/bosch-expert-2608644134-lames-scie-circulaire-for-high-pressure-laminate-56-dents-bosch-expert-2608644134_SKU744579.html
https://www.sobrico.com/p/bosch-2608662431-lame-scie-oscillante-asz-32-sc-outils-multi-fonctions-bosch-2608662431_SKU777336.html
https://www.sobrico.com/p/bosch-2608664348-lame-scie-oscillante-bosch-starlock-carbure-metalmax-aiz-45-at-45mm-bosch-expert-2608664348_SKU777339.html

Can't get anything back on some web pages using Beautiful Soup

I'm trying to scrape this page of job posts, the posts are buried in a bunch of divs but ultimately contained in an unordered list, when I try to retrieve the list using find_all I get None returned either by using the tag id or class. Is there anything I'm doing wrong?
url = "https://resumes.indeed.com/search?l=&q=python&searchFields=jt"
source = requests.get(url).text
soup = BeautifulSoup(source, "lxml")
match = soup.find_all("ul", class_="rezemp-ResumeSearchPage-resultsList")
print(match)
Try this instead
from bs4 import BeautifulSoup
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager
executable_path = {'executable_path':ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless = False)
url = "https://resumes.indeed.com/search?l=&q=python&searchFields=jt"
browser.visit(url)
html = browser.html
soup = BeautifulSoup(html, 'html.parser')
match = soup.find_all("ul", class_="rezemp-ResumeSearchPage-resultsList")
# This is the next you get but it's not structured, not sure if that's what you are looking for.
match[0].text
First, I noticed that the given URL doesn't have any element using class
rezemp-ResumeSearchPage-resultsList
So I change this to "https://resumes.indeed.com/search?l=&lmd=all&q=python&searchFields=jt", i.e. choose 'show all resumes' on that website.
then use selenium to load js:
import lxml
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument('--headless')
url = "https://resumes.indeed.com/search?l=&lmd=all&q=python&searchFields=jt"
browser=webdriver.Chrome(options=chrome_options,executable_path="***\\chromedriver.exe")
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
match = soup.find_all("ul", class_="rezemp-ResumeSearchPage-resultsList")
print(match)
Got it
You should try this
list = []
for m in match:
list.append(m)
print(list)

Unable to get specific tag using bs4?

I am using bs4 and python 3.6 my problem is that there is a youtube search page and I want to get the link of the first video in it so I found after inspecting that id of that anchor tag is video-title and I used that parameter to find that a tag using following code also the link of every video's anchor tag has the same id as video-title so I decided to use find instead of find_all
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
soup =BeautifulSoup(requests.get('https://www.youtube.com/results?search_query=unravel').text,'lxml')
link = soup.find('a',id="video-title")
print(link)
but in return it gives
None
I have tried to get all anchor tags but that also doesn't include the tag which I want.
Can anyone tell what the problem is?
you can use this "\watch?v=\w+" to get your links simpler than bs4 ☺
use selenium with regex for best results
you can try this assuming you have selenium and lxml install on your environment.
from selenium import webdriver
from bs4 import BeautifulSoup
def get_tag():
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get('https://www.youtube.com/results?search_query=unravel')
# print(driver.page_source)
soup = BeautifulSoup(driver.page_source, 'lxml')
atags = soup.find_all('a',{'id':'video-title'})
for tag in atags:
print(tag.get('title'))
this method will return the title of <a> tag which have <video-title> as their ID.

Search by class with beautiful soup (python) not working

I'm trying to scrape the date, and the minimum and maximum temperatures from the site https://www.ipma.pt/pt/otempo/prev.localidade.hora/#Porto&Gondomar.
I want to find all of the divs with the date class and all the spans with the tempMin and tempMax classes, so I wrote
pagina2= "https://www.ipma.pt/pt/otempo/prev.localidade.hora/#Porto&Gondomar"
client2= uReq(pagina2)
pagina2bs= soup(client2.read(), "html.parser")
client2.close()
data = pagina2bs.find_all("div", class_="date")
minT = pagina2bs.find_all("span", class_="tempMin")
maxT = pagina2bs.find_all("span", class_="tempMax")
but all I get are empty lists. I've compared this with similar code and I can't see where I made a mistake, since there are clearly tags with these classes.
From my perspective it has to do with the content of the pagina2bs variable. You are passing the right variables to the find_all method.
Use selenium to get the html of that website.
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import html5lib
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options, executable_path='C:/Users/**USERNAME**/Desktop/chromedriver.exe')
startUrl = "https://www.ipma.pt/pt/otempo/prev.localidade.hora/#Porto&Gondomar"
driver.get(startUrl)
html = driver.page_source
soup = bs(html,features="html5lib")
divs = soup.find_all("div", class_="date")
print(divs)
Install all the needed packages and a the selenium chrome driver. Link to this chromedriver in the code like I did on my machine.

Webscraping - Python - Can't find links in html

I'm trying to scrape all links from https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en however without even selecting an element, my code retrieves no links. Please see my code below.
import bs4,requests as rq
Link = 'https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en'
RQOBJ = rq.get(Link)
BS4OBJ = bs4.BeautifulSoup(RQOBJ.text)
print(BS4OBJ)
hope you want link of courses on the page, this code will help
from selenium import webdriver
from bs4 import BeautifulSoup
import time
baseurl='https://www.udemy.com'
url="https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
courseLink = soup.findAll("a", {"class": "card__title",'href': True})
for link in courseLink:
print baseurl+link['href']
driver.quit()
It will print:
https://www.udemy.com/the-complete-sql-bootcamp/
https://www.udemy.com/the-complete-oracle-sql-certification-course/
https://www.udemy.com/introduction-to-sql23/
https://www.udemy.com/oracle-sql-12c-become-an-sql-developer-with-subtitle/
https://www.udemy.com/sql-advanced/
https://www.udemy.com/sql-for-newbs/
https://www.udemy.com/sql-for-marketers-data-analytics-data-science-big-data/
https://www.udemy.com/sql-for-punk-analytics/
https://www.udemy.com/sql-basics-for-beginners/
https://www.udemy.com/oracle-sql-step-by-step-approach/
https://www.udemy.com/microsoft-sql-for-beginners/
https://www.udemy.com/sql-tutorial-learn-sql-with-mysql-database-beginner2expert/
the website use javascript to fetch data, you should use selenium

Categories

Resources