I'm new to web Scraping and can't get the prices i have found them in the terminal but the list appears empty despite this
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = "https://www.kuantokusta.pt/p/6894201/msi-geforce-rtx-3080-ventus-3x-plus-oc-lhr-10gb-gddr6"
driver.get(url)
html = driver.page_source
doc = BeautifulSoup(html , "html.parser")
print(doc.prettify())
prices = doc.find_all(text="EUR")
print(prices)
There is a CSS class .prices and especially .old-price and .new-price.
So, you could use them as follow:
prices = doc.select('.prices')
or
prices = doc.select('.old-price')
or
prices = doc.select('.new-price')
Related
I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading
I'm trying to scrape this page of job posts, the posts are buried in a bunch of divs but ultimately contained in an unordered list, when I try to retrieve the list using find_all I get None returned either by using the tag id or class. Is there anything I'm doing wrong?
url = "https://resumes.indeed.com/search?l=&q=python&searchFields=jt"
source = requests.get(url).text
soup = BeautifulSoup(source, "lxml")
match = soup.find_all("ul", class_="rezemp-ResumeSearchPage-resultsList")
print(match)
Try this instead
from bs4 import BeautifulSoup
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager
executable_path = {'executable_path':ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless = False)
url = "https://resumes.indeed.com/search?l=&q=python&searchFields=jt"
browser.visit(url)
html = browser.html
soup = BeautifulSoup(html, 'html.parser')
match = soup.find_all("ul", class_="rezemp-ResumeSearchPage-resultsList")
# This is the next you get but it's not structured, not sure if that's what you are looking for.
match[0].text
First, I noticed that the given URL doesn't have any element using class
rezemp-ResumeSearchPage-resultsList
So I change this to "https://resumes.indeed.com/search?l=&lmd=all&q=python&searchFields=jt", i.e. choose 'show all resumes' on that website.
then use selenium to load js:
import lxml
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument('--headless')
url = "https://resumes.indeed.com/search?l=&lmd=all&q=python&searchFields=jt"
browser=webdriver.Chrome(options=chrome_options,executable_path="***\\chromedriver.exe")
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
match = soup.find_all("ul", class_="rezemp-ResumeSearchPage-resultsList")
print(match)
Got it
You should try this
list = []
for m in match:
list.append(m)
print(list)
I try to webscrape with javascript dynamic + bs + python and Ive read a lot of things to come up with this code where I try to scrape a price rendered with javascript on a famous website for example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.nespresso.com/fr/fr/order/capsules/original/"
browser = webdriver.PhantomJS(executable_path = "C:/phantomjs-2.1.1-windows/bin/phantomjs.exe")
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
soup.find("span", {'class':'ProductListElement__price'}).text
But I only have as a result '\xa0' which is the source value, not the javascript value and I don't know really what I did wrong ...
Best regards
You don't need the expense of a browser. The info is in a script tag so you can regex that out and handle with json library
import requests, re, json
r = requests.get('https://www.nespresso.com/fr/fr/order/capsules/original/')
p = re.compile(r'window\.ui\.push\((.*ProductList.*)\)')
data = json.loads(p.findall(r.text)[0])
products = {product['name']:product['price'] for product in data['configuration']['eCommerceData']['products']}
print(products)
Regex:
Here are two ways to get the prices
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.nespresso.com/fr/fr/order/capsules/original/"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
# Getting the prices using bs4
soup = BeautifulSoup(html, 'lxml')
prices = soup.select('.ProductListElement__price')
print([p.text for p in prices])
# Getting the prices using selenium
prices =browser.find_elements_by_class_name("ProductListElement__price")
print([p.text for p in prices])
I want to scrape some specific data from a website using urllib and BeautifulSoup.
Im trying to fetch the text "190.0 kg". I have tried as you can see in my code to use attrs={'class': 'col-md-7'}
but this returns the wrong result. Is there any way to specify that I want it to return the text between <h3>?
from urllib.request import urlopen
from bs4 import BeautifulSoup
# specify the url
quote_page = 'https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne'
# query the website and return the html to the variable 'page'
page = urlopen(quote_page)
# parse the html using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
# take out the <div> of name and get its value
Weight_box = soup.find('div', attrs={'class': 'col-md-7'})
name = name_box.text.strip()
print (name)
Since this content is dynamically generated there is no way to access that data using the requests module.
You can use selenium webdriver to accomplish this:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_driver = "path_to_chromedriver"
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path=chrome_driver)
driver.get('https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne')
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
current_lifter = soup.find("div", {"id":"current_lifter"})
value = current_lifter.find_all("div", {'class':'row'})[2].find_all("h3")[0].text
driver.quit()
print(value)
Just be sure to have the chromedriver executable in your machine.
I'm trying to scrape all links from https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en however without even selecting an element, my code retrieves no links. Please see my code below.
import bs4,requests as rq
Link = 'https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en'
RQOBJ = rq.get(Link)
BS4OBJ = bs4.BeautifulSoup(RQOBJ.text)
print(BS4OBJ)
hope you want link of courses on the page, this code will help
from selenium import webdriver
from bs4 import BeautifulSoup
import time
baseurl='https://www.udemy.com'
url="https://www.udemy.com/courses/search/?q=sql&src=ukw&lang=en"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
courseLink = soup.findAll("a", {"class": "card__title",'href': True})
for link in courseLink:
print baseurl+link['href']
driver.quit()
It will print:
https://www.udemy.com/the-complete-sql-bootcamp/
https://www.udemy.com/the-complete-oracle-sql-certification-course/
https://www.udemy.com/introduction-to-sql23/
https://www.udemy.com/oracle-sql-12c-become-an-sql-developer-with-subtitle/
https://www.udemy.com/sql-advanced/
https://www.udemy.com/sql-for-newbs/
https://www.udemy.com/sql-for-marketers-data-analytics-data-science-big-data/
https://www.udemy.com/sql-for-punk-analytics/
https://www.udemy.com/sql-basics-for-beginners/
https://www.udemy.com/oracle-sql-step-by-step-approach/
https://www.udemy.com/microsoft-sql-for-beginners/
https://www.udemy.com/sql-tutorial-learn-sql-with-mysql-database-beginner2expert/
the website use javascript to fetch data, you should use selenium