I'm trying to take a name from a HTML page with BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
nightbot = 'https://nightbot.tv/t/tonyxzero/song_requests'
page = urllib.request.urlopen(nightbot)
soup = BeautifulSoup(page, 'html5lib')
list_item = soup.find('strong', attrs={'class': 'ng-binding'})
print (list_item)
But when i print print(list_item) i get a none as reply. There is a way to fix it?
Webpage is rendered by javascript. So you have to use a package like selenium to get what you want.
You can try this:
CODE:
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://nightbot.tv/t/tonyxzero/song_requests')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
list_item = soup.find('strong', attrs={'class': 'ng-binding'})
print (list_item)
RESULT:
<strong class="ng-binding" ng-bind="$state.current.title">Song Requests: TONYXZERO</strong>
Related
I was trying to scrape some data using BeautifulSoup on python from the site 'https://www.geappliances.com/ge-appliances/kitchen/ranges/' which has some products.
import unittest, time, random
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup
import pandas as pd
links = []
browser = webdriver.Firefox(executable_path="C:\\Users\\drivers\\geckodriver\\win64\\v0.29.1\\geckodriver.exe")
browser.get("https://www.geappliances.com/ge-appliances/kitchen/ranges/")
content = browser.page_source
soup = BeautifulSoup(content, "html.parser")
for main in soup.findAll('li', attrs = {'class' : 'product'}):
name=main.find('a', href=True)
if (name != ''):
links.append((name.get('href')).strip())
print("Got links : ", len(links))
exit()
Here in output I get:-
Got links: 0
I printed the soup and saw that this part was not there in the soup. I have been trying to get around this problem to no avail.
Am I doing something wrong? Any suggestion are appreciated. Thanks.
Study the source of the webpage.Check your findAll function.
You should wait till the page loads. Use time.sleep() to pause execution for a while.
You can try like this.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://www.geappliances.com/ge-appliances/kitchen/ranges/'
driver = webdriver.Chrome("chromedriver.exe")
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
u = soup.find('ul', class_='productGrid')
for item in u.find_all('li', class_='product'):
print(item.find('a')['href'])
driver.close()
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Induction-Fingerprint-Resistant-Range-with-In-Oven-Camera-PHS93XYPFS
/appliance/GE-Profile-30-Electric-Pizza-Oven-PS96PZRSS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Gas-Double-Oven-Convection-Fingerprint-Resistant-Range-PGS960YPFS
/appliance/GE-Profile-30-Smart-Dual-Fuel-Slide-In-Front-Control-Fingerprint-Resistant-Range-P2S930YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Electric-Double-Oven-Convection-Fingerprint-Resistant-Range-PS960YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Induction-and-Convection-Range-with-No-Preheat-Air-Fry-PHS930BPTS
/appliance/GE-Profile-30-Smart-Slide-In-Fingerprint-Resistant-Front-Control-Induction-and-Convection-Range-with-No-Preheat-Air-Fry-PHS930YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Gas-Range-with-No-Preheat-Air-Fry-PGS930BPTS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Gas-Fingerprint-Resistant-Range-with-No-Preheat-Air-Fry-PGS930YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Electric-Convection-Range-with-No-Preheat-Air-Fry-PSS93BPTS
/appliance/GE-Profile-30-Smart-Slide-In-Electric-Convection-Fingerprint-Resistant-Range-with-No-Preheat-Air-Fry-PSS93YPFS
/appliance/GE-30-Slide-In-Front-Control-Gas-Double-Oven-Range-JGSS86SPSS
I try to webscrape with javascript dynamic + bs + python and Ive read a lot of things to come up with this code where I try to scrape a price rendered with javascript on a famous website for example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.nespresso.com/fr/fr/order/capsules/original/"
browser = webdriver.PhantomJS(executable_path = "C:/phantomjs-2.1.1-windows/bin/phantomjs.exe")
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
soup.find("span", {'class':'ProductListElement__price'}).text
But I only have as a result '\xa0' which is the source value, not the javascript value and I don't know really what I did wrong ...
Best regards
You don't need the expense of a browser. The info is in a script tag so you can regex that out and handle with json library
import requests, re, json
r = requests.get('https://www.nespresso.com/fr/fr/order/capsules/original/')
p = re.compile(r'window\.ui\.push\((.*ProductList.*)\)')
data = json.loads(p.findall(r.text)[0])
products = {product['name']:product['price'] for product in data['configuration']['eCommerceData']['products']}
print(products)
Regex:
Here are two ways to get the prices
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.nespresso.com/fr/fr/order/capsules/original/"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
# Getting the prices using bs4
soup = BeautifulSoup(html, 'lxml')
prices = soup.select('.ProductListElement__price')
print([p.text for p in prices])
# Getting the prices using selenium
prices =browser.find_elements_by_class_name("ProductListElement__price")
print([p.text for p in prices])
So I want to get the image source from this website:
https://www.pixiv.net/en/artworks/77619496
But every time I try to scrape it with bs4 I keep failing, I've tried other posts too but couldn't get it to work.
It keeps returning None
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.pixiv.net/en/artworks/77564597'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
x = soup.find("img")
print(x)
If you look at chrome debug console's network section or the console in the browser you are browsing in, you should see that there is no img elements at the beginning, the page generates img elements by executing javascript. However, I inspected the page and there is a meta element which has image data in it and you can parse it with JSON as shown:
import requests, json
from bs4 import BeautifulSoup
url = 'https://www.pixiv.net/en/artworks/77564597'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
x = soup.find("meta", {"id": "meta-preload-data"}).get("content")
usefulData = json.loads(x)
print(usefulData)
The sample output is here.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.pixiv.net/en/artworks/77564597'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'sc-fzXfPI fRnFme'}):
for img in item.findAll('img', attrs={'class': 'sc-fzXfPJ lclRkv'}):
print(img.get('src'))
Output:
https://i.pximg.net/c/250x250_80_a2/custom-thumb/img/2019/11/28/00/02/59/78026183_p0_custom1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/10/31/04/15/04/77564597_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/30/07/23/45/76528190_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/23/08/01/08/76410568_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/07/24/03/41/47/75881545_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/05/30/04/24/27/74969583_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/custom-thumb/img/2019/11/28/00/02/59/78026183_p0_custom1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/10/31/04/15/04/77564597_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/30/07/23/45/76528190_p0_square1200.jpg
from bs4 import BeautifulSoup
import requests
r = requests.get('http://medicalassociation.in/doctor-search')
soup = BeautifulSoup(r.text,'lxml')
link = soup.find('table',{'class':'tab-gender'})
link1 = link.find('tbody')
link2 = link1.find('tr')[3:4]
link3 = link2.find('a',class_='user-name')
print link3.text
Not getting links through this code.I want to take out view profile links
The following works for me on several test runs. Just using requests and select with class selector.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://medicalassociation.in/doctor-search')
soup = bs(r.content, 'lxml')
results = [item['href'] for item in soup.select(".user-name")]
print(results)
Request.get() rendering javascripts and can't see any element.You can use WebDriver and get the page_source and then fetch the information.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://medicalassociation.in/doctor-search")
soup = BeautifulSoup(driver.page_source,'html.parser')
for a in soup.find_all('a',class_="user-name"):
if a.text is not None :
print(a['href'])
I want to scrape some specific data from a website using urllib and BeautifulSoup.
Im trying to fetch the text "190.0 kg". I have tried as you can see in my code to use attrs={'class': 'col-md-7'}
but this returns the wrong result. Is there any way to specify that I want it to return the text between <h3>?
from urllib.request import urlopen
from bs4 import BeautifulSoup
# specify the url
quote_page = 'https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne'
# query the website and return the html to the variable 'page'
page = urlopen(quote_page)
# parse the html using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
# take out the <div> of name and get its value
Weight_box = soup.find('div', attrs={'class': 'col-md-7'})
name = name_box.text.strip()
print (name)
Since this content is dynamically generated there is no way to access that data using the requests module.
You can use selenium webdriver to accomplish this:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_driver = "path_to_chromedriver"
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path=chrome_driver)
driver.get('https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne')
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
current_lifter = soup.find("div", {"id":"current_lifter"})
value = current_lifter.find_all("div", {'class':'row'})[2].find_all("h3")[0].text
driver.quit()
print(value)
Just be sure to have the chromedriver executable in your machine.