Problem with BS4 printing only certain parts of the web page - python

I am having a problem with bs4 only finding some things in html. To be specific when I try to print span.nav2__menu-link-main-text it selects it and prints it without a problem but when I try to select other part of the page it probably selects it but It doesnt want to print it out. Here is the code that prints and the code that doesnt print:
Tried using different parsers other than lxml and none worked.
#This one prints
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('span.nav2__menu-link-main-text'):
print(i.text)
#This one does not print
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://osu.ppy.sh/users/https://osu.ppy.sh/users/18723891'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for i in soup.select('div.value-dispaly__value'):
print(i.text)
I expect this program to print the current value of div.value-dispaly__value
but when I start the program it prints nothing even tough I can see the value is 4000 when I inspect the page.

It seems that code you are willing to get is dynamically added to the web page by javascript.
In order to update web js part, you have to use requests render() function.

Website page is javascript request rendering to get data, so you need to use automation library like selenium. download selenium web driver as per your browser requirement.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/
Replace your code to this:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://osu.ppy.sh/users/12008062')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
for i in soup.find_all('div',{"class":"value-display__value"}):
print(i.get_text())
O/P:
#47,514
#108
11d 19h 49m
44
4,000
11d 19h 49m
44
4,000
#47,514
#108
0
0

Related

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

I'm Trying To Scrape The Duration Of Tiktok Videos But I am Getting 'None'

I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading

Scraping a Javascript enabled web page in Python

I am looking to scrape the following web page, where I wish to scrape all the text on the page, including all the clickable elements.
I've attempted to use requests:
import requests
response = requests.get("https://cronoschimp.club/market/details/2424?isHonorary=false")
response.text
Which scrapes the meta-data but none of the actual data.
Is there a way to click through and get the elements in the floating boxes?
As it's a Javascript enabled web page, you can't get anything as output using requests, bs4 because they can't render javascript. So, you need an automation tool something like selenium. Here I use selenium with bs4 and it's working fine. Please, see the minimal working example as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://cronoschimp.club/market/details/2424?isHonorary=false'
driver.get(url)
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('div',class_="DetailsHeader_title__1NbGC").get_text(strip=True)
p= soup.find('span',class_="DetailsHeader_value__1wPm8")
price= p.get_text(strip=True) if p else "Not for sale"
print([name,price])
Output:
['Chimp #2424', 'Not for sale']

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

What is difference between soup of selenium and requests?

I was crawling some information from the web, but there were different results while I'm using Selenium and requests
Selenium
driver.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(driver.page_source, 'html.parser')
sample= soup.find_all('div', class_='accord_hd')`
requests
response= requests.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(response.content, 'html.parser')
sample= soup.find_all('div', class_='accord_hd')`
while using Selenium, it returned an empty list.
but in requests, there was a list with some strings in it.
I experienced sth similar to this before, so I wonder what's going on here
requests will obtain/return the initial html source code.
selenium will simulate/automate the browser to open the web page, which then you can pull the html source that was used to render the page.
The difference between these 2 is requests does not support that rendering/java script if the site is dynamically created. While since selenium actually opens up the browser to display the page, will allow the page to render it's contents before getting the html source.
That's the reason why you may get 2 different responses when using requests versus selenium.
However, in the particular code you have given above, I had the exact same output with using Selenium and using requests
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(driver.page_source, 'html.parser')
sample_selenium= soup.find_all('div', class_='accord_hd')
driver.close()
import requests
response = requests.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(response.content, 'html.parser')
sample_requests= soup.find_all('div', class_='accord_hd')
print ('Selenium: %s items\nRequests: %s items' %(len(sample_selenium), len(sample_requests)))
Output:
Selenium: 11 items
Requests: 11 items

Categories

Resources