BeautifulSoup returns empty span elements? - python

I'm trying to pull prices from Binance's home page and BeautifulSoup returns empty elements for me. Binance's home page is at https://www.binance.com/en/, and the interesting block I'm trying to get text from is:
<div class="sc-62mpio-0-sc-iAyFgw iQwJlO" color="#999"><span>"/" "$" "35.49"</span></div>
On Binance's home page is a table and one of the columns is titled "Last Price". Next to the last price is the last USD price in a faded gray color and I'm trying to pull every one of those. Here's my code so far.
def grabPrices():
page = requests.get("https://www.binance.com/en")
soup = BeautifulSoup(page.text, "lxml")
prices = soup.find_all("span", {"class": None})
print(prices)
But the output is just a large array of "–" tags.

Selenium should be one way of scraping the table content you want from this biniance page. And google Selenium about its set up (pretty much by download a driver and place it in your local disk, if you are a chrome user, see this download link chrome driver). Here is my code to access the content you are interested:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome(executable_path=r'C:\chromedriver\chromedriver.exe')
time.sleep(3) # Allow time to launch the controlled web
driver.get('https://www.binance.com/en/')
time.sleep(3) # Allow time to load the page
sel = Selector(text=driver.page_source)
Table = sel.xpath('//*[#id="__next"]/div/main/div[4]/div/div[2]/div/div[2]/div/div[2]/div')
Table.extract() # This basically gives you all the content of the table, see follow screen shot (screen shot is truncated for display purpose)
Then if you further process the entire table content with something like:
tb_rows = Table.xpath('.//div/a//div//div//span/text()').extract()
tb_rows # Then you will get follow screen shot
At this point, the result is narrowed down to pretty much what you are interested, but notice that the lastprice's two components (number/dollar price) are stored in two tag in source page, so we can do following to combine them together and reach to the destination:
for n in range(0,len(tb_rows),2):
LastPrice = tb_rows[n] + tb_rows[n+1]
print(LastPrice) # For sure, other than print, you could store each element in a list
driver.quit() # don't forget to quit driver by the end
The final output looks like:

Related

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

Python/Selenium - How to webscrape this dropdown

My code runs fine and prints the title for all rows but the rows with dropdowns.
For example, row 4 has a dropdown if clicked. I implemented a try which would in theory initiate the dropdown, to then pull the titles.
But my click/scrape for the rows with these drop downs are not printing.
Expected output- Print all titles including the ones in dropdown.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='card item-container session')
for property in productlist:
sessiontitle=property.find('h4',class_='session-title card-title').text
print(sessiontitle)
try:
ifDropdown=driver.find_elements_by_class_name('item-expand-action expand')
ifDropdown.click()
time.sleep(4)
newTitle=driver.find_element_by_class_name('card-title').text
print(newTitle)
except:
newTitle='none'
There were a couple of issues. First, when you locate from the driver by class and there is more than one, you need to separate them by dots, not spaces, so that the driver knows it's dealing with another class.
Second, find_elements returns a list, and the list has no .click(), so you get an error, which your except catches but assumes means there was no link to click.
I rewrote it (without soup for now) so that it instead checks (With the dot replacing space) for a link to open within the session and then loops over the new ones that appeared.
Here is what I have and tested. Note at the end this only gets the sessions and subsessions in the view. You will need to add logic to scroll and get the rest.
# stuff to initialize driver is above here, I used firefox
# Open the website page
URL = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list"
driver.get(URL)
time.sleep(4)#time for page to populate
product_list=driver.find_elements_by_css_selector('div.card.item-container.session')
#above line gets all top level sessions
for product in product_list:
session_title=product.find_element_by_css_selector('h4.card-title').text
print(session_title)
dropdowns=product.find_elements_by_class_name('item-expand-action.expand')
#above line finds dropdown within this session, if any
if len(dropdowns)==0:#nothing more for this session
continue#move to next session
#still here, click on the dropdown, using execute because link can overlap chevron
driver.execute_script("arguments[0].scrollIntoView(true); arguments[0].click();",
dropdowns[0])
time.sleep(4)#wait for subsessions to appear
session_titles=product.find_elements_by_css_selector('h4.card-title')
session_index = 0#suppress reprinting title of master session
for session_title in session_titles:
if session_index > 0:
print(" " + session_title.text)#indent for clarity
session_index = session_index + 1
#still to do, deal with other sessions that only get paged into view when you scroll
#that is a different question

Python- WebScraping a page

My code is supposed to go into a website, navigate through 2 pages, and print out all the titles and URL/href within each row.
Currently - My code goes into these 2 pages fine, however it only prints out the first title of each page and not each title of each row.
The page does have some JavaScript, and I think maybe this is why it does not show any links/urls/hrefs within each of these rows? Ideally id like to print the URLS of each row.
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/presentations/endometrial/{x}')
time.sleep(3)
page_source = driver.page_source
eachrow=driver.find_elements_by_xpath("//li[#class='result clearfix']")
for item in eachrow:
title=driver.find_element_by_xpath("//span[#class='bodyTitle']").text
print(title)
You're using driver inside your for loop meaning you're searching the whole page - so you will always get the same element.
You want to search from each item instead.
for item in eachrow:
title = item.find_element_by_xpath(".//span[#class='bodyTitle']").text
Also, there are no "URLs" in the rows as mentioned - when you click on a row the data-id attribute is used in the request.
<h1 class="name" data-id="1989" data-key="">
Which sends a request to https://www.abstractsonline.com/oe3/Program/9325/Presentation/694

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

how to loop pages using beautiful soup 4 and python and selenium?

I'm Fairly new to Python and using beautiful soup first time though I have some experience with selenium. I am trying to scrape a website ("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx" ) For all the affiliation number.
The problem is they are on multiple pages( 20 result on 1, total: 21,000+ result)
so, I wish to scrape these in some kind of loop that can iterate over the next page btn, the problem in URL of the web page does not change and thus there is no pattern.
Okay so for this i have tried, google sheet Import HTML/ Import XML method but due to large scale of problem it just hangs.
Next I tried python and started reading about scraping using python (I'm doing this for the first time :) ) Some-one on this platform suggested an method
(Python Requests/BeautifulSoup access to pagination)
I am trying to do the same but with little and no success.
Also, to fetch the result we have to first, query the search bar with the keyword "a" --> then click search. Only then the website show result.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe",options=options)
driver.get("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx")
#click on the radio btn
driver.find_element(By.ID,'optlist_0').click()
time.sleep(2)
# Search the query with letter A And Click Search btn
driver.find_element(By.ID,'keytext').send_Keys("a")
driver.find_element(By.ID,'search').click()
time.sleep(2)
next_button = driver.find_element_by_id("Button1")
data = []
try:
while (next_button):
soup = BeautifulSoup(driver.page_source,'html.parser')
table = soup.find('table',{'id':'T1'}) #Main Table
table_body = table.find('tbody') #get inside the body
rows = table_body.find_all('tr') #look for all tablerow
for row in rows:
cols = row.find_all('td') # in every Tablerow, look for tabledata
for row2 in cols:
#table -> tbody ->tr ->td -><b> --> exit loop. ( only first tr is our required data, print this)
The final outcome I expect is List of all affiliation number across multiple pages.
A minor addition to the code within your while loop:
next_button = 1 #Initialise the variable for the first instance of while loop
while next_button:
#First scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#Now locate the button & click on it
next_button = driver.find_element(By.ID,'Button1')
next_button.click()
###
###Beautiful Soup Code : Fetch the page source now & do your thing###
###
#Adjust the timing as per your requirement
time.sleep(2)
Note the fact that scrolling to the bottom of the page is important, otherwise an error will pop up claiming 'Button1' element is hidden under the footer. So with the script(in the beginning of the loop), the browser will move down to the bottom of the page. Here, it can see the 'Button1' element clearly. Now, locate the element, perform the click action & then let your Beautiful Soup take over.

Categories

Resources