BeautifulSoup in Python - DIV Contents are not displaying - python

I would like to start by saying I reviewed several solutions on this site, but none seem to be working for me.
I am simply trying to access the contents of a div tag from this website: https://play.spotify.com/chart/3S3GshZPn5WzysgDvfTywr, but the contents are not showing.
Here is the code I have so far:
SpotifyGlobViralurl='https://play.spotify.com/chart/3S3GshZPn5WzysgDvfTywr'
browser.get(SpotifyGlobViralurl)
page = browser.page_source
soup = BeautifulSoup(page)
#the div contents exist in an iframe, so now we call the iframe contents of the 3rd iframe on page:
iFrames=[]
iframexx = soup.find_all('iframe')
response = urllib2.urlopen(iframexx[3].attrs['src'])
iframe_soup = BeautifulSoup(response)
divcontents = iframe_soup.find('div', id='main-container')
I am trying to pull the contents of the 'main-container' div but as you will see, it appears empty when stored in the divcontent variable created. However, if you visit the actual URL and inspect the elements, you will find this 'main-container' div statement filled with all of its contents.
I appreciate the help.

That's because it the container is loaded dynamically. I've noticed you are using selenium, you have to continue using it, switch to the iframe and wait for main-container to load:
wait = WebDriverWait(browser, 10)
# wait for iframe to become visible
iframe = wait.until(EC.visibility_of_element_located((By.XPATH, "//iframe[starts-with(#id, 'browse-app-spotify:app:chart:')]")))
browser.switch_to.frame(iframe)
# wait for header in the container to appear
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#main-container #header")))
container = browser.find_element_by_id("main-container")

Related

Selenium webscraper not scraping desired tags

here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.
Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))

How do you use beautifulsoup and selenium to scrape html inside shadow dom?

I'm trying to make an automation program to scrape part of a website. But this website is made out of javascript, and the part of the website I want to scrape is in a shadow dom.
So I figured out that I should use selenium to go to that website and use this code to access elements in shadow dom
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
and use
driver.page_source
to get the HTML of that website. But this code doesn't show me elements that are inside the shadow dom.
I've tried combining those two and tried
root1 = driver.find_element(By. CSS_SELECTOR, "path1")
shadow_root = expand_shadow_element(root1)
html = shadow_root.page_source
but I got
AttributeError: 'ShadowRoot' object has no attribute 'page_source'
for a response. So I think that I need to use BeautifulSoup to scrape data from that page, but I can't figure out how to combine BeautifulSoup and Selenium to scrape data from a shadow dom.
P.S. If the part I want to scrape is
<h3>apple</h3>
<p>1$</p>
<p>red</p>
I want to scrape that code exactly, not
apple
1$
red
You would use BeautifulSoup here as follows:
soup = BeautifulSoup(driver.page_source, 'lxml')
my_parts = soup.select('h3') # for example
Most likely you need to wait for an element to show in the code so you need to set Implicit Wait or Explicit Wait, then once an element is loaded you can soup that page for HTML result.
driver.implicitly_wait(15) #in secounds
text = shadow_root.find_element(By. CSS_SELECTOR, "path2").get_attribute('innerHTML')
None of the answers solved my problem, so I tinkered with the code and this worked! The answer was get_attribute!

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

how to use selenium to go from one url tab to another before scraping?

I have created the following code in hopes to open up a new tab with a few parameters and then scrape the data table that is on the new tab.
#Open Webpage
url = "https://www.website.com"
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
driver.get(url)
#Click Necessary Parameters
driver.find_element_by_partial_link_text('Output').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[3]').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[4]').click()
driver.find_element_by_xpath('//*[#id="repOpt"]/table[2]/tbody/tr/td[2]/input[4]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Dates').click()
driver.find_element_by_xpath('//*[#id="RangeOption"]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[3]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[4]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[3]/select/option[31]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[4]/select/option[1]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Groupings').click()
driver.find_element_by_xpath('//*[#id="availFld_DATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_LOCID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_STATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_DDSO_SA"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_CLASS_ID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_REGION"]/a/img').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Run').click()
time.sleep(2)
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
soup = BeautifulSoup(page, features = 'html5lib')
soup.prettify()
However, the following error pops up when I run it.
requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?
I will say that regardless of the parameters, the new tab always generates the same url. In other words, if the new tab creates www.website.com/b, it also creates www.website.com/b the third, fourth, etc. time, regardless of changing the parameters. Any thoughts?
The problem lies here:
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
df_url is not referring to the url of the page. To get that, you should call driver.current_url after switching windows to get the url of the active window.
Some other pointers:
finding elements by xpath is relatively inefficient (source)
instead of time.sleep, you can look into using explicit waits
Insert the url below the driver variable because first, the webdriver executes and then the url provided
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
url = "https://www.website.com"

How to get iframe source from page_source when the id isn't on the iframe

Hello today i wanna ask how to get the link inside the page source but without id, i asked before how to get the link with id ok now i understand, but i've tried the same method with another link and i was not successful about that so here is my code:
from selenium import webdriver
# Create a new instance of the Firefox driver
driver_path = r"C:\Users\666\Desktop\New folder (8)\chromedriver.exe"
driver = webdriver.Chrome(driver_path)
# go to the google home page
driver.get("https://www.gledalica.com/sa-prevodom/iron-fist-s02e01-video_02cb355f8.html")
# find the element that's name attribute is q (the google search box)
element = driver.find_element_by_id("Playerholder")
frame = driver.find_element_by_tag_name("iframe")
driver.switch_to.frame("iframe")
link = frame.get_attribute("src")
driver.quit()
Like this here: enter image description here
There are multiple way to get it. In this case one of easiest is by using a CSS selector:
frame = find_element_by_css_selector('#Playerholder iframe')
This looks for the element with id = "Playerholder" in the html and then look for a child of it that is an iframe.

Categories

Resources