Selenium Freezing on large pages - python

I am scraping a very large document, and when I call:
page_source = driver.page_source
It freezes and isn't able to capture the full page source. Is there something I can do to mitigate this issue? The page is from an autoscroll and I can't access to the source.

You can workaround it with an execute_script():
driver.execute_script("return document.documentElement.outerHTML;")
You can also try scrolling into view of the footer and only then get the page source:
footer = driver.find_element_by_tag_name("footer")
driver.execute_script("arguments[0].scrollIntoView();", footer)
print(driver.page_source)
Assuming there is the footer element, of course.

Related

How do you use beautifulsoup and selenium to scrape html inside shadow dom?

I'm trying to make an automation program to scrape part of a website. But this website is made out of javascript, and the part of the website I want to scrape is in a shadow dom.
So I figured out that I should use selenium to go to that website and use this code to access elements in shadow dom
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
and use
driver.page_source
to get the HTML of that website. But this code doesn't show me elements that are inside the shadow dom.
I've tried combining those two and tried
root1 = driver.find_element(By. CSS_SELECTOR, "path1")
shadow_root = expand_shadow_element(root1)
html = shadow_root.page_source
but I got
AttributeError: 'ShadowRoot' object has no attribute 'page_source'
for a response. So I think that I need to use BeautifulSoup to scrape data from that page, but I can't figure out how to combine BeautifulSoup and Selenium to scrape data from a shadow dom.
P.S. If the part I want to scrape is
<h3>apple</h3>
<p>1$</p>
<p>red</p>
I want to scrape that code exactly, not
apple
1$
red
You would use BeautifulSoup here as follows:
soup = BeautifulSoup(driver.page_source, 'lxml')
my_parts = soup.select('h3') # for example
Most likely you need to wait for an element to show in the code so you need to set Implicit Wait or Explicit Wait, then once an element is loaded you can soup that page for HTML result.
driver.implicitly_wait(15) #in secounds
text = shadow_root.find_element(By. CSS_SELECTOR, "path2").get_attribute('innerHTML')
None of the answers solved my problem, so I tinkered with the code and this worked! The answer was get_attribute!

Pagination with selenium and python

I'm trying to scrape using selenium and python, the web page have a paginator in javascript, when I click in the button, I can see that the content reload but when I try to get the new table information it's the same old table info, selenium doesn't noticed that de DOM info has changed, I'm aware about the stale DOM, I'm just looking for the best path to solve this problem
for link in source.find_all('div', {'class': 'company-row d-flex'}):
print(link.a.text, link.small.text, link.find('div', {'class': 'col-2'}).text)
# Next button (I´ll make an iterator)
driver.find_element_by_xpath('//a[#href="hrefcurrentpage=2"]').click()
# Tried this and doesn't work
# time.sleep(5)
# Here the table change but get the same old info
for link in source.find_all('div', {'class': 'company-row d-flex'}):
print(link.a.text, link.small.text, link.find('div', {'class': 'col-2'}).text) ```
I think you are getting the same data after opening the next page even after delay since you are getting your data from the existing source.
So, you should re-read, reload the source after clicking the pagination, possibly with some delay.

Append "view-source:" to button click url output

I'm trying to get the rendered html of a webpage. The ctrl+u equivalent (in firefox or chrome).
Currently I must .click() load the page, get the url and then load it again adding view-source: to the url
search = browser.find_elements_by_xpath('//*[#id="edit-keys"]')
button = browser.find_elements_by_xpath('//*[#id="edit-submit"]')
browser.execute_script("arguments[0].value = 'bla';", search[0])
browser.execute_script('arguments[0].target="_blank";', button[0].find_element_by_xpath('./ancestor::form'))
browser.execute_script('arguments[0].click();', button[0])
url = browser.current_url
browser.get("view-source:" + url)
Is it possible to do this without loading the url twice?
browser.execute_script('return document.documentElement.outerHTML') does not offer the view-source: equivalent
driver.page_source also does not match view-source:
maybe there is a way to add view-source: to browser.execute_script('arguments[0].click();', button[0])?
to get the rendered HTML with dynamic JS loaded elements and all you will need to get it with JS with a simple one-liner:
rendered_source = driver.execute_script('return document.documentElement.outerHTML;')

Instagram crawling with scrolling down...with python selenium

total_link = []
temp = ['a']
total_num = 0
while driver.find_element_by_tag_name('div'):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Divs=driver.find_element_by_tag_name('div').text
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
my_titles = soup.select(
'div._6d3hm > div._mck9w'
)
for title in my_titles:
try:
if title in temp:
#print('중복')
pass
else:
#print('중복이 아니다')
link = str(title.a.get("href")) #주소를 가져와!
total_link.append(link)
#print(link)
except:
pass
print("현재 모은 개수: " + str(len(total_link)))
temp = my_titles
time.sleep(2)
if 'End of Results' in Divs:
print('end')
break
else:
continue
Blockquote
Hello I was scraping instagram data with the tags in korean.
My code is consisted in the followings.
scroll down the page
by using bs4 and requests, get their HTML
locate to the point where the time log, picture src, text, tags, ID
select them all, and crawl it.
after it is done with the HTML that is on the page, scroll down
do the same thing until the end
By doing this, and using the codes of the people in this site, it seemed to work...
but after few scrolls going down, at certain points, scroll stops with the error message showing
'읽어드리지 못합니다' or in English 'Unable to read'
Can I know the reason why the error pops up and how to solve the problem?
I am using python and selenium
thank you for your answer
Instagram is trying to protect against malicious attacks, such as scraping or any other automated ways. It often occurs when you are trying to access to Instagram pages abnormally fast. So you have to set time.sleep() options more frequently or longer.

BeautifulSoup in Python - DIV Contents are not displaying

I would like to start by saying I reviewed several solutions on this site, but none seem to be working for me.
I am simply trying to access the contents of a div tag from this website: https://play.spotify.com/chart/3S3GshZPn5WzysgDvfTywr, but the contents are not showing.
Here is the code I have so far:
SpotifyGlobViralurl='https://play.spotify.com/chart/3S3GshZPn5WzysgDvfTywr'
browser.get(SpotifyGlobViralurl)
page = browser.page_source
soup = BeautifulSoup(page)
#the div contents exist in an iframe, so now we call the iframe contents of the 3rd iframe on page:
iFrames=[]
iframexx = soup.find_all('iframe')
response = urllib2.urlopen(iframexx[3].attrs['src'])
iframe_soup = BeautifulSoup(response)
divcontents = iframe_soup.find('div', id='main-container')
I am trying to pull the contents of the 'main-container' div but as you will see, it appears empty when stored in the divcontent variable created. However, if you visit the actual URL and inspect the elements, you will find this 'main-container' div statement filled with all of its contents.
I appreciate the help.
That's because it the container is loaded dynamically. I've noticed you are using selenium, you have to continue using it, switch to the iframe and wait for main-container to load:
wait = WebDriverWait(browser, 10)
# wait for iframe to become visible
iframe = wait.until(EC.visibility_of_element_located((By.XPATH, "//iframe[starts-with(#id, 'browse-app-spotify:app:chart:')]")))
browser.switch_to.frame(iframe)
# wait for header in the container to appear
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#main-container #header")))
container = browser.find_element_by_id("main-container")

Categories

Resources