I am trying to retrieve all reviewers comments for a particular app (https://play.google.com/store/apps/details?id=com.getsomeheadspace.android&hl=en&showAllReviews=true) using selenium and beautifulsoup. I load the link by using following code:
driver = webdriver.Chrome(path)
driver.get('https://play.google.com/store/apps/details?id=com.tudasoft.android.BeMakeup&hl=en&showAllReviews=true')
The above command does not load all reviewers comments. I mean it only loads the first 39 comments and does not load remaining comments. Is there any way to load all comments in single go?
You can use infinite loop and load the page until the Show More element is found because of lazy loading.To slowdown the loop I have used time.sleep(1). It gives 200 reviews on that page.If you want to get more then you need to click on Show More again.
However some the review format is not supporting hence try..except block.Hope this will helps.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://play.google.com/store/apps/details?id=com.tudasoft.android.BeMakeup&hl=en&showAllReviews=true')
while True:
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(1)
elements=WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div.UD7Dzf')))
if len(driver.find_elements_by_xpath("//span[text()='Show More']"))>0:
break;
print(len(elements))
allreview=[]
try:
for review in elements:
allreview.append(review.text)
except:
allreview.append("format incorrect")
print(allreview)
Looks like you have to scroll down to get all the information on the page.
try this:
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
You may have to do that a couple of times to load all the data
Related
I am attempting to scrape the website basketball-reference and am running into an issue I can't seem to solve. I am trying to grab the box score element for each game played. This is something I was able to easily do with urlopen but b/c other portions of the site require Selenium I thought I would rewrite the entire process with Selenium
Issue seems to be that even if I wait to scrape until I to see the first element load using WebDriverWait, when I then move forward to grabbing the elements I get nothing returned.
One thing I found interesting is if I did a full site print using my results from urlopen w/ something like print (uClient.read()) I would get roughly 300 more lines of html after beautifying compared to doing the same with print (driver.page_source). Even if I put an ImplicitlyWait set for 5 minutes.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.wait = WebDriverWait(driver, 10)
driver.get('https://www.basketball-reference.com/boxscores/')
driver.wait.until(EC.presence_of_element_located((By.XPATH,'//*[#id="content"]/div[3]/div[1]')))
box = driver.find_elements_by_class_name('game_summary expanded nohover')
print (box)
driver.quit()
Try the below code, it is working in my computer. Do let me know if you still face problem.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.wait = WebDriverWait(driver, 60)
driver.get('https://www.basketball-reference.com/boxscores/')
driver.wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="content"]/div[3]/div[1]')))
boxes = driver.wait.until(
EC.presence_of_all_elements_located((By.XPATH, "//div[#class=\"game_summary expanded nohover\"]")))
print("Number of Elements Located : ", len(boxes))
for box in boxes:
print(box.text)
print("-----------")
driver.quit()
If it resolves your problem then please mark it as answer. Thanks
Actually the site doesn't require selenium at all. All the data is there through a simple requests (it's just in the Comments of the html, would just need to parse that). Secondly, you can grab the box scores quite easily with pandas
import pandas as pd
dfs = pd.read_html('https://www.basketball-reference.com/boxscores/')
for idx, table in enumerate(dfs[:-2]):
print (table)
if (idx+1)%3 == 0:
print("-----------")
I've been trying to parse the links ended with 20012019.csv from a webpage using the below script but the thing is I'm always having timeout exception error. It occurred to me that I did things in the right way.
However, any insight as to where I'm going wrong will be highly appreciated.
My attempt so far:
from selenium import webdriver
url = 'https://promo.betfair.com/betfairsp/prices'
def get_info(driver,link):
driver.get(link)
for item in driver.find_elements_by_css_selector("a[href$='20012019.csv']"):
print(item.get_attribute("href"))
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_info(driver,url)
finally:
driver.quit()
Your code is fine (tried it and it works), the reason you get a timeout is because the default timeout is 60s according to this answer and the page is huge.
Add this to your code before making the get request (to wait 180s before timeout):
driver.set_page_load_timeout(180)
You were close. You have to induce WebDriverWait for the the visibility of all elements located and you need to change the line:
for item in driver.find_elements_by_css_selector("a[href$='20012019.csv']"):
to:
for item in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[href$='20012019.csv']"))):
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I've written a script using python with selenium to click on some links listed in the sidebar of google maps. When any of the items get clicked, the related information attached to each lead shows up in the right sided area. The script is doing fine. However, I've used hardcoded delay to do the job. How can I get rid of hardcoded delay by achieving the same with explicit wait. Thanks in advance.
Link to the site: website
The script I'm trying with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "replace_with_above_link"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "[id^='rlimg0_']"))):
item.location
time.sleep(3) #wish to try with explicit wait but can't find any idea
item.click()
driver.quit()
I tried with wait.until(EC.staleness_of(item)) instead of hardcoded delay but no luck.
If you want to wait until new data displayed after each clcik you may try below:
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "[id^='rlimg0_']"))):
div = driver.find_element_by_xpath("//div[#class='xpdopen']")
item.location
item.click()
wait.until(EC.staleness_of(div))
I've written a script in python in combination with selenium to parse names from a webpage. The data from that site is not javascript enabled. However, the next page links are within javascript. As the next page links of that webpage are of no use if I go for requests library, I have used selenium to parse the data from that site traversing 25 pages. The only problem I'm facing here is that although my scraper is able to reach the last page clicking through 25 pages, it only fetches the data from the first page only. Moreover, the scraper keeps running even though it has done clicking the last page. The next page links look exactly like javascript:nextPage();. Btw, the url of that site never changes even if I click on the next page button. How can i get all the names from 25 pages? The css selector I've used in my scraper is flawless. Thanks in advance.
Here is what I've written:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
while True:
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.text)
try:
n_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='nextPage']")))
driver.execute_script(n_link.get_attribute("href"))
except: break
driver.quit()
You don't have to handle "Next" button or somehow change page number - all entries are already in page source. Try below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
for name in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table.greygeneraltxt td.greygeneraltxt,td.lightbluebg"))):
print(name.get_attribute('textContent'))
driver.quit()
You can also try this solution if it's not mandatory for you to use Selenium:
import requests
from lxml import html
r = requests.get("https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false")
source = html.fromstring(r.content)
for name in source.xpath("//table[#class='greygeneraltxt']//td[text() and position()>1]"):
print(name.text)
It appears this can actually be done more simply than the current approach. After the driver.get method, you can simply use the page_source property to get the html behind it. From there you can get out data from all 25 pages at once. To see how it's structured, just right click and "view source" in chrome.
html_string=driver.page_source
I am trying to scrape a website and want to get the url's and images from Google AdSense. But it seems I am not getting any details of Google Adsense.
Here I want
If we search "refrigerator" in google then we will get some ads there which I need to fetch. Or some blogs, website showing Google Ads like See image
But when I inspect I can find related divs and url but when I hit url then i am getting only static html data.
Here is code which I need to fetch
Here is script which I have written in Selenium, Python.
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "http://www.compiletimeerror.com/"
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url) # load page
delay = 10 # seconds
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_xpath("(//div[#class='pla-unit'])[0]")))
print "Page is ready!"
Element=browser.find_element(By.ID,value="google_image_div")
print Element
print Element.text
except TimeoutException:
print "Loading took too much time!"
But I'm still unable to get data. Please give me any reference or hint.
You need to first select the frame which contains the elements you want to work with.
select_frame("id=google_ads_frame1");
NOTE: I am not sure about the python syntax. But it should be something similar to this.
Use Selenium's switch_to.frame method to direct your browser to the iframe in your html, before selecting your element variable (untested):
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "http://www.compiletimeerror.com/"
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url) # load page
delay = 10 # seconds
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_xpath("(//div[#class='pla-unit'])[0]")))
print "Page is ready!"
browser.switch_to.frame(browser.find_element_by_id('google_ads_frame1'))
element=browser.find_element(By.ID,value="google_image_div")
print element
print element.text
except TimeoutException:
print "Loading took too much time!"
http://elementalselenium.com/tips/3-work-with-frames
A note on Python style best practices: use lowercase when declaring local variables (element vs. Element).