I'm writing a Python crawler using the Selenium library and the PhantomJs browser. I triggered a click event in a page to open a new page, and then I used the browser.page_source method, but I get the original page source instead of the new open page source. I wonder how to get the new open page source?
Here's my code:
import requests
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
html = browser.page_source
print(html)
browser.quit()
You need to switch to the new window first
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.switch_to_window(browser.window_handles[-1])
html = browser.page_source
I believe you need to add a wait before getting page source.
I've used an implicit wait at the code below.
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.implicitly_wait(5)
html = browser.page_source
browser.quit()
Better to use an explicit wait, but it required a condition like EC.element_to_be_clickable((By.ID, 'someid'))
Related
Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.
I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)
you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)
I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.
I have permission to download some weather data from the following website:
https://www.meteobridel.com/messnetz/index3.php#
I was wondering is there is a possibility to automatically find the download URL behind the 'CSV' button and then download that csv file with Python.
I tried this, but it didn't work:
from selenium import webdriver
browser = webdriver.Safari()
url = 'https://meteobridel.lu/?page_id=5'
browser.get(url)
browser.find_element_by_xpath('//*[#id="CSV"]').click()
browser.close()
Thanks already!
Try
from selenium import webdriver
browser = webdriver.Safari()
url = 'https://meteobridel.lu/?page_id=5'
browser.get(url)
browser.find_element_by_xpath('//body/div[#id='main']/div[1]/div[1]/div[1]/a[4]').click()
browser.close()
Checking the page you provided I can't find an "CSV"-ID.
Maybe try getting the button by class:
browser.find_element_by_xpath(r"//a[contains(#class, 'buttons-csv')]").click()
element is inside iframe so you have to switch to it first , as the id of frame is unique you can switch like
browser.switch_to.frame("iframe")
browser.find_element_by_xpath('//span[contains(text(),"CSV")]/..').click()
I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more"?
I want to crawl reviews from imdb using python. It only displays 25 reviews until I click "load more" button. I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[#id="browse-itemsprimary"]/li[2]/button/span/span[2]'
driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(movie_review.text, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
print("Complete")
I want all the reviews, but now I can only get the first 25.
You have several issues in your script. Hardcoded wait is very inconsistent and certainly the worst option to comply. The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again. Moreover, every title produces a huge line gap in the output which needs to be properly stripped. I've slightly changed your script to reflect the suggestion I've given above.
Try this to get the required output:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
URL = "https://www.imdb.com/title/tt4209788/reviews"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
for elem in soup.find_all(class_='imdb-user-review'):
name = elem.find(class_='title').get_text(strip=True)
print(name)
driver.quit()
Your code is fine. Great even. But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button. That's why you are getting the same 25 reviews listed all the time.
When you use Selenium to control the web browser, you are clicking the 'Load More' button. This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.
The bottom line is that JavaScript (which is run in the web browser) updates the page. But in your Python program, you only get the HTML once for the page statically using the Requests library.
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60
To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews. Then, have BeautifulSoup parse the HTML again. We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.
So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup. Something like this should work:
while True:
try:
allReviewsDiv = driver.find_element_by_xpath("//div[#class='lister-list']")
allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
pdb.set_trace()
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
I am trying to write a python script which can change language of youtube and next time when I open browser using webdriver it show me youtube page in language which I saved before. Problem is my script is not working as expected and I am not sure why?
Selecting language and saving cookies in following code
url='https://www.youtube.com'
print(url)
#page = requests.get(url)
#open web browser
browser = webdriver.Firefox()
#load specific url
browser.get(url)
#wait to load js
time.sleep(5)
#find language picker and click
browser.find_element_by_xpath('//*[#id="yt-picker-language-button"]').click()
#wait to open language list
time.sleep(2)
#find and click specific language
browser.find_element_by_xpath('//*[#id="yt-picker-language-footer"]/div[2]/form/div/div[1]/button[1]').click()
pickle.dump(browser.get_cookies() , open("youtubeCookies.pkl","wb"))
Loading data from cookies.
url='https://www.youtube.com'
print(url)
driver=webdriver.Firefox()
driver.get(url)
for cookie in pickle.load(open("youtubeCookies.pkl", "rb")):
driver.add_cookie(cookie)
time.sleep(3)
driver.refresh()
please guide me what I am doing wrong?
Thank you