I'm trying to finish a simple script reading data from some pages. My code looks like this:
def parsePage (https):
driver = webdriver.Chrome("path\chromedriver.exe")
driver.get(https)
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
#All the stuff going below
Then, the function is executed about 200 times in a loop, each one for a different page.
What my problem is, is that if the one of mentioned 200 pages won't load whole script crashes. Is there a way to make script wait each time for a page to successfully load, and if it won't load just try again?
You can wait for complete or one of complete and interactive state of JavaScript using WebDriverWait:
from selenium.webdriver.support.ui import WebDriverWait
def parsePage (https):
driver = webdriver.Chrome("path\chromedriver.exe")
driver.get(https)
WebDriverWait(driver, 20).until(
lambda d: d.execute_script('return (document.readyState == "complete" || document.readyState == "interactive")'))
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
Related
here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.
Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))
My code so far - If I search for a job title in LinkedIn - (For example-Cyber Analyst), will gather all links of this job posting/page
Goal -I put these links in a list, and iterate through them (Code works so far) to print the title of each job posting/link
My code iterates through every link, but does not get the Post title/Job title text. Which is the goal.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
test1=[]
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install())
url = "https://www.linkedin.com/jobs/search/?currentJobId=2213597199&geoId=103644278&keywords=cyber%20analyst&location=United%20States&start=0&redirect=false"
driver.get(url)
time.sleep(2)
elements = driver.find_elements_by_class_name("result-card__full-card-link")
job_links = [e.get_attribute("href") for e in elements]
for job_link in job_links:
test1.append(job_link) #prints all links into test1
for b in test1:
driver.get(b)
time.sleep(3)
element1=driver.find_elements_by_class_name("jobs-top-card__job-title t-24")
title=[t.get_attribute("jobs-top-card__job-title t-24") for t in element1]
print(title)
I couldn't see class 'obs-top-card__job-title t-24' on the link pages, but this gives you the job titles for every href
Change
element1=driver.find_elements_by_class_name("jobs-top-card__job-title t-24")
title=[t.get_attribute("jobs-top-card__job-title t-24") for t in element1]
to
element1=driver.find_elements_by_class_name("topcard__title")
title=[t.text for t in element1]
>>> ['Cyber Threat Intelligence Analyst']
>>> ['Jr. Python/Cyber Analyst (TS/SCI)']
>>> ['Cyber Security Analyst']
....ect
every time you do driver.get(b) a new page is fetched, so the html code is not the same as driver.get(url) so I think t.get_attribute("jobs-top-card__job-title t-24") belongs to html code for driver.get(url) but as I said this page is closed as driver.get(b) is fetched
Also each page for driver.get(b) has the same structure so element1=driver.find_elements_by_class_name("topcard__title") will always work
e.g. this is a one of the pages of driver.get(b):
This is where topcard_title is
I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more"?
I want to crawl reviews from imdb using python. It only displays 25 reviews until I click "load more" button. I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[#id="browse-itemsprimary"]/li[2]/button/span/span[2]'
driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(movie_review.text, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
print("Complete")
I want all the reviews, but now I can only get the first 25.
You have several issues in your script. Hardcoded wait is very inconsistent and certainly the worst option to comply. The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again. Moreover, every title produces a huge line gap in the output which needs to be properly stripped. I've slightly changed your script to reflect the suggestion I've given above.
Try this to get the required output:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
URL = "https://www.imdb.com/title/tt4209788/reviews"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
for elem in soup.find_all(class_='imdb-user-review'):
name = elem.find(class_='title').get_text(strip=True)
print(name)
driver.quit()
Your code is fine. Great even. But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button. That's why you are getting the same 25 reviews listed all the time.
When you use Selenium to control the web browser, you are clicking the 'Load More' button. This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.
The bottom line is that JavaScript (which is run in the web browser) updates the page. But in your Python program, you only get the HTML once for the page statically using the Requests library.
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60
To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews. Then, have BeautifulSoup parse the HTML again. We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.
So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup. Something like this should work:
while True:
try:
allReviewsDiv = driver.find_element_by_xpath("//div[#class='lister-list']")
allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
pdb.set_trace()
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
I'm writing a Python crawler using the Selenium library and the PhantomJs browser. I triggered a click event in a page to open a new page, and then I used the browser.page_source method, but I get the original page source instead of the new open page source. I wonder how to get the new open page source?
Here's my code:
import requests
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
html = browser.page_source
print(html)
browser.quit()
You need to switch to the new window first
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.switch_to_window(browser.window_handles[-1])
html = browser.page_source
I believe you need to add a wait before getting page source.
I've used an implicit wait at the code below.
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.implicitly_wait(5)
html = browser.page_source
browser.quit()
Better to use an explicit wait, but it required a condition like EC.element_to_be_clickable((By.ID, 'someid'))
I want to retrieve all visible content of a web page. Let say for example this webpage. I am using a headless firefox browser remotely with selenium.
The script I am using looks like this
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
This is supposed to load the page, parse the dom, and then replace the iframe with id dsq-app1 with it's visible content. If I execute those commands one by one via my python command line it works as expected. I can then see the paragraphs with all the visible content. When instead I execute all those commands at once, either by executing the script or by pasting all this snippet in my interpreter, it behaves differently. The paragraphs are missing, the content still exists in json format, but it's not what I want.
Any idea why this may happening? Something to do with replace_with maybe?
Sounds like the dom elements are not yet loaded when your code try to reach them.
Try to wait for the elements to be fully loaded and just then replace.
This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.
To add to Or Duan's answer I provide what I ended up doing. The problem of finding whether a page or parts of a page have loaded completely is an intricate one. I tried to use implicit and explicit waits but again I ended up receiving half-loaded frames. My workaround is to check the readyState of the original document and the readyState of iframes.
Here is a sample function
def _check_if_load_complete(driver, timeout=10):
elapsed_time = 1
while True:
if (driver.execute_script('return document.readyState') == 'complete' or
elapsed_time == timeout):
break
else:
sleep(0.0001)
elapsed_time += 1
then I used that function right after I changed the focus of the driver to the iframe
driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)
Try to get the Page Source after detecting the required ID/CSS_SELECTOR/CLASS or LINK.
You can always use explicit wait of Selenium WebDriver.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName)
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
Correct me if this not work