python dynamic webscraping javascript content

python dynamic webscraping javascript content - python

I am using Python and Selenium to scrape a website. What I do is go to the homepage, type in a keyword, such as 1300746-79-5. On the resulting page, I am trying to scrape the data in the "pricing" section. Specifically, I need to get the "SKU-Pack Size" and "Price(USD)" information. But these information is Javascript encripted, so I cannot see them in the source code. I am wondering how I can achieve this.
I have written some code that gets me to the page of interest, but I still cannot see the javascript information. Here is what I have so far.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pprint
# Create a new instance of the Firefox driver
driver = webdriver.Chrome('C:\Users\Rei\Desktop\chromedriver.exe')
driver.get("http://www.sigmaaldrich.com/united-states.html")
print driver.title
inputElement = driver.find_element_by_name("Query")
# type in the search
inputElement.send_keys("1300746-79-5")
inputElement.submit()

Everything you have done looks correct to me.
"SKU-Pack Size" and "Price(USD)" information are not "encrypted", but retrieved after JavaScript clicking action. All you need to do is to click product name or pricing link.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pprint
driver = webdriver.Chrome()
driver.get("http://www.sigmaaldrich.com/united-states.html")
print driver.title
inputElement = driver.find_element_by_name("Query")
# type in the search
inputElement.send_keys("1300746-79-5")
inputElement.submit()
pricing_link = driver.find_element_by_css_selector("li.priceValue a")
print pricing_link.text
pricing_link.click()
# then deal with the data you want
price_table = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".priceAvailContainer tbody"))
)
print 'price_table.text: ' + price_table.text
driver.quit()

Related

How do I access an internal link with Selenium that seems restricted

I am trying to fetch data from nj.58.com using selenium. I can access the homepapage and some internal links. While navigating through the links, I noticed that the website sees me as a web crawler when I visit a specific url; even if I interact with the links as a human.
I have built my selenium script to a point but I'm stock because the sites throws antibot response back at me.
Here is what I've done:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import Select
import undetected_chromedriver as uc
import time
import pandas as pd
driver = uc.Chrome()
website = 'https://nj.58.com/'
driver.get(website)
driver.implicitly_wait(4)
wait = WebDriverWait(driver, 10)
driver.maximize_window()
switch_city = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#commonTopbar_ipconfig > a"))).click()
city_location = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#selector-search-input')))
city_location.clear()
city_location.send_keys('南京' + Keys.RETURN)
keyword = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#keyword')))
keyword.clear()
keyword.send_keys('"废纸回收"')
time.sleep(2)
search_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#searchbtn')))
search_btn.click()
When I click on search_btn, I'm expecting to see a list of items that I'm interested in. But instead it sees me as a web crawler at this variable position (search_btn) even before using the selenium.
How can I bypass this antibot/antihuman detection at point when I click on search_btn?

Selenium not printing inner text of div

I am using selenium to try to scrape data from a website (https://www.mergentarchives.com/), and I am attempting to get the innerText from this element:
<div class="x-paging-info" id="ext-gen200">Displaying reports 1 - 15 of 15</div>
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()
driver.maximize_window()
search_url = 'https://www.mergentarchives.com/search.php'
driver.get(search_url)
assert 'Mergent' in driver.title
company_name_input = '//*[#id="ext-comp-1009"]'
search_button = '//*[#id="ext-gen287"]'
driver.implicitly_wait(10)
driver.find_element_by_xpath(company_name_input).send_keys('3com corp')
driver.find_element_by_xpath(search_button).click()
driver.implicitly_wait(20)
print(driver.find_element_by_css_selector('#ext-gen200').text)
basically I am just filling out a search form, which works, and its taking me to a search results page, where the number of results is listed in a div element. When I attempt to print the text of this element, I simply get a blank space, there is nothing written and no error.
[Finished in 21.1s]
What am I doing wrong?

I think you may need explicit Wait :
wait = WebDriverWait(driver, 10)
info = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[#class = 'x-paging-info' and #id='ext-gen200']"))).get_attribute('innerHTML')
print(info)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

You may need to put a condition by verifying if search results loaded or not and once its loaded you can use below code
print(driver.find_element_by_id('ext-gen200').text)

How to click on onclick link with image? Python Selenium

I'm just learning how to webscrape dynamically using Selenium in Python. I'm currently trying to click on a link within the webpage to page forward over search results.
So far this is the code that I'm using:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
elem = driver.find_element_by_css_selector("img[src='/Sicr/TraDocEstProc/CLProLey2016.nsf/8eac1ef603908b5105256cdf006c41b1/$Body/0.AB2?OpenElement&FieldElemFormat=gif']")
elem.click()
This is the HTML that corresponds with the element I'd like to click on:
`<img src="/Sicr/TraDocEstProc/CLProLey2016.nsf/8eac1ef603908b5105256cdf006c41b1/$Body/0.AB2?OpenElement&FieldElemFormat=gif" width="81" height="16" border="0">`
From my somewhat limited knowledge of HTML this seems like the link is actually embedded in the gif which is why I tried to use the CSS selector that goes along with that image. But this did not work.
Any guidance would be greatly appreciated!
Update:
I changed my code by adding the following import
from selenium.webdriver.common.by import By
And I changed the following:
elem = driver.find_element(By.CSS_SELECTOR, "img[src='/Sicr/TraDocEstProc/CLProLey2016.nsf/8eac1ef603908b5105256cdf006c41b1/$Body/0.AB2?OpenElement&FieldElemFormat=gif']")
elem.click()
Now I get an error for "no such element."

There is an iframe.You need to switch to iframe first to access the element.Try below code.use WebDriverWait to handle dynamic element.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'ventana02')))
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(#onclick,'A50')]/img[contains(#src,'Sicr/TraDocEstProc/CLProLey')]")))
elem.click()
EDITED
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
driver.switch_to.frame(0)
elem=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//a[contains(#onclick,'A50')]/img[contains(#src,'Sicr/TraDocEstProc/CLProLey')]")))
elem.click()

Why does this selenium not click "next page" until end?

I am writing a scraping code for the website Upwork, and need to click through each page for job listings. Here is my python code, which I used selenium to web crawl.
from bs4 import BeautifulSoup
import requests
from os.path import basename
from selenium import webdriver
import time
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.upwork.com/o/jobs/browse/c/design-creative/")
link = driver.find_element_by_link_text("Next")
while EC.elementToBeClickable(By.linkText("Next")):
wait.until(EC.element_to_be_clickable((By.linkText, "Next")))
link.click()

There are couple of problems:
EC has no attribute elementToBeClickable. In Python you should use element_to_be_clickable
Your link defined on the first page only, so using it on the second page should give you StaleElementReferenceException
There is no wait variable defined in your code. I guess you mean something like
wait = WebDriverWait(driver, 10)
By has no attribute linkText. Try LINK_TEXT instead
Try to use below code to get required behavior
from selenium.common.exceptions import TimeoutException
while True:
try:
wait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, Next"))).click()
except TimeoutException:
break
This should allow you to click Next button while it's available

Python, Selenium, and Beautiful Soup for URL

I am trying to write a script using Selenium to access pastebin do a search and print out in text the URL results. I need the visible URL results and nothing else.
<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/VYQTSbzY</div>
Current script is:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://www.pastebin.com')
search = browser.find_element_by_name('q')
search.send_keys("test")
search.send_keys(Keys.RETURN)
soup=BeautifulSoup(browser.page_source)
for link in soup.find_all('a'):
print link.get('href',None),link.get_text()

You don't actually need BeautifulSoup. selenium itself is very powerful at locating element:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.pastebin.com')
search = browser.find_element_by_name('q')
search.send_keys("test")
search.send_keys(Keys.RETURN)
# wait for results to appear
wait = WebDriverWait(browser, 10)
results = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.gsc-resultsbox-visible")))
# grab results
for link in results.find_elements_by_css_selector("a.gs-title"):
print link.get_attribute("href")
browser.close()
Prints:
http://pastebin.com/VYQTSbzY
http://pastebin.com/VYQTSbzY
http://pastebin.com/VAAQCjkj
...
http://pastebin.com/fVUejyRK
http://pastebin.com/fVUejyRK
Note the use of an Explicit Wait which helps to wait for the search results to appear.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python dynamic webscraping javascript content - python

Related

How do I access an internal link with Selenium that seems restricted

Selenium not printing inner text of div

How to click on onclick link with image? Python Selenium

Why does this selenium not click "next page" until end?

Python, Selenium, and Beautiful Soup for URL

Categories

Resources