How to get data from a React page using Python Selenium?

How to get data from a React page using Python Selenium? - python

I am trying to get some data from site built in React but I cannot extract what I need. Basically, I want to get the datetime presents on site, but my script could not find the div.
Here is my code:
Site url: https://gisaid.org/phylodynamics/china-cn/
def config_webdriver(browser: str):
chrome_options = ChromeOptions()
firefox_options = FirefoxOptions()
chrome_options.add_argument("--headless")
firefox_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options) if browser == "chrome" \
else webdriver.Firefox(options=firefox_options)
return driver
def get_date_from_china_phylodynamics(browser: str, url: str):
driver = config_webdriver(browser)
driver.get(url)
wait_driver = WebDriverWait(driver, 20)
try:
element = wait_driver.until(
EC.visibility_of_element_located(
(By.CSS_SELECTOR,
"#root > div > div.mb-3.mt-2.justify-content-center.row > div"
)
))
print(element)
except Exception as error:
print(error)
driver.close()

I think you'd better take the underlying json by pressing F12 > Network
https://phylodynamics2.pandemicprepardness.org/charon/getDataset?prefix=SARS-CoV-2/China5
https://phylodynamics2.pandemicprepardness.org/charon/getAvailable?prefix=SARS-CoV-2/China5
You can retrieve the json and create objects. Parsing HTML will fail if the DOM will be changed. See for instance: https://reqbin.com/code/python/g4nr6w3u/python-parse-json-example

Related

Issue when scraping data on McMaster-Carr website

I'm writing a crawler for McMaster-Carr. For example, the page https://www.mcmaster.com/98173A200 , if I open the page directly in browser, I can view all the product data.
Because the data is in dynamically-loaded content, so I'm using Selenium + bs4.
if __name__ == "__main__":
url = "https://www.mcmaster.com/98173A200"
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Chrome("C:/chromedriver/chromedriver.exe", options=options)
driver.set_page_load_timeout(20)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
delay = 20
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'MainContent')))
except TimeoutException:
print("Timeout loading DOM!")
print(soup)
However, if I run the code I would get a login dialog, which I wouldn't get if I open this page directly in a browser like I mentioned.
I also tried logging in with the code below
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'Email')))
print("Page is ready!!")
input("Press Enter to continue...")
except TimeoutException:
print("Loading took too much time!")
email_input.send_keys(email)
password_input = driver.find_element_by_id('Password')
password_input.send_keys(password)
login_button = driver.find_element_by_class_name("FormButton_primaryButton__1kNXY")
login_button.click()
Then it shows access restricted.
I compared the requested header in the page opened by Selenium and the page in my browser, I couldn't find anything wrong. I also tried other webdrivers like PhantomJS and FireFox, and I got the same result.
I also tried using random user-agent using the code below
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names,
operating_systems=operating_systems,
limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
chrome_options = Options()
chrome_options.add_argument('user-agent=' + user_agent)
Still same result.
The developer tool in the page opened by Selenium showed there were a bunch of errors. I guess the tokenauthorization one is the key to this issue, but I don't know what should I do with it.
Any help would be appreciated!

The reason you saw a login window is that you were accessing McMaster carr via a chrome driver. When the server recognizes your behaviour, it will require you to sign in.
A typical login wouldn't work if you haven't been authenticated by McMaster (need to sign NDA)
You should look into McMaster API. With the API, you can access the database directly. However, you need to sign an NDA with McMaster Carr before obtaining access to the API. https://www.mcmaster.com/help/api/

Having trouble in getting data using css selector/xpath in selenium

I'm trying to extract data from the link below using selenium via python:
www.oanda.com
But I'm getting an error that, "Unable to Locate an Element". In browser console i tried using this Css selector:
document.querySelector('div.position.short-position.style-scope.position-ratios-app')
This querySelector returns me the data for short percentage of 1st row in the browser console(for this test), but when i used this selector in the python script below it gives me an error that, "Unable to Locate element" or sometimes empty sctring.
Please suggest me solution if there's any.Will be grateful, thanks :)
# All Imports
import time
from selenium import webdriver
#will return driver
def getDriver():
driver = webdriver.Chrome()
time.sleep(3)
return driver
def getshortPercentages(driver):
shortPercentages = []
shortList = driver.find_elements_by_css_selector('div.position.short-position.style-scope.position-ratios-app')
for elem in shortList:
shortPercentages.append(elem.text)
return shortPercentages
def getData(url):
driver = getDriver()
driver.get(url)
time.sleep(5)
# pagesource = driver.page_source
# print("Page Source: ", pagesource)
shortList = getshortPercentages(driver)
print("Returned source from selector: ", shortList)
if __name__ == '__main__':
url = "https://www.oanda.com/forex-trading/analysis/open-position-ratios"
getData(url)

Required data is located inside an iframe, so you need to switch to iframe before handling elements:
driver.switch_to.frame(driver.find_element_by_class_name('position-ratios-iframe'))
Also note that data inside iframe is dynamic, so make sure that you're using Implicit/Explicit wait (using time.sleep(5) IMHO is not the best solution)

Loop through url with Selenium Webdriver

The below request finds the contest id's for the day. I am trying to pass that str into the driver.get url so it will go to each individual contest url and download each contests CSV. I would imagine you have to write a loop but I'm not sure what that would look like with a webdriver.
import time
from selenium import webdriver
import requests
import datetime
req = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA')
data = req.json()
for ids in data:
contest = ids['id']
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
driver.get('https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby');
time.sleep(2) # Let DK Load!
search_box = driver.find_element_by_name('username')
search_box.send_keys('username')
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys('password')
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
time.sleep(2) # Let Page Load, If not it will go to Account!
driver.get('https://www.draftkings.com/contest/exportfullstandingscsv/' + str(contest) + '')

Try in following order:
import time
from selenium import webdriver
import requests
import datetime
req = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA')
data = req.json()
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
driver.get('https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby')
time.sleep(2) # Let DK Load!
search_box = driver.find_element_by_name('username')
search_box.send_keys('Pr0c3ss')
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys('generic1!')
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
time.sleep(2) # Let Page Load, If not it will go to Account!
for ids in data:
contest = ids['id']
driver.get('https://www.draftkings.com/contest/exportfullstandingscsv/' + str(contest) + '')

You do not need to send load selenium for x nos of times to download x nos of files. Requests and selenium can share cookies. This means you can login to site with selenium, retrieve the login details and share them with requests or any other application. Take a moment to check out httpie, https://httpie.org/doc#sessions it seems you manually control sessions like requests does.
For requests look at: http://docs.python-requests.org/en/master/user/advanced/?highlight=sessions
For selenium look at: http://selenium-python.readthedocs.io/navigating.html#cookies
Looking at the Webdriver block,you can add proxies and load the browser headless or live: Just comment the headless line and it should load the browser live, this makes debugging easy, easy to understand movements and changes to site api/html.
import time
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import requests
import datetime
import shutil
LOGIN = 'https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby'
BASE_URL = 'https://www.draftkings.com/contest/exportfullstandingscsv/'
USER = ''
PASS = ''
try:
data = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA').json()
except BaseException as e:
print(e)
exit()
ids = [str(item['id']) for item in data]
# Webdriver block
driver = webdriver.Chrome()
options.add_argument('headless')
options.add_argument('window-size=800x600')
# options.add_argument('--proxy-server= IP:PORT')
# options.add_argument('--user-agent=' + USER_AGENT)
try:
driver.get(URL)
driver.implicitly_wait(2)
except WebDriverException:
exit()
def login(USER, PASS)
'''
Login to draftkings.
Retrieve authentication/authorization.
http://selenium-python.readthedocs.io/waits.html#implicit-waits
http://selenium-python.readthedocs.io/api.html#module-selenium.common.exceptions
'''
search_box = driver.find_element_by_name('username')
search_box.send_keys(USER)
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys(PASS)
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
driver.implicitly_wait(2)
cookies = driver.get_cookies()
return cookies
site_cookies = login(USER, PASS)
def get_csv_files(id):
'''
get each id and download the file.
'''
session = rq.session()
for cookie in site_cookies:
session.cookies.update(cookies)
try:
_data = session.get(BASE_URL + id)
with open(id + '.csv', 'wb') as f:
shutil.copyfileobj(data.raw, f)
except BaseException:
return
map(get_csv_files, ids)

will this help
for ids in data:
contest = ids['id']
driver.get('https://www.draftkings.com/contest/exportfullstandingscsv/' + str(contest) + '')

May be its time to decompose it a bit.
Create few isolated functions, which are:
0. (optional) Provide authorisation to target url.
1. Collecting all needed id (first part of your code).
2. Exporting CSV for specific id (second part of your code).
3. Loop through list of id and call func #2 for each.
Share chromedriver as input argument for each of them to save driver state and auth-cookies.
Its works fine, make code clear and readable.

I think you can set the URL of a contest to an a element in the landing page, and then click on it. Then repeat the step with other ID.
See my code below.
req = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA')
data = req.json()
contests = []
for ids in data:
contests.append(ids['id'])
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
driver.get('https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby');
time.sleep(2) # Let DK Load!
search_box = driver.find_element_by_name('username')
search_box.send_keys('username')
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys('password')
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
time.sleep(2) # Let Page Load, If not it will go to Account!
for id in contests:
element = driver.find_element_by_css_selector('a')
script1 = "arguments[0].setAttribute('download',arguments[1]);"
driver.execute_script(script1, element, str(id) + '.pdf')
script2 = "arguments[0].setAttribute('href',arguments[1]);"
driver.execute_script(script2, element, 'https://www.draftkings.com/contest/exportfullstandingscsv/' + str(id))
time.sleep(1)
element.click()
time.sleep(3)

web scraping a site without direct access

any help is appreciated in advance.
deal is i have been trying scrape data from this website(https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do),but direct access to the website is not possible.Rather then data i need,i am getting invalid access.To access the website i must go to (https://www.mptax.mp.gov.in/mpvatweb/index.jsp) and then click on 'dealer search' from dropdown menu while hovering over dealer information.
I am looking for solution in Python,
Here's something i tried.I have just started web scraping:
import requests
from bs4 import BeautifulSoup
with requests.session() as request:
MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"
page=request.get(INITIAL)
jsession=page.cookies["JSESSIONID"]
print(jsession)
print(page.headers)
result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
page1=request.get(MAIN,headers={"Referer":INITIAL})
soup=BeautifulSoup(page1.content,'html.parser')
data=soup.find_all("tr",class_="whitepapartd1")
print(data)
Deal is i want to scrape data about firm's based on their firm name.

thanks for telling me a way #Arnav and #Arman ,so here's the final code:
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value
PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
#ask for input
company_name=input("tell the company name")
#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")
#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()
#we are now on the leftmenu page
#click on radio button
browser.find_element_by_css_selector("#byName").click()
#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)
#submit form
inputElement.submit()
#now we are on dealerssearch page
#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")
#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")
#check length of 'list' and on that basis decide what to print
if(len(list)!=0):
#company name at index=9
#tin no. at index=10
#registration status at index=11
#circle name at index=15
#store the values
name=list[9].get_text()
tin=list[10].get_text()
status=list[11].get_text()
circle=list[15].get_text()
#make dictionary
Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}
print(Company_Details)
else:
Company_Details={"VAT RC No":"Not found in database"}
print(Company_Details)
#close the chrome
browser.stop_client()
browser.close()
browser.quit()

Would you mind using a browser?
You can use a browser and access the link at xpath (//*[#id="dropmenudiv"]/a[1]).
You might have to download and put chromedriver in the mentioned directory if you haven't used chromedriver before. You can also use selenium + phantomjs if you want to do headless browsing (without the browser opening up each time).
from selenium import webdriver
xpath = "//*[#id="dropmenudiv"]/a[1]"
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.set_window_size(1120,550)
browser.get('https://www.mptax.mp.gov.in/mpvatweb')
link = browser.find_element_by_xpath("//*[#id="dropmenudiv"]/a[1]")
link.click()
url = browser.current_url

Retrieving url from google image search for first entry, using python and selenium

Ever since the API has been deprecated, its been very hard to retrieve the google image search url using Selenium. I've scoured stackoverflow, but most of the results to this question are from years ago when scraping search engines was simpler.
Looking for a way to return the url of the first image in a google search query. I've used everything in selenium from clicks, to retrieve innerhtml of elements, to my most recent attempt, using actionchains to attempt to navigate to the url of the pic and then returning the current url.
def GoogleImager(searchterm, musedict):
page = "http://www.google.com/"
landing = driver.get(page)
actions = ActionChains(driver)
WebDriverWait(landing, '10')
images = driver.find_element_by_link_text('Images').click()
actions.move_to_element(images)
searchbox = driver.find_element_by_css_selector('#lst-ib')
WebDriverWait(searchbox, '10')
sendsearch = searchbox.send_keys('{} "logo" {}'.format('Museum of Bad Art', 'bos')+Keys.ENTER)
WebDriverWait(sendsearch, '10')
logo = driver.find_element_by_xpath('//*[#id="rg_s"]/div[1]/a').click()
WebDriverWait(logo, '10')
logolink = driver.find_element_by_xpath('//*[#id="irc_cc"]/div[3]/div[1]/div[2]/div[2]/a')
WebDriverWait(logolink, '10')
actions.move_to_element(logolink).click(logolink)
print(driver.current_url)
return driver.current_url
I'm using this to return the first image for a museum name and city in the search.

I tried to make your code work with Google, got frustrated and switched to Yahoo instead. I couldn't make heads or tails of your musedict access loops so I substituted a simple dictionary for demonstration purposes:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
museum_dictionary = { "louvre": "Paris", "prado": "Madrid"}
driver = webdriver.Firefox()
def YahooImager(searchterm):
page = "https://images.search.yahoo.com"
landing = driver.get(page)
WebDriverWait(driver, 4)
assert "Yahoo Image Search" in driver.title
searchbox = driver.find_element_by_name("p") # Find the query box
city = museum_dictionary[searchterm]
searchbox.send_keys("{} {}".format(searchterm, city) + Keys.RETURN)
WebDriverWait(driver, 4)
try:
driver.find_element_by_xpath('//*[#id="resitem-0"]/a').click()
except NoSuchElementException:
assert 0, '//*[#id="resitem-0"]/a'
driver.close()
WebDriverWait(driver, 4)
try:
driver.find_element_by_link_text("View Image").click()
except NoSuchElementException:
assert 0, "View Image"
driver.close()
WebDriverWait(driver, 4)
# driver.close()
return driver.current_url
image_url = YahooImager("prado")
print(repr(image_url))
It works, but takes quite a while. (That's probably something someone who knows these libraries better could optimize -- I just wanted to see it work at all.) This example is fragile and occasionally just fails.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get data from a React page using Python Selenium? - python

Related

Issue when scraping data on McMaster-Carr website

Having trouble in getting data using css selector/xpath in selenium

Loop through url with Selenium Webdriver

web scraping a site without direct access

Retrieving url from google image search for first entry, using python and selenium

Categories

Resources