Error scraping Highcharts in Python with selenium - python

I am trying to scrape highcharts from two different websites,
I came across this execute_script answer in this stackoverflow question :
How to scrape charts from a website with python?
It helped me scrape from the first website but when i use it on the second website it returns the following error:
line 27, in <module>
temp = driver.execute_script('return window.Highcharts.charts[0]'
selenium.common.exceptions.JavascriptException: Message: javascript error: Cannot read
property '0' of undefined
The website is :
http://lumierecapital.com/#
You're supposed to click on the performance button on the left to get the highchart.
Goal: i just want to scrape the Date and NAV per unit values from it
Like the last website, this code should've printed out a dict with X and Y as keys and the date and data as values but it doesn't work for this one.
Here's the python code:
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.chrome.options import Options
from shutil import which
from selenium import webdriver
import time
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_path = which("chromedriver")
driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)
driver.set_window_size(1366, 768)
driver.get("http://lumierecapital.com/#")
performance_button = driver.find_element_by_xpath("//a[#page='performance']")
performance_button.click()
time.sleep(7)
temp = driver.execute_script('return window.Highcharts.charts[0]'
'.series[0].options.data')
for item in temp:
print(item)

You can use re module to extract the values of performance chart:
import re
import requests
url = 'http://lumierecapital.com/content_performance.html'
html_data = requests.get(url).text
for year, month, day, value, datex in re.findall(r"{ x:Date\.UTC\((\d+), (\d+), (\d+)\), y:([\d.]+), datex: '(.*?)' }", html_data):
print('{:<10} {}'.format(datex, value))
Prints:
30/9/07 576.092
31/10/07 577.737
30/11/07 567.998
31/12/07 556.670
31/1/08 460.886
29/2/08 496.740
31/3/08 484.016
30/4/08 523.829
31/5/08 546.661
30/6/08 494.067
31/7/08 475.942
31/8/08 389.147
30/9/08 299.661
31/10/08 183.690
30/11/08 190.054
31/12/08 211.960
31/1/09 193.308
... and so on.

Related

Data scraping from dynamic sites

I am scraping data from dynamic site (https://www.mozzartbet.com/sr#/betting) using this code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
s = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service = s)
driver.get('https://www.mozzartbet.com/sr#/betting')
driver.maximize_window()
results = driver.find_elements('xpath', '//*[#id="focus"]/section[1]/div/div[2]/div[2]/article/div/div[2]/div/div[1]/span[2]/span')
for result in results:
print(result.text)
I want to scrape quotes from all matches in Premier League and once this code worked properly, and for some reason when I ran it next time it didn't make list results (it contained 0 elements), although I tried that xpath in inspect section of web page and it returned what I wanted, path is in code above.

Using Python and Selenium to scrape hard-to-find web tables

I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()
I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

Website not allowing to extract data using python

I am trying to extract data from a webpage (https://clinicaltrials.gov), I have built a scraper using selenium and lxml and it is working fine. I need to hit the next page button once the first page scraping is done, after going to the next page I need to take the url of that page using (driver.current_url) and start the scraping again.
Here the problem is search results table only changing but the URL remaining same. So whenever driver taking current url (driver.current_url) first page results coming again and again.
Edited:
here is the code
import re
import time
import urllib.parse
import lxml.html
import pandas as pd
import requests
import urllib3
from lxml import etree
from lxml import html
from pandas import ExcelFile
from pandas import ExcelWriter
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
siteurl = 'https://clinicaltrials.gov/'
driver = webdriver.Chrome()
driver.get(siteurl)
WebDriverWait(driver, 5)
driver.maximize_window()
def advancesearch():
driver.find_element_by_link_text('Advanced Search').click()
driver.find_element_by_id('StartDateStart').send_keys('01/01/2016')
driver.find_element_by_id('StartDateEnd').send_keys('12/30/2020')
webdriver.ActionChains(driver).send_keys(Keys.ENTER).perform()
time.sleep(3)
driver.find_element_by_xpath("//input[contains(#id, 'home-search-condition-query')]").send_keys('medicine') #Give keyword here
advancesearch()
#driver.find_element_by_xpath("//div[contains(#class, 'dataTables_length')]//label//select//option[4]").click()
#time.sleep(8)
def nextbutton():
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
driver.find_element_by_xpath("//a[contains(#class, 'paginate_button next')]").click()
def extractor():
cur_url = driver.current_url
read_url = requests.get(cur_url)
souptree = html.fromstring(read_url.content)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (names, phones, mails) in zip(name, phone, mail):
fullname = names[9:]
print(fullname, phones, mails, title, sep='\t')
while True:
extractor()
nextbutton()
You don't need to get the URL if the page it's already change.
You could start re-iterating from when the page has reloaded after you click next. You can make the driver wait until an element is present (explicit wait) or just wait (implicit wait).
There are a number of changes (for example, use shorter less fragile css selectors and bs4) I would probably make but the two that stand out:
1) You already have the data you need so there is no requirement for a new url. Simply use the current page_source of driver.
So top of extractor function becomes
def extractor():
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
#rest of code
2) To reduce iterations I would set results count to 100 at start
def advancesearch():
driver.find_element_by_link_text('Advanced Search').click()
driver.find_element_by_id('StartDateStart').send_keys('01/01/2016')
driver.find_element_by_id('StartDateEnd').send_keys('12/30/2020')
webdriver.ActionChains(driver).send_keys(Keys.ENTER).perform()
time.sleep(3)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#theDataTable_length [value='100']"))).click() #change for 100 results so less looping
then add additional import of
from selenium.webdriver.common.by import By

Scraping the data based on automatically access the multiple pages based on href

I am having the problem of automating the multiple pages using the selenium webdriver and python. In my code, I am getting the pages clicked automatically upto 10 pages but after the 10 pages, it won't work. I am not getting the page clicking after page number 11.
import urllib.request
from bs4 import BeautifulSoup
import csv
import os
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import os
url = 'http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?
hDistName=Buldhana'
chrome_path = r'C:/Users/User/AppData/Local/Programs/Python/Python36/Scripts/chromedriver.exe'
d = webdriver.Chrome(executable_path=chrome_path)
d.implicitly_wait(10)
d.get(url)
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlTaluka')).select_by_value('7')
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlVillage')).select_by_value('1464')
page = [page.get_attribute('href')for page in
d.find_elements_by_css_selector(
"#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate [href*='Page$']")]
while True:
pages = [page.get_attribute('href')for page in
d.find_elements_by_css_selector(
"#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate
[href*='Page$']")]
for script_page in pages:
d.execute_script(script_page)
#print(script_page)
Try using page indexes and check whether the page is available and you have to click on each page and go on.Try the following code.
from selenium import webdriver
url = 'http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Buldhana'
chrome_path = r'C:/Users/User/AppData/Local/Programs/Python/Python36/Scripts/chromedriver.exe'
d = webdriver.Chrome(executable_path=chrome_path)
d.implicitly_wait(10)
d.get(url)
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlTaluka')).select_by_value('7')
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlVillage')).select_by_value('1464')
i=2
while True:
if len(d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate a[href*='Page${}']".format(i)))>0:
print( d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate a[href*='Page${}']".format(i))[0].get_attribute('href'))
d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate a[href*='Page${}']".format(i))[0].click()
i+=1
else:
break
Output:
Since i started from page 2.
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$2')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$3')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$4')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$5')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$6')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$7')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$8')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$9')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$10')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$11')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$12')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$13')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$14')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$15')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$16')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$17')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$18')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$19')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$20')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$21')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$22')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$23')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$24')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$25')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$26')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$27')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$28')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$29')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$30')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$31')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$32')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$33')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$34')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$35')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$36')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$37')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$38')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$39')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$40')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$41')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$42')
Process finished with exit code 0

xpath copied from inspector returns wrong results

I'm using selenium webdriver configured with chrome and want to scrape the price from Yahoo Finance. An example page is: https://finance.yahoo.com/quote/0001.KL
I've opened the example page in a chrome browser and used the inspector to navigate to where the price is highlighted on the page and use the inspector's copy xpath for use in my python script.
import os
from lxml import html
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
ua = UserAgent()
def yahoo_scrape_one(kl_stock_id):
''' function to scrape yahoo finance for a single KLSE stock returns dict'''
user_agent = ua.random
chrome_driver = os.getcwd() + '/chromedriver'
chrome_options = Options()
chrome_options.add_argument('user-agent={user_agent}')
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options,
executable_path=chrome_driver)
prefix = 'https://finance.yahoo.com/quote/'
suffix = '.KL'
url = prefix + kl_stock_id + suffix
driver.get(url)
tree = html.fromstring(driver.page_source)
price = tree.xpath('//*[#id="quote-header-info"]/div[3]/div/div/span[1]/text()')
print(price)
test_stock = "0001"
yahoo_scrape_one(test_stock)
the print output is
['+0.01 (+1.41%)']
This appears to be information from the next span (change and percent change) not the price. Any insights on this curious behaviour would be appreciated. Any suggestions on an alternate approach would also give joy.
After running your actual script, I got the same "erroneous" output you were reporting. However, I then commented out the headless option and ran the driver again, inspecting the element within the actual Selenium browser instance, and found that the XPath for the element is different on the page generated by your script. Use the following line of code instead:
price = tree.xpath('//*[#id="quote-header-info"]/div[3]/div/span/text()')
This produces the correct output of ['0.36']
This is another way you can achieve the same output without hardcoding index:
price = tree.xpath("//*[#id='quote-market-notice']/../span")[0].text

Categories

Resources