Data scraping from dynamic sites

Data scraping from dynamic sites - python

I am scraping data from dynamic site (https://www.mozzartbet.com/sr#/betting) using this code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
s = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service = s)
driver.get('https://www.mozzartbet.com/sr#/betting')
driver.maximize_window()
results = driver.find_elements('xpath', '//*[#id="focus"]/section[1]/div/div[2]/div[2]/article/div/div[2]/div/div[1]/span[2]/span')
for result in results:
print(result.text)
I want to scrape quotes from all matches in Premier League and once this code worked properly, and for some reason when I ran it next time it didn't make list results (it contained 0 elements), although I tried that xpath in inspect section of web page and it returned what I wanted, path is in code above.

Related

How to scrape price from Udemy?

Note: I don't find a relevant worked solution in any other similar questions.
How to find price from udemy website with web scraping?
Scraping Data From Udemy , AngularJs Site Using PHP
How to GET promotional price using Udemy API?
My problem is how to scrape courses prices from Udemy using python & selenium?
This is the link:
https://www.udemy.com/courses/development/?p=1
My attempt is below.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
driver = webdriver.Chrome(ChromeDriverManager().install())
url = "https://www.udemy.com/courses/development/?p=1"
driver.get(url)
time.sleep(2)
#data = driver.find_element('//div[#class="price-text--price-part"]')
#data = driver.find_element_by_xpath('//div[contains(#class, "price-text--price-part"]')
#data=driver.find_element_by_css_selector('div.udlite-sr-only[attrName="price-text--price-part"]')
print(data)
Not of them worked for me. So, is there a way to select elements by classes that contain a specific text?
In this example, the text to find is: "price-text--price-part"

The first xpath doesn't highlight any element in the DOM.
The second xpath doesn't have a closing brackets for contains
//div[contains(#class, "price-text--price-part"]
should be
//div[contains(#class, "price-text--price-part")]
Try like below, it might work. (When I tried the website detected as a bot and price was not loaded)
driver.get("https://www.udemy.com/courses/development/?p=1")
options = driver.find_elements_by_xpath("//div[contains(#class,'course-list--container')]/div[contains(#class,'popper')]")
for opt in options:
title = opt.find_element_by_xpath(".//div[contains(#class,'title')]").text # Use a dot in the xpath to find element within in an element.
price = opt.find_element_by_xpath(".//div[contains(#class,'price-text--price-part')]/span[2]/span").text
print(f"{title}: {price}")

Using Python and Selenium to scrape hard-to-find web tables

I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

Error scraping Highcharts in Python with selenium

I am trying to scrape highcharts from two different websites,
I came across this execute_script answer in this stackoverflow question :
How to scrape charts from a website with python?
It helped me scrape from the first website but when i use it on the second website it returns the following error:
line 27, in <module>
temp = driver.execute_script('return window.Highcharts.charts[0]'
selenium.common.exceptions.JavascriptException: Message: javascript error: Cannot read
property '0' of undefined
The website is :
http://lumierecapital.com/#
You're supposed to click on the performance button on the left to get the highchart.
Goal: i just want to scrape the Date and NAV per unit values from it
Like the last website, this code should've printed out a dict with X and Y as keys and the date and data as values but it doesn't work for this one.
Here's the python code:
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.chrome.options import Options
from shutil import which
from selenium import webdriver
import time
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_path = which("chromedriver")
driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)
driver.set_window_size(1366, 768)
driver.get("http://lumierecapital.com/#")
performance_button = driver.find_element_by_xpath("//a[#page='performance']")
performance_button.click()
time.sleep(7)
temp = driver.execute_script('return window.Highcharts.charts[0]'
'.series[0].options.data')
for item in temp:
print(item)

You can use re module to extract the values of performance chart:
import re
import requests
url = 'http://lumierecapital.com/content_performance.html'
html_data = requests.get(url).text
for year, month, day, value, datex in re.findall(r"{ x:Date\.UTC\((\d+), (\d+), (\d+)\), y:([\d.]+), datex: '(.*?)' }", html_data):
print('{:<10} {}'.format(datex, value))
Prints:
30/9/07 576.092
31/10/07 577.737
30/11/07 567.998
31/12/07 556.670
31/1/08 460.886
29/2/08 496.740
31/3/08 484.016
30/4/08 523.829
31/5/08 546.661
30/6/08 494.067
31/7/08 475.942
31/8/08 389.147
30/9/08 299.661
31/10/08 183.690
30/11/08 190.054
31/12/08 211.960
31/1/09 193.308
... and so on.

Clicking multiple items on one page using selenium

My main purpose is to go to this specific website, to click each of the products, have enough time to scrape the data from the clicked product, then go back to click another product from the page until all the products are clicked through and scraped (The scraping code I have not included).
My code opens up chrome to redirect to my desired website, generates a list of links to click by class_name. This is the part I am stuck on, I would believe I need a for-loop to iterate through the list of links to click and go back to the original. But, I can't figure out why this won't work.
Here is my code:
import csv
import time
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
url = "https://www.vatainc.com/infusion/adult-infusion.html?limit=all"
service = service.Service('path to chromedriver')
service.start()
capabilities = {'chrome.binary': 'path to chrome'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = driver.find_elements_by_class_name('product-name')
for link in links:
link.click()
driver.back()
link.click()

I have another solution to your problem.
When I tested your code it showed a strange behaviour. Fixed all problems that I had using xpath.
url = "https://www.vatainc.com/infusion/adult-infusion.html?limit=all"
driver.get(url)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(#class, 'product-name')]/a")]
htmls = []
for link in links:
driver.get(link)
htmls.append(driver.page_source)
Instead of going back and forward I saved all links (named as links) and iterate over this list.

xpath copied from inspector returns wrong results

I'm using selenium webdriver configured with chrome and want to scrape the price from Yahoo Finance. An example page is: https://finance.yahoo.com/quote/0001.KL
I've opened the example page in a chrome browser and used the inspector to navigate to where the price is highlighted on the page and use the inspector's copy xpath for use in my python script.
import os
from lxml import html
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
ua = UserAgent()
def yahoo_scrape_one(kl_stock_id):
''' function to scrape yahoo finance for a single KLSE stock returns dict'''
user_agent = ua.random
chrome_driver = os.getcwd() + '/chromedriver'
chrome_options = Options()
chrome_options.add_argument('user-agent={user_agent}')
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options,
executable_path=chrome_driver)
prefix = 'https://finance.yahoo.com/quote/'
suffix = '.KL'
url = prefix + kl_stock_id + suffix
driver.get(url)
tree = html.fromstring(driver.page_source)
price = tree.xpath('//*[#id="quote-header-info"]/div[3]/div/div/span[1]/text()')
print(price)
test_stock = "0001"
yahoo_scrape_one(test_stock)
the print output is
['+0.01 (+1.41%)']
This appears to be information from the next span (change and percent change) not the price. Any insights on this curious behaviour would be appreciated. Any suggestions on an alternate approach would also give joy.

After running your actual script, I got the same "erroneous" output you were reporting. However, I then commented out the headless option and ran the driver again, inspecting the element within the actual Selenium browser instance, and found that the XPath for the element is different on the page generated by your script. Use the following line of code instead:
price = tree.xpath('//*[#id="quote-header-info"]/div[3]/div/span/text()')
This produces the correct output of ['0.36']

This is another way you can achieve the same output without hardcoding index:
price = tree.xpath("//*[#id='quote-market-notice']/../span")[0].text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data scraping from dynamic sites - python

Related

How to scrape price from Udemy?

Using Python and Selenium to scrape hard-to-find web tables

Error scraping Highcharts in Python with selenium

Clicking multiple items on one page using selenium

xpath copied from inspector returns wrong results

Categories

Resources