Using Python and Selenium to scrape hard-to-find web tables

Using Python and Selenium to scrape hard-to-find web tables - python

I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

Related

How To Scrape Content With Load More Pages Using Selenium Python

I need to scrape the titles for all blog post articles via a Load More button as set by my desired range for i in range(1,3):
At present I'm only able to capture the titles for the first page even though i'm able to navigate to the next page using selenium.
Any help would be much appreciated.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
# Selenium Routine
from requests_html import HTMLSession
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
# Removes SSL Issues With Chrome
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('log-level=3')
options.add_argument('--disable-notifications')
#options.add_argument('--headless') # Comment to view browser actions
# Get website url
urls = "https://jooble.org/blog/"
r = requests.get(urls)
driver = webdriver.Chrome(executable_path="C:\webdrivers\chromedriver.exe",options=options)
driver.get(urls)
productlist = []
for i in range(1,3):
# Get Page Information
soup = BeautifulSoup(r.content, features='lxml')
items = soup.find_all('div', class_ = 'post')
print(f'LOOP: start [{len(items)}]')
for single_item in items:
title = single_item.find('div', class_ = 'front__news-title').text.strip()
print('Title:', title)
product = {
'Title': title,
}
productlist.append(product)
print()
time.sleep(5)
WebDriverWait(driver, 40).until(EC.element_to_be_clickable((By.XPATH,"//button[normalize-space()='Show more']"))).send_keys(Keys.ENTER)
driver.close()
# Save Results
df = pd.DataFrame(productlist)
df.to_csv('Results.csv', index=False)

It do not need selenium overhead in this case, cause you can use requests directly to get quetsion specific data via api.
Try to check the network tab in your browsers devtools if you click the button and you get the url that is requested to load more content. Iterate and set parameter value &page={i}.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
for i in range (1,3):
url = f'https://jooble.org/blog/wp-admin/admin-ajax.php?id=&post_id=0&slug=home&canonical_url=https%3A%2F%2Fjooble.org%2Fblog%2F&posts_per_page=6&page={i}&offset=20&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&lang=en&order=DESC&orderby=date&action=alm_get_posts&query_type=standard'
r=requests.get(url)
if r.status_code != 200:
print(f'Error occured: {r.status_code} on url: {url}')
else:
soup = BeautifulSoup(str(r.json()['html']))
for e in soup.select('.type-post'):
data.append({
'title':e.select_one('.front__news-title').get_text(strip=True),
'description':e.select_one('.front__news-description').get_text(strip=True),
'url':e.a.get('href')
})
pd.DataFrame(data)
Output
title
description
url
0
How To Become A Copywriter
If you have a flair for writing, you might consider leveraging your talents to earn some dough by working as a copywriter. The primary aim of a copywriter is to…
https://jooble.org/blog/how-to-become-a-copywriter/
1
How to Find a Job in 48 Hours
A job search might sound scary for many people. However, it doesn't have to be challenging, long, or overwhelming. With Jooble, it is possible to find the best employment opportunities…
https://jooble.org/blog/how-to-find-a-job-in-48-hours/
2
17 Popular Jobs That Involve Working With Animals
If you are interested in caring for or helping animals, you can build a successful career in this field. The main thing is to find the right way. Working with…
https://jooble.org/blog/17-popular-jobs-that-involve-working-with-animals/
3
How to Identify Phishing and Email Scam
What Phishing and Email Scam Are Cybercrime is prospering, and more and more internet users are afflicted daily. The best example of an online scam is the phishing approach -…
https://jooble.org/blog/how-to-identify-phishing-and-email-scam/
4
What To Do After Getting Fired
For many people, thoughts of getting fired tend to be spine-chilling. No wonder, since it means your everyday life gets upside down in minutes. Who would like to go through…
https://jooble.org/blog/what-to-do-after-getting-fired/
5
A mobile application for a job search in 69 countries has appeared
Jooble, a job search site in 69 countries, has launched the Jooble Job Search mobile app for iOS and Android. It will help the searcher view vacancies more conveniently and…
https://jooble.org/blog/a-mobile-application-for-a-job-search-in-69-countries-has-appeared/
...

Gathering data from table using Pandas and Beautiful Soup after logging in using Selenium

I'm trying to scrape data from a paginated table. The table can only be accessed by logging in to a user account. I've decided to approach this using Selenium to log in. I then hope to be able to read this into a Pandas DataFrame. I plan on using BeautifulSoup as a go between.
Here is my code:
from selenium import webdriver
import time
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.example.com/userarea"
driver = webdriver.Chrome()
time.sleep(6)
driver.get(url)
time.sleep(6)
username = driver.find_element_by_id("user")
username.clear()
username.send_keys("xyz#email.com")
password = driver.find_element_by_id("password")
password.clear()
password.send_keys('password')
driver.find_element_by_xpath('//button[]').click()
driver.find_element_by_xpath('//button[text()="Log in"]').click()
time.sleep(6)
driver.find_element_by_xpath('//span[text()="Text"]').click()
driver.find_element_by_xpath('//span[text()="Text"]').click()
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
try:
tables = soup.find_all('th')
print(tables) #Returns an empty list
df = pd.read_html(str(tables))
df.head()
except:
driver.close()
driver.close()
Unfortunately, this is only printing an empty list. I've tried using lxml too but no joy.
Using the inspection tools it does seem that there aren't any table tags, so I tried to find all <th> tags instead (which definitely are present). Again no joy. I've not yet tried to work through the individual pages. I only mention the pagination in case it offers a clue to the issue.
Any idea what I'm missing?

Thank you to those that offered suggestions. In the end furas' suggestion was best placed and it turned out the script was running too quickly. I paused Python for 6 seconds after clicking on the page with the table on. Seems to run on javascript and I can actually see the values pop into place now as the script works through the pagination.
import time
#Navigate to page, then let it load using:
time.sleep(6)

Why am I getting inconsistent results from web scraping?

I'm having issues scraping data from a website. The issue might be with Visual Studio Code, I am using the "Code Runner" extension. This is my first time using Beautiful Soup and Selenium so the issue might also be with my code. I started last Friday and after some difficulty came up with a solution on Saturday. My code is:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
parcelID = 1014100000 #this is a random parcelID I grabbed from the site
url = 'https://www.manateepao.com/parcel/?parid={}'.format(parcelID)
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
#was getting encoding error with print(html). replaced character that was giving me trouble
newHTML = html.replace(u"\u2715", "*")
soupFilter = SoupStrainer('div', {'id': 'ownerContentScollContainer'})
soup = BeautifulSoup(newHTML, 'html.parser', parse_only=soupFilter)
webparcelID = soup.find_all('b')
lColumn = soup.find_all('div', {'class' : 'col-sm-2 m-0 p-0 text-sm-right'})
rColumn = soup.find_all('div', {'class' : 'col-sm m-0 p-0 ml-2'})
parcel_Dict = {}
for i in range(len(lColumn)):
parcel_Dict[i] = {lColumn[i].string: rColumn[i].string}
#This is to test if I got any results or not
print(parcel_Dict)
driver.close()
driver.quit()
What I am hoping to find each time I scrape a page is:
The Parcel ID. This is in its own bold, b, tag.
The Ownership and Mailing Address. The Ownership should always be at parcel_Dict[1] and the mailing address should always be at parcel_Dict[3].
I run the code and sometimes I get a result, and other times I get an empty dictionary.
Thank you for any help you can provide.

I solved my own issue by adding the following lines of code
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[#id='ownerContentScollContainer']")))
I waited until the ownerContentScrollContainer was fully loaded before proceeding to execute the rest of the code.
This post and this post helped me figure out where I might be going wrong. I used this tutorial to figure out how to use the appropriate Xpath.

how to scrape data from shopee using beautiful soup

I'm currently a student where currently I studied beautifulsoup so my lecturer as me to scrape data from shopee however I cannot scrape the details of the products. Currently, I'm trying to scrape data from https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I only want to scrape the name and price of the products. can someone tell me why I cannot scrape the data using beautifulsoup ?
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)

This question is a bit tricky (for python beginners) because it involves a combination of selenium (for headless browsing) and beautifulsoup (for html data extraction). Moreover, the problem becomes difficult because the Document Object Model (DOM) is encased within javascripting. We know javascript is there because we get an empty response from the website when accessed only using beautifulsoup, like, for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
Therefore, to extract data from such a webpage which has a scripting language controlling its DOM, we have to use selenium for headless browsing (this tells the website that a browser is accessing it). We also have to use some sort of delay parameter, (which tells the website that it's accessed by a human). For this, the function WebdriverWait() from the selenium library will help.
I now present snippets of code that explain the process.
First, import the requisite libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
Next, initialize the settings for the headless browser. I'm using chrome.
# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
Next, I declare empty list variables to hold the data.
# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
# find_all() returns an array of elements.
# We have to go through all of them and select that one you are need. And than call get_text()
for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
item_name.append(item_n.text)
# find the price of items
for item_c in soup.find_all('span', class_='_341bF0'):
print(item_c.get_text())
item_cost.append(item_c.text)
# find initial item cost
for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
print(item_ic.get_text())
item_init_cost.append(item_ic.text)
# find total number of items sold/month
for items_s in soup.find_all('div',class_ = '_18SLBt'):
print(items_s.get_text())
items_sold.append(item_ic.text)
# find item discount percent
for dp in soup.find_all('span', class_ = 'percent'):
print(dp.get_text())
discount_percent.append(dp.text)
# find item location
for il in soup.find_all('div', class_ = '_3amru2'):
print(il.get_text())
item_loc.append(il.text)
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print ("Loading took too much time!-Try again")
Thereafter, I use the zip function to combine the different list items.
rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)
Finally, I write this data to disc,
import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
As a good practice, its wise to close the headless browser once the task is complete. And so i code it as,
# close the automated browser
browser.close()
Result
Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor
Note to the readers
The OP brought to my attention that the xpath was not working as given in my answer. I checked the website again after 2 days and noticed a strange phenomenon. The class_ attribute of the div class had indeed changed. I found a similar Q. But it did not help much. So for now, I'm concluding the div attributes in the shoppee website can change again. I leave this as an open problem to solve later.
Note to the OP
Ana, the above code will work for just one page i.e., it will work only for the webpage, https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I invite you to further enhance your skills by solving how to scrape data for multiple webpages under the sales tag. Your hint is the 1/9 seen on the top right of the this page and/or the 1 2 3 4 5 links at the bottom of the page. Another hint for you is to look at the urljoin in the urlparse library. Hope this should get you started.
Helpful resources
XPATH tutorial

The page is loading after the first request sends to the page by ajax async so sending one request and getting the source of the page you want seems not possible.
You should simulate a browser then you can get the source and you can use the beautifulsoup. See the code:
BeautifulSoup way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')
for p in products:
name = p.select('div[data-sqe="name"] > div')[0].get_text()
price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
print('name: ' + name)
print('price: ' + price)
print('product: ' + product + '\n')
However, using selenium is a good approach to get everything you want. See the example below:
Selenium Way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')
for p in products:
name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
print('name: ' + name)
print('price: ' + price.replace('\n', ' | '))
print('product: ' + product + '\n')

please post your code so we can help.
or you can start like this.. :)
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg
my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()

LXML XPATH - Data returned from one site and not another

I'm just learning python and decided to play with some website scraping.
I created 1 that works, and a second, almost identical as far as I can tell, that doesn't work, and I can't figure out why.
from lxml import html
import requests
page = requests.get('https://thronesdb.com/set/Core')
tree = html.fromstring(page.content)
cards = [tree.xpath('//a[#class = "card-tip"]/text()'),tree.xpath('//td[#data-th = "Faction"]/text()'),
tree.xpath('//td[#data-th = "Cost"]/text()'),tree.xpath('//td[#data-th = "Type"]/text()'),
tree.xpath('//td[#data-th = "STR"]/text()'),tree.xpath('//td[#data-th = "Traits"]/text()'),
tree.xpath('//td[#data-th = "Set"]/text()'),tree.xpath('//a[#class = "card-tip"]/#data-code')]
print(cards)
That one does what I expect (I know it's not pretty). It grabs certain elements from a table on the site.
This one returns [[]]:
from lxml import html
import requests
page = requests.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
tree = html.fromstring(page.content)
offers = [tree.xpath('//a[#class = "offer_title"]/text()')]
print(offers)
What I expect it to do is give me a list that has the text from each offer_title element on the page.
The xpath I'm gunning at I grabbed from Firebug, which is:
/html/body/div[1]/div/div/div/section/div[2]/ul[1]/li[2]/div/h3/a
Here's the actual string from the site:
Angelcare Digital Video And Sound Monitor - $89.99 ($90.00 Off)
I have also read a few other questions, but they didn't answer how this could work the first way, but not the second. Can't post them because of the link restrictions on new accounts.
Titles:
Python - Unable to Retrieve Data From Webpage Table Using Beautiful
Soup or lxml xpath
Python lxml xpath no output
Trouble with scraping text from site using lxml / xpath()
Any help would be appreciated. I did some reading on the lxml website about xpath, but I may be missing something in the way I'm building a query.
Thanks!

The reason why the first code is working is that required data is initially present in DOM while on second page required data is generated dynamically by JavaScript, so you cannot scrape it because requests doesn't support handling dynamic content.
You can try to use, for example, Selenium + PhantomJS to get required data as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS(executable_path='/path/to/phantomJS')
driver.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
xpath = '//a[#class = "offer_title"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
offers = [link.get_attribute('textContent') for link in driver.find_elements_by_xpath(xpath)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python and Selenium to scrape hard-to-find web tables - python

Related

How To Scrape Content With Load More Pages Using Selenium Python

Gathering data from table using Pandas and Beautiful Soup after logging in using Selenium

Why am I getting inconsistent results from web scraping?

how to scrape data from shopee using beautiful soup

LXML XPATH - Data returned from one site and not another

Categories

Resources