How To Scrape Content With Load More Pages Using Selenium Python - python

I need to scrape the titles for all blog post articles via a Load More button as set by my desired range for i in range(1,3):
At present I'm only able to capture the titles for the first page even though i'm able to navigate to the next page using selenium.
Any help would be much appreciated.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
# Selenium Routine
from requests_html import HTMLSession
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
# Removes SSL Issues With Chrome
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('log-level=3')
options.add_argument('--disable-notifications')
#options.add_argument('--headless') # Comment to view browser actions
# Get website url
urls = "https://jooble.org/blog/"
r = requests.get(urls)
driver = webdriver.Chrome(executable_path="C:\webdrivers\chromedriver.exe",options=options)
driver.get(urls)
productlist = []
for i in range(1,3):
# Get Page Information
soup = BeautifulSoup(r.content, features='lxml')
items = soup.find_all('div', class_ = 'post')
print(f'LOOP: start [{len(items)}]')
for single_item in items:
title = single_item.find('div', class_ = 'front__news-title').text.strip()
print('Title:', title)
product = {
'Title': title,
}
productlist.append(product)
print()
time.sleep(5)
WebDriverWait(driver, 40).until(EC.element_to_be_clickable((By.XPATH,"//button[normalize-space()='Show more']"))).send_keys(Keys.ENTER)
driver.close()
# Save Results
df = pd.DataFrame(productlist)
df.to_csv('Results.csv', index=False)

It do not need selenium overhead in this case, cause you can use requests directly to get quetsion specific data via api.
Try to check the network tab in your browsers devtools if you click the button and you get the url that is requested to load more content. Iterate and set parameter value &page={i}.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
for i in range (1,3):
url = f'https://jooble.org/blog/wp-admin/admin-ajax.php?id=&post_id=0&slug=home&canonical_url=https%3A%2F%2Fjooble.org%2Fblog%2F&posts_per_page=6&page={i}&offset=20&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&lang=en&order=DESC&orderby=date&action=alm_get_posts&query_type=standard'
r=requests.get(url)
if r.status_code != 200:
print(f'Error occured: {r.status_code} on url: {url}')
else:
soup = BeautifulSoup(str(r.json()['html']))
for e in soup.select('.type-post'):
data.append({
'title':e.select_one('.front__news-title').get_text(strip=True),
'description':e.select_one('.front__news-description').get_text(strip=True),
'url':e.a.get('href')
})
pd.DataFrame(data)
Output
title
description
url
0
How To Become A Copywriter
If you have a flair for writing, you might consider leveraging your talents to earn some dough by working as a copywriter. The primary aim of a copywriter is to…
https://jooble.org/blog/how-to-become-a-copywriter/
1
How to Find a Job in 48 Hours
A job search might sound scary for many people. However, it doesn't have to be challenging, long, or overwhelming. With Jooble, it is possible to find the best employment opportunities…
https://jooble.org/blog/how-to-find-a-job-in-48-hours/
2
17 Popular Jobs That Involve Working With Animals
If you are interested in caring for or helping animals, you can build a successful career in this field. The main thing is to find the right way. Working with…
https://jooble.org/blog/17-popular-jobs-that-involve-working-with-animals/
3
How to Identify Phishing and Email Scam
What Phishing and Email Scam Are Cybercrime is prospering, and more and more internet users are afflicted daily. The best example of an online scam is the phishing approach -…
https://jooble.org/blog/how-to-identify-phishing-and-email-scam/
4
What To Do After Getting Fired
For many people, thoughts of getting fired tend to be spine-chilling. No wonder, since it means your everyday life gets upside down in minutes. Who would like to go through…
https://jooble.org/blog/what-to-do-after-getting-fired/
5
A mobile application for a job search in 69 countries has appeared
Jooble, a job search site in 69 countries, has launched the Jooble Job Search mobile app for iOS and Android. It will help the searcher view vacancies more conveniently and…
https://jooble.org/blog/a-mobile-application-for-a-job-search-in-69-countries-has-appeared/
...

Related

Python log in website once for multiple site scraping

I want to scrape data from a website.
The Python code below works fine for 1 id. But for each id, the code logs in ONCE. Thus if I want to download data for 100id, the code will need to log in 100 times.
Is there anyway to log in just one time, and followed by a scraping loop to download data for multiple ids?
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# URL and id ---------------------------------
company = 313055
url = 'https://www.capitaliq.com/CIQDotNet/Estimates/CIQ/consensus.aspx?CompanyId={}&dataVendorId=1'
bot = webdriver.Chrome(ChromeDriverManager().install())
bot.get(url.format(company))
# Log in -------------------------------------
bot.find_element_by_id('username').send_keys("")
pwd = bot.find_element_by_id('password')
pwd.send_keys('')
pwd.send_keys(Keys.RETURN)
# Download data ------------------------------
soup = BeautifulSoup(bot.page_source, 'lxml')
UPDATED:
So the solution is to use the main page url, and log in.
Then putting company into search box.
This case, you dont need to login again and again
UPDATED: So the solution is to use the main page url, and log in. Then putting company into search box. This case, you dont need to login again and again
# LOG ON TO Capitaliq.com
url = 'https://www.capitaliq.com'
bot = webdriver.Chrome(ChromeDriverManager().install())
bot.get(url.format(company))
bot.find_element_by_id('username').send_keys("")
pwd = bot.find_element_by_id('password')
pwd.send_keys('')
pwd.send_keys(Keys.RETURN)
tickers_list=[19049,24937]
for ticker in tickers_list:
print('**********', ticker)
good=0
# Enter Firm information
bot.find_element_by_id('SearchTopBar').send_keys(ticker)
bot.find_element_by_id('SearchTopBar').send_keys(Keys.RETURN)

Data scraping from dynamic sites

I am scraping data from dynamic site (https://www.mozzartbet.com/sr#/betting) using this code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
s = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service = s)
driver.get('https://www.mozzartbet.com/sr#/betting')
driver.maximize_window()
results = driver.find_elements('xpath', '//*[#id="focus"]/section[1]/div/div[2]/div[2]/article/div/div[2]/div/div[1]/span[2]/span')
for result in results:
print(result.text)
I want to scrape quotes from all matches in Premier League and once this code worked properly, and for some reason when I ran it next time it didn't make list results (it contained 0 elements), although I tried that xpath in inspect section of web page and it returned what I wanted, path is in code above.

Why am I getting inconsistent results from web scraping?

I'm having issues scraping data from a website. The issue might be with Visual Studio Code, I am using the "Code Runner" extension. This is my first time using Beautiful Soup and Selenium so the issue might also be with my code. I started last Friday and after some difficulty came up with a solution on Saturday. My code is:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
parcelID = 1014100000 #this is a random parcelID I grabbed from the site
url = 'https://www.manateepao.com/parcel/?parid={}'.format(parcelID)
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
#was getting encoding error with print(html). replaced character that was giving me trouble
newHTML = html.replace(u"\u2715", "*")
soupFilter = SoupStrainer('div', {'id': 'ownerContentScollContainer'})
soup = BeautifulSoup(newHTML, 'html.parser', parse_only=soupFilter)
webparcelID = soup.find_all('b')
lColumn = soup.find_all('div', {'class' : 'col-sm-2 m-0 p-0 text-sm-right'})
rColumn = soup.find_all('div', {'class' : 'col-sm m-0 p-0 ml-2'})
parcel_Dict = {}
for i in range(len(lColumn)):
parcel_Dict[i] = {lColumn[i].string: rColumn[i].string}
#This is to test if I got any results or not
print(parcel_Dict)
driver.close()
driver.quit()
What I am hoping to find each time I scrape a page is:
The Parcel ID. This is in its own bold, b, tag.
The Ownership and Mailing Address. The Ownership should always be at parcel_Dict[1] and the mailing address should always be at parcel_Dict[3].
I run the code and sometimes I get a result, and other times I get an empty dictionary.
Thank you for any help you can provide.
I solved my own issue by adding the following lines of code
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[#id='ownerContentScollContainer']")))
I waited until the ownerContentScrollContainer was fully loaded before proceeding to execute the rest of the code.
This post and this post helped me figure out where I might be going wrong. I used this tutorial to figure out how to use the appropriate Xpath.

Using Python and Selenium to scrape hard-to-find web tables

I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()
I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

how to scrape data from shopee using beautiful soup

I'm currently a student where currently I studied beautifulsoup so my lecturer as me to scrape data from shopee however I cannot scrape the details of the products. Currently, I'm trying to scrape data from https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I only want to scrape the name and price of the products. can someone tell me why I cannot scrape the data using beautifulsoup ?
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)
This question is a bit tricky (for python beginners) because it involves a combination of selenium (for headless browsing) and beautifulsoup (for html data extraction). Moreover, the problem becomes difficult because the Document Object Model (DOM) is encased within javascripting. We know javascript is there because we get an empty response from the website when accessed only using beautifulsoup, like, for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
Therefore, to extract data from such a webpage which has a scripting language controlling its DOM, we have to use selenium for headless browsing (this tells the website that a browser is accessing it). We also have to use some sort of delay parameter, (which tells the website that it's accessed by a human). For this, the function WebdriverWait() from the selenium library will help.
I now present snippets of code that explain the process.
First, import the requisite libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
Next, initialize the settings for the headless browser. I'm using chrome.
# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
Next, I declare empty list variables to hold the data.
# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
# find_all() returns an array of elements.
# We have to go through all of them and select that one you are need. And than call get_text()
for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
item_name.append(item_n.text)
# find the price of items
for item_c in soup.find_all('span', class_='_341bF0'):
print(item_c.get_text())
item_cost.append(item_c.text)
# find initial item cost
for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
print(item_ic.get_text())
item_init_cost.append(item_ic.text)
# find total number of items sold/month
for items_s in soup.find_all('div',class_ = '_18SLBt'):
print(items_s.get_text())
items_sold.append(item_ic.text)
# find item discount percent
for dp in soup.find_all('span', class_ = 'percent'):
print(dp.get_text())
discount_percent.append(dp.text)
# find item location
for il in soup.find_all('div', class_ = '_3amru2'):
print(il.get_text())
item_loc.append(il.text)
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print ("Loading took too much time!-Try again")
Thereafter, I use the zip function to combine the different list items.
rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)
Finally, I write this data to disc,
import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
As a good practice, its wise to close the headless browser once the task is complete. And so i code it as,
# close the automated browser
browser.close()
Result
Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor
Note to the readers
The OP brought to my attention that the xpath was not working as given in my answer. I checked the website again after 2 days and noticed a strange phenomenon. The class_ attribute of the div class had indeed changed. I found a similar Q. But it did not help much. So for now, I'm concluding the div attributes in the shoppee website can change again. I leave this as an open problem to solve later.
Note to the OP
Ana, the above code will work for just one page i.e., it will work only for the webpage, https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I invite you to further enhance your skills by solving how to scrape data for multiple webpages under the sales tag. Your hint is the 1/9 seen on the top right of the this page and/or the 1 2 3 4 5 links at the bottom of the page. Another hint for you is to look at the urljoin in the urlparse library. Hope this should get you started.
Helpful resources
XPATH tutorial
The page is loading after the first request sends to the page by ajax async so sending one request and getting the source of the page you want seems not possible.
You should simulate a browser then you can get the source and you can use the beautifulsoup. See the code:
BeautifulSoup way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')
for p in products:
name = p.select('div[data-sqe="name"] > div')[0].get_text()
price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
print('name: ' + name)
print('price: ' + price)
print('product: ' + product + '\n')
However, using selenium is a good approach to get everything you want. See the example below:
Selenium Way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')
for p in products:
name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
print('name: ' + name)
print('price: ' + price.replace('\n', ' | '))
print('product: ' + product + '\n')
please post your code so we can help.
or you can start like this.. :)
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg
my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()

Categories

Resources