I'm Fairly new to Python and using beautiful soup first time though I have some experience with selenium. I am trying to scrape a website ("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx" ) For all the affiliation number.
The problem is they are on multiple pages( 20 result on 1, total: 21,000+ result)
so, I wish to scrape these in some kind of loop that can iterate over the next page btn, the problem in URL of the web page does not change and thus there is no pattern.
Okay so for this i have tried, google sheet Import HTML/ Import XML method but due to large scale of problem it just hangs.
Next I tried python and started reading about scraping using python (I'm doing this for the first time :) ) Some-one on this platform suggested an method
(Python Requests/BeautifulSoup access to pagination)
I am trying to do the same but with little and no success.
Also, to fetch the result we have to first, query the search bar with the keyword "a" --> then click search. Only then the website show result.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe",options=options)
driver.get("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx")
#click on the radio btn
driver.find_element(By.ID,'optlist_0').click()
time.sleep(2)
# Search the query with letter A And Click Search btn
driver.find_element(By.ID,'keytext').send_Keys("a")
driver.find_element(By.ID,'search').click()
time.sleep(2)
next_button = driver.find_element_by_id("Button1")
data = []
try:
while (next_button):
soup = BeautifulSoup(driver.page_source,'html.parser')
table = soup.find('table',{'id':'T1'}) #Main Table
table_body = table.find('tbody') #get inside the body
rows = table_body.find_all('tr') #look for all tablerow
for row in rows:
cols = row.find_all('td') # in every Tablerow, look for tabledata
for row2 in cols:
#table -> tbody ->tr ->td -><b> --> exit loop. ( only first tr is our required data, print this)
The final outcome I expect is List of all affiliation number across multiple pages.
A minor addition to the code within your while loop:
next_button = 1 #Initialise the variable for the first instance of while loop
while next_button:
#First scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#Now locate the button & click on it
next_button = driver.find_element(By.ID,'Button1')
next_button.click()
###
###Beautiful Soup Code : Fetch the page source now & do your thing###
###
#Adjust the timing as per your requirement
time.sleep(2)
Note the fact that scrolling to the bottom of the page is important, otherwise an error will pop up claiming 'Button1' element is hidden under the footer. So with the script(in the beginning of the loop), the browser will move down to the bottom of the page. Here, it can see the 'Button1' element clearly. Now, locate the element, perform the click action & then let your Beautiful Soup take over.
Related
So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.
My code runs fine and prints the title for all rows but the rows with dropdowns.
For example, row 4 has a dropdown if clicked. I implemented a try which would in theory initiate the dropdown, to then pull the titles.
But my click/scrape for the rows with these drop downs are not printing.
Expected output- Print all titles including the ones in dropdown.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='card item-container session')
for property in productlist:
sessiontitle=property.find('h4',class_='session-title card-title').text
print(sessiontitle)
try:
ifDropdown=driver.find_elements_by_class_name('item-expand-action expand')
ifDropdown.click()
time.sleep(4)
newTitle=driver.find_element_by_class_name('card-title').text
print(newTitle)
except:
newTitle='none'
There were a couple of issues. First, when you locate from the driver by class and there is more than one, you need to separate them by dots, not spaces, so that the driver knows it's dealing with another class.
Second, find_elements returns a list, and the list has no .click(), so you get an error, which your except catches but assumes means there was no link to click.
I rewrote it (without soup for now) so that it instead checks (With the dot replacing space) for a link to open within the session and then loops over the new ones that appeared.
Here is what I have and tested. Note at the end this only gets the sessions and subsessions in the view. You will need to add logic to scroll and get the rest.
# stuff to initialize driver is above here, I used firefox
# Open the website page
URL = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list"
driver.get(URL)
time.sleep(4)#time for page to populate
product_list=driver.find_elements_by_css_selector('div.card.item-container.session')
#above line gets all top level sessions
for product in product_list:
session_title=product.find_element_by_css_selector('h4.card-title').text
print(session_title)
dropdowns=product.find_elements_by_class_name('item-expand-action.expand')
#above line finds dropdown within this session, if any
if len(dropdowns)==0:#nothing more for this session
continue#move to next session
#still here, click on the dropdown, using execute because link can overlap chevron
driver.execute_script("arguments[0].scrollIntoView(true); arguments[0].click();",
dropdowns[0])
time.sleep(4)#wait for subsessions to appear
session_titles=product.find_elements_by_css_selector('h4.card-title')
session_index = 0#suppress reprinting title of master session
for session_title in session_titles:
if session_index > 0:
print(" " + session_title.text)#indent for clarity
session_index = session_index + 1
#still to do, deal with other sessions that only get paged into view when you scroll
#that is a different question
I am scraping a website that dynamically renders with javascript. The urls don't change when hitting the > button So I have been trying to look at the inspector in the network section and more specifically the "General" section for the "Request Url" and the "Request Method" as well as in the "Form Data" section looking for any sort of ID that could be unique to distinguish each successive page. However when recording a log of clicking the > button from page to page the "Form Data" data seems to be the same each time (See images):
Currently my code doesn't incorporate this method because I can't see it helping until I can find a unique identifier in the "Form Data" section. However, I can show my code if helpful. In essence it just pulls the first page of data over and over again in my while loop even though I'm using a driver with selenium and using driver.find_elements_by_xpath("xpath of > button").click() before trying to get the data with BeautifulSoup.
(Updated code see comments)
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from pandas import *
masters_list = []
def extract_info(html_source):
# html_source will be inner HTMl of table
global lst
soup = BeautifulSoup(html_source, 'html.parser')
lst = soup.find('tbody').find_all('tr')[0]
masters_list.append(lst)
# i am printing just id because it's id set as crypto name you have to do more scraping to get more info
chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: # loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID, 'DataTables_Table_0').get_attribute(
'innerHTML') # this is for crypto data table
extract_info(crypto_table)
paginate = driver.find_element(
By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME, 'li')
# we clicking on next arrow sign at last not on 2,3,.. etc anchor link
next_page_link = pages_list[-1].find_element(By.TAG_NAME, 'a')
# checking is there next page available
if "disabled" in next_page_link.get_attribute('class'):
loop = False
pages_list[-1].click() # if there next page available then click on it
df = pd.DataFrame(masters_list)
print(df)
df.to_csv("crypto_list.csv")
driver.quit()
I am using my own code to show how i am getting the table i add explanation as comment for important line
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def extract_info(html_source):
soup = BeautifulSoup(html_source,'html.parser') #html_source will be inner HTMl of table
lst = soup.find('tbody').find_all('tr')
for i in lst:
print(i.get('id')) # i am printing just id because it's id set as crypto name you have to do more scraping to get more info
driver = webdriver.Chrome()
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: #loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID,'DataTables_Table_0').get_attribute('innerHTML') # this is for crypto data table
print(extract_info(crypto_table))
paginate = driver.find_element(By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME,'li')
next_page_link = pages_list[-1].find_element(By.TAG_NAME,'a') # we clicking on next arrow sign at last not on 2,3,.. etc anchor link
if "disabled" in next_page_link.get_attribute('class'): # checking is there next page available
loop = False
pages_list[-1].click() # if there next page available then click on it
so main answer of your question is when you click on button, selenium update the page then you can use driver.page_source to get updated html. some times (*not this url) page can have ajax request which can take some time so you have to wait till the selenium load the full page.
i am trying to scrape information from this link https://www.hopkinsguides.com/hopkins/index/Johns_Hopkins_ABX_Guide/Antibiotics
This site uses jquery. My goal is to scrape all the antibiotic names, then for each antibiotic scrape "NON-FDA APPROVED USES" which is contained in a separate link. I hope i'm making sense.
The antibiotics are in categories that contain MANY other subcategories that contain the rest of antibiotics with their respective link.
My program first logs in, and the clicks on the first 7 buttons to expand and show more categories. I used driver.find_element_by_x_path to expand the first layer but i cant expand the second layer the same way (by looping through x_path) because if i do it will end up taking me to the other page where the "NON-FDA APPROVED USES" info is contained instead of expanding the page.
It does so because once u expand the first layer, then the second layer now contains more buttons/subcategories AND links that take you to the page where "NON-FDA APPROVED USES".
So if these are my x_paths
#//*[#id="firstul"]/li[1]/a
#//*[#id="firstul"]/li[2]/a
li[1] could be a redirecting link,
li[2] could be a button that shows more links(which is what i want first)
I made a soup to separate the buttons from links but now i cant click on the "a" tags i printed out in the bottom for loop.
Any ideas on how i should go about this?? Thanks in advance.
Here's my code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from random import randint
from bs4 import BeautifulSoup
#SIGN-IN
driver = webdriver.Chrome()
driver.get("http://www.hopkinsguide.com/home")
url = "https://www.hopkinsguides.com/hopkins/index/"
assert "Hopkins" in driver.title
sign_in_button = driver.find_element_by_xpath('//*[#id="logout"]')
sign_in_button.click()
user_elem = driver.find_element_by_name('username')
pass_elem = driver.find_element_by_id('dd-password')
user_elem.send_keys("user")
time.sleep(2)
pass_elem.send_keys("pass")
time.sleep(2)
sign_in_after_input = driver.find_element_by_xpath('//*[#id="dd-login-button"]')
sign_in_after_input.click()
def expand_page():
req = driver.get("https://www.hopkinsguides.com/hopkins/index/Johns_Hopkins_ABX_Guide/Antibiotics")
time.sleep(randint(2, 4))
#expand first layer
for i in range(1, 8):
driver.find_element_by_xpath("//*[#id='firstul']/li[" + str(i) + "]/a").click()
time.sleep(2)
html = driver.page_source
soup = BeautifulSoup(html, features='lxml')
for i in soup.find_all('a'):
if i.get('data-path') != None:
print(i)
time.sleep(2)
expand_page()
To expand all the values this should work for you, this will expand all the first level values and keep checking if any child values are expandable by checking the role attribute of element recursively:
def click_further(driver, elem):
subs = WebDriverWait(driver, 5).until(lambda driver: elem.find_elements_by_xpath("./following-sibling::ul//li/a"))
for sub in subs:
if sub.get_attribute('role') == "button":
sub.click()
click_further(driver, sub)
for idx in range(1,8):
elem = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//*[#id='firstul']/li[{}]/a".format(idx))))
elem.click()
click_further(driver, elem)
I guess then you can figure out how to get the text which you want to extract from it.
I suppose you want to expand all the expandable nodes first before accessing the underlying links one by one. From what I can see of the site, the discriminating attribute would be <li class="expandable index-expand"> and <li class="index-leaf">.
You can use Selenium to locate the "expandable index-expand" classes and click the nested <a> tag first. Then, repeat the same operation for the expanded child layer each time you click. Once you no longer detect "expandable index-expand" classes in the child layer, you can proceed to grab the links from "index-leaf".
find_elements_by_class_name should do the trick
I'm using selenium and BeautifulSoup to scrape data from a website (http://www.grownjkids.gov/ParentsFamilies/ProviderSearch) with a next button, which I'm clicking in a loop. I was struggling with StaleElementReferenceException previously but overcame this by looping to refind the element on the page. However, I ran into a new problem - it's able to click all the way to the end now. But when I check the csv file it's written to, even though the majority of the data looks good, there's often duplicate rows in batches of 5 (which is the number of results that each page shows).
Pictoral example of what I mean: https://www.dropbox.com/s/ecsew52a25ihym7/Screen%20Shot%202019-02-13%20at%2011.06.41%20AM.png?dl=0
I have a hunch this is due to my program re-extracting the current data on the page every time I attempt to find the next button. I was confused why this would happen, since from my understanding, the actual scraping part happens only after you break out of the inner while loop which attempts to find the next button and into the larger one. (Let me know if I'm not understanding this correctly as I'm comparatively new to this stuff.)
Additionally, the data I output after every run of my program is different (which makes sense considering the error, since in the past, the StaleElementReferenceExceptions were occurring at sporadic locations. If it duplicates results every time this exception occurs, it would make sense for duplications to occur sporadically as well. Even worse, a different batch of results ends up being skipped each time I run the program as well - I cross-compared results from 2 different outputs and there were some results that were present in one and not the other.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import csv
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--headless")
url = "http://www.grownjkids.gov/ParentsFamilies/ProviderSearch"
driver = webdriver.Chrome('###location###')
driver.implicitly_wait(10)
driver.get(url)
#clears text box
driver.find_element_by_class_name("form-control").clear()
#clicks on search button without putting in any parameters, getting all the results
search_button = driver.find_element_by_id("searchButton")
search_button.click()
df_list = []
headers = ["Rating", "Distance", "Program Type", "County", "License", "Program Name", "Address", "Phone", "Latitude", "Longitude"]
while True:
#keeps on clicking next button to fetch each group of 5 results
try:
nextButton = driver.find_element_by_class_name("next")
nextButton.send_keys('\n')
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts = 0
while (attempts < 100):
try:
nextButton = driver.find_element_by_class_name("next")
if nextButton:
nextButton.send_keys('\n')
break
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts += 1
#finds table of center data on the page
table = driver.find_element_by_id("results")
html_source = table.get_attribute('innerHTML')
soup = BeautifulSoup(html_source, "lxml")
#iterates through centers, extracting the data
for center in soup.find_all("div", {"class": "col-sm-7 fields"}):
mini_list = []
#all fields except latlong
for row in center.find_all("div", {"class": "field"}):
material = row.find("div", {"class": "value"})
if material is not None:
mini_list.append(material.getText().encode("utf8").strip())
#parses latlong from link
for link in center.find_all('a', href = True):
content = link['href']
latlong = content[34:-1].split(',')
mini_list.append(latlong[0])
mini_list.append(latlong[1])
df_list.append(mini_list)
#writes content into csv
with open ('output_file.csv', "wb") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in df_list if row)
Anything would help! If there's other recommendations you have about the way I've used selenium/BeautifulSoup/python in order to improve my programming for the future, I would appreciate it.
Thanks so much!
I would use selenium to grab the results count then do an API call to get the actual results. You can either, in case result count is greater than limit for pageSize argument of queryString for API, loop in batches and increment the currentPage argument until you have reached the total count, or, as I do below, simply request all results in one go. Then extract what you want from the json.
import requests
import json
from bs4 import BeautifulSoup as bs
from selenium import webdriver
initUrl = 'http://www.grownjkids.gov/ParentsFamilies/ProviderSearch'
driver = webdriver.Chrome()
driver.get(initUrl)
numResults = driver.find_element_by_css_selector('#totalCount').text
driver.quit()
newURL = 'http://www.grownjkids.gov/Services/GetProviders?latitude=40.2171&longitude=-74.7429&distance=10&county=&toddlers=false&preschool=false&infants=false&rating=&programTypes=&pageSize=' + numResults + '¤tPage=0'
data = requests.get(newURL).json()
You have a collection of dictionaries to iterate over in the response:
An example of writing out some values:
if(len(data)) > 0:
for item in data:
print(item['Name'], '\n' , item['Address'])
If you are worried about lat and long values you can grab them from one of the script tags when using selenium:
The alternate URL I use for XHR jQuery GET you can find by using dev tools (F12) on the page then refreshing the page with F5 and inspect the jquery requests made in the network tab:
You should read HTML contents inside every iteration of while loop. example below:
while counter < oage_number_limit:
counter = counter + 1
new_data = driver.page_source
page_contents = BeautifulSoup(new_data, 'lxml')