Python- WebScraping a page

Python- WebScraping a page - python

My code is supposed to go into a website, navigate through 2 pages, and print out all the titles and URL/href within each row.
Currently - My code goes into these 2 pages fine, however it only prints out the first title of each page and not each title of each row.
The page does have some JavaScript, and I think maybe this is why it does not show any links/urls/hrefs within each of these rows? Ideally id like to print the URLS of each row.
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/presentations/endometrial/{x}')
time.sleep(3)
page_source = driver.page_source
eachrow=driver.find_elements_by_xpath("//li[#class='result clearfix']")
for item in eachrow:
title=driver.find_element_by_xpath("//span[#class='bodyTitle']").text
print(title)

You're using driver inside your for loop meaning you're searching the whole page - so you will always get the same element.
You want to search from each item instead.
for item in eachrow:
title = item.find_element_by_xpath(".//span[#class='bodyTitle']").text
Also, there are no "URLs" in the rows as mentioned - when you click on a row the data-id attribute is used in the request.
<h1 class="name" data-id="1989" data-key="">
Which sends a request to https://www.abstractsonline.com/oe3/Program/9325/Presentation/694

Related

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?

The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

Python/Selenium - How to webscrape this dropdown

My code runs fine and prints the title for all rows but the rows with dropdowns.
For example, row 4 has a dropdown if clicked. I implemented a try which would in theory initiate the dropdown, to then pull the titles.
But my click/scrape for the rows with these drop downs are not printing.
Expected output- Print all titles including the ones in dropdown.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='card item-container session')
for property in productlist:
sessiontitle=property.find('h4',class_='session-title card-title').text
print(sessiontitle)
try:
ifDropdown=driver.find_elements_by_class_name('item-expand-action expand')
ifDropdown.click()
time.sleep(4)
newTitle=driver.find_element_by_class_name('card-title').text
print(newTitle)
except:
newTitle='none'

There were a couple of issues. First, when you locate from the driver by class and there is more than one, you need to separate them by dots, not spaces, so that the driver knows it's dealing with another class.
Second, find_elements returns a list, and the list has no .click(), so you get an error, which your except catches but assumes means there was no link to click.
I rewrote it (without soup for now) so that it instead checks (With the dot replacing space) for a link to open within the session and then loops over the new ones that appeared.
Here is what I have and tested. Note at the end this only gets the sessions and subsessions in the view. You will need to add logic to scroll and get the rest.
# stuff to initialize driver is above here, I used firefox
# Open the website page
URL = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list"
driver.get(URL)
time.sleep(4)#time for page to populate
product_list=driver.find_elements_by_css_selector('div.card.item-container.session')
#above line gets all top level sessions
for product in product_list:
session_title=product.find_element_by_css_selector('h4.card-title').text
print(session_title)
dropdowns=product.find_elements_by_class_name('item-expand-action.expand')
#above line finds dropdown within this session, if any
if len(dropdowns)==0:#nothing more for this session
continue#move to next session
#still here, click on the dropdown, using execute because link can overlap chevron
driver.execute_script("arguments[0].scrollIntoView(true); arguments[0].click();",
dropdowns[0])
time.sleep(4)#wait for subsessions to appear
session_titles=product.find_elements_by_css_selector('h4.card-title')
session_index = 0#suppress reprinting title of master session
for session_title in session_titles:
if session_index > 0:
print(" " + session_title.text)#indent for clarity
session_index = session_index + 1
#still to do, deal with other sessions that only get paged into view when you scroll
#that is a different question

how to loop pages using beautiful soup 4 and python and selenium?

I'm Fairly new to Python and using beautiful soup first time though I have some experience with selenium. I am trying to scrape a website ("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx" ) For all the affiliation number.
The problem is they are on multiple pages( 20 result on 1, total: 21,000+ result)
so, I wish to scrape these in some kind of loop that can iterate over the next page btn, the problem in URL of the web page does not change and thus there is no pattern.
Okay so for this i have tried, google sheet Import HTML/ Import XML method but due to large scale of problem it just hangs.
Next I tried python and started reading about scraping using python (I'm doing this for the first time :) ) Some-one on this platform suggested an method
(Python Requests/BeautifulSoup access to pagination)
I am trying to do the same but with little and no success.
Also, to fetch the result we have to first, query the search bar with the keyword "a" --> then click search. Only then the website show result.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe",options=options)
driver.get("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx")
#click on the radio btn
driver.find_element(By.ID,'optlist_0').click()
time.sleep(2)
# Search the query with letter A And Click Search btn
driver.find_element(By.ID,'keytext').send_Keys("a")
driver.find_element(By.ID,'search').click()
time.sleep(2)
next_button = driver.find_element_by_id("Button1")
data = []
try:
while (next_button):
soup = BeautifulSoup(driver.page_source,'html.parser')
table = soup.find('table',{'id':'T1'}) #Main Table
table_body = table.find('tbody') #get inside the body
rows = table_body.find_all('tr') #look for all tablerow
for row in rows:
cols = row.find_all('td') # in every Tablerow, look for tabledata
for row2 in cols:
#table -> tbody ->tr ->td -><b> --> exit loop. ( only first tr is our required data, print this)
The final outcome I expect is List of all affiliation number across multiple pages.

A minor addition to the code within your while loop:
next_button = 1 #Initialise the variable for the first instance of while loop
while next_button:
#First scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#Now locate the button & click on it
next_button = driver.find_element(By.ID,'Button1')
next_button.click()
###
###Beautiful Soup Code : Fetch the page source now & do your thing###
###
#Adjust the timing as per your requirement
time.sleep(2)
Note the fact that scrolling to the bottom of the page is important, otherwise an error will pop up claiming 'Button1' element is hidden under the footer. So with the script(in the beginning of the loop), the browser will move down to the bottom of the page. Here, it can see the 'Button1' element clearly. Now, locate the element, perform the click action & then let your Beautiful Soup take over.

BeautifulSoup returns empty span elements?

I'm trying to pull prices from Binance's home page and BeautifulSoup returns empty elements for me. Binance's home page is at https://www.binance.com/en/, and the interesting block I'm trying to get text from is:
<div class="sc-62mpio-0-sc-iAyFgw iQwJlO" color="#999"><span>"/" "$" "35.49"</span></div>
On Binance's home page is a table and one of the columns is titled "Last Price". Next to the last price is the last USD price in a faded gray color and I'm trying to pull every one of those. Here's my code so far.
def grabPrices():
page = requests.get("https://www.binance.com/en")
soup = BeautifulSoup(page.text, "lxml")
prices = soup.find_all("span", {"class": None})
print(prices)
But the output is just a large array of "–" tags.

Selenium should be one way of scraping the table content you want from this biniance page. And google Selenium about its set up (pretty much by download a driver and place it in your local disk, if you are a chrome user, see this download link chrome driver). Here is my code to access the content you are interested:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome(executable_path=r'C:\chromedriver\chromedriver.exe')
time.sleep(3) # Allow time to launch the controlled web
driver.get('https://www.binance.com/en/')
time.sleep(3) # Allow time to load the page
sel = Selector(text=driver.page_source)
Table = sel.xpath('//*[#id="__next"]/div/main/div[4]/div/div[2]/div/div[2]/div/div[2]/div')
Table.extract() # This basically gives you all the content of the table, see follow screen shot (screen shot is truncated for display purpose)
Then if you further process the entire table content with something like:
tb_rows = Table.xpath('.//div/a//div//div//span/text()').extract()
tb_rows # Then you will get follow screen shot
At this point, the result is narrowed down to pretty much what you are interested, but notice that the lastprice's two components (number/dollar price) are stored in two tag in source page, so we can do following to combine them together and reach to the destination:
for n in range(0,len(tb_rows),2):
LastPrice = tb_rows[n] + tb_rows[n+1]
print(LastPrice) # For sure, other than print, you could store each element in a list
driver.quit() # don't forget to quit driver by the end
The final output looks like:

Need to Scrape Paginated Pages in Python Selenium

I have a selenium / python script that scrapes page titles and some other information. At the bottom of the page is a "next" button along with some pagination that loads the next 20 results or so when I click next. This all happens without a page load. I need to be able to scrape the remaining pages until the "next" button is no longer visible, which indicates there are no more results to be loaded. Below is the logic I have so far to give you an idea. I have simplified it so it's easily followed. I can scrape the first page of titles, but once the browser clicks "next" the script terminates. How do I get it to scrape the remaining pages? Thanks!
#loads web page
browser.get("URL")
#scrapes titles
deal_title = browser.find_elements_by_xpath("element xpath")
titles = []
for title in deal_title:
titles.append(title.text)
#clicks next button
browser.find_element_by_xpath("button xpath")
print(title)

You need a loop to repeat the process. This should work. And you might want to put sufficient sleep or waits to make sure all elements on the page gets loaded. Also may be try not to use Xpath as much. If you can target class or id that would be better.
from selenium.common.exceptions import NoSuchElementException
while True:
title = browser.find_elements_by_xpath("element xpath")
titles = []
for title in deal_title:
titles.append(title.text)
try:
browser.find_element_by_xpath("xpath of the next button").click()
except NoSuchElementExeception :
break

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python- WebScraping a page - python

Related

How to scrape a page that is dynamicaly locaded?

Python/Selenium - How to webscrape this dropdown

how to loop pages using beautiful soup 4 and python and selenium?

BeautifulSoup returns empty span elements?

Need to Scrape Paginated Pages in Python Selenium

Categories

Resources