Click Multiple links and get their url - python

I need to get the detail page urls of the match links on this webpage: https://www.sportybet.com/ke/sport/football/today
What I want?
I want to click on Man city vs PSG, copy the detail match url and print.. and do the same for the next match, Holstein Kiel vs SV Sandhausen and likewise for all the matches on the web page.
I have this code selenium below for just 1 match;
driver.find_element_by_xpath('//*[#id="importMatch"]/div[2]/div/div[3]/div[2]/div[3]').click()
get_url = driver.current_url
print(get_url)
I need help to get all the match urls with a loop or any better suggestions.

If I understand correctly what you asking for, you should do the following:
links = driver.find_element_by_xpath("//div[#class='match-league']//div[contains(#class,'market-size')]")
for link in links:
link.click()
time.sleep(1)
url = driver.current_url
print(get_url)
driver.execute_script("window.history.go(-1)")
time.sleep(2)
On the first line you collecting the elements you can click to expand
Then with for loop iterate thru all these elements, click each one of them, get the URL and go back to the main page

Related

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

Python- WebScraping a page

My code is supposed to go into a website, navigate through 2 pages, and print out all the titles and URL/href within each row.
Currently - My code goes into these 2 pages fine, however it only prints out the first title of each page and not each title of each row.
The page does have some JavaScript, and I think maybe this is why it does not show any links/urls/hrefs within each of these rows? Ideally id like to print the URLS of each row.
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/presentations/endometrial/{x}')
time.sleep(3)
page_source = driver.page_source
eachrow=driver.find_elements_by_xpath("//li[#class='result clearfix']")
for item in eachrow:
title=driver.find_element_by_xpath("//span[#class='bodyTitle']").text
print(title)
You're using driver inside your for loop meaning you're searching the whole page - so you will always get the same element.
You want to search from each item instead.
for item in eachrow:
title = item.find_element_by_xpath(".//span[#class='bodyTitle']").text
Also, there are no "URLs" in the rows as mentioned - when you click on a row the data-id attribute is used in the request.
<h1 class="name" data-id="1989" data-key="">
Which sends a request to https://www.abstractsonline.com/oe3/Program/9325/Presentation/694

How to scrape multiple links with selenium after manual login?

I am trying to automatically collect articles from a database which first requires me to login.
I have written the following code using selenium to open up the search results page, then wait and allow me to login. That works, and it can get the links to each item in the search results.
I want to then continue use selenium to continue to visit each of the links in the search results and collect the article text
browser = webdriver.Firefox()
browser.get("LINK")
time.sleep(60)
lnks = browser.find_elements_by_tag_name("a")[20:40]
for lnk in lnks:
link = lnk.get_attribute('href')
print(link)
I can't get any further. How should I then make it visit these links in turn and get the text of the articles for each one?
I tried to add driver.get(link) to the for loop, I got the 'selenium.common.exceptions.StaleElementReferenceException'
On the request of the database owner, I have removed the screenshots previously posted in this post, as well as information about the database. I would like to delete the post completely, but am unable to do so.
You need to seek bs4 tutroials, but here is starter
html_source_code = Browser.execute_script("return document.body.innerHTML;")
soup = bs4.BeautifulSoup(html_source_code, 'lxml')
links = soup.find_all('what-ever-the-html-code-is')
for l in links:
print(l['href'])

WebScraping Next pages with Selenium

When I navigate to the below link and locate the pagination at the bottom of the page:
https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&sort=Boosted
I am only able to scrape the first 4 or so pages then my script stops
I have tried with xpath, css_selector, and with the WebDriverWait options
pages_remaining = True
page = 2 //starts # page 2 since page one is scraped already with first loop
while pages_remaining:
//scrape code
try:
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, str(page)))).click()
print browser.current_url
page += 1
except TimeoutException:
pages_remaining = False
Current Results from console:
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories- shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=2&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=3&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=4&sort=Boosted
This solution is a BeautifulSoup one, because I am not too familiar with Selenium.
Try to create a new variable with your number of pages. As you can see, when you enter the next page the URL changes, thus just manipulate the given URL. See my code example below.
# Define variable pages first
pages = [str(i) for i in range(1,53)] # 53 'cuz you have 52 pages
for page in pages:
response = get("https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page=" + page + "&sort=Boosted"
# Rest of you code
This snippet should do the job for the rest of the pages. Hope that helps, although this might not exactly what you have been looking for.
When you have any questions just post below. ;).
Cheers.
You could loop throught page numbers until no more results are shown by just changing the url:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = "https://m.shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page={}&sort=Boosted"
driver = webdriver.Chrome()
page = 1
soup = BeautifulSoup("")
#Will loop untill there's no more results
while "Looks like we don’t have exactly what you’re looking for." not in soup.text:
print(base_url.format(page))
#Go to page
driver.get(base_url.format(page))
soup = BeautifulSoup(driver.page_source)
### your extracting code
page +=1

Need to Scrape Paginated Pages in Python Selenium

I have a selenium / python script that scrapes page titles and some other information. At the bottom of the page is a "next" button along with some pagination that loads the next 20 results or so when I click next. This all happens without a page load. I need to be able to scrape the remaining pages until the "next" button is no longer visible, which indicates there are no more results to be loaded. Below is the logic I have so far to give you an idea. I have simplified it so it's easily followed. I can scrape the first page of titles, but once the browser clicks "next" the script terminates. How do I get it to scrape the remaining pages? Thanks!
#loads web page
browser.get("URL")
#scrapes titles
deal_title = browser.find_elements_by_xpath("element xpath")
titles = []
for title in deal_title:
titles.append(title.text)
#clicks next button
browser.find_element_by_xpath("button xpath")
print(title)
You need a loop to repeat the process. This should work. And you might want to put sufficient sleep or waits to make sure all elements on the page gets loaded. Also may be try not to use Xpath as much. If you can target class or id that would be better.
from selenium.common.exceptions import NoSuchElementException
while True:
title = browser.find_elements_by_xpath("element xpath")
titles = []
for title in deal_title:
titles.append(title.text)
try:
browser.find_element_by_xpath("xpath of the next button").click()
except NoSuchElementExeception :
break

Categories

Resources