Scraping news articles using Selenium Python - python

I am Learning to scrape news articles from the website https://tribune.com.pk/pakistan/archives. The first thing is to scrape the link of every news article. Now the problem is that <a tag contains two href in it but I want to get the first href tag which I am unable to do
I am attaching the html of that particular part
The code I have written returns me 2 href tags but I only want the first one
def Url_Extraction():
category_name = driver.find_element(By.XPATH, '//*[#id="main-section"]/h1')
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
Moreover I am able to paginate but I can't get the full article by clicking the individual links given on the main page.

You have to modify the below XPath:
Instead of this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
Use this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]/a")

Related

Python Selenium - How do you extract a link from an element with no href? [duplicate]

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

Python Selenium: How to pull a link from an element with no href

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

Python- WebScraping a page

My code is supposed to go into a website, navigate through 2 pages, and print out all the titles and URL/href within each row.
Currently - My code goes into these 2 pages fine, however it only prints out the first title of each page and not each title of each row.
The page does have some JavaScript, and I think maybe this is why it does not show any links/urls/hrefs within each of these rows? Ideally id like to print the URLS of each row.
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/presentations/endometrial/{x}')
time.sleep(3)
page_source = driver.page_source
eachrow=driver.find_elements_by_xpath("//li[#class='result clearfix']")
for item in eachrow:
title=driver.find_element_by_xpath("//span[#class='bodyTitle']").text
print(title)
You're using driver inside your for loop meaning you're searching the whole page - so you will always get the same element.
You want to search from each item instead.
for item in eachrow:
title = item.find_element_by_xpath(".//span[#class='bodyTitle']").text
Also, there are no "URLs" in the rows as mentioned - when you click on a row the data-id attribute is used in the request.
<h1 class="name" data-id="1989" data-key="">
Which sends a request to https://www.abstractsonline.com/oe3/Program/9325/Presentation/694

Webscraping with varying page numbers

So I'm trying to webscrape a bunch of profiles. Each profile has a collection of videos. I'm trying to webscrape information about each video. The problem I'm running into is that each profile uploads a different number of videos, so the number of pages containing videos per profile varies. For instance, one profile has 45 pages of videos, as you can see by the html below:
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li class="no-page">...<li>45</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
While another profile has 2 pages
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
My question is, how do I account for the varying changes in page? I was thinking of making a for loop and just adding a random number at the end, like
for i in range(0,1000):
new_url = 'url' + str(i)
where i accounts for the page, but I want to know if there's a more efficient way of doing this.
Thank you.
The "skeleton" of the loop can look like this:
url = 'http://url/?page={page}'
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
# ...
# do we have next page?
next_page = soup.select_one('.next-page')
# no, so break from the loop
if not next_page:
break
page += 1
You can have infinite loop while True: and you will break the loop only if there's no next page (if there isn't any class="next-page" tag on the last page).
Get the <li>...</li> elements of the <div class="pagination "><ul>
Exclude the last one by its class <li class="no-page">
Parse the "href" and build your next url destinations.
Scrape every new url destination.
I just want to thank everyone who did answer for taking the time to answer my question. I figured out the answer - or at least what worked for me- and decided to share in case it would be helpful for anyone else.
url = 'insert url'
re = requests.get(url)
soup = BeautifulSoup(re.content,'html.parser')
#look for pagination class
page = soup.find(class_='pagination')
#create list to include all page numbers
href=[]
#look for all 'li' tags as the users above suggested
links = page.findAll('li')
for link in links:
href += [link.find('a',href=True).text]
'''
href will now include all pages and the word Next.
So for instance it will look something like this:[1,2,3...,44,Next].
I want to get 44, which will be href[-2] and then convert that to an int for
a for loop. In the for loop add + 1 because it will iterate to i-1, instead of
i. For instance, if you're iterating (0,44), the last output of i will be 43,
which is why we +1
'''
for i in range(0, int(href[-2])+1):
new_url = url + str(1)

Web scraping with python and selenium

New to stack and been learning Python for a couple of months now. I am in the process of writing a script which logs on to a website (which I am a subscriber of) and scrape article titles and text.
So far I have been able to log on to the website and get to the page with the article titles, and pull the titles for the first page. However, I am having trouble cycling through the pages.
from selenium import webdriver
chrome_path = r"C:\Users\user.name\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.WEBSITE.co.uk/")
driver.find_element_by_name("ctl00$LoginView1$Login1$UserName").send_keys('USERNAME') # Enters username
driver.find_element_by_name("ctl00$LoginView1$Login1$Password").send_keys('PASSWORD') # Enters password
driver.find_element_by_name("ctl00$LoginView1$Login1$Submit").click() # Submits username/password
driver.find_element_by_xpath('//*[#id="middle_col"]/div[2]/div[1]/a[1]').click() # Clicks on more articles
def title_scraper(max_pages): # A loop to cycle through xpaths of various pages (?)
page = 2 # Set at 2 for test circa 40 in total
while page < max_pages:
newPage = '//*[#id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[' + str(page) + ']/a' # xpath = //*[#id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[1]/a - it is td[1] which increases depending on page number
driver.find_element_by_xpath(newPage).click() # Scrapes article titles, currently only does the first page
titles = driver.find_elements_by_class_name("articletitle")
for title in titles:
print(title.text)
Sorry if this has already been answered, I have had no luck with online resources so far!
Update:
def title_scraper(max_pages):
page = 2
while page < max_pages:
path = '//*[#id="ctl00_mainContentArea_ArticleListing1_gvwArticles"]/tbody/tr[11]/td/table/tbody/tr/td[' + str(
max_pages) + ']/a'
driver.find_element_by_xpath(path)
titles = driver.find_elements_by_class_name("articletitle")
for title in titles:
print(title.text)

Categories

Resources