python selenium scrape href (link) from website - python

I have this site https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=searchResults&SearchType=linkquery&LinkID=6017#keyWordSearch=&locationSearch=
I want to scrape the link for each job role, the HTML source for one of the roles is:
<a id="Job_1" href="https://jobs.ubs.com/TGnewUI/Search/home/HomeWithPreLoad?partnerid=25008&siteid=5012&PageType=JobDetails&jobid=223876" ng-class="oQ.ClassName" class="jobProperty jobtitle" ng-click="handlers.jobClick($event, this)" ng-bind-html="$root.utils.htmlEncode(oQ.Value)">Technology Delivery Lead (IB Technology)</a>
I have tried this:
job_link = driver.find_elements_by_css_selector(".jobProperty.jobtitle ['href']")
for job_link in job_link:
job_link = job_link.text
print(job_link)
But it simply returns nothing, can someone kindly help

Why not just print out it's href tag by get_attribute.
job_link = driver.find_elements_by_css_selector(".jobProperty.jobtitle")
for job_link in job_link:
print(job_link.get_attribute('href'))

Related

Scraping news articles using Selenium Python

I am Learning to scrape news articles from the website https://tribune.com.pk/pakistan/archives. The first thing is to scrape the link of every news article. Now the problem is that <a tag contains two href in it but I want to get the first href tag which I am unable to do
I am attaching the html of that particular part
The code I have written returns me 2 href tags but I only want the first one
def Url_Extraction():
category_name = driver.find_element(By.XPATH, '//*[#id="main-section"]/h1')
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
Moreover I am able to paginate but I can't get the full article by clicking the individual links given on the main page.
You have to modify the below XPath:
Instead of this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
Use this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]/a")

Python Selenium: How to pull a link from an element with no href

I am trying to iterate through a series of car listings and return the links to the individual CarFax and Experian Autocheck documents for each listing.
Page I am trying to pull the links from
The XPATH for the one constant parent element across all child elements I am looking for is:
.//div[#class="display-inline-block align-self-start"]/div[1]
I initially tried to simply extract the href attribute from the child <div> and <a> tags at this XPATH: .//div[#class="display-inline-block align-self-start"]/div[1]/a[1]
This works great for some of the listings but does not work for others that do not have an <a> tag and instead include a <span> tag with an inline text link using text element "Get AutoCheck Vehicle History".
That link functions correctly on the page, but there is no href attribute or any link I can find attached to the element in the page and I do not know how to scrape it with Selenium. Any advice would be appreciated as I am new to Python and Selenium.
For reference, here is the code I was using to scrape through the page (this eventually returns an IndexError as only some of the iterations of elements on the list have the <a> tag and the final amount does not match the total amount of listings on the page indicated by len(name)
s = Service('/Users/admin/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.autotrader.com/cars-for-sale/ferrari/458-spider/beverly-hills-ca-90210?dma=&searchRadius=0&location=&isNewSearch=true&marketExtension=include&showAccelerateBanner=false&sortBy=relevance&numRecords=100")
nameList = []
autoCheckList = []
name = driver.find_elements(By.XPATH, './/h2[#class="text-bold text-size-400 text-size-sm-500 link-unstyled"]')
autoCheck = driver.find_elements(By.XPATH, './/div[#class="display-inline-block align-self-start"]/div[1]/a[1]')
for i in range(len(name)):
nameList.append(name[i].text)
autoCheckList.append(autoCheck[i].get_attribute('href'))

python selenium webscraping (clicking buttons which shows data and then extracting it)

so what I'm trying to do is: https://www.jobbank.gc.ca/jobsearch/jobsearch?sort=D&fsrc=16&fbclid=IwAR2SIG3lbY1S9lO4WilcKw6TxJAJQbFIGYTVE_tOTqYRpb43qM3uYgLWV64, < in this link open all listings and then when it redirects to another page there is a button ( Show how to apply ) when we click on that button there will be shown an email address. So I want to to scrape every job listing title and email address through my code. I already scraped titles and hrefs but have no idea what to do next(e.g clicking on every job listing, then clicking to "Show how to apply" and scraping emails from there). I hope you guys understand what I want to do ( Sorry for my english )
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import os
s = Service('C:\Program Files (x86)\chromedriver.exe')
driver = webdriver.Chrome(service=s)
driver.get('https://www.jobbank.gc.ca/jobsearch/jobsearch?sort=D&fsrc=16&fbclid=IwAR2SIG3lbY1S9lO4WilcKw6TxJAJQbFIGYTVE_tOTqYRpb43qM3uYgLWV64')
# Get titles of Job listings
elements = []
for element in driver.find_elements(By.CLASS_NAME, 'resultJobItem'):
title = element.find_element(By.XPATH, './/*[#class="noctitle"]').text
if title not in elements:
elements.append({'Title': title.split('\n')})
# Get all href
link = driver.find_elements(By.XPATH, './/*[#class="results-jobs"]/article/a')
for links in link:
elements.append({'Link': links.get_attribute('href')})
print(elements)
Looks like you can use their own api with a post request to get the data.
You'll need to scrape the job id.
so for the job on this url: https://www.jobbank.gc.ca/jobsearch/jobposting/35213663
i see that the job id is 1860693. so ill need to post a request like this.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.jobbank.gc.ca/jobsearch/jobposting/35213663"
jobid = "1860693"
data = {
'seekeractivity:jobid': f'{jobid}',
'seekeractivity_SUBMIT': '1',
'javax.faces.ViewState': 'stateless',
'javax.faces.behavior.event': 'action',
'jbfeJobId': f'{jobid}',
'action': 'applynowbutton',
'javax.faces.partial.event': 'click',
'javax.faces.source': 'seekeractivity',
'javax.faces.partial.ajax': 'true',
'javax.faces.partial.execute': 'jobid',
'javax.faces.partial.render': 'applynow',
'seekeractivity': 'seekeractivity'
}
response = requests.post(url, data)
soup = BS(response.text)
email = soup.a.text
print(email)
this gives me
>> info#taylorlumber.ca
I would store all the links seperately.
So assume the following variable all_links contains all the links. Now,
.
.
.
driver.quit()
link1 = all_links[0] # lets take the example of the first link. youd have to for loop through all the link; for link in links
new_driver = webdriver.Chrome(service=s)
new_driver.get(link1)
new_driver.find_element_by_css_selector("#applynowbutton").click()
At this point the 'Show how to Apply' button has been clicked.
Unfortunately, I dont know too much about html and all but essentially at this point you can extract the email much like you extracted all the links previously
Try like below:
Can apply scrollIntoView to the particular job option. When it reaches the end, click on Show more option and continue extracting details.
driver.get("https://www.jobbank.gc.ca/jobsearch/jobsearch?sort=D&fsrc=16&fbclid=IwAR2SIG3lbY1S9lO4WilcKw6TxJAJQbFIGYTVE_tOTqYRpb43qM3uYgLWV64")
i = 0
while True:
try:
jobs = driver.find_elements_by_xpath("//div[#class='results-jobs']/article")
driver.execute_script("arguments[0].scrollIntoView(true);",jobs[i])
title = jobs[i].find_element_by_xpath(".//span[#class='noctitle']").text
link = jobs[i].find_element_by_tag_name("a").get_attribute("href")
print(f"{i+1} - {title} : {link}")
i+=1
if i == 100:
break
except IndexError:
driver.find_element_by_id("moreresultbutton").click()
time.sleep(3)

How to scrape multiple links with selenium after manual login?

I am trying to automatically collect articles from a database which first requires me to login.
I have written the following code using selenium to open up the search results page, then wait and allow me to login. That works, and it can get the links to each item in the search results.
I want to then continue use selenium to continue to visit each of the links in the search results and collect the article text
browser = webdriver.Firefox()
browser.get("LINK")
time.sleep(60)
lnks = browser.find_elements_by_tag_name("a")[20:40]
for lnk in lnks:
link = lnk.get_attribute('href')
print(link)
I can't get any further. How should I then make it visit these links in turn and get the text of the articles for each one?
I tried to add driver.get(link) to the for loop, I got the 'selenium.common.exceptions.StaleElementReferenceException'
On the request of the database owner, I have removed the screenshots previously posted in this post, as well as information about the database. I would like to delete the post completely, but am unable to do so.
You need to seek bs4 tutroials, but here is starter
html_source_code = Browser.execute_script("return document.body.innerHTML;")
soup = bs4.BeautifulSoup(html_source_code, 'lxml')
links = soup.find_all('what-ever-the-html-code-is')
for l in links:
print(l['href'])

How to get info from this html using beautifulsoup?

I want to get all the social link of a company from this. When doing
summary_div.find("div", {'class': "cp-summary__social-links"})
I am getting this
<div class="cp-summary__social-links">
<div data-integration-name="react-component" data-payload='{"props":
{"links":[{"url":"http://www.snapdeal.com?utm_source=craft.co","icon":"web","label":"Website"},
{"url":"http://www.linkedin.com/company/snapdeal?utm_source=craft.co","icon":"linkedin","label":"LinkedIn"},
{"url":"https://instagram.com/snapdeal/?utm_source=craft.co","icon":"instagram","label":"Instagram"},
{"url":"https://www.facebook.com/Snapdeal?utm_source=craft.co","icon":"facebook","label":"Facebook"},
{"url":"https://www.crunchbase.com/organization/snapdeal?utm_source=craft.co","icon":"cb","label":"CrunchBase"},
{"url":"https://www.youtube.com/user/snapdeal?utm_source=craft.co","icon":"youtube","label":"YouTube"},
{"url":"https://twitter.com/snapdeal?utm_source=craft.co","icon":"twitter","label":"Twitter"}],
"companyName":"Snapdeal"},"name":"CompanyLinks"}' data-rwr-element="true"></div></div>
I also tried getting children of cp-summary__social-links, which I want indeed and then find all a tag to get all the links. This does not work too.
Any idea, how to do this?
Update: As Sraw suggested, I managed to get all urls by doing like this.
urls = []
social_link = summary_div.find("div", {'class': "cp-summary__social-links"}).find("div", {"data-integration-name": "react-component"})
json_text = json.loads(social_link["data-payload"])
for link in json_text['props']['links']:
urls.append(link['url'])
Thanks in advance.

Categories

Resources