I am trying to webscrape a site using Python, Selenium, Beautifulsoup.
When I tried to get all the links ,It' returning an invalid string.
This is what I have tried
Can someone help me please?
from time import sleep
from selenium.webdriver.common.by import By
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.hirist.com/c/filter/mobile-applications-jobs-in-cochin%20kochi_trivandrum%20thiruvananthapuram-5-70_75-0-0-1-0-0-0-0-2.html?ref=homepagecat')
sleep(10)
links = driver.find_elements(by=By.XPATH, value='.//div[#class="jobfeed-wrapper multiple-wrapper"]')
for link in links:
link.get_attribute('href')
print(link)
It is your selection with xpath, you select the <div> that do not have an href attribute. Select also its first <a> like .//div[#class="jobfeed-wrapper multiple-wrapper"]/a and it will work:
links = driver.find_elements(by=By.XPATH, value='.//div[#class="jobfeed-wrapper multiple-wrapper"]/a')
for link in links:
print(link.get_attribute('href'))
Example
Instead of time use WebDriverWait to check if specific elements are available.
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.hirist.com/c/filter/mobile-applications-jobs-in-cochin%20kochi_trivandrum%20thiruvananthapuram-5-70_75-0-0-1-0-0-0-0-2.html?ref=homepagecat'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 10)
links = wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[#class="jobfeed-wrapper multiple-wrapper"]/a')))
for link in links:
print(link.get_attribute('href'))
Output
https://www.hirist.com/j/xforia-technologies-android-developer-javakotlin-10-15-yrs-1011605.html?ref=cl&jobpos=1&jobversion=2
https://www.hirist.com/j/firminiq-system-ios-developer-swiftobjective-c-3-10-yrs-1011762.html?ref=cl&jobpos=2&jobversion=2
https://www.hirist.com/j/firminiq-system-android-developer-kotlin-3-10-yrs-1011761.html?ref=cl&jobpos=3&jobversion=2
https://www.hirist.com/j/react-native-developer-mobile-app-designing-3-5-yrs-1009438.html?ref=cl&jobpos=4&jobversion=2
https://www.hirist.com/j/flutter-developer-iosandroid-apps-2-3-yrs-1008214.html?ref=cl&jobpos=5&jobversion=2
https://www.hirist.com/j/accubits-technologies-react-native-developer-ios-android-platforms-3-7-yrs-1003520.html?ref=cl&jobpos=6&jobversion=2
https://www.hirist.com/j/appincubator-react-native-developer-iosandroid-platform-2-7-yrs-1001957.html?ref=cl&jobpos=7&jobversion=2
You didn't declare path to chromedriver on your computer. Check where the chromdriver is, then try
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)
Related
If you visit this site,
https://www.premierleague.com/results
You will be able to see several match results. If you click on each match, you will be directed to another website.
My question is how can I get the href (link) of each match.
links = driver.find_elements(By.XPATH, '//*[#id="mainContent"]/div[3]/div[1]')
for link in links:
x = link.get_attribute("href")
List.append(x)
This is what I have so far and it is not working.
I see elements like
<div data-href="//www.premierleague.com/match/66686" ...>
and you could search
//div[#data-href]
and later use get_attribute("data-href")
Full working code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
#import time
url = 'https://www.premierleague.com/results'
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
wait = WebDriverWait(driver, 10)
#time.sleep(5)
# close popup window with "Accept All Cookies"
button = wait.until(EC.visibility_of_element_located((By.XPATH, '//button[text()="Accept All Cookies"]')))
button.click()
all_items = driver.find_elements(By.XPATH, '//div[#data-href]')
print('len(all_items):', len(all_items))
for item in all_items:
print(item.get_attribute('data-href'))
Result:
len(all_items): 40
//www.premierleague.com/match/66686
//www.premierleague.com/match/66682
//www.premierleague.com/match/66687
//www.premierleague.com/match/66689
//www.premierleague.com/match/66691
//www.premierleague.com/match/66684
//www.premierleague.com/match/66705
//www.premierleague.com/match/66677
//www.premierleague.com/match/66674
//www.premierleague.com/match/66675
//www.premierleague.com/match/66676
//www.premierleague.com/match/66679
//www.premierleague.com/match/66672
//www.premierleague.com/match/66678
//www.premierleague.com/match/66680
//www.premierleague.com/match/66681
//www.premierleague.com/match/66673
//www.premierleague.com/match/66633
//www.premierleague.com/match/66584
//www.premierleague.com/match/66513
//www.premierleague.com/match/66637
//www.premierleague.com/match/66636
//www.premierleague.com/match/66635
//www.premierleague.com/match/66666
//www.premierleague.com/match/66670
//www.premierleague.com/match/66668
//www.premierleague.com/match/66665
//www.premierleague.com/match/66667
//www.premierleague.com/match/66669
//www.premierleague.com/match/66654
//www.premierleague.com/match/66656
//www.premierleague.com/match/66659
//www.premierleague.com/match/66657
//www.premierleague.com/match/66655
//www.premierleague.com/match/66652
//www.premierleague.com/match/66660
//www.premierleague.com/match/66661
//www.premierleague.com/match/66653
//www.premierleague.com/match/66658
//www.premierleague.com/match/66524
I am trying to scrape a website. Where in I have to press a link. for this purpose, I am using selenium library with chrome drive.
from selenium import webdriver
url = 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25222&siteid=5011&noback=1&fromSM=true#Applications'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
link = browser.find_element_by_link_text("Don't have an account yet?")
link.click()
But it is not working. Any ideas why it is not working? Is there a workaround?
You can get it done in several ways. Here is one of such. I've used driver.execute_script() command to force the clicking. You should not go for hardcoded delay as they are very inconsistent.
Modified script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25222&siteid=5011&noback=1&fromSM=true#Applications'
driver = webdriver.Chrome()
driver.get(url)
item = wait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[ng-click='newAccntScreen()']")))
driver.execute_script("arguments[0].click();",item)
On YouTube, I want to search for certain videos (i.e. videos on Python) and after this, I want to return all videos this search returns. Right now if, I try this Python returns all the videos on the start page not on the page after the search.
Current code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("http://youtube.com")
driver.find_element_by_name("search_query").send_keys("Python")
driver.find_element_by_id("search-icon-legacy").click()
links = driver.find_elements_by_id("video-title")
for x in links:
print(x.get_attribute("href"))
What goes wrong here?
But is better to use an explicit wait for this:
links = ui.WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.ID, "video-title")))
Reference.
Hope it helps you!
As per the discussion with #Mark:
It seems that the elements of the first page of Youtube are still in the DOM...
The only fix I see is to go to the search URL:
driver.get("http://youtube.com/results?search_query=Python")
# driver.find_element_by_name("search_query").send_keys("Python")
# driver.find_element_by_id("search-icon-legacy").click()
You should use WebDriverWait not sleep:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
opt = Options()
opt.add_argument("--incognito")
driver = webdriver.Chrome(executable_path=r'C:\path\to\chromedriver.exe', chrome_options=opt)
driver.get("http://youtube.com")
driver.find_element_by_name("search_query").send_keys("Python")
driver.find_element_by_id("search-icon-legacy").click()
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.ID, "video-title")))
links = driver.find_elements_by_id("video-title")
for x in links:
print(x.get_attribute("href"))
The output:
https://www.youtube.com/watch?v=rfscVS0vtbw
https://www.youtube.com/watch?v=f79MRyMsjrQ
https://www.youtube.com/watch?v=kLZuut1fYzQ
https://www.youtube.com/watch?v=N4mEzFDjqtA
https://www.youtube.com/watch?v=Z1Yd7upQsXY
https://www.youtube.com/watch?v=hnDU1G9hWqU
https://www.youtube.com/watch?v=3cZsjOclmoM
https://www.youtube.com/watch?v=f3EbDbm8XqY
https://www.youtube.com/watch?v=2uCXIbkbDSE
https://www.youtube.com/watch?v=HXV3zeQKqGY
https://www.youtube.com/watch?v=JJmcL1N2KQs
https://www.youtube.com/watch?v=qiSCMNBIP2g
https://www.youtube.com/watch?v=7lmCu8wz8ro
https://www.youtube.com/watch?v=25ovCm9jKfA
https://www.youtube.com/watch?v=q6Mc_sAPZ2Y
https://www.youtube.com/watch?v=yE9v9rt6ziw
https://www.youtube.com/watch?v=Y8Tko2YC5hA
https://www.youtube.com/watch?v=G0rQ7AEl5LA
https://www.youtube.com/watch?v=CtbckFw0pJs
https://www.youtube.com/watch?v=sugvnHA7ElY
To return all videos from the search with the keyword as Python you need to:
Maximize the screen so all the resultant video links get rendered within the HTML DOM.
Induce WebDriverWait for the desired elements to be visible before extracting the href attributes.
You can use the following solution
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.youtube.com/")
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#search"))).send_keys("Python")
driver.find_element_by_css_selector("button.style-scope.ytd-searchbox#search-icon-legacy").click()
print([my_href.get_attribute("href") for my_href in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.yt-simple-endpoint.style-scope.ytd-video-renderer#video-title")))])
Console Output:
['https://www.youtube.com/watch?v=rfscVS0vtbw', 'https://www.youtube.com/watch?v=7UeRnuGo-pg', 'https://www.youtube.com/watch?v=3cZsjOclmoM', 'https://www.youtube.com/watch?v=f79MRyMsjrQ', 'https://www.youtube.com/watch?v=CtbckFw0pJs', 'https://www.youtube.com/watch?v=Z1Yd7upQsXY', 'https://www.youtube.com/watch?v=kLZuut1fYzQ', 'https://www.youtube.com/watch?v=IZ0IM_T4aio', 'https://www.youtube.com/watch?v=qiSCMNBIP2g', 'https://www.youtube.com/watch?v=N0lxfilGfak', 'https://www.youtube.com/watch?v=N4mEzFDjqtA', 'https://www.youtube.com/watch?v=s3Ejdx6cIho', 'https://www.youtube.com/watch?v=Y8Tko2YC5hA', 'https://www.youtube.com/watch?v=c3FXQU3TyCU', 'https://www.youtube.com/watch?v=yE9v9rt6ziw', 'https://www.youtube.com/watch?v=yvHrNlAF0Y0', 'https://www.youtube.com/watch?v=ZDa-Z5JzLYM']
I'm trying to grab the href element of each shoe in this site:
http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/
But I can't get the proper selectors right.
response.xpath('.//*[#class="newnav itemnamelink"]')
[]
Anyone know how would I do this in xpath or css?
Required links generated dynamically, so you wouldn't be able to scrape them from HTML source that you get like requests.get("http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/")
You might use selenium to get required values via browser session:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = web.Chrome()
driver.get('http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/')
wait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[#class='getproductdisplay-innertable']")))
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//a[#class="newnav itemnamelink"]')]
I am trying to write a script using Selenium to access pastebin do a search and print out in text the URL results. I need the visible URL results and nothing else.
<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/VYQTSbzY</div>
Current script is:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://www.pastebin.com')
search = browser.find_element_by_name('q')
search.send_keys("test")
search.send_keys(Keys.RETURN)
soup=BeautifulSoup(browser.page_source)
for link in soup.find_all('a'):
print link.get('href',None),link.get_text()
You don't actually need BeautifulSoup. selenium itself is very powerful at locating element:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.pastebin.com')
search = browser.find_element_by_name('q')
search.send_keys("test")
search.send_keys(Keys.RETURN)
# wait for results to appear
wait = WebDriverWait(browser, 10)
results = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.gsc-resultsbox-visible")))
# grab results
for link in results.find_elements_by_css_selector("a.gs-title"):
print link.get_attribute("href")
browser.close()
Prints:
http://pastebin.com/VYQTSbzY
http://pastebin.com/VYQTSbzY
http://pastebin.com/VAAQCjkj
...
http://pastebin.com/fVUejyRK
http://pastebin.com/fVUejyRK
Note the use of an Explicit Wait which helps to wait for the search results to appear.