I'm trying to grab the href element of each shoe in this site:
http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/
But I can't get the proper selectors right.
response.xpath('.//*[#class="newnav itemnamelink"]')
[]
Anyone know how would I do this in xpath or css?
Required links generated dynamically, so you wouldn't be able to scrape them from HTML source that you get like requests.get("http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/")
You might use selenium to get required values via browser session:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = web.Chrome()
driver.get('http://www.soccerpro.com/Clearance-Soccer-Shoes-c168/')
wait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[#class='getproductdisplay-innertable']")))
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//a[#class="newnav itemnamelink"]')]
Related
I am trying to webscrape a site using Python, Selenium, Beautifulsoup.
When I tried to get all the links ,It' returning an invalid string.
This is what I have tried
Can someone help me please?
from time import sleep
from selenium.webdriver.common.by import By
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.hirist.com/c/filter/mobile-applications-jobs-in-cochin%20kochi_trivandrum%20thiruvananthapuram-5-70_75-0-0-1-0-0-0-0-2.html?ref=homepagecat')
sleep(10)
links = driver.find_elements(by=By.XPATH, value='.//div[#class="jobfeed-wrapper multiple-wrapper"]')
for link in links:
link.get_attribute('href')
print(link)
It is your selection with xpath, you select the <div> that do not have an href attribute. Select also its first <a> like .//div[#class="jobfeed-wrapper multiple-wrapper"]/a and it will work:
links = driver.find_elements(by=By.XPATH, value='.//div[#class="jobfeed-wrapper multiple-wrapper"]/a')
for link in links:
print(link.get_attribute('href'))
Example
Instead of time use WebDriverWait to check if specific elements are available.
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.hirist.com/c/filter/mobile-applications-jobs-in-cochin%20kochi_trivandrum%20thiruvananthapuram-5-70_75-0-0-1-0-0-0-0-2.html?ref=homepagecat'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 10)
links = wait.until(EC.presence_of_all_elements_located((By.XPATH, './/div[#class="jobfeed-wrapper multiple-wrapper"]/a')))
for link in links:
print(link.get_attribute('href'))
Output
https://www.hirist.com/j/xforia-technologies-android-developer-javakotlin-10-15-yrs-1011605.html?ref=cl&jobpos=1&jobversion=2
https://www.hirist.com/j/firminiq-system-ios-developer-swiftobjective-c-3-10-yrs-1011762.html?ref=cl&jobpos=2&jobversion=2
https://www.hirist.com/j/firminiq-system-android-developer-kotlin-3-10-yrs-1011761.html?ref=cl&jobpos=3&jobversion=2
https://www.hirist.com/j/react-native-developer-mobile-app-designing-3-5-yrs-1009438.html?ref=cl&jobpos=4&jobversion=2
https://www.hirist.com/j/flutter-developer-iosandroid-apps-2-3-yrs-1008214.html?ref=cl&jobpos=5&jobversion=2
https://www.hirist.com/j/accubits-technologies-react-native-developer-ios-android-platforms-3-7-yrs-1003520.html?ref=cl&jobpos=6&jobversion=2
https://www.hirist.com/j/appincubator-react-native-developer-iosandroid-platform-2-7-yrs-1001957.html?ref=cl&jobpos=7&jobversion=2
You didn't declare path to chromedriver on your computer. Check where the chromdriver is, then try
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)
im quite noob in python and right now building up a web scraper in Selenium that would take all URL's for products in the clicked 'tab' on web page. But my code take the URL's from the first 'tab'. Code below. Thank you guys. Im starting to be kind of frustrated lol.
Screenshot
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from lxml import html
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.alza.sk/vypredaj-akcia-zlava/e0.htm'
driver.get(url)
driver.find_element_by_xpath('//*[#id="tabs"]/ul/li[2]').click()
links = []
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'blockFilter')))
link = driver.find_elements_by_xpath("//a[#class='name browsinglink impression-binded']")
for i in link:
links.append(i.get_attribute('href'))
finally:
driver.quit()
print(links)
To select current tab:
current_tab = driver.current_window_handle
To switch between tabs:
driver.switch_to_window(driver.window_handles[1])
driver.switch_to.window(driver.window_handles[-1])
Assuming you have the new tab url as TAB_URL, you should try:
from selenium.webdriver.common.action_chains import ActionChains
action = ActionChains(driver)
action.key_down(Keys.CONTROL).click(TAB_URL).key_up(Keys.CONTROL).perform()
Also, apparently the li doesn't have a click event, are you sure this element you are getting '//*[#id="tabs"]/ul/li[2]' has the aria-selected property set to true or any of these classes: ui-tabs-active ui-state-active?
If not, you should call click on the a tag inside this li.
Then you should increase the timeout parameter of your WebDriverWait to guarantee that the div is loaded.
I'm just learning how to webscrape dynamically using Selenium in Python. I'm currently trying to click on a link within the webpage to page forward over search results.
So far this is the code that I'm using:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
elem = driver.find_element_by_css_selector("img[src='/Sicr/TraDocEstProc/CLProLey2016.nsf/8eac1ef603908b5105256cdf006c41b1/$Body/0.AB2?OpenElement&FieldElemFormat=gif']")
elem.click()
This is the HTML that corresponds with the element I'd like to click on:
`<img src="/Sicr/TraDocEstProc/CLProLey2016.nsf/8eac1ef603908b5105256cdf006c41b1/$Body/0.AB2?OpenElement&FieldElemFormat=gif" width="81" height="16" border="0">`
From my somewhat limited knowledge of HTML this seems like the link is actually embedded in the gif which is why I tried to use the CSS selector that goes along with that image. But this did not work.
Any guidance would be greatly appreciated!
Update:
I changed my code by adding the following import
from selenium.webdriver.common.by import By
And I changed the following:
elem = driver.find_element(By.CSS_SELECTOR, "img[src='/Sicr/TraDocEstProc/CLProLey2016.nsf/8eac1ef603908b5105256cdf006c41b1/$Body/0.AB2?OpenElement&FieldElemFormat=gif']")
elem.click()
Now I get an error for "no such element."
There is an iframe.You need to switch to iframe first to access the element.Try below code.use WebDriverWait to handle dynamic element.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'ventana02')))
elem = WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(#onclick,'A50')]/img[contains(#src,'Sicr/TraDocEstProc/CLProLey')]")))
elem.click()
EDITED
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('C:\\Users\\km13\\chromedriver.exe')
driver.get("http://www.congreso.gob.pe/pley-2016-2021")
driver.switch_to.frame(0)
elem=WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//a[contains(#onclick,'A50')]/img[contains(#src,'Sicr/TraDocEstProc/CLProLey')]")))
elem.click()
I'm trying to find the twitch video IDs of all videos for a specific user. So for example on this page
https://www.twitch.tv/dyrus/videos/all
So here we have all videos linked, but its not quite so simple as to just scrape the html and find the links since they are generated dynamically it seems.
So I heard about selenium and did something like this:
from selenium import webdriver
# Change path here obviously
driver = webdriver.Chrome('C:/Users/Jason/Downloads/chromedriver')
driver.get('https://www.twitch.tv/dyrus/videos/all')
link_element = driver.find_elements_by_xpath("//*[#href]")
for link in link_element:
print(link.get_attribute('href'))
driver.close()
This returns me a bunch of links on the page but not the videos, they lie "deeper" I think, any input?
Thanks in advance
I would still suggest a couple of changes as follows:
Always open the Web Browser in maximized mode so that all/majority of the desired elements are within the Viewport.
If you are on Windows OS you need to append the extension .exe at the end of the WebDriver variant name, e.g. chromedriver.exe
While you identify for elements always try to include the class attribute in your Locator Strategy.
Always invoke driver.quit() at the end of your #Test to close & destroy the WebDriver and Web Client instances gracefully.
Here is your own code block with the above mentioned tweaks:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get('https://www.twitch.tv/dyrus/videos/all')
link_elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.tw-interactive.tw-link[data-a-target='preview-card-image-link']")))
for link in link_elements:
print(link.get_attribute('href'))
driver.quit()
Console Output:
https://www.twitch.tv/videos/295314690
https://www.twitch.tv/videos/294901947
https://www.twitch.tv/videos/294472813
https://www.twitch.tv/videos/294075254
https://www.twitch.tv/videos/293617036
https://www.twitch.tv/videos/293236560
https://www.twitch.tv/videos/292800601
https://www.twitch.tv/videos/292409437
https://www.twitch.tv/videos/292328170
https://www.twitch.tv/videos/292032996
https://www.twitch.tv/videos/291625563
https://www.twitch.tv/videos/291192151
https://www.twitch.tv/videos/290824842
https://www.twitch.tv/videos/290434348
https://www.twitch.tv/videos/290021370
https://www.twitch.tv/videos/289561690
https://www.twitch.tv/videos/289495488
https://www.twitch.tv/videos/289138003
https://www.twitch.tv/videos/289110429
https://www.twitch.tv/videos/288804893
https://www.twitch.tv/videos/288784992
https://www.twitch.tv/videos/288687479
https://www.twitch.tv/videos/288432438
https://www.twitch.tv/videos/288117849
https://www.twitch.tv/videos/288004968
https://www.twitch.tv/videos/287689102
https://www.twitch.tv/videos/287451192
https://www.twitch.tv/videos/287267032
https://www.twitch.tv/videos/287017431
https://www.twitch.tv/videos/286819343
With your locator, you are returning every element on the page that contains an href attribute. You can be a little more specific than that and get what you are looking for. Switch to a CSS selector...
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Change path here obviously
driver = webdriver.Chrome('C:/Users/Jason/Downloads/chromedriver')
driver.get('https://www.twitch.tv/dyrus/videos/all')
links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-a-target='preview-card-image-link']")))
for link in links:
print(link.get_attribute('href'))
driver.close()
That prints 40 links from the page.
I'm using Selenium to login to the webpage and getting the webpage for scraping
I'm able to get the page.
I have searched the html for a table that I wanted to scrape.
here it is:-
<table cellspacing="0" class=" tablehasmenu table hoverable sensors" id="table_devicesensortable">
This is the script :-
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,'html.parser') #parsing the webpage
tbody=souppage.find('table', attrs={'id':'table_devicesensortable'}) #scrapping
I'm able to get the parsed webpage in souppage variable.
but not able to scrape and store in tbody variable.
Required table might be generated dynamically, so you need to wait until its presence on page:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
tbody = wait(driver, 10).until(EC.presence_of_element_located((By.ID, "table_devicesensortable")))
Also note that there is no need in using BeautifulSoup as Selenium has enough built-in methods and properties to do the same job for you, e.g.
headers = tbody.find_elements_by_tag_name("th")
rows = tbody.find_elements_by_tag_name("tr")
cells = tbody.find_elements_by_tag_name("td")
cell_values = [cell.text for cell in cells]
etc...
I was searching on stackoverflow for the issue and came across this post
BeautifulSoup returning none when element definitely exists
By reading the answer provided by luiyezheng i got the hint that might be as the data is fetched dynamically.So, the table might got created dynamically and hence i was unable to find.
So, the work around is :-
before storing the webpage i put a delay
so the code goes like this
time.sleep(4)
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping
i hope it might help someone.
As per the HTML you have shared to scrape the <table> you have induce WebDriverWait with expected_conditions clause set to presence_of_element_located and to achieve that you can use either of the following code blocks :
Using class:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class=' tablehasmenu table hoverable sensors' and #id='table_devicesensortable']")))
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"class":" tablehasmenu table hoverable sensors"}) #scrapping
Using id:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[#class=' tablehasmenu table hoverable sensors' and #id='table_devicesensortable']")))
rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping