I am having some trouble scraping the url below:
http://102.37.123.153/Lists/eTenders/AllItems.aspx
I am using Python with Selenium, but have many "onclick" javascript events to run to get to lowest level of information. Does anyone know how to automate this?
Thanks
url = 'http://102.37.123.153/Lists/eTenders/AllItems.aspx'
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome('c:/Users/AB/Dropbox/ITProjects/Scraping/chromedriver.exe', options=chrome_options)
res = browser.get(url)
time.sleep(10)
source = browser.page_source
soup = BeautifulSoup(source)
for link in soup.find_all('a'):
if link.get('href') == 'javascript:':
print(link)
You don't need selenium with this website, you need patience. Let me explain how you'd approach that.
Click X
Y opens, click Y
Z opens, click Z.
Goes on..........
What happened here is that when you've clicked X, an AJAX request was made to get Y and after you click Y, another AJAX was made to get Z and then this goes on.
So you can just simulate those requests, open the networks tab and see how does it craft the requests then make the same ones in your code then get the response, based on it, do the next request and the cycle will go on till you get to the innermost level of the tree.
This approach has no UI and is technically-speaking, more unfriendly and harder to implement. But it's more efficient, on the other side, you can just select your clickable elements with selenium like
eleme = driver.find_elemnent_by_x('x')
elem.click()
And it will also work
I'd also note that sometimes, links don't AJAX, they just hide the info but it's in the source code. To know what you'll recieve in your response, R-click in the website and choose View page source and note that this is different than inspect element.
Related
I try to crawl data from a dynamic web using selenium. It required an account to log in, and I must click on some link to forward to information page. After doing all these steps, I found that the source code is not changed and I can not get element that exist on new page. In otherhands, I get direct to this page, and do login but source code I get is parent pages. Can you explain to me why and how to tackle this problem?
how I perform click action
element = driver.find_element(By.CLASS_NAME, "class_name")
element2 = element.find_element(By.CSS_SELECTOR, "css_element")
element2.click()
how I get source code:
page_source = driver.execute_script("return document.body.outerHTML")
with open('a.html', 'w') as f:
f.write(page_source)
I am trying to webscrape indeed.com to search for jobs using python, with selenium and beautifulsoup. I want to click next page but cant seem to figure out how to do this. Looked at many threads but it is unclear to me which element I am supposed to perform on. Here is the web page html and the code marked with grey comes up when I inspect the next button.
Also just to mention I tried first to follow what happens to the url when mousedown is executed. After reading the addppurlparam function and adding the strings in the function and using that url I just get thrown back to page one.
Here is my code for the class with selenium meant to click on the button:
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome("C:/Users/alleballe/Downloads/chromedriver.exe")
driver.get("https://se.indeed.com/Internship-jobb")
print(driver.title)
#assert "Python" in driver.title
elem = driver.find_element_by_class_name("pagination-list")
elem = elem.find_element_by_xpath("//li/a[#aria-label='Nästa']")
print(elem)
assert "No results found." not in driver.page_source
assert elem
action = ActionChains(driver).click(elem)
action.perform()
print(elem)
driver.close()
The indeed site is formatted so that it shows 10 per page.
Your photo shows the wrong section of HTML instead you can see the links contain start=0 for the first page, start=10 for the second, start=20 for the third,...
You could use this knowledge to do a code like this:
while True:
i = 0
driver.get(f'https://se.indeed.com/jobs?q=Internship&start={i}')
# code here
i = i + 10
But, to directly answer to your question you should do:
next_page_link = driver.find_element_by_xpath('/html/head/link[6]')
driver.get(next_page_link)
This will find the link and then get it.
its work. paginated to next page.
driver.find_element_by_class_name("pagination-list").find_element_by_tag_name('a').click()
I am new to Selenium and I am trying to mimic user actions on a site to fetch data from a built in html page on button click. I am able to populate all the field details, but button click is not working, it looks like js code not running.
I tried many options like adding wait time, Action chain etc but it didnt work, i am providing site and code i have written.
driver = webdriver.Chrome()
driver.get("https://www1.nseindia.com/products/content/derivatives/equities/historical_fo.htm")
driver.implicitly_wait(10)
assigned values to all the other fields
driver.find_element_by_id('rdDateToDate').click()
Dfrom = driver.find_element_by_id('fromDate')
Dfrom.send_keys("02-Oct-2020")
Dto = driver.find_element_by_id('toDate')
Dto.send_keys("08-Oct-2020")
innerHTML = driver.execute_script("document.ready")
sleep(5)
getdata_btn = driver.find_element_by_id('getButton')
ActionChains(driver).move_to_element(getdata_btn).click().click().perform()
I recommend using a full xpath.
chrome.get("https://www1.nseindia.com/products/content/derivatives/equities/historical_fo.htm")
time.sleep(2)
print("click")
fullxpath = "/html/body/div[2]/div[3]/div[2]/div[1]/div[3]/div/div[1]/div/form/div[19]/p[2]/input"
chrome.find_element_by_xpath(fullxpath).click()
I have tried the button clicking and it worked with XPath ... I though its because of someone used the ID twice on a website, but I can not find it ... so i have no idea whats going wrong there ...
Good luck :)
I have created the following code in hopes to open up a new tab with a few parameters and then scrape the data table that is on the new tab.
#Open Webpage
url = "https://www.website.com"
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
driver.get(url)
#Click Necessary Parameters
driver.find_element_by_partial_link_text('Output').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[3]').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[4]').click()
driver.find_element_by_xpath('//*[#id="repOpt"]/table[2]/tbody/tr/td[2]/input[4]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Dates').click()
driver.find_element_by_xpath('//*[#id="RangeOption"]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[3]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[4]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[3]/select/option[31]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[4]/select/option[1]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Groupings').click()
driver.find_element_by_xpath('//*[#id="availFld_DATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_LOCID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_STATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_DDSO_SA"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_CLASS_ID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_REGION"]/a/img').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Run').click()
time.sleep(2)
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
soup = BeautifulSoup(page, features = 'html5lib')
soup.prettify()
However, the following error pops up when I run it.
requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?
I will say that regardless of the parameters, the new tab always generates the same url. In other words, if the new tab creates www.website.com/b, it also creates www.website.com/b the third, fourth, etc. time, regardless of changing the parameters. Any thoughts?
The problem lies here:
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
df_url is not referring to the url of the page. To get that, you should call driver.current_url after switching windows to get the url of the active window.
Some other pointers:
finding elements by xpath is relatively inefficient (source)
instead of time.sleep, you can look into using explicit waits
Insert the url below the driver variable because first, the webdriver executes and then the url provided
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
url = "https://www.website.com"
I am trying to automate a booking in process on a travel site using
splinter and having trouble clicking on a css element on the page.
This is my code
import splinter
import time
secret_deals_email = {
'user[email]': 'adf#sad.com'
}
browser = splinter.Browser()
url = 'http://roomer-qa-1.herokuapp.com'
browser.visit(url)
click_FIND_ROOMS = browser.find_by_css('.blue-btn').first.click()
time.sleep(10)
# click_Book_button = browser.find_by_css('.book-button-row.blue-btn').first.click()
browser.fill_form(secret_deals_email)
click_get_secret_deals = browser.find_by_name('button').first.click()
time.sleep(10)
click_book_first_room_list = browser.find_by_css('.book-button-row-link').first.click()
time.sleep(5)
click_book_button_entry = browser.find_by_css('.entry-white-box.entry_box_no_refund').first.click()
The problem is whenever I run it and the code gets to the page where I need to click the sort of purchase I would like. I can't click any of the option on the page.
I keep getting an error of the element not existing no matter what should I do.
http://roomer-qa-1.herokuapp.com/hotels/atlanta-hotels/ramada-plaza-atlanta-downtown-capitol-park.h30129/44389932?rate_plan_id=1&rate_plan_token=6b5aad6e9b357a3d9ff4b31acb73c620&
This is the link to the page that is causing me trouble please help :).
You need to whait until the element is present at the website. You can use the is_element_not_present_by_css method with a while loop to do that
while not(is_element_not_present_by_css('.entry-white-box.entry_box_no_refund')):
time.sleep(50)