Python loop through pages of website using Selenium - python

I've spent quite a bit of time on this and hoping to get some help...I'm new to Python and web scraping.
I'm accessing a website using credentials so I won't be able to share the link, but it's fairly straightforward and I have most of the code. Using Selenium, I'm able to access the website, input my credentials, access a table, pull in data I want, create a data frame, and go to the next page. But, I would like to automatically loop through all pages (with some pauses and being kind to the site) and append each page to a master. This is what I have so far:
driver = webdriver.Chrome()
driver.get('website')
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("username")
password.send_keys("password"+"\n")
driver.implicitly_wait(20)
table = driver.find_element_by_id('preblockBody')
information = []
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
#print(value.text)
information.append(value.text)
nxt=driver.find_element_by_xpath("//a[contains(#href, 'gotoNextPage(2)')]")
driver.execute_script("arguments[0].click();", nxt)
I think the best route is finding all the contains 'gotoNextPage' references and create a loop, but I'm unsure how to do so. Any help is appreciated very much.

UPDATE 1:
I've found something helpful where I use 'Next' instead of clicking the specific 'gotoNextPage' element. Here is my new code, however, it only appends the last page of info rather than appending as it goes through the pages. This is very close!
driver = webdriver.Chrome()
driver.get('website')
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("user name")
password.send_keys("password"+"\n")
while True:
driver.implicitly_wait(30)
table = driver.find_element_by_id('preblockBody')
information = []
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
#print(value.text)
information.append(value.text)
try:
driver.find_element_by_partial_link_text('Next').click()
except:
break
driver.quit()
print(information)

Related

Stuck in Hcapture loop

Am using selenium and python plus 2capture API.i was able to retrieve the tokens successfully and even submit the form using js.
The form is submitted but the link keeps on reloading therefore cannot go past the hcapture loop.
here is my code:
def Solver(self, browser):
WebDriverWait(browser, 60).until(Ec.frame_to_be_available_and_switch_to_it((By.XPATH,'//*[#id="cf-hcaptcha-container"]/div[2]/iframe')))
captcha = CaptchaRecaptcha()
url = browser.current_url
code = captcha.HCaptcha(url)
script = "let submitToken = (token) => {document.querySelector('[name=h-captcha-response]').innerText = token document.querySelector('.challenge-form').submit() }submitToken('{}')".format(code)
script1 = (f"document.getElementsByName('h-captcha-response')[0].innerText='{code}'")
print(script)
browser.execute_script(script)
time.sleep(5)
browser.switch_to.parent_frame()
time.sleep(10)
Am using proxies in the web driver and also switching the user agent
someone, please explain what am doing wrong or what I should do to break the loop.

how to use selenium to go from one url tab to another before scraping?

I have created the following code in hopes to open up a new tab with a few parameters and then scrape the data table that is on the new tab.
#Open Webpage
url = "https://www.website.com"
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
driver.get(url)
#Click Necessary Parameters
driver.find_element_by_partial_link_text('Output').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[3]').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[4]').click()
driver.find_element_by_xpath('//*[#id="repOpt"]/table[2]/tbody/tr/td[2]/input[4]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Dates').click()
driver.find_element_by_xpath('//*[#id="RangeOption"]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[3]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[4]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[3]/select/option[31]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[4]/select/option[1]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Groupings').click()
driver.find_element_by_xpath('//*[#id="availFld_DATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_LOCID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_STATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_DDSO_SA"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_CLASS_ID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_REGION"]/a/img').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Run').click()
time.sleep(2)
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
soup = BeautifulSoup(page, features = 'html5lib')
soup.prettify()
However, the following error pops up when I run it.
requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?
I will say that regardless of the parameters, the new tab always generates the same url. In other words, if the new tab creates www.website.com/b, it also creates www.website.com/b the third, fourth, etc. time, regardless of changing the parameters. Any thoughts?
The problem lies here:
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
df_url is not referring to the url of the page. To get that, you should call driver.current_url after switching windows to get the url of the active window.
Some other pointers:
finding elements by xpath is relatively inefficient (source)
instead of time.sleep, you can look into using explicit waits
Insert the url below the driver variable because first, the webdriver executes and then the url provided
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
url = "https://www.website.com"

Unable to fetch all the necessary links during Iteration - Selenium Python

I am newbie to Selenium Python. I am trying to fetch the profile URLs which will be 10 per page. Without using while, I am able to fetch all 10 URLs but for only the first page alone. When I use while, it iterates, but fetches only 3 or 4 URLs per page.
I need to fetch all the 10 links and keep iterating through pages. I think, I must do something with StaleElementReferenceException
Kindly help me solve this problem.
Given the code below.
def test_connect_fetch_profiles(self):
driver = self.driver
search_data = driver.find_element_by_id("main-search-box")
search_data.clear()
search_data.send_keys("Selenium Python")
search_submit = driver.find_element_by_name("search")
search_submit.click()
noprofile = driver.find_elements_by_xpath("//*[text() = 'Sorry, no results containing all your search terms were found.']")
self.assertFalse(noprofile)
while True:
wait = WebDriverWait(driver, 150)
try:
profile_links = wait.until(EC.presence_of_all_elements_located((By.XPATH,"//*[contains(#href,'www.linkedin.com/profile/view?id=')][text()='LinkedIn Member'or contains(#href,'Type=NAME_SEARCH')][contains(#class,'main-headline')]")))
for each_link in profile_links:
page_links = each_link.get_attribute('href')
print(page_links)
driver.implicitly_wait(15)
appendFile = open("C:\\Users\\jayaramb\\Documents\\profile-links.csv", 'a')
appendFile.write(page_links + "\n")
appendFile.close()
driver.implicitly_wait(15)
next = wait.until(EC.visibility_of(driver.find_element_by_partial_link_text("Next")))
if next.is_displayed():
next.click()
else:
print("End of Page")
break
except ValueError:
print("It seems no values to fetch")
except NoSuchElementException:
print("No Elements to Fetch")
except StaleElementReferenceException:
print("No Change in Element Location")
else:
break
Please let me know if there are any other effective ways to fetch the required profile URL and keep iterating through pages.
I created a similar setup which works alright for me. I've had some problems with selenium trying to click on the next-button but it throwing a WebDriverException instead, likely because the next-button is not in view. Hence, instead of clicking the next-button I get its href-attribute and load the new page up with driver.get() and thus avoiding an actual click making the test more stable.
def test_fetch_google_links():
links = []
# Setup driver
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.maximize_window()
# Visit google
driver.get("https://www.google.com")
# Enter search query
search_data = driver.find_element_by_name("q")
search_data.send_keys("test")
# Submit search query
search_button = driver.find_element_by_xpath("//button[#type='submit']")
search_button.click()
while True:
# Find and collect all anchors
anchors = driver.find_elements_by_xpath("//h3//a")
links += [a.get_attribute("href") for a in anchors]
try:
# Find the next page button
next_button = driver.find_element_by_xpath("//a[#id='pnnext']")
location = next_button.get_attribute("href")
driver.get(location)
except NoSuchElementException:
break
# Do something with the links
for l in links:
print l
print "Found {} links".format(len(links))
driver.quit()

Is it possible to forwardSelenium/webdriver results to mechanize/beautiful soup

Ok, so I pretty much used webdriver to navigate to a specific page with a table of results contained in a unique div. I had to use webdriver to fill the forms and interact with the javascript buttons. Anyways, i need to scrape the table into a file but I can't figure this out. Here's the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Open Firefox
driver = webdriver.Firefox()
driver.get("https://subscriber.hoovers.com/H/login/login.html")
# Login and submit
username = driver.find_element_by_id('j_username')
username.send_keys('THE_EMAIL_ADDRESS')
password = driver.find_element_by_id('j_password')
password.send_keys('THE_PASSWORD')
password.submit()
# go to "build a list" url (more like 'build-a-table' get it right guys!
driver.get('http://subscriber.hoovers.com/H/search/buildAList.html?_target0=true')
# expand industry list to reveal SIC codes form
el = driver.find_elements_by_xpath("//h2[contains(string(), 'Industry')]")[0]
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(el, 5, 5)
action.click()
action.perform()
# fill sic.codes form with all the SIC codes
siccodes = driver.find_element_by_id('advancedSearchCriteria.sicCodes')
siccodes.send_keys('316998,321114,321211,321212,321213,321214,321219,321911,'
'321912,321918,321992,322121,322130,326122,326191,326199,327110,327120,'
'327212,327215,327320,327331,327332,327390,327410,327420,327910,327991,'
'327993,327999,331313,331315,332216,332311,332312,332321,332322,332323,'
'333112,333414,333415,333991,'334290,335110,335121,335122,335129,335210,'
'335221,335222,335224,335228,335311,335312,335912,335929,335931,335932,'
'335999,337920,339910,339993,339994,339999,423310,423320,423330,423610,'
'423620,423710,423720,423730,424950,444120')
# wait 5 seconds because this is a big list to load
time.sleep(5)
# Select "Add to List" button and clickity-clickidy-CLICK!
butn = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[1]/form/div/div[3]/div/div[2]/div[1]/div[2]/p[1]/button')
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(butn, 5, 5)
action.click()
action.perform()
# wait 10 seconds to add them to list
time.sleep(10)
# Now select confirm list button and wait to be forwarded to results page
butn = driver.find_element_by_xpath('/html/body/div[3]/div/div[1]/input[2]')
action = webdriver.common.action_chains.ActionChains(driver)
action.send_keys("\n")
action.move_to_element_with_offset(butn, 5, 5)
action.click()
action.perform()
# wait 10 seconds, let it load and dig up them numbah tables
time.sleep(10)
# Check that we're on the right results landing page...
print driver.current_url
# Good we have arrived! Now lets save this url for scrape-time!
url = driver.current_url
# Print everything... but we only need the table!!! HOWW?!?!?!?
sourcecode = driver.page_source.encode("utf-8")
# EVERYTHING AFTER THIS POINT DOESN't WORK!!!! `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All I need is to print the table out as organized as possible with a for loop but it seems this works a lot better with mechanize or BeautifulSoup. So is this possible? Any suggestions? also, sorry if my code is sloppy, I'm multitasking with deadlines and other scripts. Please help meehh! I will provide my login credentials if you really need them and want to help me. It's nothing too serious, just a company SIC and D-U-N-S number database but I don't think you need it to figure this out. I know there's a few jedi's out there that can save me. :)

How to store the information in a 'NoType' HTML element in Selenium - Python

I'm trying to take some information from an HTML element using Selenium - Python, and I'm unsure on how to save it. I'm kind of new to programming, but literate enough to where I know how to write code, but it's hard to research answers and adapt those to my code. I've looked on Google and can't seem to find anything that would help me specifically with what I need.
Here is the HTML element I need to get information from:
<span id="ctl00_plnMain_rptAssigmnetsByCourse_ctl00_lblOverallAverage">99.05</span>
I need to retrieve the 99.05 and store it in a variable named "avg."
Here is my code I have for the Selenium test.
username = raw_input("Username: ")
password = raw_input("Password: ")
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://hac.mckinneyisd.net/homeaccess/default.aspx") # Load page
elem = browser.find_element_by_name("ctl00$plnMain$txtLogin") # Find the query box
elem.send_keys(username)
elem = browser.find_element_by_name("ctl00$plnMain$txtPassword") # Find the password box
elem.send_keys(password + Keys.RETURN)
time.sleep(0.2) # Let the page load
elem = browser.find_element_by_link_text("Classwork").click()
time.sleep(0.2)
???????????????
browser.close()
What should I put in the ???... to take the 99.05 from the object and save it as "avg?" I have tried:
content = elem.text("td[#id='ctl00....lblOverallAverage']"
...but I get an error saying that I can't do that because it has no type.
Try:
elem = browser.find_element_by_id("ctl00_plnMain_rptAssigmnetsByCourse_ctl00_lblOverallAverage")
avg = elem.getText()

Categories

Resources