Is it possible to forwardSelenium/webdriver results to mechanize/beautiful soup - python

Ok, so I pretty much used webdriver to navigate to a specific page with a table of results contained in a unique div. I had to use webdriver to fill the forms and interact with the javascript buttons. Anyways, i need to scrape the table into a file but I can't figure this out. Here's the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Open Firefox
driver = webdriver.Firefox()
driver.get("https://subscriber.hoovers.com/H/login/login.html")
# Login and submit
username = driver.find_element_by_id('j_username')
username.send_keys('THE_EMAIL_ADDRESS')
password = driver.find_element_by_id('j_password')
password.send_keys('THE_PASSWORD')
password.submit()
# go to "build a list" url (more like 'build-a-table' get it right guys!
driver.get('http://subscriber.hoovers.com/H/search/buildAList.html?_target0=true')
# expand industry list to reveal SIC codes form
el = driver.find_elements_by_xpath("//h2[contains(string(), 'Industry')]")[0]
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(el, 5, 5)
action.click()
action.perform()
# fill sic.codes form with all the SIC codes
siccodes = driver.find_element_by_id('advancedSearchCriteria.sicCodes')
siccodes.send_keys('316998,321114,321211,321212,321213,321214,321219,321911,'
'321912,321918,321992,322121,322130,326122,326191,326199,327110,327120,'
'327212,327215,327320,327331,327332,327390,327410,327420,327910,327991,'
'327993,327999,331313,331315,332216,332311,332312,332321,332322,332323,'
'333112,333414,333415,333991,'334290,335110,335121,335122,335129,335210,'
'335221,335222,335224,335228,335311,335312,335912,335929,335931,335932,'
'335999,337920,339910,339993,339994,339999,423310,423320,423330,423610,'
'423620,423710,423720,423730,424950,444120')
# wait 5 seconds because this is a big list to load
time.sleep(5)
# Select "Add to List" button and clickity-clickidy-CLICK!
butn = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[1]/form/div/div[3]/div/div[2]/div[1]/div[2]/p[1]/button')
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(butn, 5, 5)
action.click()
action.perform()
# wait 10 seconds to add them to list
time.sleep(10)
# Now select confirm list button and wait to be forwarded to results page
butn = driver.find_element_by_xpath('/html/body/div[3]/div/div[1]/input[2]')
action = webdriver.common.action_chains.ActionChains(driver)
action.send_keys("\n")
action.move_to_element_with_offset(butn, 5, 5)
action.click()
action.perform()
# wait 10 seconds, let it load and dig up them numbah tables
time.sleep(10)
# Check that we're on the right results landing page...
print driver.current_url
# Good we have arrived! Now lets save this url for scrape-time!
url = driver.current_url
# Print everything... but we only need the table!!! HOWW?!?!?!?
sourcecode = driver.page_source.encode("utf-8")
# EVERYTHING AFTER THIS POINT DOESN't WORK!!!! `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All I need is to print the table out as organized as possible with a for loop but it seems this works a lot better with mechanize or BeautifulSoup. So is this possible? Any suggestions? also, sorry if my code is sloppy, I'm multitasking with deadlines and other scripts. Please help meehh! I will provide my login credentials if you really need them and want to help me. It's nothing too serious, just a company SIC and D-U-N-S number database but I don't think you need it to figure this out. I know there's a few jedi's out there that can save me. :)

Related

Selenium - HTML doesn't always update after a click. Content in the browser changes, but I often get the same HTML from prior to the click

I'm trying to set up a simple webscraping script to pull every hyperlink from the discover cards on Bandcamp.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
browser = webdriver.Chrome()
all_links = []
page = 1
url = "https://bandcamp.com/?g=all&s=new&p=0&gn=0&f=digital&w=-1"
browser.get(url)
while page < 6:
page += 1
# wait until discover cards are loaded
test = WebDriverWait(browser, 20).until(EC.element_to_be_clickable(
(By.XPATH, '//*[#id="discover"]/div[9]/div[2]/div/div[1]/div/table/tbody/tr[1]/td[1]/a/div')))
# scrape hyperlinks for each of the 8 albums shown
titles = browser.find_elements(By.CLASS_NAME, "item-title")
links = [title.get_attribute('href') for title in titles[-8:]]
all_links = all_links + links
print(links)
# pagination - click through the page buttons as the links are scraped
page_nums = browser.find_elements(By.CLASS_NAME, 'item-page')
for page_num in page_nums:
if page_num.text.isnumeric():
if int(page_num.text) == page:
page_num.click()
time.sleep(20) # I've tried multiple long wait times as well as WebDriverWaits on different elements to see if the HTML will update, but I haven't seen a positive effect
break
I'm using print(links) to see where this is going wrong. In the selenium browser, it clicks through the pages well. Note that pagination via the url parameters doesn't seem possible as the discover cards often won't load unless you click the page buttons towards the bottom of my picture. BetterSoup and Requests don't work either for the same reason. The print function is returning the following:
['https://neubauten.bandcamp.com/album/stimmen-reste-musterhaus-7?from=discover-new', 'https://cirka1.bandcamp.com/album/time?from=discover-new', 'https://futuramusicsound.bandcamp.com/album/yoga-meditation?from=discover-new', 'https://deathsoundbatrecordings.bandcamp.com/album/real-mushrooms-dsbep092?from=discover-new', 'https://riacurley.bandcamp.com/album/take-me-album?from=discover-new', 'https://terracuna.bandcamp.com/album/el-origen-del-viento?from=discover-new', 'https://hyper-music.bandcamp.com/album/hypermusic-vol-4?from=discover-new', 'https://defisis1.bandcamp.com/album/priceless?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://finitysounds.bandcamp.com/album/kreme?from=discover-new', 'https://mylittlerobotfriend.bandcamp.com/album/amen-break?from=discover-new', 'https://electrinityband.bandcamp.com/album/rise?from=discover-new', 'https://abyssal-void.bandcamp.com/album/ritualist?from=discover-new', 'https://plataformarecs.bandcamp.com/album/v-a-david-lynch-experience?from=discover-new', 'https://hurricaneturtles.bandcamp.com/album/industrial-synth?from=discover-new', 'https://blackwashband.bandcamp.com/album/2?from=discover-new', 'https://worldwide-bitchin-records.bandcamp.com/album/wack?from=discover-new']
Each time it correctly pulls the first 8 albums on page 1, then for pages 2-4 it repeats the 8 albums on page 2, for pages 5-7 it repeats the 8 albums on page 5, and so on. Even though the page is updating (and the url changes) in the selenium browser, for some reason selenium is not recognizing any changes to the html so it repeats the same titles. Any idea where I've gone wrong?
Your definition of titles, i.e.
titles = browser.find_elements(By.CLASS_NAME, "item-title")
is a bad idea because item-title is the class of many elements in the page. Then another bad idea is to pick titles[-8:]. It may sounds good because you think ok each time I click a page the new elements are added at the end, but this is not always the case. Your case is one of those were elements are not added sequentially.
So let's start by considering a class exclusive of cards. For example discover-item. Then open the DevTools, press CTRL+F and enter .discover-item. When the url is first loaded, it will find 8 results. Now click next page, now it finds 16 results, click again and will find 24 results. To better see what's going I suggest you to run the following each time you click on the "next" button.
el = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
for i,e in enumerate(el):
print(i,e.get_attribute('innerText').replace('\n',' - '))
In particular, when arriving to page 3, you will see that the first item shown in page 1 (which in my case is 0 and friends - Jacob Stanley - alternative), is now printed at a different position (in my case 8 and friends - Jacob Stanley - alternative). What happened is that the items of page 3 were added at the beginning of the list, and so you can see why titles[-8:] was a bad choice.
So a better choice is to consider all cards each time you go to the next page, instead of the last 8 only (notice that the HTML of this site can contain no more than 24 cards), and then add all current cards to a set (since a set cannot contain duplicates, only new elements will be added).
# scroll to cards
cards = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".discover-item")))
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', cards[0])
time.sleep(1)
items = set()
while 1:
links = driver.find_elements(By.CSS_SELECTOR, '.discover-item .item-title')
# extract info and store it
for idx,card in enumerate(cards):
tit_art_gen = card.get_attribute('innerText').replace('\n',' - ')
href = links[idx].get_attribute('href')
# print(idx, tit_art_gen)
items.add(tit_art_gen + ' - ' + href)
# click 'next' button if it is not disabled
next_button = driver.find_element(By.XPATH, "//a[.='next']")
if 'disabled' in next_button.get_attribute('class'):
print('last page reached')
break
else:
next_button.click()
# wait until new elements are loaded
cards_new = cards.copy()
while cards_new == cards:
cards_new = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
time.sleep(.5)
cards = cards_new.copy()

Selenium scraping Issues with site having an popup window with endless scroll

I am trying to scrape a website that populates a list of providers. the site makes you go through a list of options and then finally it populates a list of providers through a pop up that has an endless/continuous scroll.
i have tried:
from selenium.webdriver.common.action_chains import ActionChains
element = driver.find_element_by_id("my-id")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
but this code didn't work.
I tried something similar to this:
driver.execute_script("arguments[0].scrollIntoView();", list )
but this didnt move anything. it just stayed on the first 20 providers.
i tried this alternative:
main = driver.find_element_by_id('mainDiv')
recentList = main.find_elements_by_class_name('nameBold')
for list in recentList :
driver.execute_script("arguments[0].scrollIntoView(true);", list)
time.sleep(20)
but ended up with this error message:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
The code that worked the best was this one:
while True:
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
but this is an endless scroll that i dont know how to stop since "while True:" will always be true.
Any help with this would be great and thanks in advance.
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
import pandas as pd
PATH = '/Users/AnthemScraper/venv/chromedriver'
driver = webdriver.Chrome(PATH)
#location for the website
driver.get('https://shop.anthem.com/sales/eox/abc/ca/en/shop/plans/medical/snq?execution=e1s13')
print(driver.title)
#entering the zipcode
search = driver.find_element_by_id('demographics.zip5')
search.send_keys(90210)
#making the scraper sleep for 5 seconds while the page loads
time.sleep(5)
#entering first name and DOB then hitting next
search = driver.find_element_by_id('demographics.applicants0.firstName')
search.send_keys('juelz')
search = driver.find_element_by_id('demographics.applicants0.dob')
search.send_keys('01011990')
driver.find_element_by_xpath('//*[#id="button/shop/getaquote/next"]').click()
#hitting the next button
driver.find_element_by_xpath('//*[#id="hypertext/shop/estimatesavings/skipthisstep"]').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
#clicking the no option to view all the health plans
driver.find_element_by_xpath('//*[#id="radioNoID"]').click()
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
driver.find_element_by_xpath('//*[#id="hypertext/shop/medical/showmemydoctorlink"]/span').click()
time.sleep(2)
#section to choose the specialist. here we are choosing all
find_specialist=\
driver.find_element_by_xpath('//*[#id="specializedin"]')
#this is the method for a dropdown
select_provider = Select(find_specialist)
select_provider.select_by_visible_text('All Specialties')
#choosing the distance. Here we click on 50 miles
choose_mile_radius=\
driver.find_element_by_xpath('//*[#id="distanceInMiles"]')
select_provider = Select(choose_mile_radius)
select_provider.select_by_visible_text('50 miles')
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#handling the endless scroll
while True:
time.sleep(20)
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
#block below allows us to grab the majority of the data. we would have to split it up in pandas since this info
#is nested in with classes
time.sleep(5)
main = driver.find_element_by_id('mainDiv')
sections = main.find_elements_by_class_name('firstRow')
pcp_info = []
#print(section.text)
for pcp in sections:
#the site stores the information inside inner classes which make it difficult to scrape.
#the solution would be to pull the entire text in the block and hope to clean it aftewards
#innerText allows to pull just the text inside the blocks
first_blox = pcp.find_element_by_class_name('table_content_colone').get_attribute('innerText')
second_blox = pcp.find_element_by_class_name('table_content_coltwo').get_attribute('innerText')
#creating columns and rows and assigning them
pcp_items = {
'first_block' : [first_blox],
'second_block' : [second_blox]
}
pcp_info.append(pcp_items)
df = pd.DataFrame(pcp_info)
print(df)
df.to_csv('yerp.csv',index=False)
#driver.quit()

Python loop through pages of website using Selenium

I've spent quite a bit of time on this and hoping to get some help...I'm new to Python and web scraping.
I'm accessing a website using credentials so I won't be able to share the link, but it's fairly straightforward and I have most of the code. Using Selenium, I'm able to access the website, input my credentials, access a table, pull in data I want, create a data frame, and go to the next page. But, I would like to automatically loop through all pages (with some pauses and being kind to the site) and append each page to a master. This is what I have so far:
driver = webdriver.Chrome()
driver.get('website')
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("username")
password.send_keys("password"+"\n")
driver.implicitly_wait(20)
table = driver.find_element_by_id('preblockBody')
information = []
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
#print(value.text)
information.append(value.text)
nxt=driver.find_element_by_xpath("//a[contains(#href, 'gotoNextPage(2)')]")
driver.execute_script("arguments[0].click();", nxt)
I think the best route is finding all the contains 'gotoNextPage' references and create a loop, but I'm unsure how to do so. Any help is appreciated very much.
UPDATE 1:
I've found something helpful where I use 'Next' instead of clicking the specific 'gotoNextPage' element. Here is my new code, however, it only appends the last page of info rather than appending as it goes through the pages. This is very close!
driver = webdriver.Chrome()
driver.get('website')
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("user name")
password.send_keys("password"+"\n")
while True:
driver.implicitly_wait(30)
table = driver.find_element_by_id('preblockBody')
information = []
job_elems = table.find_elements_by_xpath("//*[contains(#class,'pbListingTable')]")
for value in job_elems:
#print(value.text)
information.append(value.text)
try:
driver.find_element_by_partial_link_text('Next').click()
except:
break
driver.quit()
print(information)

selenium, webdriver.page_source not refreshing after click

I am trying to copy a web page's list of addresses for a given community service to a new document so i can geocode all of the locations in a map. Instead of being able to get a list of all the parcels I can only download one at a time and there are 25 parcel numbers limited to a page. As such, this would be extremely time consuming.
I want to develop a script that will look at the page source (everything including the 25 addresses which are contained in a table tag) click the next page button, copy the next page, and so on until the max page is reached. Afterwards, I can format the text to be geocoding compatible.
The code below does all of this except it only copies the first page over and over again even though I can clearly see that the program has successfully navigated to the next page:
# Open chrome
br = webdriver.Chrome()
raw_input("Navigate to web page. Press enter when done: ")
pg_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
max_page = 122 #int(max_page)
#open a text doc to write the results to
f = open(r'C:\Geocoding\results.txt', 'w')
# write results page by page until max page number is reached
pg_cnt = 1 # start on 1 as we should already have the first page
while pg_cnt < max_page:
tble_elems = soup.findAll('table')
soup = BeautifulSoup(str(tble_elems))
f.write(str(soup))
time.sleep(5)
pg_cnt +=1
# clicks the next button
br.find_element_by_xpath("//div[#class='next button']").click()
# give some time for the page to load
time.sleep(5)
# get the new page source (THIS IS THE PART THAT DOESN'T SEEM TO BE WORKING)
page_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
f.close()
I faced the same problem.
The problem i think is because some javascripts are not completely loaded.
All you need is wait till the object is loaded.Below code worked for me
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
delay = 10 # seconds
try:
myElem = WebDriverWait(drivr, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'legal-attribute-row')))
except :
print ("Loading took too much time!")

Selenium Clicks but Doesn't Select

I'm working on making an automated web scraper run on Khan Academy to make offline backups of questions using selenium and a python scraper (to come later). I'm currently working on getting it to answer questions (right or wrong doesn't matter) to proceed through exercises. Unfortunately, selenium's .click() function doesn't actually select an answer. I think it has something to do with being pointed at the wrong object but I can't tell. It currently highlights the option, but doesn't select it.
HTML for a single option (out of 4)
I made some code to reproduce the error, and hooked it up to a test account for all your debugging needs. Thanks.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
# gets us to the SAT Math Exercise page
driver.get('https://www.khanacademy.org/mission/sat/tasks/5505307403747328')
# these next lines just automate logging in. Nothing special.
login = driver.find_element_by_name('identifier')
login.send_keys('stackflowtest') # look, I made a new account just for you guys
passw = driver.find_element_by_name('password')
passw.send_keys('stackoverflow')
button = driver.find_elements_by_xpath("//*[contains(text(), 'Sign in')]")
button[1].click()
driver.implicitly_wait(5) # wait for things to become visible
radio = driver.find_element_by_class_name('perseus-radio-option')
radio.click()
check = driver.find_element_by_xpath("//*[contains(text(), 'Check answer')]")
check.click()
After a trial and error process, I found that the actual click selection can be accomplished by pointing to the element with class "description"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
# gets us to the SAT Math Exercise page
driver.get('https://www.khanacademy.org/mission/sat/tasks/5505307403747328')
# these next lines just automate logging in. Nothing special.
login = driver.find_element_by_name('identifier')
login.send_keys('stackflowtest') # look, I made a new account just for you guys
passw = driver.find_element_by_name('password')
passw.send_keys('stackoverflow')
button = driver.find_elements_by_xpath("//*[contains(text(), 'Sign in')]")
button[1].click()
driver.implicitly_wait(5) # wait for things to become visible
radio = driver.find_element_by_class_name('description')
radio.click()
check = driver.find_element_by_xpath("//*[contains(text(), 'Check answer')]")
check.click()
For people dealing with similar issues, I would recommend clicking on the very edge of the space that allows you to select and inspecting the element there. This prevents you from accidentally using one of the innermost tags.

Categories

Resources