I am trying to scrape a website that populates a list of providers. the site makes you go through a list of options and then finally it populates a list of providers through a pop up that has an endless/continuous scroll.
i have tried:
from selenium.webdriver.common.action_chains import ActionChains
element = driver.find_element_by_id("my-id")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
but this code didn't work.
I tried something similar to this:
driver.execute_script("arguments[0].scrollIntoView();", list )
but this didnt move anything. it just stayed on the first 20 providers.
i tried this alternative:
main = driver.find_element_by_id('mainDiv')
recentList = main.find_elements_by_class_name('nameBold')
for list in recentList :
driver.execute_script("arguments[0].scrollIntoView(true);", list)
time.sleep(20)
but ended up with this error message:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
The code that worked the best was this one:
while True:
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
but this is an endless scroll that i dont know how to stop since "while True:" will always be true.
Any help with this would be great and thanks in advance.
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
import pandas as pd
PATH = '/Users/AnthemScraper/venv/chromedriver'
driver = webdriver.Chrome(PATH)
#location for the website
driver.get('https://shop.anthem.com/sales/eox/abc/ca/en/shop/plans/medical/snq?execution=e1s13')
print(driver.title)
#entering the zipcode
search = driver.find_element_by_id('demographics.zip5')
search.send_keys(90210)
#making the scraper sleep for 5 seconds while the page loads
time.sleep(5)
#entering first name and DOB then hitting next
search = driver.find_element_by_id('demographics.applicants0.firstName')
search.send_keys('juelz')
search = driver.find_element_by_id('demographics.applicants0.dob')
search.send_keys('01011990')
driver.find_element_by_xpath('//*[#id="button/shop/getaquote/next"]').click()
#hitting the next button
driver.find_element_by_xpath('//*[#id="hypertext/shop/estimatesavings/skipthisstep"]').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
#clicking the no option to view all the health plans
driver.find_element_by_xpath('//*[#id="radioNoID"]').click()
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
driver.find_element_by_xpath('//*[#id="hypertext/shop/medical/showmemydoctorlink"]/span').click()
time.sleep(2)
#section to choose the specialist. here we are choosing all
find_specialist=\
driver.find_element_by_xpath('//*[#id="specializedin"]')
#this is the method for a dropdown
select_provider = Select(find_specialist)
select_provider.select_by_visible_text('All Specialties')
#choosing the distance. Here we click on 50 miles
choose_mile_radius=\
driver.find_element_by_xpath('//*[#id="distanceInMiles"]')
select_provider = Select(choose_mile_radius)
select_provider.select_by_visible_text('50 miles')
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#handling the endless scroll
while True:
time.sleep(20)
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
#block below allows us to grab the majority of the data. we would have to split it up in pandas since this info
#is nested in with classes
time.sleep(5)
main = driver.find_element_by_id('mainDiv')
sections = main.find_elements_by_class_name('firstRow')
pcp_info = []
#print(section.text)
for pcp in sections:
#the site stores the information inside inner classes which make it difficult to scrape.
#the solution would be to pull the entire text in the block and hope to clean it aftewards
#innerText allows to pull just the text inside the blocks
first_blox = pcp.find_element_by_class_name('table_content_colone').get_attribute('innerText')
second_blox = pcp.find_element_by_class_name('table_content_coltwo').get_attribute('innerText')
#creating columns and rows and assigning them
pcp_items = {
'first_block' : [first_blox],
'second_block' : [second_blox]
}
pcp_info.append(pcp_items)
df = pd.DataFrame(pcp_info)
print(df)
df.to_csv('yerp.csv',index=False)
#driver.quit()
I am trying to send several items from a CSV file to a webform using python so I don't have to type it all in by hand, especially when I update the sheet later. I tried using the answer to this question and the page comes up and seems to "submit" but I get told the import failed.
My Code
from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# using Pandas to read the csv file
source_information = pd.read_csv('C:/chrome_driver/test_csv.csv', header=None, skiprows=[0])
print(source_information)
# setting the URL for BeautifulSoup to operate in
url = "https://www.roboform.com/filling-test-all-fields"
my_web_form = get(url).content
soup = BeautifulSoup(my_web_form, 'html.parser')
# creating a procedure to fill the form
def fulfill_form(first, email):
# Setting parameters for selenium to work
path = r'C:/chrome_driver/chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_first = driver.find_element_by_name('02frstname')
input_email = driver.find_element_by_name('24emailadr')
submit = driver.find_element_by_name('Reset')
# input the values and hold a bit for the next action
input_first.send_keys(first)
time.sleep(1)
input_email.send_keys(email)
time.sleep(5)
submit.click()
time.sleep(7)
# creating a list to hold any entries should them result in error
failed_attempts = []
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(str(source_information[0]), str(source_information[1]))
except:
failed_attempts.append(source_information[0])
pass
if len(failed_attempts) > 0:
print("{} cases have failed".format(len(failed_attempts)))
print("Procedure concluded")
This tells me that "2 cases have failed"
I checked the output of my "source_information" and it shows the following
0 1
0 Corey corey#test.com
1 Breana breana#hello.org
Where am I going wrong?
Maybe:
submit = driver.find_element_by_name('Reset')
Should be...
submit = driver.find_element_by_xpath("//input[#type='reset' and #value='Reset']")
Based on the page source of (it doesn't have a name)...
<input type="reset" value="Reset">
...and note the type reset vs the value Reset.
Then you have source_information as a dataframe so you probably want to change...
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(str(source_information[0]), str(source_information[1]))
except:
failed_attempts.append(source_information[0])
pass
To something like...
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information.iterrows():
try:
fulfill_form(customer[1][0], customer[1][1])
except:
failed_attempts.append(source_information[1][0])
pass
I'd also suggest changing all your time.sleep(5) and time.sleep(7) to 1 or 2 so it runs a little quicker.
Obviously this is all from looking at the code without running your data and seeing what happens.
Additional:
I reread the question and you do have an example of test data from the failures. Running this for the changes shown above works.
I am trying to copy a web page's list of addresses for a given community service to a new document so i can geocode all of the locations in a map. Instead of being able to get a list of all the parcels I can only download one at a time and there are 25 parcel numbers limited to a page. As such, this would be extremely time consuming.
I want to develop a script that will look at the page source (everything including the 25 addresses which are contained in a table tag) click the next page button, copy the next page, and so on until the max page is reached. Afterwards, I can format the text to be geocoding compatible.
The code below does all of this except it only copies the first page over and over again even though I can clearly see that the program has successfully navigated to the next page:
# Open chrome
br = webdriver.Chrome()
raw_input("Navigate to web page. Press enter when done: ")
pg_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
max_page = 122 #int(max_page)
#open a text doc to write the results to
f = open(r'C:\Geocoding\results.txt', 'w')
# write results page by page until max page number is reached
pg_cnt = 1 # start on 1 as we should already have the first page
while pg_cnt < max_page:
tble_elems = soup.findAll('table')
soup = BeautifulSoup(str(tble_elems))
f.write(str(soup))
time.sleep(5)
pg_cnt +=1
# clicks the next button
br.find_element_by_xpath("//div[#class='next button']").click()
# give some time for the page to load
time.sleep(5)
# get the new page source (THIS IS THE PART THAT DOESN'T SEEM TO BE WORKING)
page_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
f.close()
I faced the same problem.
The problem i think is because some javascripts are not completely loaded.
All you need is wait till the object is loaded.Below code worked for me
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
delay = 10 # seconds
try:
myElem = WebDriverWait(drivr, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'legal-attribute-row')))
except :
print ("Loading took too much time!")
any help is appreciated in advance.
deal is i have been trying scrape data from this website(https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do),but direct access to the website is not possible.Rather then data i need,i am getting invalid access.To access the website i must go to (https://www.mptax.mp.gov.in/mpvatweb/index.jsp) and then click on 'dealer search' from dropdown menu while hovering over dealer information.
I am looking for solution in Python,
Here's something i tried.I have just started web scraping:
import requests
from bs4 import BeautifulSoup
with requests.session() as request:
MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"
page=request.get(INITIAL)
jsession=page.cookies["JSESSIONID"]
print(jsession)
print(page.headers)
result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
page1=request.get(MAIN,headers={"Referer":INITIAL})
soup=BeautifulSoup(page1.content,'html.parser')
data=soup.find_all("tr",class_="whitepapartd1")
print(data)
Deal is i want to scrape data about firm's based on their firm name.
thanks for telling me a way #Arnav and #Arman ,so here's the final code:
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value
PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
#ask for input
company_name=input("tell the company name")
#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")
#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()
#we are now on the leftmenu page
#click on radio button
browser.find_element_by_css_selector("#byName").click()
#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)
#submit form
inputElement.submit()
#now we are on dealerssearch page
#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")
#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")
#check length of 'list' and on that basis decide what to print
if(len(list)!=0):
#company name at index=9
#tin no. at index=10
#registration status at index=11
#circle name at index=15
#store the values
name=list[9].get_text()
tin=list[10].get_text()
status=list[11].get_text()
circle=list[15].get_text()
#make dictionary
Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}
print(Company_Details)
else:
Company_Details={"VAT RC No":"Not found in database"}
print(Company_Details)
#close the chrome
browser.stop_client()
browser.close()
browser.quit()
Would you mind using a browser?
You can use a browser and access the link at xpath (//*[#id="dropmenudiv"]/a[1]).
You might have to download and put chromedriver in the mentioned directory if you haven't used chromedriver before. You can also use selenium + phantomjs if you want to do headless browsing (without the browser opening up each time).
from selenium import webdriver
xpath = "//*[#id="dropmenudiv"]/a[1]"
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.set_window_size(1120,550)
browser.get('https://www.mptax.mp.gov.in/mpvatweb')
link = browser.find_element_by_xpath("//*[#id="dropmenudiv"]/a[1]")
link.click()
url = browser.current_url
Ok, so I pretty much used webdriver to navigate to a specific page with a table of results contained in a unique div. I had to use webdriver to fill the forms and interact with the javascript buttons. Anyways, i need to scrape the table into a file but I can't figure this out. Here's the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Open Firefox
driver = webdriver.Firefox()
driver.get("https://subscriber.hoovers.com/H/login/login.html")
# Login and submit
username = driver.find_element_by_id('j_username')
username.send_keys('THE_EMAIL_ADDRESS')
password = driver.find_element_by_id('j_password')
password.send_keys('THE_PASSWORD')
password.submit()
# go to "build a list" url (more like 'build-a-table' get it right guys!
driver.get('http://subscriber.hoovers.com/H/search/buildAList.html?_target0=true')
# expand industry list to reveal SIC codes form
el = driver.find_elements_by_xpath("//h2[contains(string(), 'Industry')]")[0]
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(el, 5, 5)
action.click()
action.perform()
# fill sic.codes form with all the SIC codes
siccodes = driver.find_element_by_id('advancedSearchCriteria.sicCodes')
siccodes.send_keys('316998,321114,321211,321212,321213,321214,321219,321911,'
'321912,321918,321992,322121,322130,326122,326191,326199,327110,327120,'
'327212,327215,327320,327331,327332,327390,327410,327420,327910,327991,'
'327993,327999,331313,331315,332216,332311,332312,332321,332322,332323,'
'333112,333414,333415,333991,'334290,335110,335121,335122,335129,335210,'
'335221,335222,335224,335228,335311,335312,335912,335929,335931,335932,'
'335999,337920,339910,339993,339994,339999,423310,423320,423330,423610,'
'423620,423710,423720,423730,424950,444120')
# wait 5 seconds because this is a big list to load
time.sleep(5)
# Select "Add to List" button and clickity-clickidy-CLICK!
butn = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[1]/form/div/div[3]/div/div[2]/div[1]/div[2]/p[1]/button')
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(butn, 5, 5)
action.click()
action.perform()
# wait 10 seconds to add them to list
time.sleep(10)
# Now select confirm list button and wait to be forwarded to results page
butn = driver.find_element_by_xpath('/html/body/div[3]/div/div[1]/input[2]')
action = webdriver.common.action_chains.ActionChains(driver)
action.send_keys("\n")
action.move_to_element_with_offset(butn, 5, 5)
action.click()
action.perform()
# wait 10 seconds, let it load and dig up them numbah tables
time.sleep(10)
# Check that we're on the right results landing page...
print driver.current_url
# Good we have arrived! Now lets save this url for scrape-time!
url = driver.current_url
# Print everything... but we only need the table!!! HOWW?!?!?!?
sourcecode = driver.page_source.encode("utf-8")
# EVERYTHING AFTER THIS POINT DOESN't WORK!!!! `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All I need is to print the table out as organized as possible with a for loop but it seems this works a lot better with mechanize or BeautifulSoup. So is this possible? Any suggestions? also, sorry if my code is sloppy, I'm multitasking with deadlines and other scripts. Please help meehh! I will provide my login credentials if you really need them and want to help me. It's nothing too serious, just a company SIC and D-U-N-S number database but I don't think you need it to figure this out. I know there's a few jedi's out there that can save me. :)