Python Selenium Screen Scrape - python

I am trying to screen scrape a website (snippet below)
The website takes an input, navigates to a second page and takes more inputs and finally displays a table. I fail at this step:
driver.find_element_by_xpath("//select[#id='agencies']/option[#value='13156']").click()
The error I get is:
selenium.common.exceptions.NoSuchElementException: Message: 'Unable to locate element:
Which is strange because I do see the element (Commented out Display id). Any help/pointers, please?
(I tried requests/RoboBrowser -- can't seem to get the post to work but failed there as well)
from selenium import webdriver
from selenium import selenium
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
url = 'http://www.ucrdatatool.gov/Search/Crime/Local/OneYearofData.cfm'
driver.get(url)
driver.find_element_by_xpath("//select[#id='state']/option[#value='1']").click()
#driver.find_element_by_xpath("//select[#id='groups']/option[#value='8']").click()
driver.find_element_by_xpath("//input[#type='submit' and #value='Next']").click()
driver.implicitly_wait(5) # seconds
# Display id tags
#elementsAll = driver.find_elements_by_xpath('//*[#id]')
#for elements in elementsAll:
# print("id: ", repr(elements))
# print("idName: ",elements.get_attribute("id"))
# driver.implicitly_wait(5) # seconds
driver.find_element_by_xpath("//select[#id='groups']/option[#value='2']").click()
driver.find_element_by_xpath("//select[#id='year']/option[#value=1986]").click()
driver.find_element_by_xpath("//select[#id='agencies']/option[#value='13156']").click()
Update -- the below works on Selenium. I intended to choose all options in the list box and save the query results...Thanks for the pointer, Alecxe!
select = Select(driver.find_element_by_id('agencies'))
for options in select.options:
select.select_by_visible_text(options.text)
select = Select(driver.find_element_by_id('groups'))
for options in select.options:
select.select_by_visible_text(options.text)
driver.find_element_by_xpath("//select[#id='year']/option[#value=1985]").click()
driver.find_element_by_xpath("//input[#type='submit' and #value='Get Table']").click()

There is no option with 13156 value in select with agencies id. There are values from 102 to 522, you can see them by printing:
[element.get_attribute('value') for element in driver.find_elements_by_xpath('//select[#id="agencies"]/option')]
Also, instead of finding options by value, use Select and get options by text:
from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_id('agencies'))
print select.options
select.select_by_visible_text('Selma Police Dept')

Related

Selenium: Selecting from Multiple Drop-Downs at Once

I am building a web scraper that has to try a combination of multiple drop-down menu options and gather data from each combination.
So basically there're 5 drop-downs. I have to gather data from all of the possible combinations of the drop-down options. For each combination, I have to press a button to pull up the page with all the data on it. I am storing all the data in a dictionary.
This is the website: http://siops.datasus.gov.br/filtro_rel_ges_covid_municipal.php?S=1&UF=12;&Municipio=120001;&Ano=2020&Periodo=20
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
# General Stuff about the website
path = '/Users/admin/desktop/projects/scraper/chromedriver'
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=path)
website = 'http://siops.datasus.gov.br/filtro_rel_ges_covid_municipal.php'
driver.get(website)
# Initial Test: printing the title
print(driver.title)
print()
# Dictionary to Store stuff in
totals = {}
# Drop Down Menus
year_select = Select(driver.find_element(By.XPATH, '//*[#id="cmbAno"]'))
uf_select = Select(driver.find_element(By.XPATH, '//*[#id="cmbUF"]'))
### THIS IS WHERE THE ERROR IS OCCURING ###
# Choose from the drop down menus
uf_select.select_by_value('29')
year_select.select_by_value('2020')
# Submit button on the page
submit_button = driver.find_element(By.XPATH, '//*[#id="container"]/div[2]/form/div[2]/div/input[2]')
submit_button.click()
# Pulling data from the webpage
nameof = driver.find_element(By.XPATH, '//*[#id="arearelatorio"]/div[1]/div/table[1]/tbody/tr[2]').text
total_balance = driver.find_element(By.XPATH, '//*[#id="arearelatorio"]/div[1]/div/table[3]/tbody/tr[9]/td[2]').text
paid_expenses = driver.find_element(By.XPATH, '//*[#id="arearelatorio"]/div[1]/div/table[4]/tbody/tr[11]/td[4]').text
# Update Dictionary with the new info
totals.update({nameof: [total_balance, paid_expenses]})
totals.update({'this is a test': ['testing stuff']})
# Print the final Dictionary and quit
print(totals)
driver.quit()
For some reason, this code does not work when trying 1 possible combination (selecting value 29 from the UF drop-down, as well as value 2020 from the year_select drop-down). If I comment out of the two drop-down selections, then it works perfectly fine.
How do I try multiple combinations of drop-down options during a single iteration?
try this instead.
# Drop Down Menus
### THIS IS WHERE THE ERROR IS OCCURING ###
# Choose from the drop down menus
uf_select = Select(driver.find_element(By.XPATH, '//*[#id="cmbUF"]'))
uf_select.select_by_value('29')
year_select = Select(driver.find_element(By.XPATH, '//*[#id="cmbAno"]'))
year_select.select_by_value('2020')
This works for me. With your example i get a stale... error, means that the element disappears. HavenĀ“t checked, but maybe the checkbox is somehow updated and looses reference when selecting the other one.

Unable to obtain table info through python selenium

I am new bee on python selenium environment. I am trying to get the SQL version table from enter link description here
from selenium.webdriver.common.by import By
from selenium import webdriver
# define the website to scrape and path where the chromediver is located
website = "https://www.sqlserverversions.com"
driver = webdriver.Chrome(executable_path='/Users//Downloads/chromedriver/chromedriver.exe')
# define 'driver' variable
# open Google Chrome with chromedriver
driver.get(website)
matches = driver.find_elements(By.TAG_NAME, 'tr')
for match in matches:
b=match.find_elements(By.XPATH,"./td[1]")
print(b.text)
it says AttributeError: 'list' object has no attribute 'text'. Am i choosing the write syntax and right parameters to grab the data?
Below is the table which i am trying to get data.
enter image description here
Below are the parameters which i am trying to put in code.
enter image description here
Please advise what is required to modify in the code to obtain the data in table format.
Thanks,
Arun
If you need data only from first table:
from selenium.webdriver.common.by import By
from selenium import webdriver
website = "https://www.sqlserverversions.com"
driver = webdriver.Chrome(executable_path='/Users//Downloads/chromedriver/chromedriver.exe')
driver.get(website)
show_service_pack_versions = True
xpath_first_table_sql_rows = "(//table[#class='tbl'])[1]//tr/td/a[starts-with(text(),'SQL Server')]//ancestor::tr"
matches = driver.find_elements(By.XPATH, xpath_first_table_sql_rows)
for match in matches:
sql_server_a_element = match.find_element(By.XPATH, "./td/a[2]")
print(sql_server_a_element.text)
sql_server_rtm_version_a_element = match.find_element(By.XPATH, ".//td[#class='rtm']")
print('RTMs:')
print(sql_server_rtm_version_a_element.text)
if(show_service_pack_versions):
print('SPs:')
sql_server_sp_version_td_elements = match.find_elements(By.XPATH, ".//td[#class='sp']")
for td in sql_server_sp_version_td_elements:
print('---')
print(td.text)
print('----------------------------------')
if you set show_service_pack_versions = False then information regarding service packs will be skipped
There was a part of your code where you were calling b.text after getting the result of find_elements, which returns a list. You can only call b.text on a single WebElement (not a list of them). Here's the updated code:
from selenium.webdriver.common.by import By
from selenium import webdriver
website = "https://www.sqlserverversions.com"
driver = webdriver.Chrome(executable_path='/Users//Downloads/chromedriver/chromedriver.exe')
driver.get(website)
matches = driver.find_elements("css selector", "tr")
for match in matches[1:]:
items = match.find_elements("css selector", "td")
for item in items:
print(item.text)
That will print out A LOT of rows, unless you limit the loop.
If you just need text it's simpler to do it on the browser side:
data = driver.execute_script("""
return [...document.querySelectorAll('tr')].map(tr => [...tr.querySelectorAll('td')].map(td => td.innerText))
""")

Selenium scraping Issues with site having an popup window with endless scroll

I am trying to scrape a website that populates a list of providers. the site makes you go through a list of options and then finally it populates a list of providers through a pop up that has an endless/continuous scroll.
i have tried:
from selenium.webdriver.common.action_chains import ActionChains
element = driver.find_element_by_id("my-id")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
but this code didn't work.
I tried something similar to this:
driver.execute_script("arguments[0].scrollIntoView();", list )
but this didnt move anything. it just stayed on the first 20 providers.
i tried this alternative:
main = driver.find_element_by_id('mainDiv')
recentList = main.find_elements_by_class_name('nameBold')
for list in recentList :
driver.execute_script("arguments[0].scrollIntoView(true);", list)
time.sleep(20)
but ended up with this error message:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
The code that worked the best was this one:
while True:
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
but this is an endless scroll that i dont know how to stop since "while True:" will always be true.
Any help with this would be great and thanks in advance.
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
import pandas as pd
PATH = '/Users/AnthemScraper/venv/chromedriver'
driver = webdriver.Chrome(PATH)
#location for the website
driver.get('https://shop.anthem.com/sales/eox/abc/ca/en/shop/plans/medical/snq?execution=e1s13')
print(driver.title)
#entering the zipcode
search = driver.find_element_by_id('demographics.zip5')
search.send_keys(90210)
#making the scraper sleep for 5 seconds while the page loads
time.sleep(5)
#entering first name and DOB then hitting next
search = driver.find_element_by_id('demographics.applicants0.firstName')
search.send_keys('juelz')
search = driver.find_element_by_id('demographics.applicants0.dob')
search.send_keys('01011990')
driver.find_element_by_xpath('//*[#id="button/shop/getaquote/next"]').click()
#hitting the next button
driver.find_element_by_xpath('//*[#id="hypertext/shop/estimatesavings/skipthisstep"]').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
#clicking the no option to view all the health plans
driver.find_element_by_xpath('//*[#id="radioNoID"]').click()
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
driver.find_element_by_xpath('//*[#id="hypertext/shop/medical/showmemydoctorlink"]/span').click()
time.sleep(2)
#section to choose the specialist. here we are choosing all
find_specialist=\
driver.find_element_by_xpath('//*[#id="specializedin"]')
#this is the method for a dropdown
select_provider = Select(find_specialist)
select_provider.select_by_visible_text('All Specialties')
#choosing the distance. Here we click on 50 miles
choose_mile_radius=\
driver.find_element_by_xpath('//*[#id="distanceInMiles"]')
select_provider = Select(choose_mile_radius)
select_provider.select_by_visible_text('50 miles')
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#handling the endless scroll
while True:
time.sleep(20)
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
#block below allows us to grab the majority of the data. we would have to split it up in pandas since this info
#is nested in with classes
time.sleep(5)
main = driver.find_element_by_id('mainDiv')
sections = main.find_elements_by_class_name('firstRow')
pcp_info = []
#print(section.text)
for pcp in sections:
#the site stores the information inside inner classes which make it difficult to scrape.
#the solution would be to pull the entire text in the block and hope to clean it aftewards
#innerText allows to pull just the text inside the blocks
first_blox = pcp.find_element_by_class_name('table_content_colone').get_attribute('innerText')
second_blox = pcp.find_element_by_class_name('table_content_coltwo').get_attribute('innerText')
#creating columns and rows and assigning them
pcp_items = {
'first_block' : [first_blox],
'second_block' : [second_blox]
}
pcp_info.append(pcp_items)
df = pd.DataFrame(pcp_info)
print(df)
df.to_csv('yerp.csv',index=False)
#driver.quit()

web scraping a site without direct access

any help is appreciated in advance.
deal is i have been trying scrape data from this website(https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do),but direct access to the website is not possible.Rather then data i need,i am getting invalid access.To access the website i must go to (https://www.mptax.mp.gov.in/mpvatweb/index.jsp) and then click on 'dealer search' from dropdown menu while hovering over dealer information.
I am looking for solution in Python,
Here's something i tried.I have just started web scraping:
import requests
from bs4 import BeautifulSoup
with requests.session() as request:
MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"
page=request.get(INITIAL)
jsession=page.cookies["JSESSIONID"]
print(jsession)
print(page.headers)
result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
page1=request.get(MAIN,headers={"Referer":INITIAL})
soup=BeautifulSoup(page1.content,'html.parser')
data=soup.find_all("tr",class_="whitepapartd1")
print(data)
Deal is i want to scrape data about firm's based on their firm name.
thanks for telling me a way #Arnav and #Arman ,so here's the final code:
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value
PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
#ask for input
company_name=input("tell the company name")
#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")
#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()
#we are now on the leftmenu page
#click on radio button
browser.find_element_by_css_selector("#byName").click()
#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)
#submit form
inputElement.submit()
#now we are on dealerssearch page
#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")
#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")
#check length of 'list' and on that basis decide what to print
if(len(list)!=0):
#company name at index=9
#tin no. at index=10
#registration status at index=11
#circle name at index=15
#store the values
name=list[9].get_text()
tin=list[10].get_text()
status=list[11].get_text()
circle=list[15].get_text()
#make dictionary
Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}
print(Company_Details)
else:
Company_Details={"VAT RC No":"Not found in database"}
print(Company_Details)
#close the chrome
browser.stop_client()
browser.close()
browser.quit()
Would you mind using a browser?
You can use a browser and access the link at xpath (//*[#id="dropmenudiv"]/a[1]).
You might have to download and put chromedriver in the mentioned directory if you haven't used chromedriver before. You can also use selenium + phantomjs if you want to do headless browsing (without the browser opening up each time).
from selenium import webdriver
xpath = "//*[#id="dropmenudiv"]/a[1]"
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.set_window_size(1120,550)
browser.get('https://www.mptax.mp.gov.in/mpvatweb')
link = browser.find_element_by_xpath("//*[#id="dropmenudiv"]/a[1]")
link.click()
url = browser.current_url

extracting more information from webdriver

I have written a code to extract the mobile models from the following website
"http://www.kart123.com/mobiles/pr?p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.kart123.com/mobiles/pr?p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6")
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
Everything is working fine but the problem arises at the end of the page. It is showing the contents of the first page only.Please could you help me what can I do in order to get all the models.
This will get you on your way, I would use while loops using sleep to get all the page loaded before getting the information from the page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Firefox()
driver.get("http://www.flipkart.com/mobiles/pr? p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6")
time.sleep(3)
for i in range(5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to bottom of page
time.sleep(2)
driver.find_element_by_xpath('//*[#id="show-more-results"]').click() # click load more button, needs to be done until you reach the end.
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
Ok this is going to be a major hack but here goes... The site gets more phones as you scroll down by hitting an ajax script giving you 20 more each time. The script its hitting is this:
http://www.flipkart.com/mobiles/pr?p[]=sort%3Dpopularity&sid=tyy%2C4io&start=1&ref=8aef4a5f-3429-45c9-8b0e-41b05a9e7d28&ajax=true
Notice the start parameter you can hack this into what you want with
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
num = 1
while num <=2450:
"""
This condition will need to be updated to the maximum number
of models you're interested in (or if you're feeling brave try to extract
this from the top of the page)
"""
driver.get("http://www.flipkart.com/mobiles/pr?p[]=sort%3Dpopularity&sid=tyy%2C4io&start=%f&ref=8aef4a5f-3429-45c9-8b0e-41b05a9e7d28&ajax=true" % num)
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
num += 20
You'll be making 127 get requests so this will be quite slow...
You can get full source of the page and do all the analysis based on it:
page_text = driver.page_source
The page shall contain current content including whatever was generated by JavaScript. Be carefull to get this content at the moment, all the rendering is completed (you may e.g. wait for presence of some string, which gets rendered at the end).

Categories

Resources