I got a Python Selenium project that does what I want (yay!) but for every instance it opens a new browser window. Is there any way to prevent that?
I've went through the documentation of Selenium but they refer to driver.get(url). It's most likely because it's in the for...loop but I can't seem to get the URL to change with the queries and params if it's outside of the for...loop.
So, for example, I want to open these URLs:
https://www.google.com/search?q=site%3AParameter1+%22Query1%22
https://www.google.com/search?q=site%3AParameter2+%22Query1%22
https://www.google.com/search?q=site%3AParameter3+%22Query1%22
etc..
from selenium import webdriver
import time
from itertools import product
params = ['Parameter1', 'Parameter2', 'Parameter3', 'Parameter4']
queries = ['Query1', 'Query2', 'Query3', 'Query4',]
for (param, query) in product(params,queries):
url = f'https://www.google.com/search?q=site%3A{param}+%22{query}%22' # google as an example
driver = webdriver.Chrome('G:/Python Projects/venv/Lib/site-packages/chromedriver.exe')
driver.get(url)
#does stuff
You are declaring your path to Chrome in the loop. Declare it once and reuse:
from itertools import product
from selenium import webdriver
params = ['Parameter1', 'Parameter2', 'Parameter3', 'Parameter4']
queries = ['Query1', 'Query2', 'Query3', 'Query4',]
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
for (param, query) in product(params,queries):
url = f'https://www.google.com/search?q=site%3A{param}+%22{query}%22'
driver.get(url)
# driver.close()
Related
I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()
I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()
I am new in python.
This is my practice code.
After log in:
Determine if the page contains certain keyword
If this page contains keyword then execute file.exe from my local machine
Refresh Page and do Step1, Step2, Step3. again and again.
After log in, page only reloads twice instead again and again.
I cannot figure out where it goes wrong.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
webdriver_path = 'C:\Python\chromedriver.exe'
options = Options()
driver = webdriver.Chrome(executable_path=webdriver_path, options=options)
driver.get("www.TestLogin.com") #Here is a LoginPage
driver.find_element_by_id('_userId').send_keys('userid') #Here is account for login
driver.find_element_by_id('_userPass').send_keys('userpass') #Here is password for login
driver.find_element_by_id('btn-login').click()
driver.find_elements_by_id('table-orders')
import time
time.sleep(3)
driver.refresh()
#Array for keyword
import_searchs = [
'rose',
'tulip'
]
for i in import_searchs:
list = driver.find_elements_by_xpath('//*[contains(text(), "' + i + '")]')
for item in list:
import os
os.system('C"\file.exe')
while True works!
Thank you everybody
How to load multiple urls in driver.get() ?
I am trying to load 3 urls in below code, but how to load the other 2 urls?
And afterwards the next challenge is to pass authentication for all the urls as well which is same.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=r"C:/Users/RYadav/AppData/Roaming/Microsoft/Windows/Start Menu/Programs/Python 3.8/chromedriver.exe")
driver.get("https://fleet.my.salesforce.com/reportbuilder/reportType.apexp")#put here the adress of your page
elem = driver.find_elements_by_xpath('//*[#id="ext-gen63"]')#put here the content you have put in Notepad, ie the XPath
button = driver.find_element_by_id('id="ext-gen63"')
print(elem.get_attribute("class"))
driver.close
submit_button.click()
Try below code :
def getUrls(targeturl):
driver = webdriver.Chrome(executable_path=r" path for chromedriver.exe")
driver.get("http://www."+targeturl+".com")
# perform your taks here
driver.quit()
for i in range(3):
webPage = ['google','facebook','gmail']
for i in webPage:
print i;
getUrls(i)
You can't load more than 1 url at a time for each Webdriver. If you want to do so, you maybe need some multiprocessing module. If you want to do an iterative solution, just create a list with every url you need and loop through it. With that you won't have the credential problem neither.
I'm using selenium webdriver configured with chrome and want to scrape the price from Yahoo Finance. An example page is: https://finance.yahoo.com/quote/0001.KL
I've opened the example page in a chrome browser and used the inspector to navigate to where the price is highlighted on the page and use the inspector's copy xpath for use in my python script.
import os
from lxml import html
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
ua = UserAgent()
def yahoo_scrape_one(kl_stock_id):
''' function to scrape yahoo finance for a single KLSE stock returns dict'''
user_agent = ua.random
chrome_driver = os.getcwd() + '/chromedriver'
chrome_options = Options()
chrome_options.add_argument('user-agent={user_agent}')
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options,
executable_path=chrome_driver)
prefix = 'https://finance.yahoo.com/quote/'
suffix = '.KL'
url = prefix + kl_stock_id + suffix
driver.get(url)
tree = html.fromstring(driver.page_source)
price = tree.xpath('//*[#id="quote-header-info"]/div[3]/div/div/span[1]/text()')
print(price)
test_stock = "0001"
yahoo_scrape_one(test_stock)
the print output is
['+0.01 (+1.41%)']
This appears to be information from the next span (change and percent change) not the price. Any insights on this curious behaviour would be appreciated. Any suggestions on an alternate approach would also give joy.
After running your actual script, I got the same "erroneous" output you were reporting. However, I then commented out the headless option and ran the driver again, inspecting the element within the actual Selenium browser instance, and found that the XPath for the element is different on the page generated by your script. Use the following line of code instead:
price = tree.xpath('//*[#id="quote-header-info"]/div[3]/div/span/text()')
This produces the correct output of ['0.36']
This is another way you can achieve the same output without hardcoding index:
price = tree.xpath("//*[#id='quote-market-notice']/../span")[0].text
I want to get a new window url by using selenium, and using PhantomJs is more efficient than Firefox.
python code is here:
from selenium import webdriver
renren = webdriver.Firefox()
#renren = webdriver.PhantomJS()
renren.get("file:///home/xjz/Desktop/html/easy.html")
renren.execute_script("windows()")
now_handle1 = renren.current_window_handle
all_handles1 = renren.window_handles
for handle1 in all_handles1:
if handle1 != now_handle1:
renren.switch_to_window(handle1)
print renren.current_url
print renren.page_source
In script "windows()", it will open a new window for http://www.renren.com/.
When I use Firefox,I get current url and context of http://www.renren.com/ . But I get "about:blank" of the url and "" of the context.It means I get failed when I use PhantomJS.
So how can I get current url when I use selenium with PhantomJS.
Thanks a lot.
You can add sleep time in your code before getting the current URL.
from selenium import webdriver
renren = webdriver.Firefox()
...
...
...
import time
time.sleep(10)# in seconds
print renren.current_url
..