I am attempting to scrape data through multiple pages (36) from a website to gather the document number and the revision number for each available document and save it to two different lists. If I run the code block below for each individual page, it works perfectly. However, when I added the while loop to loop through all 36 pages, it will loop, but only the data from the first page is saved.
#sam.gov website
url = 'https://sam.gov/search/?index=sca&page=1&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'
#webdriver
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
driver.get(url)
#get rid of pop up window
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
#list of revision numbers
revision_num = []
#empty list for all the WD links
WD_num = []
substring = '2015'
current_page = 0
while True:
current_page += 1
if current_page == 36:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
print('Last Page Complete!')
break
else:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
#click on next page
click_icon = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.ID,'bottomPagination-nextPage']))
click_icon.click()
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, 'main-container')))
Things I've tried:
I added the WebDriverWait in order to slow the script down for the page to load and/or elements to be clickable/located
I declared the empty lists outside the loop so it does not overwrite over each iteration
I have edited the while loop multiple times to either count up to 36 (while current_page <37) or moved the counter to the top or bottom of the loop)
Any ideas? TIA.
EDIT: added screenshot of 'field name'
I have refactor your code and made things very simple.
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
revision_num = []
WD_num = []
for page in range(1,37):
url = 'https://sam.gov/search/?index=sca&page={}&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'.format(page)
driver.get(url)
if page==1:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//a[contains(#class,'usa-link') and contains(.,'2015')]")))
wd_links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[#class='sds-field__name' and text()='Revision Number']/following-sibling::div")))
for element in elements:
revision_num.append(element.text)
for wd_link in wd_links:
WD_num.append(wd_link.text)
print(revision_num)
print(WD_num)
if you know only 36 pages to iterate you can pass the value in the url.
wait for element visible using webdriverwait
construct your xpath in such a way so can identify element uniquely without if, but.
console output on my terminal:
Trying to get details of Tyres on this page. https://eurawheels.com/fr/catalogue/INFINY-INDIVIDUAL . Each tyre has different FINITIONS. The price and other details are different for each FINITIONS. I would like to click on each FINITION type. The problem is that on clicking the FINITION type the links go stale, and You cannot refresh the page, if you do it will take you back to the starting page. So, How can I avoid stale element error without refreshing the page?
count_added = False
buttons_div = driver.find_elements_by_xpath('//div[#class="btn-group"]')
fin_buttons = buttons_div[2].find_elements_by_xpath('.//button')
fin_count = len(fin_buttons)
if fin_count > 2:
for z in range(fin_count):
if not count_added:
z = z + 2 #Avoid clicking the Title
count_added = True
fin_buttons[z].click()
finition = fin_buttons[z].text
time.sleep(2)
driver.refresh() #Cannot do this. Will take to a different page
To clarify: the stale element is thrown because the element is no longer attached to the DOM. In your case is this: buttons_div = driver.find_elements_by_xpath('//div[#class="btn-group"]') that its being used as parent in the fin_buttons[z].click()
To solve this you'll have to "refresh" the element once the DOM changes. You can do that like this:
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome(executable_path="D:/chromedriver.exe")
driver.get("https://eurawheels.com/fr/catalogue/INFINY-INDIVIDUAL")
driver.maximize_window()
driver.find_elements_by_xpath("//div[#class='card-body text-center']/a")[1].click()
def click_fin_buttons(index):
driver.find_elements_by_xpath('//div[#class="btn-group"]')[2].find_elements_by_xpath('.//button')[index].click()
def text_fin_buttons(index):
return driver.find_elements_by_xpath('//div[#class="btn-group"]')[2].find_elements_by_xpath('.//button')[index].text
sleep(2)
count_added = False
buttons_div = driver.find_elements_by_xpath('//div[#class="btn-group"]')
fin_buttons = buttons_div[2].find_elements_by_xpath('.//button')
fin_count = len(fin_buttons)
if fin_count > 2:
for z in range(fin_count):
if not count_added:
z = z + 2 #Avoid clicking the Title
count_added = True
click_fin_buttons(z)
finition = text_fin_buttons(z)
sleep(2)
print(finition)
#driver.refresh() #Cannot do this. Will take to a different page
I am trying to iterate over a list of links on a [website][1] but Selenium is not able to locate particular and seemingly random ones. In particular, I am trying to click on each of the cities and extract the number of stores using a for loop but it always skips, say, "Alameda" among all some other cities even though when I see nothing different about the html code.
driver = webdriver.Chrome(path)
driver.set_window_size(1120, 1000)
driver.get("https://locations.traderjoes.com/ca/")
cities = driver.find_elements_by_class_name('itemlist')
for i in range(0, len(cities)):
print(city_list[i])
if cities[i].is_displayed():
cities[i].click()
num = len(driver.find_elements_by_class_name('address-left'))
num_stores_by_city.append(num)
driver.find_element_by_xpath('//*[#id="content"]/a[2]').click()
else:
time.sleep(3)
cities[i].click()
num = len(driver.find_elements_by_class_name('address-left'))
num_stores_by_city.append(num)
driver.find_element_by_xpath('//*[#id="content"]/a[2]').click()
This will determine the cities and then loop through each gathering the number of stores and adding information to a dictionary type object:
driver = webdriver.Chrome(path)
url = 'https://locations.traderjoes.com/ca/'
driver.get(url)
city_list = {}
city_index = 0
processing_cities = True
while processing_cities:
cities = driver.find_elements_by_css_selector('.itemlist a')
if city_index < len(cities):
city_text = cities[city_index].text
cities[city_index].click()
store_locations = driver.find_elements_by_css_selector('.itemlist')
city_list[city_text] = len(store_locations)
driver.get(url)
city_index += 1
else:
processing_cities = False
print(city_list)
One of the issues you were running into was that once you click on an element your previously found elements become stale. You need to re-find previously found elements to interact with them again.
I have a list, which is dynamically loaded by AJAX.
At first, while loading, it's code is like this:
<ul><li class="last"><a class="loading" href="#"><ins> </ins>Загрузка...</a></li></ul>
When the list is loaded, all of it li and a are changed. And it's always more than 1 li.
Like this:
<ul class="ltr">
<li id="t_b_68" class="closed" rel="simple">
<a id="t_a_68" href="javascript:void(0)">Category 1</a>
</li>
<li id="t_b_64" class="closed" rel="simple">
<a id="t_a_64" href="javascript:void(0)">Category 2</a>
</li>
...
I need to check if list is loaded, so I check if it has several li.
So far I tried:
1) Custom waiting condition
class more_than_one(object):
def __init__(self, selector):
self.selector = selector
def __call__(self, driver):
elements = driver.find_elements_by_css_selector(self.selector)
if len(elements) > 1:
return True
return False
...
try:
query = WebDriverWait(driver, 30).until(more_than_one('li'))
except:
print "Bad crap"
else:
# Then load ready list
2) Custom function based on find_elements_by
def wait_for_several_elements(driver, selector, min_amount, limit=60):
"""
This function provides awaiting of <min_amount> of elements found by <selector> with
time limit = <limit>
"""
step = 1 # in seconds; sleep for 500ms
current_wait = 0
while current_wait < limit:
try:
print "Waiting... " + str(current_wait)
query = driver.find_elements_by_css_selector(selector)
if len(query) > min_amount:
print "Found!"
return True
else:
time.sleep(step)
current_wait += step
except:
time.sleep(step)
current_wait += step
return False
This doesn't work, because driver (current element passed to this function) gets lost in DOM. UL isn't changed but Selenium can't find it anymore for some reason.
3) Excplicit wait. This just sucks, because some lists are loaded instantly and some take 10+ secs to load. If I use this technique I have to wait max time every occurence, which is very bad for my case.
4) Also I can't wait for child element with XPATH correctly. This one just expects ul to appear.
try:
print "Going to nested list..."
#time.sleep(WAIT_TIME)
query = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, './/ul')))
nested_list = child.find_element_by_css_selector('ul')
Please, tell me the right way to be sure, that several heir elements are loaded for specified element.
P.S. All this checks and searches should be relative to current element.
First and foremost the elements are AJAX elements.
Now, as per the requirement to locate all the desired elements and create a list, the simplest approach would be to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.ltr li[id^='t_b_'] > a[id^='t_a_'][href]")))
Using XPATH:
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='ltr']//li[starts-with(#id, 't_b_')]/a[starts-with(#id, 't_a_') and starts-with(., 'Category')]")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Incase your usecase is to wait for certain number of elements to be loaded e.g. 10 elements, you can use you can use the lambda function as follows:
Using >:
myLength = 9
WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//ul[#class='ltr']//li[starts-with(#id, 't_b_')]/a[starts-with(#id, 't_a_') and starts-with(., 'Category')]")) > int(myLength))
Using ==:
myLength = 10
WebDriverWait(driver, 20).until(lambda driver: len(driver.find_elements_by_xpath("//ul[#class='ltr']//li[starts-with(#id, 't_b_')]/a[starts-with(#id, 't_a_') and starts-with(., 'Category')]")) == int(myLength))
You can find a relevant discussion in How to wait for number of elements to be loaded using Selenium and Python
References
You can find a couple of relevant detailed discussions in:
Getting specific elements in selenium
Cannot find table element from div element in selenium python
Extract text from an aria-label selenium webdriver (python)
I created AllEc which basically piggybacks on WebDriverWait.until logic.
This will wait until the timeout occurs or when all of the elements have been found.
from typing import Callable
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
class AllEc(object):
def __init__(self, *args: Callable, description: str = None):
self.ecs = args
self.description = description
def __call__(self, driver):
try:
for fn in self.ecs:
if not fn(driver):
return False
return True
except StaleElementReferenceException:
return False
# usage example:
wait = WebDriverWait(driver, timeout)
ec1 = EC.invisibility_of_element_located(locator1)
ec2 = EC.invisibility_of_element_located(locator2)
ec3 = EC.invisibility_of_element_located(locator3)
all_ec = AllEc(ec1, ec2, ec3, description="Required elements to show page has loaded.")
found_elements = wait.until(all_ec, "Could not find all expected elements")
Alternatively I created AnyEc to look for multiple elements but returns on the first one found.
class AnyEc(object):
"""
Use with WebDriverWait to combine expected_conditions in an OR.
Example usage:
>>> wait = WebDriverWait(driver, 30)
>>> either = AnyEc(expectedcondition1, expectedcondition2, expectedcondition3, etc...)
>>> found = wait.until(either, "Cannot find any of the expected conditions")
"""
def __init__(self, *args: Callable, description: str = None):
self.ecs = args
self.description = description
def __iter__(self):
return self.ecs.__iter__()
def __call__(self, driver):
for fn in self.ecs:
try:
rt = fn(driver)
if rt:
return rt
except TypeError as exc:
raise exc
except Exception as exc:
# print(exc)
pass
def __repr__(self):
return " ".join(f"{e!r}," for e in self.ecs)
def __str__(self):
return f"{self.description!s}"
either = AnyEc(ec1, ec2, ec3)
found_element = wait.until(either, "Could not find any of the expected elements")
Lastly, if it's possible to do so, you could try waiting for Ajax to be finished.
This is not useful in all cases -- e.g. Ajax is always active. In the cases where Ajax runs and finishes it can work. There are also some ajax libraries that do not set the active attribute, so double check that you can rely on this.
def is_ajax_complete(driver)
rt = driver.execute_script("return jQuery.active", *args)
return rt == 0
wait.until(lambda driver: is_ajax_complete(driver), "Ajax did not finish")
(1) You did not mention the error you get with it
(2) Since you mention
...because driver (current element passed to this function)...
I'll assume this is actually a WebElement. In this case, instead of passing the object itself to your method, simply pass the selector that finds that WebElement (in your case, the ul). If the "driver gets lost in DOM", it could be that re-creating it inside the while current_wait < limit: loop could mitigate the problem
(3) yeap, time.sleep() will only get you that far
(4) Since the li elements loaded dynamically contain class=closed, instead of (By.XPATH, './/ul'), you could try (By.CSS_SELECTOR, 'ul > li.closed') (more details on CSS Selectors here)
Keeping in mind comments of Mr.E. and Arran I made my list traversal fully on CSS selectors. The tricky part was about my own list structure and marks (changing classes, etc.), as well as about creating required selectors on the fly and keeping them in memory during traversal.
I disposed waiting for several elements by searching for anything that is not loading state. You may use ":nth-child" selector as well like here:
#in for loop with enumerate for i
selector.append(' > li:nth-child(%i)' % (i + 1)) # identify child <li> by its order pos
This is my hard-commented code solution for example:
def parse_crippled_shifted_list(driver, frame, selector, level=1, parent_id=0, path=None):
"""
Traversal of html list of special structure (you can't know if element has sub list unless you enter it).
Supports start from remembered list element.
Nested lists have classes "closed" and "last closed" when closed and "open" and "last open" when opened (on <li>).
Elements themselves have classes "leaf" and "last leaf" in both cases.
Nested lists situate in <li> element as <ul> list. Each <ul> appears after clicking <a> in each <li>.
If you click <a> of leaf, page in another frame will load.
driver - WebDriver; frame - frame of the list; selector - selector to current list (<ul>);
level - level of depth, just for console output formatting, parent_id - id of parent category (in DB),
path - remained path in categories (ORM objects) to target category to start with.
"""
# Add current level list elements
# This method selects all but loading. Just what is needed to exclude.
selector.append(' > li > a:not([class=loading])')
# Wait for child list to load
try:
query = WebDriverWait(driver, WAIT_LONG_TIME).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ''.join(selector))))
except TimeoutException:
print "%s timed out" % ''.join(selector)
else:
# List is loaded
del selector[-1] # selector correction: delete last part aimed to get loaded content
selector.append(' > li')
children = driver.find_elements_by_css_selector(''.join(selector)) # fetch list elements
# Walk the whole list
for i, child in enumerate(children):
del selector[-1] # delete non-unique li tag selector
if selector[-1] != ' > ul' and selector[-1] != 'ul.ltr':
del selector[-1]
selector.append(' > li:nth-child(%i)' % (i + 1)) # identify child <li> by its order pos
selector.append(' > a') # add 'li > a' reference to click
child_link = driver.find_element_by_css_selector(''.join(selector))
# If we parse freely further (no need to start from remembered position)
if not path:
# Open child
try:
double_click(driver, child_link)
except InvalidElementStateException:
print "\n\nERROR\n", InvalidElementStateException.message(), '\n\n'
else:
# Determine its type
del selector[-1] # delete changed and already useless link reference
# If <li> is category, it would have <ul> as child now and class="open"
# Check by class is priority, because <li> exists for sure.
current_li = driver.find_element_by_css_selector(''.join(selector))
# Category case - BRANCH
if current_li.get_attribute('class') == 'open' or current_li.get_attribute('class') == 'last open':
new_parent_id = process_category_case(child_link, parent_id, level) # add category to DB
selector.append(' > ul') # forward to nested list
# Wait for nested list to load
try:
query = WebDriverWait(driver, WAIT_LONG_TIME).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ''.join(selector))))
except TimeoutException:
print "\t" * level, "%s timed out (%i secs). Failed to load nested list." %\
''.join(selector), WAIT_LONG_TIME
# Parse nested list
else:
parse_crippled_shifted_list(driver, frame, selector, level + 1, new_parent_id)
# Page case - LEAF
elif current_li.get_attribute('class') == 'leaf' or current_li.get_attribute('class') == 'last leaf':
process_page_case(driver, child_link, level)
else:
raise Exception('Damn! Alien class: %s' % current_li.get_attribute('class'))
# If it's required to continue from specified category
else:
# Check if it's required category
if child_link.text == path[0].name:
# Open required category
try:
double_click(driver, child_link)
except InvalidElementStateException:
print "\n\nERROR\n", InvalidElementStateException.msg, '\n\n'
else:
# This element of list must be always category (have nested list)
del selector[-1] # delete changed and already useless link reference
# If <li> is category, it would have <ul> as child now and class="open"
# Check by class is priority, because <li> exists for sure.
current_li = driver.find_element_by_css_selector(''.join(selector))
# Category case - BRANCH
if current_li.get_attribute('class') == 'open' or current_li.get_attribute('class') == 'last open':
selector.append(' > ul') # forward to nested list
# Wait for nested list to load
try:
query = WebDriverWait(driver, WAIT_LONG_TIME).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ''.join(selector))))
except TimeoutException:
print "\t" * level, "%s timed out (%i secs). Failed to load nested list." %\
''.join(selector), WAIT_LONG_TIME
# Process this nested list
else:
last = path.pop(0)
if len(path) > 0: # If more to parse
print "\t" * level, "Going deeper to: %s" % ''.join(selector)
parse_crippled_shifted_list(driver, frame, selector, level + 1,
parent_id=last.id, path=path)
else: # Current is required
print "\t" * level, "Returning target category: ", ''.join(selector)
path = None
parse_crippled_shifted_list(driver, frame, selector, level + 1, last.id, path=None)
# Page case - LEAF
elif current_li.get_attribute('class') == 'leaf':
pass
else:
print "dummy"
del selector[-2:]
This How I solved the problem that I want to wait until certain amount of post where complete load through AJAX
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# create a new Chrome session
driver = webdriver.Chrome()
# navigate to your web app.
driver.get("http://my.local.web")
# get the search button
seemore_button = driver.find_element_by_id("seemoreID")
# Count the cant of post
seemore_button.click()
# Wait for 30 sec, until AJAX search load the content
WebDriverWait(driver,30).until(EC.visibility_of_all_elements_located(By.CLASS_NAME, "post")))
# Get the list of post
listpost = driver.find_elements_by_class_name("post")
I must be thinking about this wrong.
I want to get the contents of an element, in this case a formfield, on a page that I am accessing with Webdriver/Selenium 2
Here is my broken code:
Element=driver.find_element_by_id(ElementID)
print Element
print Element.text
here is the result:
<selenium.webdriver.remote.webelement.WebElement object at 0x9c2392c>
(Notice the blank line)
I know that element has contents since I just stuffed them in there with the previous command using .sendkeys and I can see them on the actual web page while the script runs.
but I need to get the contents back into data.
What can I do to read this? Preferably in a generic fashion so that I can pull contents from varied types of elements.
I believe prestomanifesto was on the right track. It depends on what kind of element it is. You would need to use element.get_attribute('value') for input elements and element.text to return the text node of an element.
You could check the WebElement object with element.tag_name to find out what kind of element it is and return the appropriate value.
This should help you figure out:
driver = webdriver.Firefox()
driver.get('http://www.w3c.org')
element = driver.find_element_by_name('q')
element.send_keys('hi mom')
element_text = element.text
element_attribute_value = element.get_attribute('value')
print element
print 'element.text: {0}'.format(element_text)
print 'element.get_attribute(\'value\'): {0}'.format(element_attribute_value)
driver.quit()
element.get_attribute('innerHTML')
I know when you said "contents" you didn't mean this, but if you want to find all the values of all the attributes of a webelement this is a pretty nifty way to do that with javascript in python:
everything = b.execute_script(
'var element = arguments[0];'
'var attributes = {};'
'for (index = 0; index < element.attributes.length; ++index) {'
' attributes[element.attributes[index].name] = element.attributes[index].value };'
'var properties = [];'
'properties[0] = attributes;'
'var element_text = element.textContent;'
'properties[1] = element_text;'
'var styles = getComputedStyle(element);'
'var computed_styles = {};'
'for (index = 0; index < styles.length; ++index) {'
' var value_ = styles.getPropertyValue(styles[index]);'
' computed_styles[styles[index]] = value_ };'
'properties[2] = computed_styles;'
'return properties;', element)
you can also get some extra data with element.__dict__.
I think this is about all the data you'd ever want to get from a webelement.
My answer is based on this answer: How can I get the current contents of an element in webdriver
just more like copy-paste.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.w3c.org')
element = driver.find_element_by_name('q')
element.send_keys('hi mom')
element_text = element.text
element_attribute_value = element.get_attribute('value')
print (element)
print ('element.text: {0}'.format(element_text))
print ('element.get_attribute(\'value\'): {0}'.format(element_attribute_value))
element = driver.find_element_by_css_selector('.description.expand_description > p')
element_text = element.text
element_attribute_value = element.get_attribute('value')
print (element)
print ('element.text: {0}'.format(element_text))
print ('element.get_attribute(\'value\'): {0}'.format(element_attribute_value))
driver.quit()
In Java its Webelement.getText() . Not sure about python.