Python and Selenium: Automatically adjusting range depending on if something exists - python

The first post on Stackoverflow, this website has been very useful to me in the past so wanted to thank the community first and foremost. I've been learning Python over the last 2-3 weeks, just by doing my own little "projects", and I had a question that I tried searching for but was not really sure on how to phrase it so finding an answer is a bit difficult.
Essentially what I would like to do is take a user input for a Pokémon, go to a website which has information on it and prints a table of the moves that Pokémon can learn by leveling up. I've managed to get a code running, however, the issue I have is that each Pokémon learns a different number of moves. The code I have is:
import selenium.webdriver as webdriver
def moves(x):
move = browser.find_element_by_xpath("""//*[#id="svtabs_moves_15"]/div[1]/div[1]/div[1]/table/tbody/tr[""" + str(x) + """]/td[2]/a""").text
return(move)
poke = input("Search for which Pokémon?: ")
browser = webdriver.PhantomJS()
browser.get("https://pokemondb.net/pokedex/" + str(poke))
for x in range(1,50):
print(moves(x))
If a Pokémon only has 15 moves it learns by level up, then on the 16th iteration of x, an error is returned because that xpath doesn't exist, so I am looking for a way to modify my code such that it stops printing if the xpath doesn't exist.
I was thinking using a while True statement, but not too sure how to approach it. Again, I'm very new to Python so the code may not be the most elegant.
Thanks for reading!

Using while loop with a try except statement. So if the element is not present it just stops the loop.
def moves(x):
while True:
try:
move = browser.find_element_by_xpath("""//*[#id="svtabs_moves_15"]/div[1]/div[1]/div[1]/table/tbody/tr[""" + str(x) + """]/td[2]/a""").text
except:
break
return(move)

easiest option here is to add try-except to pass error without breaking loop:
...
(your code above)
while 1: #initially was for x in range(1,50), but while loop is better
try:
print(moves(x))
except:
break

If the difference between each move is in the <tr> tag you can locate list of all those elements and use it to get the data you are looking for
def moves(element):
move = element.find_element_by_xpath('//*td[2]/a').text
return(move)
browser.get("https://pokemondb.net/pokedex/" + str(poke))
moves_list = browser.find_element_by_xpath("""//*[#id="svtabs_moves_15"]/div[1]/div[1]/div[1]/table/tbody/tr[""" + str(x) + """]""")
for x in range(1, len(moves_list)):
print(moves(moves_list[x]))

Related

Selenium Painfully slow

So I have a selenium script that scrapes data from a website. Sadly I can't share the site but I have noticed the same issue across several scrapers I have made. I have it set that if any exception is hit to just return 'Not Found' as this is possible with the information I am looking for. When the information is found the script is extremely quick but when not it's slow as all can be.
Any suggestions to speed this up?
form = driver.find_element_by_id('formSearchCriteria')
form.send_keys(userID)
searchButton = driver.find_element_by_id('phs-save-btn')
searchButton.click()
nextButton = driver.find_elements_by_xpath('//*[#class ="jss137"]')
nextButton[0].click()
list = driver.find_elements_by_xpath('//*[#class ="balance-field"]')
l = str(list[1].text)
ignore_keys = ["User ","Identifier"]
for ignore in ignore_keys:
l = l.replace(ignore,"")
return l
except:
return 'Not Found'
The script is slow cause of
nextButton = driver.find_elements_by_xpath('//*[#class ="jss137"]')
and
list = driver.find_elements_by_xpath('//*[#class ="balance-field"]')
these two lines.
the things is they both return list.
Now, if a single element is found all is well. It'll work as expected.
But if not, it would wait whatever implicitwait you have in your suite.
so if you want them to be take less time, I would suggest you to decrease the ImplicitWait.
driver.implicitly_wait(10)
10 or even less should work fine for you.
You can do it in the browser:
l = driver.execute_script('return [...document.querySelectorAll(".balance-field")].map(a => a.innerText)[1]')
This is potentially way faster depending on how many elements.

Selenium / Python: Why isn't my Selenium find_element_by finding elements anymore after finding the first one in my for loop iterations?

Do you see something wrong with this setup?
(selenium, etc. imported earlier on)
It iterates through table_rows until it finds the first row where the “try” is successful, then comes back from the getinfo() function ran from the “try” (which clicks a link, goes to a page, gets info, and then clicks the back button back to the original page), and then keeps iterating through the rest of table_rows.
The correct number of table_rows iterations are performed by the end, and the “try” function is being triggered again (the print() before current_cell works), but the find_element_by_class doesn’t seem to be picking up any more “a0” in the subsequent table_rows iterations, even though there are definitely some there that should be being found (the print() after current_cell never prints after the very first time).
Thank you for your help. I'm new to coding and have learned a ton, but this is stumping me.
def getinfo(current_cell):
link_in_current_cell = current_cell.find_element_by_tag_name("a")
link_in_current_cell.click()
waitfortable2 = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, "top-edit-table"))
)
print("Here is the info about a0.")
driver.back()
return True
for row in table_rows:
print("got the row!")
waitfortable = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.XPATH, "/html/body/table[3]/tbody/tr/td[2]/form/table/tbody/tr/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]"))
)
try:
print("we're trying!")
current_cell = row.find_element_by_class_name("a0")
print("we got an a0!")
getinfo(current_cell)
except:
print("Not an a0 cell.")
pass
continue
Here is more of the code from before "for row in table_rows:" if you need it, but I don't think that part is an issue, as it is still iterating through the rest of table_rows after it finds the first "a0" cell.
try:
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.XPATH, "/html/body/table[3]/tbody/tr/td[2]/form/table/tbody/tr/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]"))
)
table = driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td[2]/form/table/tbody/tr/td/table/tbody/tr[4]/td/table")
table_rows = table.find_elements_by_tag_name("tr")
for row in table_rows:
print("got the row!")
....
....
....(see code box above)
Success! I found my own workaround!
FYI: I still could not see anything wrong with my existing code, as
it correctly found all the a0 cells when I commented out the function
part of that text (#getinfo(current_cell)... thank you #goalie1998 for
the suggestion). And, I didn't change anything in that function for
this new workaround, which works correctly. So, it must have something
to do with Selenium getting messed up when trying to iterate through a
loop that (1) tries to find_element_by something on the page (that
exists multiple times on the page, and that's why you're creating the
loop) and (2) clicks on a link within that loop, goes to a page, goes
back to a page, and then is supposed to keep running through the
iterations with the find_element_by "function" (probably wrong term
usage here) to get the next one that exists on the page. Not sure why
Selenium gets messed up from that, but it does. (More experienced
coders, feel free to elaborate).
Anyway, my workaround thought process, which may help some of you solve this issue for yourselves by doing something similarly, is:
(1) Find all of the links BEFORE clicking on any of them (and create a list of those links)
Instead of trying to find & click the links one-at-a-time as they show up, I decided to find all of them FIRST (BEFORE clicking on them). I changed the above code to this:
# this is where I'm storing all the links
text_link_list = []
for row in table_rows:
print("got the row!")
waitfortable = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.XPATH, "/html/body/table[3]/tbody/tr/td[2]/form/table/tbody/tr/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]"))
)
## Get a0
try:
print("we're trying!")
row.find_element_by_class_name("a0")
print("we got an a0!")
# this next part is just because there are also blank "a0" cells without
# text (aka a link) in them, and I don't care about those ones.
current_row_has_a0 = row.find_element_by_class_name("a0")
if str(current_row_has_a0.text) != "":
text_link_list += [current_row_has_a0.text]
print("text added to text_link_list!")
else:
print("wasn't a text cell!")
except:
pass
continue
(2) Iterate through that list of links, running your Selenium code that includes .click() and .back()
Now that I had my list of links, I could just iterate through that and do my .click() —> perform actions —> .back() function that I created ( getinfo() -- original code in question above).
## brand new for loop, after "for row in table_rows" loop
for text in text_link_list:
# waiting for page to load upon iterations
table = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "/html/body/table[3]/tbody/tr/td[2]/form/table/tbody/tr/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]"))
)
# this is my .click() --> perform actions --> .back() function
getinfo(text)
However, I just needed to make two small changes to my .getinfo() function.
One, I was now clicking on the links via their "link text", not the a0 class I was using before (need to use .find_element_by_link_text).
Two, I could now use my more basic driver.find_element_by instead of my original table.find_element_by ...."table" may have worked as well, but I was worried about the memory of getting to "table" being lost since I was now in my function running the .click() code. I decided to go with "driver" since it was more certain. (I'm still pretty new to coding, this may not have been necessary.)
def getinfo(text):
link_in_current_cell = driver.find_element_by_link_text(text)
link_in_current_cell.click()
waitfortable2 = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, "top-edit-table"))
)
print("Here is the info from this temporary page.")
driver.back()
return True
I hope that this can all be helpful to someone. I was stoked when I did it and it worked! Let me know if it helped you! <3
PS. NOTE IF HAVING STALE ERRORS / StaleElementReferenceException:
If you are iterating through your loops and clicking a link via something like driver.find_element_by (instead of using .back()), you may run into a Stale error, especially if you're just trying to use .click() on a variable that you assigned earlier or outside of the loop. You can fix this (maybe not in the most beautiful way) by redefining your variable right at that point of the loop when you're wanting to click on the link, with code like this:
my_link = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.LINK_TEXT, "linktext"))
)
my_link.click()
continue
"continue" is only necessary if you're wanting to end the current iteration and begin the next one, and is also not likely necessary if this is at the end of your loop
you can change that red "10" number to whatever amount of time you'd like to give the code to find the element (aka the page to reload, most likely) before the script fails

Selenium Python Instagram Scraping All Images in a post not working

I am writing a small code to download all images/videos in a post. Here is my code:
import urllib.request as reqq
from selenium import webdriver
import time
browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")
browser.get(url)
browser.maximize_window()
url_list = ['https://www.instagram.com/p/CE9CZmsghan/']
img_urls = []
vid_urls = []
img_url = ""
vid_url = ""
for x in url_list:
count = 0
browser.get(x)
while True:
try:
elements = browser.find_elements_by_class_name('_6CZji')
elements[0].click()
time.sleep(1)
except:
count+=1
time.sleep(1)
if count == 2:
break
try:
vid_url = browser.find_element_by_class_name('_5wCQW').find_element_by_tag_name('video').get_attribute('src')
vid_urls.append(vid_url)
except:
img_url = browser.find_element_by_class_name('KL4Bh').find_element_by_tag_name('img').get_attribute('src')
img_urls.append(img_url)
for x in range(len(img_urls)):
reqq.urlretrieve(img_urls[x],f"D:\\instaimg"+str(x+1)+".jpg")
for x in range(len(vid_urls)):
reqq.urlretrieve(vid_urls[x],"D:\\instavid"+str(x+1)+".mp4")
browser.close()
This code extracts all the images in the post except the last image. IMO, this code is right. Do you know why this code doesn't extract the last image? Any help would be appreciated. Thanks!
Go to the URL that you're using in the example and open the inspector, and very carefully watch how the DOM changes as you click between images. There are multiple page elements with class KL4Bh because it tracks the previous image, the current image, and the next image.
So doing find_element_by_class_name('KL4Bh') returns the first match on the page.
Ok, lets break down this loop and see what is happening:
first iteration
page opens
immediately click 'next' to second photo
grab the first element for class 'KL4Bh' from the DOM
the first element for that class is the first image (now the 'previous' image)
[... 2, 3, 4 same as 1 ...]
fifth iteration
look for a "next" button to click
find no next button
`elements[0]` fails with index error
grab the first element for class 'KL4Bh' from the DOM
the first element for that class is **still the fourth image**
sixth iteration
look for a "next" button to click
find no next button
`elements[0]` fails with index error
error count exceeds threshold
exit loop
try something like this:
n = 0
while True:
try:
elements = browser.find_elements_by_class_name('_6CZji')
elements[0].click()
time.sleep(1)
except IndexError:
n=1
count+=1
time.sleep(1)
if count == 2:
break
try:
vid_url = browser.find_elements_by_class_name('_5wCQW')[n].find_element_by_tag_name('video').get_attribute('src')
vid_urls.append(vid_url)
except:
img_url = browser.find_elements_by_class_name('KL4Bh')[n].find_element_by_tag_name('img').get_attribute('src')
img_urls.append(img_url)
it will do the same thing as before, except since it's now using find_elements_by_class and indexing into the resulting list, when it gets to the last image the index error for the failed button click will also cause the image lookup to increment the index it uses. So it will take the second element (the current image) on the last iteration of the loop.
There are still some serious problems with this code, but it does fix the bug you are seeing. One problem at a time :)
Edit
A few things that I think would improve this code:
When using try-except blocks to catch exceptions/errors, there are a few rules that should almost always be followed:
Name specific exceptions & errors to handle, don't use unqualified except. The reason for this is that by catching every possible error, we actually suppress and obfuscate the source of bugs. The only legitimate reason to do this is to generate a custom error message, and the last line of the except-block should always be raise to allow the error to propagate. It goes against how we typically think of software errors, but when writing code, errors are your friend.
The try-except blocks are also problematic because they are being used as a conditional control structure. Sometimes it seems easier to code like this, but it is usually a sign of incomplete understanding of the libraries being used. I am specifically referring to the block that is checking for a video versus an image, although the other one could be refactored too. As a rule, when doing conditional branching, use an if statement.
Using sleep with selenium is almost always incorrect, but it's by far the most common pitfall for new selenium users. What happens is that the developer will start getting errors about missing elements when trying to search the DOM. They will correctly conclude that it is because the page was not full loaded in the browser before selenium tried to read it. But using sleep is not the right approach because just waiting for a fixed time makes no guarantee that the page will be fully loaded. Selenium has a built-in mechanism to handle this, called explicit wait (along with implicit wait and fluent wait). Using an explicit wait will guarantee that the page element is visible before your code is allowed to proceed.

Element sometimes appears and sometimes does not, how to continue script either way?

My selenium (python) script is doing some work on a website that is a dashboard I.e the actual webpage/link does not change as I interact with elements on this dashboard. I mention this as I considered simply hopping around different links via driver.get() command to avoid some of the issues I am having right now.
At one point, the script reaches a part of the dashboard that for some reason during some of my test runs has an element and other times this element is not present. I have a few lines that interact with this element as it is in the way of another element that I would like to .click(). So when this element is not present my script stops working. This is a script that will repeat the same actions with small variations at the beginning, so I need to somehow integrate some sort of an 'if' command I guess.
My goal is to write something that does this:
- If this element is NOT present skip the lines of code that interact with it and jump to the next line of action. If the element is present obviously carry on.
day = driver.find_element_by_xpath('/html/body/div[4]/table/tbody/tr[4]/td[2]/a')
ActionChains(driver).move_to_element(day).click().perform()
driver.implicitly_wait(30)
lizard2 = driver.find_element_by_xpath('/html/body/div[5]/div/div[2]/div[1]/img')
ActionChains(driver).move_to_element(lizard2).perform()
x2 = driver.find_element_by_xpath('/html/body/div[5]/div/div[2]/div[2]')
ActionChains(driver).move_to_element(x2).click().perform()
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
So the block that starts with 'lizard2' and goes to the ActionChains line is the block that interacts with this element that is sometimes apparent sometimes not as the script goes back and forth doing the task.
The top two lines are just the code that gets me to the phase of the dashboard that has this randomly appearing element. And the scroll command at the end is what follows. As mentioned before I need the middle part to be ignored and continue to the scroll part if this 'lizard2' and 'x2' elements are not found.
I apologize if this is confusing and not really concise, I am excited to hear what you guys think and provide any additional information/details, thank you!
You can simply perform your find_element_by_xpath within a try/except block like so:
from selenium.common.exceptions import NoSuchElementException
try:
myElement = driver.find_element_by_xpath(...)
#Element exists, do X
except NoSuchElementException:
#Element doesn't exist, do Y
Edit: Just like to add that someone in the comments to your question suggested that this method was 'hacky', it's actually very Pythonic and very standard.
private boolean isElementPresent(By by) {
try {
driver.findElement(by);
return true;
} catch (NoSuchElementException e) {
return false;
}
}
Answered by #Lucan :
You can simply perform your find_element_by_xpath within a try/except block like so:
from selenium.common.exceptions import NoSuchElementException
try:
myElement = driver.find_element_by_xpath(...)
#Element exists, do X
except NoSuchElementException:
#Element doesn't exist, do Y

How to make a while-loop run as long as an element is present on page using selenium in python?

I'm struggling with creating a while-loop which runs as long as a specific element is present on a website. What I currently have is by no means a solution I'm proud of, but it works. I would however very much appreciate some suggestions on how to change the below:
def spider():
url = 'https://stackoverflow.com/questions/ask'
driver.get()
while True:
try:
unique_element = driver.find_element_by_class_name("uniqueclass")
do_something()
except NoSuchElementException:
print_data_to_file(entries)
break
do_something_else()
As you can see, the first thing I do within the while-loop is to check for a unique element which is only present on pages containing data I'm interested in. Thus, when I reach a page without this information, I'll get the NoSuchElementException and break.
How can I achieve the above without having to make a while True?
driver.find_elements won't throw any error. If the returned list is empty it means there aren't any more elements
def spider():
url = 'https://stackoverflow.com/questions/ask'
driver.get()
while len(driver.find_elements_by_class_name("uniqueclass")) > 0:
do_something()
do_something_else()
You could also use explicit wait with expected condition staleness_of, however you won't be able to execute do_something() and it's used for short waiting periods.

Categories

Resources