Python, selenium - how to refresh page until item found? - python

I am trying to refresg page until item appears but my code doesn't work (I took pattern on that: python selenium keep refreshing until item found (Chromedriver)).
Here is the code:
while True:
try:
for h1 in driver.find_elements_by_class_name("name-link"):
text = h1.text.replace('\uFEFF', "")
if "Puffy" in text:
break
except NoSuchElementException:
driver.refresh
else:
for h1 in driver.find_elements_by_class_name("name-link"):
text = h1.text.replace('\uFEFF', "")
if "Puffy" in text:
h1.click()
break
break
These fragment is because I have to find one item with the same class name and replace BOM with "" (find_element_by_partial_link_text didn't work).
for h1 in driver.find_elements_by_class_name("name-link"):
text = h1.text.replace('\uFEFF', "")
if "Puffy" in text:
break
Could someone help me? Thanks a lot.

You're trying to get list of elements (driver.find_elements_by_class_name() might return list of elements or empty list - no exceptions) - you cannot get NoSuchElementException in this case, so driver.refresh will not be executed. Try below instead
while True:
if any(["Puffy" in h1.text.replace('\uFEFF', "") for h1 in driver.find_elements_by_class_name("name-link")]):
break
else:
driver.refresh
driver.find_element_by_xpath("//*[contains(., 'Puffy')]").click()

Related

Why is no data stored in my list in Python?

I have the following code to get some data using selenium. That goes through a list with ids with a for loop and to store them in my lists (titulos = [] and ids = []. It was working fine until I added the try/except. The code would look like this:
for item in registros:
found = False
ids = []
titulos = []
try:
while true:
#code to request data
try:
error = False
error = #error message
if error is True:
break
except:
continue
except:
continue
try:
found = #if id has data
if found.is_displayed:
titulo = #locator
ids.append(item)
titulos.append(titulo)
except NoSuchElementException:
input.clear()
The first inner try block needs to be indented. Also, the error parameter will always be set to the text message so it will always be true. Try formatting your code correctly and then identifying the problem.

how to make except only one value of some in selenium?

I want to find title, address, price of some items in an online mall.
But, sometimes the address is empty and my code is break in my code(below_it's an only selenium part)
num = 1
while 1:
try:
title = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/span').text
datas_title.append(title)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
datas_address.append(address)
price = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/p').text
datas_price.append(price)
print('crowling....num = '+str(num))
num=num+1
except Exception as e:
print("finish get data...")
break
print(datas_title)
print(datas_address)
print(datas_price)
what should I do if the address is empty -> just ignore it and find the next items?
Use this so you can skip the entries with missing information:
num = 1
while 1:
try:
title = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/span').text
datas_title.append(title)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
datas_address.append(address)
price = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/p').text
datas_price.append(price)
print('crowling....num = '+str(num))
num=num+1
except:
print("an error was encountered")
continue
print(datas_title)
print(datas_address)
print(datas_price)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
if not address:
address = "None"
else:
address = address[0].text
datas_title.append(address)
You could use find_elements to check if it's empty and then proceed to do it with either value. You can than encapsulate this into a function pass it the xpath and the data_title array and your code should be repeatable.
I think you need to first check if the web element returned isn't none. And then proceed with fetching text.
You could write a function for it, and catch that exception in it.

Nested while loops does not break as expected

I have a list of links and for each link I want to check if it contains a specific sublink and add this sublink to the initial list. I have this code:
def getAllLinks():
i = 0
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
while i < len(sourcePaths)+1:
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text)
next_btn = soup.find(lambda e: e.name == 'td' and '1..99' in e.text)
if next_btn:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
break
else:
i += 1
continue
break
return sourcePaths
print(getAllLinks())
The first link in the list does not contain the sublink, so it's an else case. The code does this OK. However, the second link in the list does contain the sublink, but it gets stuck here:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
The third link contains the sublink but my code does not get to look at that link. At the end I am getting a list containing the initial links plus 4 times the sublink of the second link.
I think I'm breaking incorrectly somewhere but I can't figure out how to fix it.
Remove the while. It's not needed. Change the selectors
import requests
from bs4 import BeautifulSoup
def getAllLinks():
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text, "html.parser")
next_btn = soup.find("p",class_="headline").find("table", {"align":"center"})
if next_btn:
anchor = next_btn.find_all("td")[-1].find("a")
if anchor: sourcePaths.append(anchor["href"])
return sourcePaths
print(getAllLinks())
Output:
['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=100', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0&nrc=100']
Your second break statement never gets executed because the first "for" loop is already broken by the first break statement and never reaches the second break statement. Put condition which break the while loop.

If Elif issue / Selenium Python

I made the following code to scrape some website. A list of product code is itered on research bar with Selenium. If there is no result found (if driver.find_element_by_css_selector("div[class='search-did-you-mean']"):) i just clear the research bar to make another search. If there is some results (elif driver.find_element_by_css_selector("div[class='result-search']"):) I scrape it
Here is the code :
for product in product_list:
inputElement = driver.find_element_by_id("q")
inputElement.send_keys(product[0])
inputElement.send_keys(Keys.ENTER)
inputElement.click()
time.sleep(5)
if driver.find_element_by_css_selector("div[class='search-did-you-mean']"):
time.sleep(5)
clearResearch = driver.find_element_by_id("q")
WebDriverWait(driver, 10).until_not(EC.visibility_of_element_located((By.ID, "overley")))
clearResearch.send_keys(Keys.CONTROL + "a")
clearResearch.send_keys(Keys.DELETE)
elif driver.find_element_by_css_selector("div[class='result-search']"):
time.sleep(5)
item['price'] = driver.find_element_by_css_selector("span[class='sale-price']").text
item['desc'] = driver.find_element_by_css_selector("h3[class='product-name']").text
print(item)
There is no result for the first product code of the list, so it is cleared and a new code is given. Problem appears with the second item, there is results but my elif condition seems not understand as I get an Unable to locate element: div[class='search-did-you-mean'] error.
Do you know what is wrong with my code ? Thanks a lot
This is selenium behavior will throw exception if no element found, wrap it in try-except
first_product = None
try:
first_product = driver.find_element_by_css_selector("div[class='search-did-you-mean']"
except: pass
if first_product:
.....
You can use find_elements_by_css_selector and check if the returned list has elements in it
if driver.find_elements_by_css_selector("div[class='search-did-you-mean']"):
#...
elif driver.find_elements_by_css_selector("div[class='result-search']"):
#...

Recursive function gives no output

I'm scraping all the URL of my domain with recursive function.
But it outputs nothing, without any error.
#usr/bin/python
from bs4 import BeautifulSoup
import requests
import tldextract
def scrape(url):
for links in url:
main_domain = tldextract.extract(links)
r = requests.get(links)
data = r.text
soup = BeautifulSoup(data)
for href in soup.find_all('a'):
href = href.get('href')
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == main_domain.domain :
problem.append(href)
elif not href == '#' and link_domain.tld == '':
new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
problem.append(new)
return len(problem)
return scrape(problem)
problem = ["http://xyzdomain.com"]
print(scrape(problem))
When I create a new list, it works, but I don't want to make a list every time for every loop.
You need to structure your code so that it meets the pattern for recursion as your current code doesn't - you also should not call variables the same name as libraries, e.g. href = href.get() because this will usually stop the library working as it becomes the variable, your code as it currently is will only ever return the len() as this return is unconditionally reached before: return scrap(problem).:
def Recursive(Factorable_problem)
if Factorable_problem is Simplest_Case:
return AnswerToSimplestCase
else:
return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))
for example:
def Factorial(n):
""" Recursively Generate Factorials """
if n < 2:
return 1
else:
return n * Factorial(n-1)
Hello I've made a none recursive version of this that appears to get all the links on the same domain.
The code below I've tested using the problem included in the code. When I'd solved the problems with the recursive version the next problem was hitting the recursion depth limit so I rewrote it so it ran in an iterative fashion, the code and result below:
from bs4 import BeautifulSoup
import requests
import tldextract
def print_domain_info(d):
print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)
SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
# Get a link from the stack of links
link = problem.pop()
# Check we haven't been to this address before
if link in SEARCHED_URLS:
continue
# We don't want to come back here again after this point
SEARCHED_URLS.append(link)
# Try and get the website
try:
req = requests.get(link)
except:
# If its not working i don't care for it
print "borked website found: {0}".format(link)
continue
# Now we get to this point worth printing something
print "Trying to parse:{0}".format(link)
print "Status Code:{0} Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
# Get the domain info
dInfo = tldextract.extract(link)
print_domain_info(dInfo)
# I like utf-8
data = req.text.encode("utf-8")
print "Lenght Of Data Retrived:{0}".format(len(data)) # More info
soup = BeautifulSoup(data) # This was here before so i left it.
print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
FOUND_THIS_ITERATION = [] # Getting the same links over and over was boring
found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS] # Find me all the links i don't got
for href in found_links:
href = href.get('href') # You wrote this seems to work well
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
if href not in FOUND_THIS_ITERATION: # I'ma check you out next time
print "Check out this link: {0}".format(href)
print_domain_info(link_domain)
FOUND_THIS_ITERATION.append(href)
problem.append(href)
else: # I got you already
print "DUPE LINK!"
else:
print "Not on same domain moving on"
# Count down
print "We have {0} more sites to search".format(len(problem))
if problem:
continue
else:
print "Its been fun"
print "Lets see the URLS we've visited:"
for url in SEARCHED_URLS:
print url
Which prints, after a lot of other logging loads of neocities websites!
What's happening is the script is popping a value of the list of websites yet to visit, it then gets all the links on the page which are on the same domain. If those links are to pages we haven't visited we add the link to the list of links to be visited. After we do that we pop the next page and do the same thing again until there are no pages left to visit.
Think this is what your looking for, get back to us in the comments if this doesn't work in the way that you want or if anyone can improve please leave a comment.

Categories

Resources