I want to find title, address, price of some items in an online mall.
But, sometimes the address is empty and my code is break in my code(below_it's an only selenium part)
num = 1
while 1:
try:
title = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/span').text
datas_title.append(title)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
datas_address.append(address)
price = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/p').text
datas_price.append(price)
print('crowling....num = '+str(num))
num=num+1
except Exception as e:
print("finish get data...")
break
print(datas_title)
print(datas_address)
print(datas_price)
what should I do if the address is empty -> just ignore it and find the next items?
Use this so you can skip the entries with missing information:
num = 1
while 1:
try:
title = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/span').text
datas_title.append(title)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
datas_address.append(address)
price = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/p').text
datas_price.append(price)
print('crowling....num = '+str(num))
num=num+1
except:
print("an error was encountered")
continue
print(datas_title)
print(datas_address)
print(datas_price)
address = browser.find_element_by_xpath('//*[#id="root"]/div[1]/section/article/div/div['+str(num)+']/div/div/a/div/p[2]').text
if not address:
address = "None"
else:
address = address[0].text
datas_title.append(address)
You could use find_elements to check if it's empty and then proceed to do it with either value. You can than encapsulate this into a function pass it the xpath and the data_title array and your code should be repeatable.
I think you need to first check if the web element returned isn't none. And then proceed with fetching text.
You could write a function for it, and catch that exception in it.
Related
I have the following code to get some data using selenium. That goes through a list with ids with a for loop and to store them in my lists (titulos = [] and ids = []. It was working fine until I added the try/except. The code would look like this:
for item in registros:
found = False
ids = []
titulos = []
try:
while true:
#code to request data
try:
error = False
error = #error message
if error is True:
break
except:
continue
except:
continue
try:
found = #if id has data
if found.is_displayed:
titulo = #locator
ids.append(item)
titulos.append(titulo)
except NoSuchElementException:
input.clear()
The first inner try block needs to be indented. Also, the error parameter will always be set to the text message so it will always be true. Try formatting your code correctly and then identifying the problem.
I am performing web scraping via Python \ Selenium \ Chrome headless driver. I am reading the results from JSON - here is my code:
CustId=500
while (CustId<=510):
print(CustId)
# Part 1: Customer REST call:
urlg = f'https://mywebsite/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
# print(dict_from_json)
#try:
CustID = (dict_from_json['customerAddressCreateCommand']['customerId'])
# Addr = (dict_from_json['customerShowCommand']['customerAddressShowCommandSet'][0]['addressDisplayName'])
writefunction()
CustId = CustId+1
The issue is sometimes 'addressDisplayName' will be present in the result set and sometimes not. If its not, it errors with the error:
IndexError: list index out of range
Which makes sense, as it doesn't exist. How do I ignore this though - so if 'addressDisplayName' doesn't exist just continue with the loop? I've tried using a TRY but the code still stops executing.
try..except block should resolved your issue.
CustId=500
while (CustId<=510):
print(CustId)
# Part 1: Customer REST call:
urlg = f'https://mywebsite/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
# print(dict_from_json)
CustID = (dict_from_json['customerAddressCreateCommand']['customerId'])
try:
Addr = (dict_from_json['customerShowCommand']['customerAddressShowCommandSet'][0]'addressDisplayName'])
except:
Addr ="NaN"
CustId = CustId+1
If you get an IndexError (with an index of '0') it means that your list is empty. So it is one step in the path earlier (otherwise you'd get a KeyError if 'addressDisplayName' was missing from the dict).
You can check if the list has elements:
if dict_from_json['customerShowCommand']['customerAddressShowCommandSet']:
# get the data
Otherwise you can indeed use try..except:
try:
# get the data
except IndexError, KeyError:
# handle missing data
I get the following Index error when trying to run the code below (found here: https://github.com/israel-dryer/Twitter-Scraper/blob/main/twitter-scraper-tut.ipynb):
card = cards[0]
IndexError: list index out of range
Since I am a Python newbie could you help me figure this out please?
Thanks a lot!!!
card = driver.find_elements_by_xpath('//div[#data-testid="tweet"]')
card = cards[0]
#function that collects tweets while scrolling; filters out sponsored tweets; saves tweets in Tuple
def get_tweet_data(card):
# get username of the tweet
username_tweet = card.find_element_by_xpath('.//span').text
# get Twitter Handle
handle_tweet = card.find_element_by_xpath('.//span[contains(text()."#")]').text
# get date of post - if no date, then sponsored tweet - then do not return
try:
date_tweet = card.find_element_by_xpath('.//time').get_attribute('datetime')
except NoSuchElementException:
return
# get Text of Tweet
comment = card.find_element_by_xpath('.//div[2]/div[2]/div[1]').text
responding = card.find_element_by_xpath('.//div[2]/div[2]/div[2]').text
text_tweet = comment + responding
# number of replies, retweets, likes
reply_count = card.find_element_by_xpath('.//div[#data-testid="reply"]').text
retweet_count = card.find_element_by_xpath('.//div[#data-testid="retweet"]').text
like_count = card.find_element_by_xpath('.//div[#data-testid="like"]').text
tweet = (username_tweet, handle_tweet, date_tweet, text_tweet, reply_count, retweet_count, like_count)
return tweet
get_tweet_data(card)
It means that driver.find_elements_by_xpath('//div[#data-testid="tweet"]') did not return anything, i.e. there are no elements that match that xpath
find_elements_by_xpath() returns a list of found elements or None if no elements are found. So in this case you need check length of 'card' before getting items from it. E.g.:
cards = driver.find_elements_by_xpath('//div[#data-testid="tweet"]')
if len(cards) > 0:
card = cards[0]
else:
raise NoSuchElementException('No cards were found')
Maybe, it would be better if you will use driver.find_element_by_xpath() instead of driver.find_elements_by_xpath() - it will return an exception if no elements are found.
I want to rasie an exception error if any mismatch found but also the loop should continue.
If there's any mismatch/exception error, the entire case should fail.
Can y'all check the below code and help me out here?
def test01_check_urls(self, test_setup):
#reading the file
Total_entries=len(old_urls) //Total_entries=5
print("Total entries in the sheet: "+ str(Total_entries))
col_count=0
#opening urls
while col_count<Total_entries:
Webpage=old_urls[col_count] //fetching data from 1st cell in the excel
Newpage=new_urls[col_count] //fetching data from 1st cell in the excel
driver.get(Webpage)
print("The old page url is: "+Webpage)
page_title=driver.title
print(page_title)
Redr_page=driver.current_url
print("The new url is: "+Redr_page)
print("New_url from sheet:"+Newpage)
try:
if Redr_page==Newpage:
print("Correct url")
except:
raise Exception("Url mismatch")
col_count+=1
Have a variable url_mismatch, initially False. Instead of immediately raising an exception when the is a URL mismatch, just set this variable to True. Then when the loop ends, check the value of this variable and raise an exception if the variable is True.
However, It's not clear how your try block results in an exception. Did you possibly mean (no try block necessary):
if Redr_page == Newpage:
print("Correct url")
else:
raise Exception("Url mismatch")
For now I am leaving that part of the code unmodified:
url_mismatch = False
while col_count<Total_entries:
Webpage=old_urls[col_count] //fetching data from 1st cell in the excel
Newpage=new_urls[col_count] //fetching data from 1st cell in the excel
driver.get(Webpage)
print("The old page url is: "+Webpage)
page_title=driver.title
print(page_title)
Redr_page=driver.current_url
print("The new url is: "+Redr_page)
print("New_url from sheet:"+Newpage)
try:
if Redr_page==Newpage:
print("Correct url")
except:
print('Mismatch url')
url_mismatch = True # show we have had a mismtach
col_count+=1
# now check for a mismatch and raise an exception if there has been one:
if url_mismatch:
raise Exception("Url mismatch")
I am making a request to a server... for whatever reason (beyond my comprehension), the server will give me a status code of 200, but when I use Beautiful Soup to grab a list from the html, nothing is returned. It only happens on the first page of pagination.
To get around a known bug, I have to loop until the list is not empty.
This works, but it's clunky. Is there a better way to do this? Knowing that I have to force the request until the list contains an item.
# look for attractions
attraction_list = soup.find_all(attrs={'class': 'listing_title'})
while not attraction_list:
print('the list is empty')
try:
t = requests.Session()
t.cookies.set_policy(BlockAll)
page2 = t.get(search_url)
print(page2.status_code)
soup2 = BeautifulSoup(page2.content, 'html.parser')
attraction_list = soup2.find_all(attrs={'class': 'listing_title'})
except:
pass
I came up with this.
attraction_list = soup.find_all(attrs={'class': 'listing_title'})
while not attraction_list:
print('the list is empty')
for q in range(0, 4):
try:
t = requests.Session()
t.cookies.set_policy(BlockAll)
page2 = t.get(search_url)
print(page2.status_code)
soup2 = BeautifulSoup(page2.content, 'html.parser')
attraction_list = soup2.find_all(attrs={'class': 'listing_title'})
except Exception as str_error:
print('FAILED TO FIND ATTRACTIONS')
time.sleep(3)
continue
else:
break
It'll try 4 times to get that pull the attractions, if attractions_list ends up with a valid list, it breaks. Good enough.