I am looping through rows which each have a link and an index value that I assign to it. In addition to selenium, I am also using Beautiful Soup API to check page html.
The main issue is that once I have found the link index that I want to use, I execute links[index].click() and it will only work occasionally.
Error: list index out of range
When I double checked, I see that my index is still in range of the list but is still not working
# Each link is confirmed to work, but only works every other time the script is run
page_html = BeautifulSoup(driver.page_source, 'html.parser')
links = [link1, link2]
rows = page_html.find_all('tr',recursive=False)
index = 0
found = False
for row in rows:
col = row.select('td:nth-of-type(5)')
for string in col[0].strings:
# If column has a "Yes" string, let's use the index of this row
if (string == 'Yes'):
found = True
break
# Break from loop if we already have the row that we want
if found:
break
# If not found, continue adding to index value
index += 1
# This is the part of the code that does not work consistently
links[index].click()
To debug this I attempted the following:
def custom_wait(num=3):
driver.implicitly_wait(num)
time.sleep(num)
attempts = 0
while attempts < 10:
custom_wait()
try:
links[index].click()
except:
PrintException()
attempts += 1
else:
logger.debug("Link Successfully clicked")
break
When I run this code, it says that the link is successfully clicked but again it mentions that the index is out of range.
If the page contains more than 2 rows, it's not necessarily a surprise it throws an exception :O
The links list contains 2 values (index-0, index-1). If the third row's col does not contain the string 'Yes' you don't break from the for loop and increment the index variable.
So at the third row the index = 2 and the links list does not have anything at index-2, hence you get the IndexError
Why don't you loop over the links instead?
found = False
for link in links:
link.click()
rows = page_html.find_all('tr',recursive=False)
for row in rows:
col = row.select('td:nth-of-type(5)')
for string in col[0].strings:
if (string == 'Yes'):
found = True
break
if found:
break
if found:
break
Related
I'm trying to scrape a website and get every meal_box meal_container row in a list by driver.find_elements but for some reason I couldn't do it. I tried By.CLASS_NAME, because it seemed the logical one but the length of my list was 0. Then I tried By.XPATH, and the length was then 1 (I understand why). I think I can use XPATH to get them one by one, but I don't want to do it if I can handle it in a for loop.
I don't know why the "find_elements(By.CLASS_NAME,'print_name')" works but not "find_elements(By.CLASS_NAME,"meal_box meal_container row")"
I'm new at both web scraping and stackoverflow, so if any other details are needed I can add them.
Here is my code:
meals = driver.find_elements(By.CLASS_NAME,"meal_box meal_container row")
print(len(meals))
for index, meal in enumerate(meals):
foods = meal.find_elements(By.CLASS_NAME, 'print_name')
print(len(foods))
if index == 0:
mealName = "Breakfast"
elif index == 1:
mealName = "Lunch"
elif index == 2:
mealName = "Dinner"
else:
mealName = "Snack"
for index, title in enumerate(foods):
recipe = {}
print(title.text)
print(mealName + "\n")
recipe["name"] = title.text
recipe["meal"] = mealName
Here is the screenshot of the HTML:
It seems Ok but about class name put a dot between characters.
Like "meal_box.meal_container.row" Try this.
meals = driver.find_elements(By.CLASS_NAME,"meal_box.meal_container.row")
Try to use driver.find_element_by_css_selector
It can be because "meal_box meal_container row" is inside of other element. So you should try finding the highest element and look for needed one inside.
root = driver.find_element(By.CLASS_NAME,"row")
meals = root.find_elements(By.CLASS_NAME, "meal_box meal_container row")
I'm trying to iterate through a list of NFL QBs (over 100) and add create a list of links that I will use later.
The links follow a standard format, however if there are multiple players with the same name (such as 'Josh Allen') the link format needs to change.
I've been trying to do this with different nested while/for loops with Try/Except with little to no success. This is what I have so far:
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
name_int = 0
for names in test:
try:
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
q_b = pd.read_html(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
q_b1 = q_b[0]
#filter_status is a function that only works with QB data
df = filter_stats(q_b1)
#triggers the try if the link wasn't a QB
df.head(5)
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
except:
#adds one to the variable to change the link to find the proper QB link
name_int += 1
The result only appends the final correct link. I need to append each correct link to the empty list.
Still a beginner in Python and trying to challenge myself with different projects. Thanks!
As stated, the try/except will work in that it will try the code under the try block. If at any point within that block it fails or raises and exception/error, it goes and executes the block of code under the except.
There are better ways to go about this problem (for example, I'd use BeautifulSoup to simply check the html for the "QB" position), but since you are a beginner, I think trying to learn this process will help you understand the loops.
So what this code does:
1 It formats your player name into the link format.
2 We initialize a while loop that will it will enter
3 It gets the table.
4a) It enters a function that checks if the table contains 'passing'
stats by looking at the column headers.
4b) If it finds 'passing' in the column, it will return a True statement to indicate it is a "QB" type of table (keep in mind sometimes there might be runningbacks or other positions who have passing stats, but we'll ignore that). If it returns True, the while loop will stop and go to the next name in your test list
4c) If it returns False, it'll increment your name_int and check the next one
5 To take care of a case where it never finds a QB table, the while loop will go to False if it tries 10 iterations
Code:
import pandas as pd
def check_stats(q_b1):
for col in q_b1.columns:
if 'passing' in col.lower():
return True
return False
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
for names in test:
name_int = 0
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
qbStatsInTable = False
while qbStatsInTable == False:
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
url = f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/'
try:
q_b = pd.read_html(url, header=0)
q_b1 = q_b[0]
except Exception as e:
print(e)
break
#Check if "passing" in the table columns
qbStatsInTable = check_stats(q_b1)
if qbStatsInTable == True:
print(f'{names} - Found QB Stats in {link1}/{link2}/gamelog/')
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
else:
name_int += 1
if name_int == 10:
print(f'Did not find a link for {names}')
qbStatsInTable = False
Output:
print(empty_list)
['https://www.pro-football-reference.com/players/A/AlleJo02/gamelog/', 'https://www.pro-football-reference.com/players/J/JackLa00/gamelog/', 'https://www.pro-football-reference.com/players/C/CarrDe02/gamelog/']
I don't know how much usernames have because in each iteration data change old users are replaced with new ones so I don`t know who will be the last.How can I break the loop after no more new users are found.
n = 0
usernames = soup.find_all('div', class_='KV-D4')
while n < 10000:
for each in usernames:
each.get_text()
n+=1
if(usernames[last]):
break
If all you want to do is loop through a list of divs and break as soon as the value are no longer unique; then create a list and add items that are not unique otherwise break.
For example:
unique_usernames = []
usernames = soup.find_all('div', class_='KV-D4')
for each in usernames:
username = each.text
if username in unique _usernames:
break # First not unique user found
else:
unique_usernames.append(username)
lets say i have something as follow:
c = list(range(1,5))
when I print c, I get
1
2
3
4
I am using c as an iteration in a programme.
Now lets say that while working on "2" i get an error and I want to go back to 2 so as it does the following
1
2 (error)
2
3
4
how to do that?
I have inserted an "if" statement if an error occurs and a "continue" but it just skips to the next item.
Thanks
for i in range (4513,5001):
url="https://...{pagenum}.....xml".format(pagenum=i)
response=requests.get(url, verify=False)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td",{"class":"detail_1"})
if not g_data:
print("List is empty")
continue
results=[]
print (i)
for item in g_data:
results.append(item.text)
df=pd.DataFrame(np.array(results).reshape(20,7),columns=list("abcdefg"))
excel_reader=pd.ExcelFile('test6.xlsx')
to_update={"Sheet1":df}
excel_writer=pd.ExcelWriter('test6.xlsx')
for sheet in excel_reader.sheet_names:
sheet_df=excel_reader.parse(sheet)
append_df=to_update.get(sheet)
if append_df is not None:
sheet_df=pd.concat([sheet_df,df]).drop_duplicates()
sheet_df.to_excel(excel_writer,sheet,index=False)
excel_writer.save()
Usually I get a value error after the df line. Thats because the server of the website didnt have enough time to respond. Hence g_data is empty.
The error is occurring on this line at the call to reshape(...):
df=pd.DataFrame(np.array(results).reshape(20,7),columns=list("abcdefg"))
'results' can be an empty array right? Then np.array(results) gives you and you get an empty array. Then when you call reshape(20,7) with the result of np.array(results), it gives array because the result of np.array(results) does not have 20*7 = 140 elements
reshape(20,7) expects that the input (previous) array should also have 20*7 = 140 elements. But you may have an input array with 0 elements (according to lines preceding the df line).
You may check the description of 'reshape' parameter on https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html
One solution to what you want (Warning: if the server fails all the time for a certain page, you will get an infinite loop unless you want to limit the number of retries):
for i in range (4513,5001):
url="https://...{pagenum}.....xml".format(pagenum=i)
downloaded = False
while(not downloaded):
response=requests.get(url, verify=False)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td",{"class":"detail_1"})
if not g_data:
print("List is empty")
continue
else:
downloaded = True
results=[]
print (i)
for item in g_data:
results.append(item.text)
df=pd.DataFrame(np.array(results).reshape(20,7),columns=list("abcdefg"))
excel_reader=pd.ExcelFile('test6.xlsx')
to_update={"Sheet1":df}
excel_writer=pd.ExcelWriter('test6.xlsx')
for sheet in excel_reader.sheet_names:
sheet_df=excel_reader.parse(sheet)
append_df=to_update.get(sheet)
if append_df is not None:
sheet_df=pd.concat([sheet_df,df]).drop_duplicates()
sheet_df.to_excel(excel_writer,sheet,index=False)
excel_writer.save()
You can loop over each item using a while loop, breaking on success and continuing to the next item:
for i in items:
while True:
# some code that can succeed or fail goes here
if success:
break
I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)
Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).