Loop through changing xpath values w/ Selenium

Loop through changing xpath values w/ Selenium - python

I'm working on scraping a site that has a dropdown menu of hundreds of schools. I am trying to go through and grab tables for only schools from a certain district in the state. So far I have isolated the values for only those schools, but I've bee unable to replace the xpath values from what is stored in my dataframe/list.
Here is my code:
ousd_list = ousd['name'].to_list()
for i in range(0,129):
n = 0
driver.find_element_by_xpath(('"//option[#value="',ousd_list[n],']"'))
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
n += 1
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')
I suspect the issue is on the find_element_by_xpath line, but I'm not sure how else I would go about resolving this issue. Any advice?

The mistake is not in the scraping part but your code logic, since you put n=0 in the beginning of your loop, it resets to 0 and every loop will just find your ousd_list[0].
Try,
ousd_list = ousd['name'].to_list()
for ousd_name in ousd_list :
driver.find_element_by_xpath(f'//option[#value="{ousd_name}"]')
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')

Related

Using Try Except to iterate through a list in Python

I'm trying to iterate through a list of NFL QBs (over 100) and add create a list of links that I will use later.
The links follow a standard format, however if there are multiple players with the same name (such as 'Josh Allen') the link format needs to change.
I've been trying to do this with different nested while/for loops with Try/Except with little to no success. This is what I have so far:
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
name_int = 0
for names in test:
try:
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
q_b = pd.read_html(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
q_b1 = q_b[0]
#filter_status is a function that only works with QB data
df = filter_stats(q_b1)
#triggers the try if the link wasn't a QB
df.head(5)
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
except:
#adds one to the variable to change the link to find the proper QB link
name_int += 1
The result only appends the final correct link. I need to append each correct link to the empty list.
Still a beginner in Python and trying to challenge myself with different projects. Thanks!

As stated, the try/except will work in that it will try the code under the try block. If at any point within that block it fails or raises and exception/error, it goes and executes the block of code under the except.
There are better ways to go about this problem (for example, I'd use BeautifulSoup to simply check the html for the "QB" position), but since you are a beginner, I think trying to learn this process will help you understand the loops.
So what this code does:
1 It formats your player name into the link format.
2 We initialize a while loop that will it will enter
3 It gets the table.
4a) It enters a function that checks if the table contains 'passing'
stats by looking at the column headers.
4b) If it finds 'passing' in the column, it will return a True statement to indicate it is a "QB" type of table (keep in mind sometimes there might be runningbacks or other positions who have passing stats, but we'll ignore that). If it returns True, the while loop will stop and go to the next name in your test list
4c) If it returns False, it'll increment your name_int and check the next one
5 To take care of a case where it never finds a QB table, the while loop will go to False if it tries 10 iterations
Code:
import pandas as pd
def check_stats(q_b1):
for col in q_b1.columns:
if 'passing' in col.lower():
return True
return False
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
for names in test:
name_int = 0
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
qbStatsInTable = False
while qbStatsInTable == False:
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
url = f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/'
try:
q_b = pd.read_html(url, header=0)
q_b1 = q_b[0]
except Exception as e:
print(e)
break
#Check if "passing" in the table columns
qbStatsInTable = check_stats(q_b1)
if qbStatsInTable == True:
print(f'{names} - Found QB Stats in {link1}/{link2}/gamelog/')
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
else:
name_int += 1
if name_int == 10:
print(f'Did not find a link for {names}')
qbStatsInTable = False
Output:
print(empty_list)
['https://www.pro-football-reference.com/players/A/AlleJo02/gamelog/', 'https://www.pro-football-reference.com/players/J/JackLa00/gamelog/', 'https://www.pro-football-reference.com/players/C/CarrDe02/gamelog/']

Cannot remove headers-first XPATH in Selenium

I require your assistance with this issue below.
My code;
def rows_to_sheet():
line = 1
for i in data:
browser.get(i)
trs = browser.find_elements_by_xpath("/html/body/table[3]/tbody/tr")
for n, tr in enumerate(trs):
row=[td.text for td in tr.find_elements_by_tag_name("td")]
worksheet.write_row("A{}".format(line), row)
line += 1
tr[1] contains headers. It is being written to excel.xlsx in a separate function.
/html/body/table[3]/tbody/tr[1]/td
And data I want to write to excel go through tr[2], tr[3], tr[4].. based on the search keyword like this;
/html/body/table[3]/tbody/tr[2]/td
/html/body/table[3]/tbody/tr[3]/td
/html/body/table[3]/tbody/tr[4]/td
/html/body/table[3]/tbody/tr[5]/td
However, tr[1] (header) is also written to excel.xlsx each time since I could not remove "tr[1]" from code block.
Could anyone help me on how to avoid/remove "tr[1]" iteration from my code please?
Many thanks in advance!

Finding first row of a table with Beautiful Soup

I'm working on an assignment for class. I need to write something that will return the first row in the table on this webpage (the Barr v. Lee) row: https://www.supremecourt.gov/opinions/slipopinion/19
I've seen other questions that some might consider similar. But they don't look like they're answering my same question. Most other questions it looks like they already have the table on head, rather than pulling it down from a website already.
Or, maybe I just can't see the resemblance. I've been scraping for about a week now.
Right now, I'm trying to build a loop that will go through all the div elements with an increment counter, and have the counter return a number that tells what the div is for that row so I can drill into it.
This is what I have so far:
for divs in soup_doc:
div_counter = 0
soup_doc.find_all('div')[div_counter]
div_counter = div_counter + 1
print(div_counter)
But right now, it's only returning 1 which I know isn't right. What should I do to fix this? Or is there a better way to go about getting this information?
My output should be:
63
7/14/20
20A8
Barr v. Lee
PC
591/2

In your example the div_counter = 0 has to go in front of your loop like this:
div_counter = 0
for divs in soup_doc:
soup_doc.find_all('div')[div_counter]
div_counter = div_counter + 1
print(div_counter)
You always get 1 because you set div_counter to 0 inside of you for-loop at a beginning of each iteration and than add 1.

To get the first row, you can use a CSS Selector .in tr:nth-of-type(2) td:
import requests
from bs4 import BeautifulSoup
URL = "https://www.supremecourt.gov/opinions/slipopinion/19"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for tag in soup.select('.in tr:nth-of-type(2) td'):
print(tag.text)
Output:
63
7/14/20
20A8
Barr v. Lee
 
PC
591/2

Count stuck at 7

I am working on a project for school where I am creating a nutrition plan based off our schools nutrition menu. I am trying to create a dictionary with every item and its calorie content but for some reason the loop im using gets stuck at 7 and will never advance the rest of the list. To add to my dictionary. So when I search for a known key (Sour Cream) it throws and error because it is never added to the dictionary. I have also noticed it prints several numbers twice in a row as well double adding them to the dictionary.
edit: have discovered the double printing was from the print statement I had - still wondering about the 7 however
from bs4 import BeautifulSoup
import urllib3
import requests
url = "https://menus.sodexomyway.com/BiteMenu/Menu?menuId=14756&locationId=11870001&whereami=http://mnsu.sodexomyway.com/dining-near-me/university-dining-center"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
allFood = soup.findAll('a', attrs={'class':'get-nutritioncalculator primary-textcolor'})
allCals = soup.findAll('a', attrs={'class':'get-nutrition primary-textcolor'})
nums = '0123456789'
def printData(charIndex):
for char in allFood[charIndex].contents:
print(char)
for char in allCals[charIndex].contents:
print(char)
def getGoals():
userCalories = int(input("Please input calorie goal for the day (kC): "))
#Display Info (Text/RsbPi)
fullList = {}
def compileFood():
foodCount = 0
for food in allFood:
print(foodCount)
for foodName in allFood[foodCount].contents:
fullList[foodName] = 0
foodCount += 1
print(foodCount)
compileFood()
print(fullList['Sour Cream'])
Any help would be great. Thanks!

Ok first why is this happening:
The reason is because the food on the index 7 is empty. Because it's empty it will never enter your for loop and therefore never increase your foodCount => it will stuck at 7 forever.
So if you would shift your index increase outside of the for loop it would work without a problem.
But you doing something crude here.
You already iterate through the food item and still use an additional variable.
You could solve it smarter this way:
def compileFood():
for food in allFood:
for foodName in food.contents:
fullList[foodName] = 0
With this you don't need to care about an additional variable at all.

How to Iterate trough Indeed reviews and find the correspondent job offer, printing the employee review?

Having established already a dynamic search for the offers based on companies generating a link where you use it to search it´s available job reviews done by the previous employees, I´m now faced with the question about coding the part that would let me after having assign job offers and job reviews to a list as well as description to iterate through them and print the correspondent.
It all seems easy to do until you notice that job offers list have a different size than job reviews so I´m on a standstill regarding the following situation.
I´m trying the following code which obviously gives me an error since cargo_revisto_list is longer in length than nome_emprego_list because once you have more reviews than job offers this tends to happen, as well as the opposite.
Lists would be per example, the following:
cargo_revisto_list = ["Business Leader","Sales Manager"]
nome_emprego_list = ["Business Leader","Sales Manager","Front-end Developer"]
opiniao_list = ["Excellent Job","Wonderful managing"]
It would be a question of luck to get them to be exactly the same size.
url = "https://www.indeed.pt/cmp/Novabase/reviews?fcountry=PT&floc=Lisboa"
comprimento_cargo_revisto = len(cargo_revisto_list) #19
comprimento_nome_emprego = len(nome_emprego_list) #10
descricoes_para_cargos_existentes = []
if comprimento_cargo_revisto > comprimento_nome_emprego:
for i in range(len(cargo_revisto_list)):
s = cargo_revisto_list[i]
for z in range(len(nome_emprego_list)):
a = nome_emprego_list[z]
if(s == a): #A Stopping here needs new way of comparing strings
c=opiniao_list[i]
descricoes_para_cargos_existentes.append(c)
elif comprimento_nome_emprego > comprimento_cargo_revisto:
for i in range(len(comprimento_nome_emprego)):
s = nome_emprego_list[i]
for z in range(len(cargo_revisto_list)):
a = cargo_revisto_list[z]
if(s == a) and a!=None:
c = opiniao_list[z]
descricoes_para_cargos_existentes.append(c)
else:
for i in range(len(cargo_revisto_list)):
s = cargo_revisto_list[i]
for z in range(len(nome_emprego_list)):
a = nome_emprego_list[z]
if(s == a):
c = (opiniao_list[i])
descricoes_para_cargos_existentes.append(c)
After solving this issue I would need to get the exact review description about the job reviewed that corresponds to the job offer, so to solve this I would get the index of cargo_revisto_list and use that index to print opiniao_list (job description) that matches the job reviewed since it was added to the list at the same time and order by Beautiful Soup at the scraping moment.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loop through changing xpath values w/ Selenium - python

Related

Using Try Except to iterate through a list in Python

Cannot remove headers-first XPATH in Selenium

Finding first row of a table with Beautiful Soup

Count stuck at 7

How to Iterate trough Indeed reviews and find the correspondent job offer, printing the employee review?

Categories

Resources