Finding first row of a table with Beautiful Soup - python

I'm working on an assignment for class. I need to write something that will return the first row in the table on this webpage (the Barr v. Lee) row: https://www.supremecourt.gov/opinions/slipopinion/19
I've seen other questions that some might consider similar. But they don't look like they're answering my same question. Most other questions it looks like they already have the table on head, rather than pulling it down from a website already.
Or, maybe I just can't see the resemblance. I've been scraping for about a week now.
Right now, I'm trying to build a loop that will go through all the div elements with an increment counter, and have the counter return a number that tells what the div is for that row so I can drill into it.
This is what I have so far:
for divs in soup_doc:
div_counter = 0
soup_doc.find_all('div')[div_counter]
div_counter = div_counter + 1
print(div_counter)
But right now, it's only returning 1 which I know isn't right. What should I do to fix this? Or is there a better way to go about getting this information?
My output should be:
63
7/14/20
20A8
Barr v. Lee
PC
591/2

In your example the div_counter = 0 has to go in front of your loop like this:
div_counter = 0
for divs in soup_doc:
soup_doc.find_all('div')[div_counter]
div_counter = div_counter + 1
print(div_counter)
You always get 1 because you set div_counter to 0 inside of you for-loop at a beginning of each iteration and than add 1.

To get the first row, you can use a CSS Selector .in tr:nth-of-type(2) td:
import requests
from bs4 import BeautifulSoup
URL = "https://www.supremecourt.gov/opinions/slipopinion/19"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for tag in soup.select('.in tr:nth-of-type(2) td'):
print(tag.text)
Output:
63
7/14/20
20A8
Barr v. Lee
 
PC
591/2

Related

Loop through changing xpath values w/ Selenium

I'm working on scraping a site that has a dropdown menu of hundreds of schools. I am trying to go through and grab tables for only schools from a certain district in the state. So far I have isolated the values for only those schools, but I've bee unable to replace the xpath values from what is stored in my dataframe/list.
Here is my code:
ousd_list = ousd['name'].to_list()
for i in range(0,129):
n = 0
driver.find_element_by_xpath(('"//option[#value="',ousd_list[n],']"'))
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
n += 1
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')
I suspect the issue is on the find_element_by_xpath line, but I'm not sure how else I would go about resolving this issue. Any advice?
The mistake is not in the scraping part but your code logic, since you put n=0 in the beginning of your loop, it resets to 0 and every loop will just find your ousd_list[0].
Try,
ousd_list = ousd['name'].to_list()
for ousd_name in ousd_list :
driver.find_element_by_xpath(f'//option[#value="{ousd_name}"]')
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')

How to access a specific object in a class HTML while web scraping with python

Now, I understand that this may be a simple question, but I don't know anything about HTML and I'm new to web scraping with python. I was wondering if anyone could tell me how to access this specific object in this class on this website (https://sky.lea.moe/stats/Igris/Apple). The specific object I want to access is in HTML below.
'''
Average Skill Level:
32.5 == $0
'''
My current code looks like this and prints out an empty list, and even if it did print, I only want it to print out everything from this specific line of code shown above.
import bs4
res = requests.get('https://sky.lea.moe/stats/Igris/Apple')
soup = bs4.BeautifulSoup(res.text, 'lxml')
type(soup)
skillAverageList = []
for i in soup.select('.stat-value'):
skillAverageList.append(i.text)
Any help would be appreciated, hopefully this will further help me understand HTML and python as a whole. Thanks in advance.
import requests
from bs4 import BeautifulSoup
res = requests.get('https://sky.lea.moe/stats/Igris/Apple')
soup = BeautifulSoup(res.text, 'lxml')
print(soup.find("div", {"id":"additional_stats_container"}).find_all("div",class_="additional-stat")[-2].get_text(strip=True))
Output:
Average Skill Level:32.5
elements = soup.find_all("span", class_="stat-name")
skill = [i for i in elements if "Average Skill" in i.text] #getting element that has "Average Skill" in its text
idx = elements.index(skill) #getting its index to get the value of same index from values
values = soup.find_all("span", class_="stat-value")
value = values[idx] #as told earlier index of name would be same for value
print(skill[0].text + value.text)

Count stuck at 7

I am working on a project for school where I am creating a nutrition plan based off our schools nutrition menu. I am trying to create a dictionary with every item and its calorie content but for some reason the loop im using gets stuck at 7 and will never advance the rest of the list. To add to my dictionary. So when I search for a known key (Sour Cream) it throws and error because it is never added to the dictionary. I have also noticed it prints several numbers twice in a row as well double adding them to the dictionary.
edit: have discovered the double printing was from the print statement I had - still wondering about the 7 however
from bs4 import BeautifulSoup
import urllib3
import requests
url = "https://menus.sodexomyway.com/BiteMenu/Menu?menuId=14756&locationId=11870001&whereami=http://mnsu.sodexomyway.com/dining-near-me/university-dining-center"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
allFood = soup.findAll('a', attrs={'class':'get-nutritioncalculator primary-textcolor'})
allCals = soup.findAll('a', attrs={'class':'get-nutrition primary-textcolor'})
nums = '0123456789'
def printData(charIndex):
for char in allFood[charIndex].contents:
print(char)
for char in allCals[charIndex].contents:
print(char)
def getGoals():
userCalories = int(input("Please input calorie goal for the day (kC): "))
#Display Info (Text/RsbPi)
fullList = {}
def compileFood():
foodCount = 0
for food in allFood:
print(foodCount)
for foodName in allFood[foodCount].contents:
fullList[foodName] = 0
foodCount += 1
print(foodCount)
compileFood()
print(fullList['Sour Cream'])
Any help would be great. Thanks!
Ok first why is this happening:
The reason is because the food on the index 7 is empty. Because it's empty it will never enter your for loop and therefore never increase your foodCount => it will stuck at 7 forever.
So if you would shift your index increase outside of the for loop it would work without a problem.
But you doing something crude here.
You already iterate through the food item and still use an additional variable.
You could solve it smarter this way:
def compileFood():
for food in allFood:
for foodName in food.contents:
fullList[foodName] = 0
With this you don't need to care about an additional variable at all.

Using Beautiful Soup to Find Links Before a Certain Letter

I have a BeautifulSoup problem that hopefully you can help me out with.
Currently, I have a website with a lot of links on it. The links lead to pages that contain the data of that item that is linked. If you want to check it out, it's this one: http://ogle.astrouw.edu.pl/ogle4/ews/ews.html. What I ultimately want to accomplish is to print out the links of the data that are labeled with an 'N'. It may not be apparent at first, but if you look closely on the website, some of the data have 'N' after their Star No, and others do not. Afterwards, I use that link to download a file containing the information I need on that data. The website is very convenient because the download URLs only change a bit from data to data, so I only need to change a part of the URL, as you'll see in the code below.
I currently have accomplished the data downloading part. However, this is where you come in. Currently, I need to put in the identification number of the BLG event that I desire. (This will become apparent after you view the code below.) However, the website is consistently updating over time, and having to manually search for 'N' events takes up unnecessary time. I want the Python code to be able to do it for me. My original thoughts on the subject were that I could have BeautifulSoup search through the text for all N's, but I ran into some issues on accomplishing that. I feel like I am not familiar enough with BeautifulSoup to get done what I wish to get done. Some help would be appreciated.
The code I have currently is below. I have put in a range of BLG events that have the 'N' label as an example.
#Retrieve .gz files from URLs
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL))
#Select the desired data numbers
numbers = list(range(974,998))
x=0
for i in numbers:
numbers[x] = str(i)
x += 1
print(numbers)
#Get all links and put into list
allLinks = []
for link in soup.find_all('a'):
list_links = link.get('href')
allLinks.append(list_links)
#Remove None datatypes from link list
while None in allLinks:
allLinks.remove(None)
#print(allLinks)
#Remove all links but links to data pages and gets rid of the '.html'
list_Bindices = [i for i, s in enumerate(allLinks) if 'b' in s]
print(list_Bindices)
bLinks = []
for x in list_Bindices:
bLinks.append(allLinks[x])
bLinks = [s.replace('.html', '') for s in bLinks]
#print(bLinks)
#Create a list of indices for accessing those pages
list_Nindices = []
for x in numbers:
list_Nindices.append([i for i, s in enumerate(bLinks) if x in s])
#print(type(list_Nindices))
#print(list_Nindices)
nindices_corrected = []
place = 0
while place < (len(list_Nindices)):
a = list_Nindices[place]
nindices_corrected.append(a[0])
place = place + 1
#print(nindices_corrected)
#Get the page names (without the .html) from the indices
nLinks = []
for x in nindices_corrected:
nLinks.append(bLinks[x])
#print(nLinks)
#Form the URLs for those pages
final_URLs = []
for x in nLinks:
y = "ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/"+ x + "/phot.dat"
final_URLs.append(y)
#print(final_URLs)
#Retrieve the data from the URLs
z = 0
for x in final_URLs:
name = nLinks[z] + ".dat"
#print(name)
urllib.request.urlretrieve(x, name)
z += 1
#hrm = urllib.request.urlretrieve("ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/blg-0974.tar.gz", "practice.gz")
This piece of code has taken me quite some time to write, as I am not a professional programmer, nor an expert in BeautifulSoup or URL manipulation in any way. In fact, I use MATLAB more than Python. As such, I tend to think in terms of MATLAB, which translates into less efficient Python code. However, efficiency is not what I am searching for in this problem. I can wait the extra five minutes for my code to finish if it means that I understand what is going on and can accomplish what I need to accomplish. Thank you for any help you can offer! I realize this is a fairly muti-faceted problem.
This should do it:
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL), 'html5lib')
Here, I'm using the html5lib to parse the url content.
Next, we'll look through the table, extracting links if the star names have a 'N' in them:
table = soup.find('table')
links = []
for tr in table.find_all('tr', {'class' : 'trow'}):
td = tr.findChildren()
if 'N' in td[4].text:
links.append('http://ogle.astrouw.edu.pl/ogle4/ews/' + td[1].a['href'])
print(links)
Output:
['http://ogle.astrouw.edu.pl/ogle4/ews/blg-0974.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0975.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0976.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0977.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0978.html',
...
]

Python script extract from HTML

I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.
The code looks like this and is outputting zero:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for a in soup.select(".chooser-list ul"):
list_entry = a.findAll('li')
relative_link = list_entry[0].find('a')['href']
link = "https://www.teamrankings.com" + relative_link
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeautifulSoup(r.text, "html.parser")
team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
# time.sleep(1)
print total_rank
debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing
The selector
".tr-table datatable scrollable dataTable no-footer tr"
Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.
I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:
".tr-table.datatable.scrollable.dataTable.no-footer tr"
UPDATE: the new selector looks like this:
".tr-table.datatable.scrollable.dataTable.no-footer table"
The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:
<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">
The proper selector remains the one I originally suggested.
The #audiodude's answer is correct though the suggested selector is not working for me.
You don't need to check every single class of the table element. Here is the working selector:
team_rows = soup.select("table.datatable tr")
Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:
rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank) # it is important to convert the row number to int
Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:
links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"]
for a in links_list.select("ul.expand-content li a[href]")]

Categories

Resources