I'm currently writing a Python script and part of it gets winshares from the first 4 seasons of every player's career in the NBA draft between 2005 and 2015. I've been messing around with this for almost 2 hours (getting increasingly frustrated), but I've been unable to get the Win Shares for the individual players. I'm trying to use the "Advanced" table at the following link as a test case: https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none
When getting the player's names from the draft pages I had no problems, but I've tried so many iterations of the following code and have had no success in accessing the td element the stat is in.
playerSoup = BeautifulSoup(playerHtml)
playertr = playerSoup.find_all("table", id = "advanced").find("tbody").findAll("tr")
playerws = playertr.findAll("td")[21].getText()
This page use JavaScript to add tables but it doesn't read data from server. All tables are in HTML but as comments <!-- ... ->
Using BeautifulSoup you can find all comments and then check which one has text "Advanced". And then you can use this comment as normal HTML in BeautifulSoup
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none'
r = requests.get(url)
soup = BeautifulSoup(r.content)
all_comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for item in all_comments:
if "Advanced" in item:
adv = BeautifulSoup(item)
playertr = adv.find("table", id="advanced")
if not playertr:
#print('skip')
continue # skip comment without table - go back to `for`
playertr = playertr.find("tbody").findAll("tr")
playerws = adv.find_all("td")[21].getText()
print('playertr:', playertr)
print('playerws:', playerws)
for row in playertr:
if row:
print(row.find_all('th')[0].text)
all_td = row.find_all('td')
print([x.text for x in all_td])
print('--')
Related
I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.
To give some example:
I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.
My code is:
import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'
# creating request object
req = requests.get(url)
# creating soup object
data = BeautifulSoup(req.text, 'html')
# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
print(li.text, end=" ")
At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.
Here's the code implementation of all above things:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')
lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
try:
dic["name"].append(li.find(text=True,recursive=False).strip())
dic["value"].append(li.find("span").text.replace(" ",""))
print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
except:
pass
df=pd.DataFrame(dic)
print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")
And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:
soup = BeautifulSoup(page.content, 'lxml')
dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
a=li.find_all("li")[1:-1]
for b in a:
error=0
try:
print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
dic["name"].append(b.find(text=True,recursive=False).strip())
dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
except Exception as e:
pass
df=pd.DataFrame(dic)
Find main tag by specific class and from it find all li tag
main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
values.append(i.find("span").get_text())
names.append(i.find(text=True,recursive=False))
main_values.append(values)
For table representation use pandas module
import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df
Output:
Liczba mieszkańców (2011) Kod pocztowy Numer kierunkowy
0 1 935 05-532 (+48) 22
I have two problems. First of all, I get the error that is listed in the title "AttributeError: 'NoneType' object has no attribute 'find_all'" whenever I activate this line of code. Secondly, I want to access one more statistic on this specific website as well. So, firstly, my code is below. This is meant to gather names from a website, trim off the excess, then take those names, insert them into a URL, and take two statistics. The first statistic that I am taking is on line 22, which is the source of the error. And the second statistic is in HTML and is also going to be listed after my code.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('https://plancke.io/hypixel/guild/name/GBP')
soup = BeautifulSoup(res.text, 'lxml')
memberList = []
skillAverageList = []
for i in soup.select('.playerInfo'):
memberList.append(i.text)
memberList = [e[37:-38] for e in memberList]
members = [re.sub("[A-Z][^A-Z]+$", "", member.split(" ")[1]) for member in memberList]
print(members)
for i in range(len(memberList) + 1):
player = memberList[i]
skyLeaMoe = requests.get('https://sky.lea.moe/stats/' + str(player))
skillAverageList.append(soup.find("div", {"id":"additional_stats_container"}).find_all("div",class_="additional-stat")[-2].get_text(strip=True))
pprint(skillAverageList)
Below is the second statistic that I would like to scrape from this website as well (in HTML). This specific statistic is attributed to this specific site, but the code above will hopefully be able to cycle through the entire list (https://sky.lea.moe/stats/Igris/Apple).
<span class="stat-name">Total Slayer XP: </span> == $0
<span class ="stat-value">457,530</span>
I am sorry if this is a lot, I have almost no knowledge of HTML and any attempt for me to learn it has been a struggle. Thanks in advance to anyone this reaches.
It seems that this site doesn't have a div with the id of "additional_stats_container", and therefore soup.find("div", {"id":"additional_stats_container"}) returns None.
Upon inspecting the HTML of this URL with a browser, I couldn't find such a div.
This script will print all names and their Total Slayer XP:
import requests
from bs4 import BeautifulSoup
url = 'https://plancke.io/hypixel/guild/name/GBP'
stats_url = 'https://sky.lea.moe/stats/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('td a[href*="/hypixel/player/stats/"]'):
s = BeautifulSoup(requests.get(stats_url + a['href'].split('/')[-1]).content, 'html.parser')
total_slayer_xp = s.select_one('span:contains("Total Slayer XP") + span')
print('{:<30} {}'.format(a.text, total_slayer_xp.text if total_slayer_xp else '-'))
Prints:
[MVP+] Igris 457,530
[VIP] Kutta 207,665
[VIP] mistercint 56,455
[MVP+] zouce 1,710,540
viellythedivelon 30
[MVP+] Louis7864 141,670
[VIP] Broadside1138 292,240
[VIP+] Babaloops 40
[VIP+] SparkleDuck9 321,290
[VIP] TooLongOfAUserNa 423,700
...etc.
I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise. I have been at this for the past few days and my code still does not work.
The first thing I ask the user is to import the course catalog abbreviation. For example, ICS is abbreviated as Information for Computer Science. Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.
While I was able to get the input portion to work, I still have errors or the program just stops.
Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?
Here is the code and my attempt at it:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
This is the original code that was provided in Juypter Notebook
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed. But, for this task, the user can not save files and it must be an outright input to get the output.
The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list []. So thats the first thing you need to change. Looking at the html, you'll want to be looking for <table>, <tr>, <td>, tags
working off what was provided from the original code, you just need to alter a few things to flow logically:
I'll point out what I changed:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
This will give you:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)
I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)
You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.
You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))
I have been scratching my head for nearly 4 days trying to find the best way to loop through a table of URLs on one website, request the URL and scrape text from 2 different areas of the second site.
I have tried to rewrite this script multiple times, using several different solutions to achieve my desired results, however, I have not been able to fully accomplish this.
Currently, I am able to select the first link of the table on page one, to go to the new page and select the data I need but I cant get the code to continue to loop through every link on the first page.
import requests
from bs4 import BeautifulSoup
journal_site = "https://journals.sagepub.com"
site_link 'http://journals.sagepub.com/action/showPublications?
pageSize=100&startPage='
# each page contains 100 results I need to scrape from
page_1 = '0'
page_2 = '1'
page_3 = '3'
page_4 = '4'
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
for table_row in soup.select('div.results'):
journal_name = table_row.findAll('tr', class_='False')
journal_link = table_row.find('a')['href']
journal_page = journal_site + journal_link
r = requests.get(journal_page)
soup = BeautifulSoup(r.text, 'html.parser')
for journal_header, journal_description in zip(soup.select('main'),
soup.select('div.journalCarouselTextText')):
try:
title = journal_header.h1.text.strip()
description = journal_description.p.text.strip()
print(title,':', description)
except AttributeError:
continue
What is the best way to find the title and the description for every journal_name? Thanks in advance for the help!
Most of your code works for me, just needed to modify the middle section of the code, leaving the parts before and after the same:
# all code same up to here
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find("div", { "class" : "results" })
table = results.find('table')
for row in table.find_all('a', href=True):
journal_link = row['href']
journal_page = journal_site + journal_link
# from here same as your code
I stopped after it got the fourth response(title/description) of 100 results from the first page. I'm pretty sure it will get all the expected results, only needs to loop through the 4 subsequent pages.
Hope this helps.