Problem concatenating URL and scraping data - python

I am trying to append a URL in python to scrape details from the target URL.
I have the below code but it seems to be scraping the data from url1 rather than URL.
I have scraped the team names from the NFL websit without any issue. The issue is with the spotrac URL where I am appending the team name which I have scraped from the NFL website.
import requests
from bs4 import BeautifulSoup
URL ='https://www.nfl.com/teams/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text)
for team in team_name:
team = team.replace(" ", "-").lower()
url1 = 'https://www.spotrac.com/nfl/rankings/'
URL = url1 +str(team)
print(URL)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
spotrac_df = pd.DataFrame(columns = ['Name', 'Salary'])
for h3 in bs_soup.select('h3'):
spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)
I'm almost certain the problem is coming from the URL not appending properly. The scraping is taking the salaries etc from url1 rather than URL.
My console output (using Spyder IDE) is as below for print(URL)

url is appending correctly, but you have a leading white space in your team names. I also made a few other changes and noted them in the code.
Lastly, (and I used to do this two), creating an empty dataframe then appending to it after each iteration I suppose isn't the best method. I've been told it better to construct your rows using lists/dictionaries, and then when done, then call on pandas to construct the dataframe, so changed that as well.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ='https://www.nfl.com/teams/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text.strip()) #<- remove leading/trailing white space
url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name:
team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team
url1 = 'https://www.spotrac.com/nfl/rankings/'
url = url1 + str(team)
print(url)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in bs_soup.select('h3'):
spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary
spotrac_df = pd.DataFrame(spotrac_rows)
Output:
print(spotrac_df)
Name Salary
0 Chandler Jones $21,333,333
1 Patrick Peterson $13,184,588
2 D.J. Humphries $12,800,000
3 DeAndre Hopkins $12,500,000
4 Larry Fitzgerald $11,750,000
5 Jordan Hicks $10,500,000
6 Justin Pugh $10,500,000
7 Kenyan Drake $8,483,000
8 Kyler Murray $8,080,601
9 Robert Alford $7,500,000
10 J.R. Sweezy $6,500,000
11 Corey Peters $4,437,500
12 Haason Reddick $4,288,444
13 Jordan Phillips $4,000,000
14 Isaiah Simmons $3,757,101
15 Maxx Williams $3,400,000
16 Zane Gonzalez $3,259,000
17 Devon Kennard $2,500,000
18 Budda Baker $2,173,184
19 De'Vondre Campbell $2,000,000
20 Andy Lee $2,000,000
21 Byron Murphy $1,815,795
22 Christian Kirk $1,607,691
23 Aaron Brewer $1,168,750
24 Max Garcia $1,143,125
25 Andy Isabella $1,052,244
26 Mason Cole $977,629
27 Zach Allen $975,855
28 Chris Banjo $887,500
29 Jonathan Bullard $887,500
... ...
2530 Khari Blasingame $675,000
2531 Kenneth Durden $675,000
2532 Cody Hollister $675,000
2533 Joey Ivie $675,000
2534 Greg Joseph $675,000
2535 Kareem Orr $675,000
2536 David Quessenberry $675,000
2537 Derick Roberson $675,000
2538 Shaun Wilson $675,000
2539 Cole McDonald $635,421
2540 Chris Jackson $629,570
2541 Kobe Smith $614,333
2542 Aaron Brewer $613,333
2543 Cale Garrett $613,333
2544 Tommy Hudson $613,333
2545 Kristian Wilkerson $613,333
2546 Khaylan Kearse-Thomas $612,500
2547 Nick Westbrook $612,333
2548 Kyle Williams $611,833
2549 Mason Kinsey $611,666
2550 Tucker McCann $611,666
2551 Cameron Scarlett $611,666
2552 Teair Tart $611,666
2553 Brandon Kemp $611,333
2554 Wyatt Ray $610,000
2555 Josh Smith $610,000
2556 Logan Woodside $610,000
2557 Rashard Davis $610,000
2558 Avery Gennesy $610,000
2559 Parker Hesse $610,000
[2560 rows x 2 columns]

Related

Returning only last item and splitting into columns

I'm having a couple of issues - I seem to be only returning the last item on this list. Can someone help me here please? I also want to split the df into columns filtering all of the postcodes into one column. Not sure where to start with this. Help much appreciated. Many thanks in advance!
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="dealer-overview")
company_elements = results.find_all("article")
for company_element in company_elements:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
print (company_info)
data = {company_info}
df = pd.DataFrame(data)
df.shape
df
IIUC, you need to replace the loop with:
df = pd.DataFrame({'info': [e.getText(separator=u', ')
.replace('Find out more »', '')
for e in company_elements]})
output:
info
0 ESP Bathrooms & Interiors, Queens Retail Park,...
1 Paul Scarr & Son Ltd, Supreme Centre, Haws Hil...
2 Stonebridge Interiors, 19 Main Street, Pontela...
3 Bathe Distinctive Bathrooms, 55 Pottery Road, ...
4 Draw A Bath Ltd, 68 Telegraph Road, Heswall, W...
.. ...
346 Warren Keys, Unit B Carrs Lane, Tromode, Dougl...
347 Haldane Fisher, Isle of Man Business Park, Coo...
348 David Scott (Agencies) Ltd, Supreme Centre, 11...
349 Ballycastle Homecare Ltd, 2 The Diamond, Bally...
350 Beggs & Partners, Great Patrick Street, Belfas...
[351 rows x 1 columns]

python combining 2 re.findall strings in columns and rows in csv

#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
res = d+a
for tup in res:
tup = re.sub("<td>",'',str(tup))
tup = re.sub("</td>",'',str(tup))
print(tup)
I'm getting sale dates then addresses when just printing to screen. I have tried several things to get to csv but I end up all data in 1 column or 1 row. I would like to just sale dates, addresses 2 columns with all returned rows.
This is what I get just using print()
8/25/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/1/2021
9/8/2021
9/8/2021
9/8/2021
9/8/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/15/2021
9/22/2021
9/29/2021
9/29/2021
9/29/2021
11/17/2021
4/30/3021
40 PAVILICA ROAD STOCKTON NJ 08559
129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
63 PHLOX COURT WHITEHOUSE STATION NJ 08889
41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9 MAPLE AVENUE FRENCHTOWN NJ 08825
95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
27 WORMAN ROAD STOCKTON NJ 08559
30 COLD SPRINGS ROAD CALIFON NJ 07830
211 OLD CROTON ROAD FLEMINGTON NJ 08822
3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
80 SCHAAF ROAD BLOOMSBURY NJ 08804
9 CAMBRIDGE DRIVE MILFORD NJ 08848
5 VAN FLEET ROAD NESHANIC STATION NJ 08853
34 WASHINGTON STREET ANNANDALE NJ 08801
229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
28 ROSE RUN LAMBERTVILLE NJ 08530
Any Help would be great I have been playing with this all day and can't seem to get it right no matter what I try
My two cents :
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
separator = ','
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content))
for date, address in zip(d, a):
print(re.sub("</td>|<td>",'',str(date)),
separator,
re.sub("</td>|<td>",'',str(address)))
Output, date and address are now in one row:
8/25/2021 , 40 PAVILICA ROAD STOCKTON NJ 08559
9/1/2021 , 129 KINGWOOD LOCKTOWN ROAD FRENCHTOWN NJ 08825
9/1/2021 , 63 PHLOX COURT WHITEHOUSE STATION NJ 08889
9/1/2021 , 41 WESTCHESTER TERRACE UNIT 11 CLINTON NJ 08809
9/1/2021 , 461 LITTLE YORK MOUNT PLEASANT ROAD MILFORD NJ 08848
9/1/2021 , 9 MAPLE AVENUE FRENCHTOWN NJ 08825
9/8/2021 , 95 BARTON HOLLOW ROAD FLEMINGTON NJ 08822
9/8/2021 , 27 WORMAN ROAD STOCKTON NJ 08559
9/8/2021 , 30 COLD SPRINGS ROAD CALIFON NJ 07830
9/8/2021 , 211 OLD CROTON ROAD FLEMINGTON NJ 08822
9/15/2021 , 3 BRIAR LANE FLEMINGTON NJ 08822(VACANT)
9/15/2021 , 61 N. FRANKLIN STREET LAMBERTVILLE NJ 08530
9/15/2021 , 802 SPRUCE HILLS DRIVE GLEN GARDNER NJ 08826
9/15/2021 , 2155 STATE ROUTE 31 GLEN GARDNER NJ 08826
9/15/2021 , 80 SCHAAF ROAD BLOOMSBURY NJ 08804
9/15/2021 , 9 CAMBRIDGE DRIVE MILFORD NJ 08848
9/22/2021 , 5 VAN FLEET ROAD NESHANIC STATION NJ 08853
9/29/2021 , 34 WASHINGTON STREET ANNANDALE NJ 08801
9/29/2021 , 229 MILFORD MT PLEASANT ROAD MILFORD NJ 08848
9/29/2021 , 1608 COUNTY ROAD 519 FRENCHTOWN NJ 08825
11/17/2021 , 29 OLD SCHOOLHOUSE ROAD ASBURY NJ 08802
4/30/3021 , 28 ROSE RUN LAMBERTVILLE NJ 08530
Extra, to export to CSV using pandas :
import pandas as pd
date_list = []
address_list = []
for date, address in zip(d, a):
date_list.append(re.sub("</td>|<td>",'',str(date)))
address_list.append(re.sub("</td>|<td>",'',str(address)))
df = pd.DataFrame([date_list, address_list]).T
df.columns = ['Date', 'Address']
df.to_csv('data.csv')
It seems to me that instead of using two regular expressions you should rather use one with named groups. I leave it to you to try.
Given that you have two corresponding lists of values, the simplest way would be instead of concatenating:
res = d+a
just going through pairs of them:
for tup, tup2 in zip(d, a):
tup = re.sub("<td>",'',str(tup))
tup = re.sub("</td>",'',str(tup))
tup2 = re.sub("<td>",'',str(tup2))
tup2 = re.sub("</td>",'',str(tup2))
print(tup, tup2)
#!/usr/bin/env python
import re
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://salesweb.civilview.com/Sales/SalesSearch?countyId=32')
soup = BeautifulSoup(page.text, 'html.parser')
list_ = soup.find(class_='table-striped')
list_items = list_.find_all('tr')
content = list_items
d = re.findall(r"<td>\d*/\d*/\d+</td>",str(content)) #this is a list
#d = re.findall(r"<td>\d*/\d*/\d+</td>|<td>\d*?\s.+\d+</td>",str(content))
a = re.findall(r"<td>\d*?\s.+\d+?.*</td>",str(content)) #this is a list
## create a dataframe with two lists and remove tags
df = pd.DataFrame(list(zip(d,a)), columns=['sales_date','address'])
for cols in df.columns:
df[cols] = df[cols].map(lambda x: x.lstrip('<td>').rstrip('</td>'))
df.to_csv("result.csv")

How to get all data using beautifulsoup?

I am trying to scrape all the addresses for "Recent Sales" in this page:
https://www.compass.com/agents/irene-vuong/
My current code looks like:
url = 'https://www.compass.com/agents/irene-vuong/'
url = requests.get(url)
soup = BeautifulSoup(url.text, 'html')
for item in soup.findAll('div', attrs={'class': 'uc-listingCard-content'}):
new = item.find('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
My output is :
256-258 Wyckoff Street
1320 Glenwood Road
1473 East 55th Street
145 Winter Avenue
25-02 Brookhaven Avenue
which is the addresses of "current" listings.
My expected output is:
352 94th Street
1754 West 12th Street
2283 E 23rd st
2063 Brown Street
3423 Avenue U
2256 Stuart Street
Which are the addresses under "Recent Sales". No matter what, I only get current listing addresses, but not all listing addresses. I tried to use re.compile(r'Recent Sales') but it would not work. I'm not sure how to get to "Recent Sales".
Any help will be greatly appreciated.
+++++
I also tried to use text 'Recent Sales' as below:
for item in soup.findAll(text=re.compile(r'Recent Sales')).findNext():
for i in item.find('div', attrs={'class':'profile-acive-listings'}):
new = i.find('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
But I get an error of:
AttributeError: ResultSet object has no attribute 'findNext'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
+++ Also tried to use class data-tn : recent sales:
for item in soup.findAll('div', attrs={'data-tn':'recent-sales'}):
new = item.findAll('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
But it won't return anything.
You can use Selenium. It renders your page in an automated browser. From the rendered page you can then get the full HTML and retrieve your listings.
Try this:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://www.compass.com/agents/irene-vuong/")
html = browser.page_source
soup = BeautifulSoup(html, 'html')
for item in soup.findAll('div', attrs={'class': 'uc-listingCard-content'}):
new = item.find('a', attrs={'class': 'uc-listingCard-title'})
print(new.text)
This prints out:
256-258 Wyckoff Street
1320 Glenwood Road
1473 East 55th Street
145 Winter Avenue
25-02 Brookhaven Avenue
352 94th Street
1754 West 12th Street
2283 E 23rd St
2063 Brown Street
3423 Avenue U
2256 Stuart Street
East 61st Street
Edit:
If you want to parse the data from the raw HTML you have to get it a script tag.
Try this:
import json
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.compass.com/agents/irene-vuong/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html')
script = soup.find_all("script")[4]
data = json.loads(script.text.split("window.__AGENT_PROFILE__ = ")[1])
data = data["data"]
df_sales = pd.DataFrame(data["closedDeals"]["sales"])
df_rentals = pd.DataFrame(data["closedDeals"]["rentals"])
This gives you Pandas dataframes with all the listing data like this.
listingIdSHA listingType location size price detailedInfo media dealInfo isOffMLS pageLink pageLinkSlug canonicalPageLink userListingCompliance
0 210837948508195937 2 {'prettyAddress': '352 94th Street', 'city': '... {'bedrooms': 4, 'bathrooms': 2.75} {'lastKnown': 1250000, 'formatted': '$1,250,000'} {'amenities': ['Driveway', 'Open Kitchen', 'Ga... [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/352-94th-street-brooklyn-ny-11209/210... 352-94th-street-brooklyn-ny-11209 /listing/352-94th-street-brooklyn-ny-11209/210... {'descriptionCompliance': 0}
1 122690464561282785 2 {'prettyAddress': '1754 West 12th Street', 'ci... {'bedrooms': 4, 'bathrooms': 2} {'lastKnown': 1040000, 'formatted': '$1,040,000'} {'amenities': ['Basement', 'Private Outdoor Sp... [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/1754-west-12th-street-brooklyn-ny-112... 1754-west-12th-street-brooklyn-ny-11223 /listing/1754-west-12th-street-brooklyn-ny-112... {'descriptionCompliance': 0}
2 NaN 2 {'prettyAddress': '2283 E 23rd St', 'neighborh... {'bedrooms': 3, 'bathrooms': 2} {'lastKnown': 800000, 'formatted': '$800,000'} NaN [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False NaN 2283-e-23rd-st NaN NaN
3 235974146369023201 2 {'prettyAddress': '2063 Brown Street', 'city':... {'bedrooms': 3, 'bathrooms': 2} {'lastKnown': 755000, 'formatted': '$755,000'} NaN [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/2063-brown-street-brooklyn-ny-11229/2... 2063-brown-street-brooklyn-ny-11229 /listing/2063-brown-street-brooklyn-ny-11229/2... {'descriptionCompliance': 0}
4 186865317970981409 2 {'prettyAddress': '3423 Avenue U', 'city': 'Br... {'bedrooms': 5, 'bathrooms': 2} {'lastKnown': 627000, 'formatted': '$627,000'} {'amenities': ['Hardwood Floors', 'Garage', 'C... [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/3423-avenue-u-brooklyn-ny-11234/18686... 3423-avenue-u-brooklyn-ny-11234 /listing/3423-avenue-u-brooklyn-ny-11234/18686... {'descriptionCompliance': 0}
5 286987776170131617 2 {'prettyAddress': '2256 Stuart Street', 'city'... {'bedrooms': 3, 'bathrooms': 1} {'lastKnown': 533000, 'formatted': '$533,000'} NaN [{'category': 0, 'thumbnailUrl': 'https://d278... {'disclaimer': 'No guarantee, warranty or repr... False /listing/2256-stuart-street-brooklyn-ny-11229/... 2256-stuart-street-brooklyn-ny-11229 /listing/2256-stuart-street-brooklyn-ny-11229/...
To retrieve just the listing adresses use this further step:
from pandas import json_normalize
df_sales = df_sales.location.apply(lambda x: dict(x))
df_sales = json_normalize(df_sales)
df_rentals = df_rentals.location.apply(lambda x: dict(x))
df_rentals = json_normalize(df_rentals)
Output:
prettyAddress city state zipCode geoId neighborhood subNeighborhoods
0 352 94th Street Brooklyn NY 11209 nyc NaN NaN
1 1754 West 12th Street Brooklyn NY 11223 nyc NaN NaN
2 2283 E 23rd St NaN NaN NaN nyc Sheepshead Bay [Sheepshead Bay]
3 2063 Brown Street Brooklyn NY 11229 nyc NaN NaN
4 3423 Avenue U Brooklyn NY 11234 nyc NaN NaN
5 2256 Stuart Street Brooklyn NY 11229 nyc NaN NaN
Edit:
You can get more clean data like so:
df_sales = pd.DataFrame(data["closedDeals"]["sales"])
columns = ['listingIdSHA', 'listingType', 'location', 'size', 'price']
df_sales = df_sales[columns]
expanded_data = []
for column in ['location', 'size', 'price']:
expanded = df_sales[column].apply(lambda x: dict(x))
expanded_data.append(json_normalize(expanded))
expanded_data = pd.concat(expanded_data, axis=1)
df_sales_cleaned = pd.concat([df_sales[['listingIdSHA', 'listingType']], expanded_data], axis=1)
display(df_sales_cleaned)
Output:
listingIdSHA listingType prettyAddress city state zipCode geoId neighborhood subNeighborhoods bedrooms bathrooms lastKnown formatted
0 210837948508195937 2 352 94th Street Brooklyn NY 11209 nyc NaN NaN 4 2.75 1250000 $1,250,000
1 122690464561282785 2 1754 West 12th Street Brooklyn NY 11223 nyc NaN NaN 4 2.00 1040000 $1,040,000
2 NaN 2 2283 E 23rd St NaN NaN NaN nyc Sheepshead Bay [Sheepshead Bay] 3 2.00 800000 $800,000
3 235974146369023201 2 2063 Brown Street Brooklyn NY 11229 nyc NaN NaN 3 2.00 755000 $755,000
4 186865317970981409 2 3423 Avenue U Brooklyn NY 11234 nyc NaN NaN 5 2.00 627000 $627,000
5 286987776170131617 2 2256 Stuart Street Brooklyn NY 11229 nyc NaN NaN 3 1.00 533000 $533,000
i recently got a project where I'm using this too, but even without Regex
try the code like this
for item in soup.findAll(text=re.compile(r'Recent Sales')):
for i in item.encode_contents().find('div', {'class':'profile-acive-listings'}):
new = i.find('a', {'class': 'uc-listingCard-title'})
print(new.text)

Scrape Table Data on Multiple Pages from Multiple URLs (Python & BeautifulSoup)

New coder here! I am trying to scrape web table data from multiple URLs. Each URL web-page has 1 table, however that table is split among multiple pages. My code only iterates through the table pages of the first URL and not the rest. So.. I am able to get pages 1-5 of NBA data for year 2000 only, but it stops there. How do I get my code to pull every year of data? Any help is greatly appreciated.
page = 1
year = 2000
while page < 20 and year < 2020:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
sal_table = soup.find_all('table', class_ = 'tablehead')
if len(sal_table) < 2:
sal_table = sal_table[0]
with open ('NBA_Salary_2000_2019.txt', 'a') as r:
for row in sal_table.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(30))
r.write('\n')
page+=1
else:
print("too many tables")
else:
year +=1
page = 1
I'd consider using pandas here as 1) it's .read_html() function (which uses beautifulsoup under the hood), is easier to parse <table> tags, and 2) it can easily then write straight to file.
Also, it's a waste to iterate through 20 pages (for example the first season you are after only has 4 pages...the rest are blank. So I'd consider adding something that says once it reaches a blank table, move on to the next season.
import pandas as pd
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
results = pd.DataFrame()
year = 2000
while year < 2020:
goToNextPage = True
page = 1
while goToNextPage == True:
base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page)
response = requests.get(base_URL, headers)
if response.status_code == 200:
temp_df = pd.read_html(base_URL)[0]
temp_df.columns = list(temp_df.iloc[0,:])
temp_df = temp_df[temp_df['RK'] != 'RK']
if len(temp_df) == 0:
goToNextPage = False
year +=1
continue
print ('Aquiring Season: %s\tPage: %s' %(year, page))
temp_df['Season'] = '%s-%s' %(year-1, year)
results = results.append(temp_df, sort=False).reset_index(drop=True)
page+=1
results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)
Output:
print (results.head(25).to_string())
RK NAME TEAM SALARY Season
0 1 Shaquille O'Neal, C Los Angeles Lakers $17,142,000 1999-2000
1 2 Kevin Garnett, PF Minnesota Timberwolves $16,806,000 1999-2000
2 3 Alonzo Mourning, C Miami Heat $15,004,000 1999-2000
3 4 Juwan Howard, PF Washington Wizards $15,000,000 1999-2000
4 5 Scottie Pippen, SF Portland Trail Blazers $14,795,000 1999-2000
5 6 Karl Malone, PF Utah Jazz $14,000,000 1999-2000
6 7 Larry Johnson, F New York Knicks $11,910,000 1999-2000
7 8 Gary Payton, PG Seattle SuperSonics $11,020,000 1999-2000
8 9 Rasheed Wallace, PF Portland Trail Blazers $10,800,000 1999-2000
9 10 Shawn Kemp, C Cleveland Cavaliers $10,780,000 1999-2000
10 11 Damon Stoudamire, PG Portland Trail Blazers $10,125,000 1999-2000
11 12 Antonio McDyess, PF Denver Nuggets $9,900,000 1999-2000
12 13 Antoine Walker, PF Boston Celtics $9,000,000 1999-2000
13 14 Shareef Abdur-Rahim, PF Vancouver Grizzlies $9,000,000 1999-2000
14 15 Allen Iverson, SG Philadelphia 76ers $9,000,000 1999-2000
15 16 Vin Baker, PF Seattle SuperSonics $9,000,000 1999-2000
16 17 Ray Allen, SG Milwaukee Bucks $9,000,000 1999-2000
17 18 Anfernee Hardaway, SF Phoenix Suns $9,000,000 1999-2000
18 19 Kobe Bryant, SF Los Angeles Lakers $9,000,000 1999-2000
19 20 Stephon Marbury, PG New Jersey Nets $9,000,000 1999-2000
20 21 Vlade Divac, C Sacramento Kings $8,837,000 1999-2000
21 22 Bryant Reeves, C Vancouver Grizzlies $8,666,000 1999-2000
22 23 Tom Gugliotta, PF Phoenix Suns $8,558,000 1999-2000
23 24 Nick Van Exel, PG Denver Nuggets $8,354,000 1999-2000
24 25 Elden Campbell, C Charlotte Hornets $7,975,000 1999-2000
...

Need efficient Beautifulsoup webscrape with nested divs & spans into pandas dataframe python

I'm trying to scrape the results of a sports tournament into a pandas dataframe where each row is a different fighter's name.
Here is my code:
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.bjjcompsystem.com/tournaments/1221/categories/1532871")
soup = BeautifulSoup(page.content, 'lxml')
body = list(soup.children)[1]
alldivs = list(body.children)[3]
sections = list(alldivs.children)[5]
division = list(sections.children)[1]
div_name = division.get_text().replace('\n','')
bracket = list(sections.children)[3]
import pandas as pd
data = []
div_name = division.get_text().replace('\n','')
bracket = list(sections.children)[3]
for i in bracket:
bracket_title = [bt.get_text() for bt in bracket.select(".bracket-title")]
location = [l.get_text() for l in bracket.select(".bracket-match-header__where")]
time = [t.get_text() for t in bracket.select(".bracket-match-header__when")]
fighter_rank = [fr.get_text() for fr in bracket.select(".match-card__competitor-n")]
competitor_desc = [cd.get_text() for cd in bracket.select(".match-card__competitor-description")]
loser_name = [ln.get_text() for ln in bracket.select(".match-competitor--loser")]
data.append((div_name,bracket_title,location,time,fighter_rank,competitor_desc,loser_name))
df = pd.DataFrame(pd.DataFrame(data, columns=['Division','Bracket','Location','Time','Rank','Fighter','Loser']))
df
However, this results in each cell by row containing a list. I modified it to the following code:
import pandas as pd
data = []
div_name = division.get_text().replace('\n','')
bracket2 = soup.find_all('div', class_='tournament-category__brackets')
for i in bracket2:
bracketNo = i.find_all('div', class_='bracket-title')
section = i.find_all('div', class_='tournament-category__bracket tournament-category__bracket-15')
for a in section:
cats = a.find_all('div', class_='tournament-category__match')
for j in cats:
fight = j.find_all('div', class_='bracket-match-header')
for k in fight:
where = k.find('div', class_='bracket-match-header__where').get_text().replace('\n',' ')
when = k.find('div', class_='bracket-match-header__when').get_text().replace('\n',' ')
match = j.find_all('div', class_='match-card match-card--yellow')
for b in match:
rank = b.find_all('span', class_='match-card__competitor-n')
fighter = b.find_all('div', class_='match-card__competitor-name')
gym = b.find_all('div', class_='match-card__club-name')
loser = b.find_all('span', class_='match-competitor--loser')
data.append((div_name,bracketNo,when,where,rank,fighter,gym,loser,))
df1 = pd.DataFrame(pd.DataFrame(data, columns=['Division','Bracket','Time','Location','Rank','Fighter','Gym','Loser']))
df1
There is only 1 division, so this will be the same in every row. There are 5 bracket categories (1/4,2/4,3/4,4/4,finals). I want the corresponding time/location for each bracket. Each rank, fighter, and gym have two in each cell and I want this to be one per row. The sections in the dataframe are of different lengths, so that is causing some issues.
Ideally I want the dataframe to look like the following:
Division Bracket Time Location Rank Fighter Gym Loser
Master 1 Male BLACK Middle Bracket 1/4 Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 16 Jeffery Bynum Hammon Caique Jiu-Jitsu None
Master 1 Male BLACK Middle Bracket 1/4 Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 53 Fábio Junior Batista da Evolve MMA Fábio Junior Batista da Evolve MMA
Master 1 Male BLACK Middle Bracket 2/4 Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 14 André Felipe Maciel Fre Carlson Gracie None
Master 1 Male BLACK Middle Bracket 2/4 Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 50 Jerardo Linares Cleber Jiu Jitsu Jerardo Linares Cleber Jiu Jitsu
Any advice would be extremely helpful. I tried to create nested loops and follow the structure, but the HTML tree was rather complicated for me. The least amount of formatting in the df is ideal as I will later loop this over multiple pages. Thanks in advance!
EDIT: Next step - looping this program over multiple pages:
pages = [ #sample, no brackets
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533466', #example of category__bracket-1
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533387', #example of category__bracket-3
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533372', #example of category__bracket-7
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533022', #example of category__bracket-15
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532847',
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532871', #example of category__bracket-15 plus finals
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532889', #example of bracket with two losers in a match, so throws an error in fight 32 on fighter a name
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532856', #example of no winner on fight 11 so throws error on fight be name
]
first I define the multiple links. This is a subset of 411 different divisions.
results = pd.DataFrame()
for page in pages:
response = requests.get(page)
soup = BeautifulSoup(response.text, 'html.parser')
division = soup.find('span', {'class':'category-title__label category-title__age-division'}).text.strip()
label = soup.find('i', {'class':'fa fa-mars'}).parent.text.strip()
belt = soup.find('i', {'class':'fa fa-belt'}).parent.text.strip()
weight = soup.find('i', {'class':'fa fa-weight'}).parent.text.strip()
# PARSE BRACKETS
brackets = soup.find_all(['div', {'class':'tournament-category__bracket tournament-category__bracket-15'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-1'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-3'},
'div', {'class':'tournament-category__bracket tournament-category__bracket-7'}])
#results = pd.DataFrame()
for bracket in brackets:
...etc
Is there a way to write into the programming how to account for different size divisions? The example at the top uses 4 brackets+finals and 15 match brackets. There are other divisions with 1 match, or 3, 7, or just 15 and not multiple brackets. Without segmenting out all links by size and re-writing the program, I'm wondering if there is an if/then statement I can add or try/except?
This was tricky as some of the attributes included the loser of the match, and then for some reason, others didn't. So had to figure out a way to fill in those missing nulls.
But none-the-less I think I managed to fill it all in correctly. Just iterated through each match of each bracket, then append them all into one table. To fill in the missing 'Loser' column, I sorted by Fight number, and basically looked at the rows with missing "Loser", and checked to see which fighter fought in a later match. Obviously, if the fighter had another match later, then his opponent was the loser.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import natsort as ns
pages = [ #sample, no brackets
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533466', #example of category__bracket-1
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533387', #example of category__bracket-3
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533372', #example of category__bracket-7
'http://www.bjjcompsystem.com/tournaments/1221/categories/1533022', #example of category__bracket-15
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532847',
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532871', #example of category__bracket-15 plus finals
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532889', #example of bracket with two losers in a match, so throws an error in fight 32 on fighter a name
'http://www.bjjcompsystem.com/tournaments/1221/categories/1532856', #example of no winner on fight 11 so throws error on fight be name
]
for url in pages:
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
division = soup.find('span', {'class':'category-title__label category-title__age-division'}).text.strip()
label = soup.find('i', {'class':'fa fa-mars'}).parent.text.strip()
belt = soup.find('i', {'class':'fa fa-belt'}).parent.text.strip()
weight = soup.find('i', {'class':'fa fa-weight'}).parent.text.strip()
# PARSE BRACKETS
#brackets = soup.find_all('div', {'class':'tournament-category__bracket tournament-category__bracket-15'})
brackets = soup.select('div[class*="tournament-category__bracket tournament-category__bracket-"]')
results = pd.DataFrame()
for bracket in brackets:
try:
bracketTitle = bracket.find_previous_sibling('div').text
except:
bracketTitle = 'Bracket 1/1'
rows = bracket.find_all('div', {'class':'row'})
for row in rows:
matches = row.find_all('div', {'class':'tournament-category__match'})
for match in matches:
#match = matches[0]#delete
bye = False
try:
match.find("div", {"class": "match-card__bye"}).text
where = match.find("div", {"class": "match-card__bye"}).text
when = match.find("div", {"class": "match-card__bye"}).text
loser = match.find("div", {"class": "match-card__bye"}).text
fighter_b_name = match.find("div", {"class": "match-card__bye"}).text
fighter_b_rank = match.find("div", {"class": "match-card__bye"}).text
fighter_b_club = match.find("div", {"class": "match-card__bye"}).text
bye = True
except:
where = match.find('div',{'class':'bracket-match-header__where'}).text
when = match.find('div',{'class':'bracket-match-header__when'}).text
fighter_a_desc = match.find_all('div',{'class':'match-card__competitor'})[0]
try:
fighter_a_name = fighter_a_desc.find('div', {'class':'match-card__competitor-name'}).text
except:
fighter_a_name = 'UNKNOWN'
try:
fighter_a_rank = fighter_a_desc.find('span', {'class':'match-card__competitor-n'}).text
except:
fighter_a_rank = 'N/A'
try:
fighter_a_club = fighter_a_desc.find('div', {'class':'match-card__club-name'}).text
except:
fighter_a_club = 'N/A'
cols = ['Bracket Title','Divison','Label','Belt','Weight','Where','When','Rank','Fighter','Opponent', 'Opponent Rank' ,'Gym','Loser']
if bye == False:
fighter_b_desc = match.find_all('div',{'class':'match-card__competitor'})[1]
try:
fighter_b_name = fighter_b_desc.find('div', {'class':'match-card__competitor-name'}).text
except:
fighter_b_name = 'UNKNOWN'
try:
fighter_b_rank = fighter_b_desc.find('span', {'class':'match-card__competitor-n'}).text
except:
fighter_b_rank = 'N/A'
try:
fighter_b_club = fighter_b_desc.find('div', {'class':'match-card__club-name'}).text
except:
fighter_b_club = 'N/A'
try:
loser = match.find('span', {'class':'match-card__competitor-description match-competitor--loser'}).find('div', {'class':'match-card__competitor-name'}).text
except:
loser = None
#print ('Loser could not be idenetified by html class')
temp_df_b = pd.DataFrame([[bracketTitle,division, label, belt, weight, where, when, fighter_b_rank, fighter_b_name, fighter_a_name, fighter_a_rank, fighter_b_club ,loser]], columns=cols)
temp_df = pd.DataFrame([[bracketTitle,division, label, belt, weight, where, when, fighter_a_rank, fighter_a_name, fighter_b_name, fighter_b_rank, fighter_a_club ,loser]], columns=cols)
temp_df = temp_df.append(temp_df_b, sort=True)
results = results.append(temp_df, sort=True).reset_index(drop=True)
# IDENTIFY LOSERS THAT WHERE NOT FOUND BY HTML ATTRIBUTES
results['Fight Number'] = results['Where'].str.split('FIGHT ', expand=True)[1].str.split(':', expand=True)[0].fillna(0)
results['Fight Number'] = pd.Categorical(results['Fight Number'], ordered=True, categories= ns.natsorted(results['Fight Number'].unique()))
results = results.sort_values('Fight Number')
results = results.drop_duplicates().reset_index(drop=True)
for idx, row in results.iterrows():
if row['Loser'] == None:
idx_save = idx
check = idx + 1
fighter_check_name = row['Fighter']
if fighter_check_name in list(results.loc[check:, 'Fighter']):
results.at[idx_save,'Loser'] = row['Opponent']
else:
results.at[idx_save,'Loser'] = row['Fighter']
print ('Processed url: %s' %url)
except:
print ('Error accessing url: %s' %url)
Output: I'm just showing the first 25 rows. 116 in total
print (results.head(25).to_string())
Belt Bracket Title Divison Fighter Gym Label Loser Opponent Opponent Rank Rank Weight When Where Fight Number
0 BLACK Bracket 2/4 Master 1 Marcelo França Mafra CheckMat Male BYE BYE BYE 4 Middle BYE BYE 0
1 BLACK Bracket 4/4 Master 1 Dealonzio Jerome Jackson Team Lloyd Irvin Male BYE BYE BYE 5 Middle BYE BYE 0
2 BLACK Bracket 2/4 Master 1 Oliver Leys Geddes Gracie Elite Team Male BYE BYE BYE 6 Middle BYE BYE 0
3 BLACK Bracket 1/4 Master 1 Gabriel Procópio da Fonseca Brazilian Top Team Male BYE BYE BYE 9 Middle BYE BYE 0
4 BLACK Bracket 2/4 Master 1 Igor Mocaiber Peralva de Mello Cicero Costha Internacional Male BYE BYE BYE 10 Middle BYE BYE 0
5 BLACK Bracket 1/4 Master 1 Sandro Gabriel Vieira Cantagalo Team Male BYE BYE BYE 1 Middle BYE BYE 0
6 BLACK Bracket 4/4 Master 1 Paulo Cesar Schauffler de Oliveira Gracie Elite Team Male BYE BYE BYE 8 Middle BYE BYE 0
7 BLACK Bracket 3/4 Master 1 Paulo César Ledesma Atos Jiu-Jitsu Male BYE BYE BYE 7 Middle BYE BYE 0
8 BLACK Bracket 3/4 Master 1 Vitor Henrique Silva Oliveira GF Team Male BYE BYE BYE 2 Middle BYE BYE 0
9 BLACK Bracket 4/4 Master 1 Clark Rouson Gracie Gracie Allegiance Male BYE BYE BYE 3 Middle BYE BYE 0
10 BLACK Bracket 4/4 Master 1 Phillip V. Fitzpatrick CheckMat Male Jonathan M. Perrine Jonathan M. Perrine 29 45 Middle Wed 08/21 at 10:06 AM FIGHT 1: Mat 8 1
11 BLACK Bracket 2/4 Master 1 André Felipe Maciel Freire Carlson Gracie Male Jerardo Linares Jerardo Linares 50 14 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 1
12 BLACK Bracket 2/4 Master 1 Jerardo Linares Cleber Jiu Jitsu Male Jerardo Linares André Felipe Maciel Freire 14 50 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 6 1
13 BLACK Bracket 1/4 Master 1 Fábio Junior Batista da Mata Evolve MMA Male Fábio Junior Batista da Mata Jeffery Bynum Hammond 16 53 Middle Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 1
14 BLACK Bracket 4/4 Master 1 Jonathan M. Perrine Gracie Humaita Male Jonathan M. Perrine Phillip V. Fitzpatrick 45 29 Middle Wed 08/21 at 10:06 AM FIGHT 1: Mat 8 1
15 BLACK Bracket 1/4 Master 1 Jeffery Bynum Hammond Caique Jiu-Jitsu Male Fábio Junior Batista da Mata Fábio Junior Batista da Mata 53 16 Middle Wed 08/21 at 10:08 AM FIGHT 1: Mat 5 1
16 BLACK Bracket 3/4 Master 1 David Benzaken Teampact Male Evan Franklin Barrett Evan Franklin Barrett 54 15 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 7 1
17 BLACK Bracket 3/4 Master 1 Evan Franklin Barrett Zenith BJJ - Las Vegas Male Evan Franklin Barrett David Benzaken 15 54 Middle Wed 08/21 at 10:07 AM FIGHT 1: Mat 7 1
18 BLACK Bracket 2/4 Master 1 Nathan S Santos Zenith BJJ - Las Vegas Male Nathan S Santos Jose A. Llanas-Campos 30 46 Middle Wed 08/21 at 10:16 AM FIGHT 2: Mat 6 2
19 BLACK Bracket 3/4 Master 1 Javier Arroyo Team Shawn Hammonds Male Javier Arroyo Kaisar Adilevich Saulebayev 43 27 Middle Wed 08/21 at 10:18 AM FIGHT 2: Mat 7 2
20 BLACK Bracket 4/4 Master 1 Manuel Ray Gonzales II Ralph Gracie Male Steven J. Patterson Steven J. Patterson 13 49 Middle Wed 08/21 at 10:10 AM FIGHT 2: Mat 8 2
21 BLACK Bracket 2/4 Master 1 Jose A. Llanas-Campos Ribeiro Jiu-Jitsu Male Nathan S Santos Nathan S Santos 46 30 Middle Wed 08/21 at 10:16 AM FIGHT 2: Mat 6 2
22 BLACK Bracket 4/4 Master 1 Steven J. Patterson Brasa CTA Male Steven J. Patterson Manuel Ray Gonzales II 49 13 Middle Wed 08/21 at 10:10 AM FIGHT 2: Mat 8 2
23 BLACK Bracket 3/4 Master 1 Kaisar Adilevich Saulebayev Charles Gracie Jiu-Jitsu Academy Male Javier Arroyo Javier Arroyo 27 43 Middle Wed 08/21 at 10:18 AM FIGHT 2: Mat 7 2
24 BLACK Bracket 1/4 Master 1 Matthew Romino Fox Team Lloyd Irvin Male Thiago Alves Cavalcante Rodrigues Thiago Alves Cavalcante Rodrigues 33 48 Middle Wed 08/21 at 10:15 AM FIGHT 2: Mat 5 2

Categories

Resources