I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.
One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...
Related
Updated: Not sure I explained it well first time.
I have a scheduling problem, or more accurately, a "first come first served" problem. A list of available assets are assigned a set of spaces, available in pairs (think cars:parking spots, diners:tables, teams:games). I need a rough simulation (random) that chooses the first two to arrive from available pairs, then chooses the next two from remaining available pairs, and so on, until all spaces are filled.
Started using teams:games to cut my teeth. The first pair is easy enough. How do I then whittle it down to fill the next two spots from among the remaining available entities? Tried a bunch of different things, but coming up short. Help appreciated.
import itertools
import numpy as np
import pandas as pd
a = ['Georgia','Oregon','Florida','Texas'], ['Georgia','Oregon','Florida','Texas']
b = [(x,y) for x,y in itertools.product(*a) if x != y]
c = pd.DataFrame(b)
c.columns = ['home', 'away']
print(c)
d = c.sample(n = 2, replace = False)
print(d)
The first results is all possible combinations. But, once the first slots are filled, there can be no repeats. in example below, once Oregon and Georgia are slated in, the only remaining options to choose from are Forlida:Texas or Texas:Florida. Obviously just the sample function alone produces duplicates frequently. I will need this to scale up to dozens, then hundreds of entities:slots. Many thanks in advance!
home away
0 Georgia Oregon
1 Georgia Florida
2 Georgia Texas
3 Oregon Georgia
4 Oregon Florida
5 Oregon Texas
6 Florida Georgia
7 Florida Oregon
8 Florida Texas
9 Texas Georgia
10 Texas Oregon
11 Texas Florida
home away
3 Oregon Georgia
5 Oregon Texas
Not exactly sure what you are trying to do. But if you want to randomly pair your unique entities you can simply randomly order them and then place them in a 2-columns dataframe. I wrote this with all the US states minus one (Wyomi):
states = ['Alaska','Alabama','Arkansas','Arizona','California',
'Colorado','Connecticut','District of Columbia','Delaware',
'Florida','Georgia','Hawaii','Iowa','Idaho','Illinois',
'Indiana','Kansas','Kentucky','Louisiana','Massachusetts',
'Maryland','Maine','Michigan','Minnesota','Missouri',
'Mississippi','Montana','North Carolina','North Dakota',
'Nebraska','New Hampshire','New Jersey','New Mexico',
'Nevada','New York','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina',
'South Dakota','Tennessee','Texas','Utah','Virginia',
'Vermont','Washington','Wisconsin','West Virginia']
a=states.copy()
random.shuffle(states)
c = pd.DataFrame({'home':a[::2],'away':a[1::2]})
print(c)
#Output
home away
0 West Virginia Minnesota
1 New Hampshire Louisiana
2 Nevada Florida
3 Alabama Indiana
4 Delaware North Dakota
5 Georgia Rhode Island
6 Oregon Pennsylvania
7 New York South Dakota
8 Maryland Kansas
9 Ohio Hawaii
10 Colorado Wisconsin
11 Iowa Idaho
12 Illinois Missouri
13 Arizona Mississippi
14 Connecticut Montana
15 District of Columbia Vermont
16 Tennessee Kentucky
17 Alaska Washington
18 California Michigan
19 Arkansas New Jersey
20 Massachusetts Utah
21 Oklahoma New Mexico
22 Virginia South Carolina
23 North Carolina Maine
24 Texas Nebraska
Not sure if this is exactly what you were asking for though.
If you need to schedule all the fixtures of the season, you can check this answer --> League fixture generator in python
I'm really new to web scraping and saw a few questions similar to mine but those solutions didn't work for me. So I'm trying to scrape this website: https://www.nba.com/schedule for the h4 tags, which hold the dates and times for upcoming basketball games. I'm trying to use beautiful soup to grab that tag but it always returns and empty list. Here's the code I'm using right now:
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
schedule = doc.find_all('h4')
I saw something in another answer about the h4 tags being inside tags and I tried to use a json module but couldn't get that to work. Thanks for your help in advance!
The data you see on the page is loaded from external URL, so BeautifulSoup doesn't see it. To load the data you can use following example:
import json
import requests
url = "https://cdn.nba.com/static/json/staticData/scheduleLeagueV2_1.json"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for g in data["leagueSchedule"]["gameDates"]:
print(g["gameDate"])
for game in g["games"]:
print(
game["homeTeam"]["teamCity"],
game["homeTeam"]["teamName"],
"-",
game["awayTeam"]["teamCity"],
game["awayTeam"]["teamName"],
)
print()
Prints:
10/3/2021 12:00:00 AM
Los Angeles Lakers - Brooklyn Nets
10/4/2021 12:00:00 AM
Toronto Raptors - Philadelphia 76ers
Boston Celtics - Orlando Magic
Miami Heat - Atlanta Hawks
Minnesota Timberwolves - New Orleans Pelicans
Oklahoma City Thunder - Charlotte Hornets
San Antonio Spurs - Utah Jazz
Portland Trail Blazers - Golden State Warriors
Sacramento Kings - Phoenix Suns
LA Clippers - Denver Nuggets
10/5/2021 12:00:00 AM
New York Knicks - Indiana Pacers
Chicago Bulls - Cleveland Cavaliers
Houston Rockets - Washington Wizards
Memphis Grizzlies - Milwaukee Bucks
...and so on.
The code that I am running (straight from sportsipy documentation):
from sportsipy.nba.teams import Teams
teams = Teams()
for team in teams:
print(team.name, team.abbreviation)
Returns the following:
The requested page returned a valid response, but no data could be found. Has the season begun, and is the data available on www.sports-reference.com?
Does anyone have any tips on moving forward with getting this information from the API?
That package api is old/outdated. The table it's trying to parse now has a different id attribute.
Few things you can do:
Go in and edit/patch the code manually to get the correct data.
Raise the issue on the github and wait for the fix and update.
Personally, the patch/fix is a quick easy one, so just do that (but there could be potentially other tables you may need to look into).
Open up the nba_utils.py:
change lines 85 and 86:
From:
teams_list = utils._get_stats_table(doc, 'div#all_team-stats-base')
opp_teams_list = utils._get_stats_table(doc, 'div#all_opponent-stats-base')
To:
teams_list = utils._get_stats_table(doc, '#totals-team')
opp_teams_list = utils._get_stats_table(doc, '#totals-opponent')
This will solve the current error, however, I don't know what other classes and functions may also need to be patched. There's a chance since this table slighltly changed, other may have as well.
Output:
Charlotte Hornets CHO
Milwaukee Bucks MIL
Utah Jazz UTA
Sacramento Kings SAC
Memphis Grizzlies MEM
Los Angeles Lakers LAL
Miami Heat MIA
Indiana Pacers IND
Houston Rockets HOU
Phoenix Suns PHO
Atlanta Hawks ATL
Minnesota Timberwolves MIN
San Antonio Spurs SAS
Boston Celtics BOS
Cleveland Cavaliers CLE
Golden State Warriors GSW
Washington Wizards WAS
Portland Trail Blazers POR
Los Angeles Clippers LAC
New Orleans Pelicans NOP
Dallas Mavericks DAL
Brooklyn Nets BRK
New York Knicks NYK
Orlando Magic ORL
Philadelphia 76ers PHI
Chicago Bulls CHI
Denver Nuggets DEN
Toronto Raptors TOR
Oklahoma City Thunder OKC
Detroit Pistons DET
Another option is to just not use the api and get the data yourself. If you don't need the abbreviations, it's pretty straight forward with pandas:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
teams = list(pd.read_html(url)[4].dropna(subset=['Rk'])['Team'])
for team in teams:
print(team)
If you do need the abbreviations, then it's a little more tricky, but can be achieved using BeautifulSoup to pull it out of the team href:
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2022.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
rows = table.find_all('td', {'data-stat':'team'})
teams = {}
for row in rows:
if row.find('a'):
name = row.find('a').text
abbreviation = row.find('a')['href'].split('/')[-2]
teams.update({name:abbreviation})
for team in teams.items():
print(team[0], team[1])
I have real estate properties and their details (17 columns) in a CSV file (nearly half a million entries). One of the columns provides a location but is actually somewhat a bit too detailed. I want to categorize my entries so I want to simplify the location to give me more generic areas. I would have the areas I want to categorize the entries into in a list such as:
keywords = ['Downtown','Park View','Industrial District', ... ]
So ideally I would like to take an entry that has for example Sky Tower Downtown Los Angeles and then classify it as Downtown.
So the task is to first detect the keyword in the location column and then append it to a new column (right beside it if possible). If no keyword is found in the entry, I would to classify it as Other.
It would look something like this:
Date
Record_Type
Location
Proterty_Type
...
Price
19-Mar-21
Active Listing
Sky Tower Downtown Los Angeles
Apartment
...
15000
19-Mar-21
Active Listing
Central Park Residential Tower, 5th Avenue
Apartment
...
17000
20-Mar-21
Active Listing
Meadow Gardens, Park View
Villa
...
125000
To something like:
Date
Record_Type
Location
Area
Proterty_Type
...
Price
19-Mar-21
Active Listing
Sky Tower Downtown Los Angeles
Downtown
Apartment
...
15000
19-Mar-21
Active Listing
Central Park Residential Tower, 5th Avenue
Other
Apartment
...
17000
20-Mar-21
Active Listing
Meadow Gardens, Park View
Park View
Villa
...
125000
Finally it saves it all to a new csv file. I would also ideally like yo use pandas to read/write on the csv.
Thanks in advance!
Edit:
I have tried methods such as the following threads, but I get errors and I don't know whats wrong, so Im open to fresh ideas.
How to append a new column to a CSV file using Python?
Adding new column to CSV in Python
If you have this datafame:
Date Record_Type Location Proterty_Type Price
0 19-Mar-21 Active Listing Sky Tower Downtown Los Angeles Apartment 15000
1 19-Mar-21 Active Listing Central Park Residential Tower, 5th Avenue Apartment 17000
2 20-Mar-21 Active Listing Meadow Gardens, Park View Villa 125000
Then:
keywords = ["Downtown", "Park View", "Industrial District"]
df.insert(
loc=3,
column="Area",
value=df["Location"].apply(
lambda x: next((kw for kw in keywords if kw in x), "Other")
),
)
print(df)
Creates Area column next to Location and prints:
Date Record_Type Location Area Proterty_Type Price
0 19-Mar-21 Active Listing Sky Tower Downtown Los Angeles Downtown Apartment 15000
1 19-Mar-21 Active Listing Central Park Residential Tower, 5th Avenue Other Apartment 17000
2 20-Mar-21 Active Listing Meadow Gardens, Park View Park View Villa 125000
I have a column on my dataframe that contains the following
Wal-Mart Stores, Inc., Clinton, IA 52732
Benton Packing, LLC, Clearfield, UT 84016
North Coast Iron Corp, Seattle, WA 98109
Messer Construction Co. Inc., Amarillo, TX 79109
Ocean Spray Cranberries, Inc., Henderson, NV 89011
W R Derrick & Co. Lexington, SC 29072
I am having problem to capture it using regex so far my regex works for first 2 lines:
[A-Z][A-za-z-\s]+,\s{1}(Inc.|LLC)
How do I split the column to 4 additional columns? i.e. Column1 = Company Name, Column 2 = City, Column 3 = State, Column 4 = Zipcode.
Example of the output is shown below:
Company_Name City State ZipCode
Wal-Mart Stores, Inc. Clinton IA 52732
The names are probably the trickiest part, but if you know that the structure of city, state, zip will always be the same (i.e. no extra commas) you could use rsplit to split the strings. Similarly pandas has a str.rsplit method as well.
df
Address
0 Wal-Mart Stores, Inc., Clinton, IA 52732
1 Benton Packing, LLC, Clearfield, UT 84016
2 North Coast Iron Corp, Seattle, WA 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109
df['Zip'] = df.Address.map(lambda x: x.rsplit(' ', 1)[-1])
df['Name'], df['City'], df['State']= zip(*df.Address.map(lambda x: x.rsplit(' ', 1)[0].rsplit(',', 2)))
df
Address Zip \
0 Wal-Mart Stores, Inc., Clinton, IA 5273 5273
1 Benton Packing, LLC, Clearfield, UT 84016 84016
2 North Coast Iron Corp, Seattle, WA 98109 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109 79109
Name City State
0 Wal-Mart Stores, Inc. Clinton IA
1 Benton Packing, LLC Clearfield UT
2 North Coast Iron Corp Seattle WA
3 Messer Construction Co. Inc. Amarillo TX