Web-scraping. Columns instead of rows

Web-scraping. Columns instead of rows - python

I have difficulties scraping data and saving it to a consistent columns.
More specifically, the website I scrape does not have a separate tags for each and every item I scrape (except key and value).
As a result, I get a CSV file with 2 rows - key and value and corresponding text in them, whereas my idea is to get columns instead.
Is it possible to keep headers constant and append value items or it is not possible, given the specifics of the website?
Thank you in advance.
import requests
import bs4
import pandas as pd
keys = []
values = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
txt = requests.get(url+str(pagenumber))
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
soup1 = soup1.find('div', {'id': 'car-attributes'})
for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'}).text
keys.append(key)
value = car_item.find('span', {'class': 'value'}).text
values.append(value)
auto_database = pd.DataFrame({
'key': keys,
'value': values,
})
auto_database.to_csv('auto_database.csv')
print("Successfully saved..")
Results
Merk & Model: Lako
Bouwjaar: 1996
Uitvoering: 233 C
Carrosserie: Open wagen
Kenteken: OD-31-VD
APK tot: 29 juni 2020
Prijs: € 7.500,00
Merk & Model: RAM
Bouwjaar: 2020
Carrosserie: SUV of Terreinwagen
Brandstof: LPG
Kilometerstand: 70 km
Transmissie: Automaat
Prijs: Zie omschrijving
Motorinhoud: 5.700 cc
Opties:
Wanted result
Merk & Model Bouwjaar
RAM 2020

I suggest to save all metadata per car item to a dataframe, set the keys as the index and join all intermediate dataframes to a final one.
Try this:
import requests
import bs4
import pandas as pd
frames = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
txt = requests.get(url+str(pagenumber))
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
soup1 = soup1.find('div', {'id': 'car-attributes'})
tmp = []
for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'}).text
value = car_item.find('span', {'class': 'value'}).text
tmp.append([key, value])
frames.append(pd.DataFrame(tmp).set_index(0))
df_final = pd.concat((tmp_df for tmp_df in frames), axis=1, join='outer').reset_index()
df_final = df_final.T
df_final.columns = df_final.loc["index"].values
df_final.drop("index", inplace=True)
df_final.reset_index(inplace=True, drop=True)
df_final.to_csv('auto_database.csv')
display(df_final.head(3))
Output:
Bouwjaar: Brandstof: Kilometerstand: Transmissie: Prijs: Motorinhoud: Kenteken: Opties: Merk & Model: Carrosserie: Uitvoering: APK tot: Energielabel: Verbruik: Topsnelheid: Kosten p/m: Vermogen: APK: Datum van registratie:
0 2014 Diesel 10.000 km Automaat € 10.950,00 400 cc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2011 Diesel 25.000 km Handgeschakeld Op aanvraag 1.500 cc VR-921-X \n\nParkeersensor\nMetallic lak\nBoordcomputer... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2016 Benzine 95.545 km Handgeschakeld € 230,00 1.395 cc NaN \n\nParkeersensor\nMetallic lak\nRadio\nMistla... A3 Sedan NaN NaN NaN NaN NaN NaN NaN NaN NaN

Another approach:
import requests
import bs4
import pandas as pd
cars = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
txt = requests.get(url+str(pagenumber))
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
soup1 = soup1.find('div', {'id': 'car-attributes'})
attribs = {}
for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'}).get_text(strip=True, separator='\n')
value = car_item.find('span', {'class': 'value'}).get_text(strip=True, separator='\n')
attribs[key] = value
cars.append(attribs)
unique_keys = set(k for car in cars for k in car.keys())
data = [{k: car.get(k) for k in unique_keys} for car in cars]
auto_database = pd.DataFrame(data)
auto_database.to_csv('auto_database.csv')
print("Successfully saved..")
Produces this csv file (screenshot from Libre Office):

Related

Python Limit time to run pandas read_html

I am trying to limit the time for running dfs = pd.read_html(str(response.text)). Once it runs for more than 5 seconds, it will stop running for this url and move to running the next url. I did not find out timeout attribute in pd.readhtml. So how can I do that?
from bs4 import BeautifulSoup
import re
import requests
import os
import time
from pandas import DataFrame
import pandas as pd
from urllib.request import urlopen
headers = {'User-Agent': 'regsre#jh.edu'}
urls={'https://www.sec.gov/Archives/edgar/data/1058307/0001493152-21-003451.txt', 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'}
for url in urls:
response = requests.get(url, headers = headers)
response.raise_for_status()
time.sleep(0.1)
dfs = pd.read_html(str(response.text))
print(url)
for item in dfs:
try:
Operation=(item[0].apply(str).str.contains('Revenue') | item[0].apply(str).str.contains('profit'))
if Operation.empty:
pass
if Operation.any():
Operation_sheet=item
if not Operation.any():
CashFlows=(item[0].apply(str).str.contains('income') | item[0].apply(str).str.contains('loss'))
if CashFlows.any():
Operation_sheet=item
if not CashFlows.any():
pass

I'm not certain what the issue is, but pandas seems to get overwhelmed by this file. If we utilize BeautifulSoup to instead search for tables, prettify them, and pass those to pd.read_html(), then it seems to be able to handle things just fine.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent': 'regsre#jh.edu'}
url = 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text)
dfs = []
for table in soup.find_all('table'):
dfs.extend(pd.read_html(table.prettify()))
# Printing the first few:
for df in dfs[0:3]:
print(df, '\n')
0 1 2 3 4
0 Nevada NaN 4813 NaN 65-0783722
1 (State or other jurisdiction of NaN (Primary Standard Industrial NaN (I.R.S. Employer
2 incorporation or organization) NaN Classification Code Number) NaN Identification Number)
0
0 Ralph V. De Martino, Esq.
1 Alec Orudjev, Esq.
2 Schiff Hardin LLP
3 901 K Street, NW, Suite 700
4 Washington, DC 20001
5 Phone (202) 778-6400
6 Fax: (202) 778-6460
0 1
0 Large accelerated filer [ ] Accelerated filer [ ]
1 NaN NaN
2 Non-accelerated filer [X] Smaller reporting company [X]
3 NaN NaN
4 NaN Emerging growth company [ ]

I have to create a dataframe from webscraping in a specific way

I have to create a dataframe in python by creating a bunch of lists from a table in a wikipedia article.
code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np
url = "https://en.wikipedia.org/wiki/Texas_Killing_Fields"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
all_tables = soup.find_all('table')
all_sortable_tables = soup.find_all('table', class_='wikitable sortable')
right_table = all_sortable_tables
A = []
B = []
C = []
D = []
E = []
for row in right_table.find_all('tr'):
cells = row.find_all('td')
if len(cells) == 5:
row.strip('\n')
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
E.append(cells[4].find(text=True))
df = pd.DataFrame(A, columns=['Victim'])
df['Victim'] = A
df['Age'] = B
df['Residence'] = C
df['Last Seen'] = D
df['Discovered'] = E
I keep getting an attribute error "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"
I have tried a bunch of methods and nothing has helped me. I'm also following a tutorial the teacher gave us and its not helpful either.
tutorial: https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/#heading-10.-loop-through-the-rows
first time here btw as a questioner.

Note: As mentioned by #ggorlen using an existig api would be the best approache. I also would recommend to use a more structured approache to store your data to avoid these bunch of lists.
data = []
for row in soup.select('table.wikitable.sortable tr:has(td)'):
data.append(
dict(
zip([h.text.strip() for h in soup.select('table.wikitable.sortable tr th')[:5]],
[c.text.strip() for c in row.select('td')][:5])
)
)
pd.DataFrame(data)
Just an alternative approach to scrape tables using pandas.read_html() cause you already imported pandas. It also uses BeautifulSoup and is doing the job for you:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Texas_Killing_Fields')[1]
df.iloc[:,:5] ### displays only the first 5 columns as in your example
Output:
Victim
Age
Residence
Last seen
Discovered
Brenda Jones
14
Galveston, Texas
July 1, 1971
July 2, 1971
Colette Wilson
13
Alvin, Texas
June 17, 1971
November 26, 1971
Rhonda Johnson
14
Webster, Texas
August 4, 1971
January 3, 1972
Sharon Shaw
13
Webster, Texas
August 4, 1971
January 3, 1972
Gloria Gonzales
19
Houston, Texas
October 28, 1971
November 23, 1971
...

How can scrape the team names and goals from this site into a table? Ive been trying a few different methods but can't quite figure it out

import requests
from bs4 import BeautifulSoup
URL = "https://www.hockey-reference.com/leagues/NHL_2021_games.html"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="all_games")
table = soup.find('div', attrs = {'id':'div_games'})
print(table.prettify())

Select the table not the div to print the table:
table = soup.find('table', attrs = {'id':'games'})
print(table.prettify())
Or use pandas.read_html() to get the table and transform into a dataframe:
import pandas as pd
pd.read_html('https://www.hockey-reference.com/leagues/NHL_2021_games.html', attrs={'id':'games'})[0].iloc[:,:5]
Output:
Date
Visitor
G
Home
G.1
2021-01-13
St. Louis Blues
4
Colorado Avalanche
1
2021-01-13
Vancouver Canucks
5
Edmonton Oilers
3
2021-01-13
Pittsburgh Penguins
3
Philadelphia Flyers
6
2021-01-13
Chicago Blackhawks
1
Tampa Bay Lightning
5
2021-01-13
Montreal Canadiens
4
Toronto Maple Leafs
5
...
...
...
...
...

table = soup.find('div', attrs = {'id':'div_games'})
trs = table.find_all('tr')
gamestats = []
for tr in trs:
gamestat = {}
gamestat['home_team_name'] = tr.find('td', attrs = {'data-stat' : 'home_team_name'})
gamestat['visit_team_name'] = tr.find('td', attrs = {'data-stat' : 'visit_team_name'})
gamestats.append(gamestat)

scrape tennis results including tournament for each match row

I want to scrape tennis matches results from this website
The results table I want has the columns: tournament_name match_time player_1 player_2 player_1_score player_2_score
This is an example
tournament_name match_time player_1 player_2 p1_set1 p2_set1
Roma / Italy 11:00 Krajinovic Filip Auger Aliassime Felix 6 4
Iasi (IX) / Romania 10:00 Bourgue Mathias Martineau Matteo 6 1
I can't associate each tournament name on the id="main_tour" with each row (one row is 2 class="match" or 2 class="match1"
I tried this code:
import requests
from bs4 import BeautifulSoup
u = "http://www.tennisprediction.com/?year=2020&month=9&day=14"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(u, timeout=30, headers=headers)
# print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
for table in soup.select('#main_tur'):
tourn_value = [i.get_text(strip=True) for i in table.select('tr:nth-child(1)')][0].split('/')[0].strip()
tourn_name = [i.get_text(strip=True) for i in table.select('tr td#main_tour')]
row = [i.get_text(strip=True) for i in table.select('.match')]
row2 = [i.get_text(strip=True) for i in table.select('.match1')]
print(tourn_value, tourn_name)

You can use this script to save the table to CSV in your format:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Saves data.csv:
EDIT: To add Odd, Prob columns for both players:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
odd1 = t.find_next(class_='main_odds_m')
odd2 = t.parent.find_next_sibling().find_next(class_='main_odds_m')
prob1 = t.find_next(class_='main_perc')
prob2 = t.parent.find_next_sibling().find_next(class_='main_perc')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
'odd1': odd1.text,
'prob1': prob1.text,
'odd2': odd2.text,
'prob2': prob2.text
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)

Andrej's solution is really nice and elegant. Accept his solution, but here was my go at it:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
rows=[]
for matchClass in ['match','match1']:
matches = soup.find_all('tr',{'class':'match'})
for idx, match in enumerate(matches):
if idx%2 != 0:
continue
row = {}
tourny = match.find_previous('td',{'id':'main_tour'}).text
time = match.find('td',{'class':'main_time'}).text
p1 = match.find('td',{'class':'main_player'})
player_1 = p1.text
row.update({'tournament_name':tourny,'match_time':time,'player_1':player_1})
sets = p1.find_previous('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p1_set%d'%(idx+1):each_set.text})
p2 = match.find_next('td',{'class':'main_player'})
player_2 = p2.text
row.update({'player_2':player_2})
sets = p2.find_next('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p2_set%d'%(idx+1):each_set.text})
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df.head(10).to_string())
tournament_name match_time player_1 p1_set1 p1_set2 p1_set3 p1_set4 p1_set5 player_2 p2_set1 p2_set2 p2_set3 p2_set4 p2_set5
0 Roma / Italy prize / money : 5791 000 USD 11:10 Krajinovic Filip (SRB) (26) 6 7 Krajinovic Filip (SRB) (26) 4 5
1 Roma / Italy prize / money : 5791 000 USD 13:15 Dimitrov Grigor (BGR) (20) 7 6 Dimitrov Grigor (BGR) (20) 5 1
2 Roma / Italy prize / money : 5791 000 USD 13:50 Coric Borna (HRV) (32) 6 6 Coric Borna (HRV) (32) 4 4
3 Roma / Italy prize / money : 5791 000 USD 15:30 Humbert Ugo (FRA) (42) 6 7 Humbert Ugo (FRA) (42) 3 6 (5)
4 Roma / Italy prize / money : 5791 000 USD 19:00 Nishikori Kei (JPN) (34) 6 7 Nishikori Kei (JPN) (34) 4 6 (3)
5 Roma / Italy prize / money : 5791 000 USD 22:00 Travaglia Stefano (ITA) (87) 6 7 Travaglia Stefano (ITA) (87) 4 6 (4)
6 Iasi (IX) / Romania prize / money : 100 000 USD 10:05 Menezes Joao (BRA) (189) 6 6 Menezes Joao (BRA) (189) 4 4
7 Iasi (IX) / Romania prize / money : 100 000 USD 12:05 Cretu Cezar (2001) (ROU) 2 6 6 Cretu Cezar (2001) (ROU) 6 3 4
8 Iasi (IX) / Romania prize / money : 100 000 USD 14:35 Zuk Kacper (POL) (306) 6 6 Zuk Kacper (POL) (306) 2 0
9 Roma / Italy prize / money : 3452 000 USD 11:05 Pavlyuchenkova Anastasia (RUS) (32) 6 6 6 Pavlyuchenkova Anastasia (RUS) (32) 4 7 (5) 1

Trouble merging Scraped data using Pandas and numpy in Python

I am trying to collect information from a lot of different urls and combine the data based on the year and Golfer name. As of now I am trying to write the information to csv and then match using pd.merge() but I have to use a unique name for each dataframe to merge. I tried to use a numpy array but I am stuck with the final process of getting all the separate data to be merged.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np
base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']
alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)
#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass
# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass
dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)
The urls, stat types, desired format
the ... is just all of the stuff inbetween
trying to get the data on one row
urls for below data ['http://www.pgatour.com/stats/stat.02356.html','http://www.pgatour.com/stats/stat.02568.html',...,'http://www.pgatour.com/stats/stat.111.html']
Statistics Titles
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek ranklastweek name events rating rounds avg
2017 2 3 Rickie Fowler 10 8.8 62 .614
TOTAL SG:APP MEASURED ROUNDS .... % # SAVES # BUNKERS TOTAL O/U PAR
26.386 43 ....70.37 76 108 +7.00

UPDATE (per comments)
This question is partly about technical methods (Pandas merge()), but it also seems like an opportunity to discuss useful workflows for data collection and cleaning. As such I'm adding a bit more detail and explanation than what is strictly required for a coding solution.
You can basically use the same approach as my original answer to get data from different URL categories. I'd recommend keeping a list of {url:data} dicts as you iterate over your URL list, and then building cleaned data frames from that dict.
There's a little legwork involved in setting up the cleaning portion, as you need to adjust for the different columns in each URL category. I've demonstrated with a manual approach, using only a few tests URLs. But if you have, say, thousands of different URL categories, then you may need to think about how to collect and organize column names programmatically. That feels out of scope for this OP.
As long as you're sure there's a year and PLAYER NAME field in each URL, the following merge should work. As before, let's assume that you don't need to write to CSV, and for now let's leave off making any optimizations to your scraping code:
First, define the url categories in urls. By url category I'm referring to the fact that http://www.pgatour.com/stats/stat.02356.html will actually be used multiple times by inserting a series of years into the url itself, e.g.: http://www.pgatour.com/stats/stat.02356.2017.html, http://www.pgatour.com/stats/stat.02356.2016.html. In this example, stat.02356.html is the url category that contains information about multiple years of player data.
import pandas as pd
# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']
# we'll store data from each url category in this dict.
url_data = {}
Now iterate over urls. Within the urls loop, this code is all the same as my original answer, which in turn is coming from OP - only with some variable names adjusted to reflect our new capturing method.
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass
Now for each key url in url_data, rows_sp contains the data you're interested in for that particular url category.
Note that rows_sp will now actually be url_data[url]["rows_sp"] when we iterate over url_data, but the next few code blocks are from my original answer, and so use the old rows_sp variable name.
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]
Writing rows_sp directly to a data frame shows that the data aren't quite in the right format:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None
7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None
pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object
With a little cleanup, we can get rows_sp into a data frame with appropriate numeric data types:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8
UPDATED CLEANING
Now that we have a series of url categories to contend with, each with a different set of fields to clean, the above section gets a little more complicated. If you only have a few pages, it may be feasible to just visually review the fields for each category, and store them, like this:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}
And then you can loop over url_data again and store in a dfs collection:
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
At this point, we're ready to merge all the different data categories by year and PLAYER NAME. (You could actually have merged iteratively in the cleaning loop, but I'm separating here for demonstrative purposes.)
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])
Now master contains the merged data for each player-year. Here's a view into the data, using groupby():
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2
RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548
TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57
RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91
TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0
You may want to do a bit more cleaning on the merged columns, as some are duplicated across data categories (e.g. ROUNDS_x and ROUNDS_y). From what I can tell, the duplicate field names seem to contain exactly the same information, so you might just drop the _y version of each one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web-scraping. Columns instead of rows - python

Related

Python Limit time to run pandas read_html

I have to create a dataframe from webscraping in a specific way

How can scrape the team names and goals from this site into a table? Ive been trying a few different methods but can't quite figure it out

scrape tennis results including tournament for each match row

Trouble merging Scraped data using Pandas and numpy in Python

Categories

Resources