Related
I am trying to retrieve football squads data from multiple wikipedia pages and put it in a Pandas Data frame. One example of the source is this [link][1], but I want too do this for links between 1930-2018.
The code that I will show used to work in Python 2 and I'm trying to adapt it to Python 3. The information in every page are multiple tables with 7 columns. All of the tables have the same format.
The code used to crash but now is running. The only problem is that it throws an empty .csv file.
Just to put more context I made some specific changes :
Python 2
path = os.path.join('.cache', hashlib.md5(url).hexdigest() + '.html')
Python 3
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
Python 2
open(path, 'w') as fd:
Python 3
open(path, 'wb') as fd:
Python 2
years = range(1930,1939,4) + range(1950,2015,4)
Python 3: Yes here I also changed the range so I could get World Cup 2018
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
This is the whole chunk of code. If somebody can spot where in the world is the problem and give a solution I would be very thankful.
import hashlib
import requests
from bs4 import BeautifulSoup
import pandas as pd
if not os.path.exists('.cache'):
os.makedirs('.cache')
ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/15612.1.29.41.4'
session = requests.Session()
def get(url):
'''Return cached lxml tree for url'''
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
if not os.path.exists(path):
print(url)
response = session.get(url, headers={'User-Agent': ua})
with open(path, 'wb') as fd:
fd.write(response.text.encode('utf-8'))
return BeautifulSoup(open(path), 'html.parser')
def squads(url):
result = []
soup = get(url)
year = url[29:33]
for table in soup.find_all('table','sortable'):
if "wikitable" not in table['class']:
country = table.find_previous("span","mw-headline").text
for tr in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in tr.find_all('td')]
cells += [country, td.a.get('title') if td.a else 'none', year]
result.append(cells)
return result
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
result = []
for year in years:
url = "http://en.wikipedia.org/wiki/"+str(year)+"_FIFA_World_Cup_squads"
result += squads(url)
Final_result = pd.DataFrame(result)
Final_result.to_csv('/Users/home/Downloads/data.csv', index=False, encoding='iso-8859-1')```
[1]: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads
To get information about each team for years 1930-2018 you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/{}_FIFA_World_Cup_squads"
dfs = []
for year in range(1930, 2019):
print(year)
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
tables = soup.find_all(
lambda tag: tag.name == "table"
and tag.select_one('th:-soup-contains("Pos.")')
)
for table in tables:
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
df["Year"] = year
df["Country"] = table.find_previous(["h3", "h2"]).span.text
dfs.append(df)
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=False)
Prints:
...
13 14 FW Moussa Konaté 3 April 1993 (aged 25) 28 Amiens 2018 Senegal 10.0
14 15 FW Diafra Sakho 24 December 1989 (aged 28) 12 Rennes 2018 Senegal 3.0
15 16 GK Khadim N'Diaye 5 April 1985 (aged 33) 26 Horoya 2018 Senegal 0.0
16 17 MF Badou Ndiaye 27 October 1990 (aged 27) 20 Stoke City 2018 Senegal 1.0
17 18 FW Ismaïla Sarr 25 February 1998 (aged 20) 16 Rennes 2018 Senegal 3.0
18 19 FW M'Baye Niang 19 December 1994 (aged 23) 7 Torino 2018 Senegal 0.0
19 20 FW Keita Baldé 8 March 1995 (aged 23) 19 Monaco 2018 Senegal 3.0
20 21 DF Lamine Gassama 20 October 1989 (aged 28) 36 Alanyaspor 2018 Senegal 0.0
21 22 DF Moussa Wagué 4 October 1998 (aged 19) 10 Eupen 2018 Senegal 0.0
22 23 GK Alfred Gomis 5 September 1993 (aged 24) 1 SPAL 2018 Senegal 0.0
and saves data.csv (screenshot from LibreOffice):
just tested, you have no data because "wikitable" class is present in every table
You can replace "not in" by "in":
if "wikitable" in table["class"]:
...
And your BeautifulSoup data will be there
Once you changed this condition, you will have a problem with this list comprehension:
cells += [country, td.a.get('title') if td.a else 'none', year]
This is because td is not defined in this list, not quite sure what is the aim of these lines but you can define tds before and use them after:
tds = tr.find_all('td')
cells += ...
In general you can add breakpoints in your code to easily identify where is the problem
I have this text file, Masterlist.txt, which looks something like this:
S1234567A|Jan Lee|Ms|05/10/1990|Software Architect|IT Department|98785432|PartTime|3500
S1234567B|Feb Tan|Mr|10/12/1991|Corporate Recruiter|HR Corporate Admin|98766432|PartTime|1500
S1234567C|Mark Lim|Mr|15/07/1992|Benefit Specialist|HR Corporate Admin|98265432|PartTime|2900
S1234567D|Apr Tan|Ms|20/01/1996|Payroll Administrator|HR Corporate Admin|91765432|FullTime|1600
S1234567E|May Ng|Ms|25/05/1994|Training Coordinator|HR Corporate Admin|98767432|Hourly|1200
S1234567Y|Lea Law|Ms|11/07/1994|Corporate Recruiter|HR Corporate Admin|94445432|PartTime|1600
I want to reduce the Salary(the number at the end of each line) of each line, only if "PartTime" is in the line and after 1995, by 50%, and then add it up.
Currently I only know how to select only lines with "PartTime" in it, and my code looks like this:
f = open("Masterlist.txt", "r")
for x in f:
if "FullTime" in x:
print(x)
How do I extract the Salary and reduce by 50% + add it up only if the year is after 1995?
Try using pandas library.
From your question I suppose you want to reduce by 50% Salary if year is less than 1995, otherwise increase by 50%.
import pandas as pd
path = r'../Masterlist.txt' # path to your .txt file
df = pd.read_csv(path, sep='|', names = [0,1,2,'Date',4,5,6,'Type', 'Salary'], parse_dates=['Date'])
# Now column Date is treated as datetime object
print(df.head())
0 1 2 Date 4 \
0 S1234567A Jan Lee Ms 1990-05-10 Software Architect
1 S1234567B Feb Tan Mr 1991-10-12 Corporate Recruiter
2 S1234567C Mark Lim Mr 1992-07-15 Benefit Specialist
3 S1234567D Apr Tan Ms 1996-01-20 Payroll Administrator
4 S1234567E May Ng Ms 1994-05-25 Training Coordinator
5 6 Type Salary
0 IT Department 98785432 PartTime 3500
1 HR Corporate Admin 98766432 PartTime 1500
2 HR Corporate Admin 98265432 PartTime 2900
3 HR Corporate Admin 91765432 FullTime 1600
4 HR Corporate Admin 98767432 Hourly 1200
df.Salary = df.apply(lambda row: row.Salary*0.5 if row['Date'].year < 1995 and row['Type'] == 'PartTime' else row.Salary + (row.Salary*0.5 ), axis=1)
print(df.Salary.head())
0 1750.0
1 750.0
2 1450.0
3 2400.0
4 600.0
Name: Salary, dtype: float64
Add some modifications to the if, else statement inside the apply function if you wanted something different.
Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3
I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.
I am trying to collect information from a lot of different urls and combine the data based on the year and Golfer name. As of now I am trying to write the information to csv and then match using pd.merge() but I have to use a unique name for each dataframe to merge. I tried to use a numpy array but I am stuck with the final process of getting all the separate data to be merged.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np
base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']
alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)
#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass
# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass
dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)
The urls, stat types, desired format
the ... is just all of the stuff inbetween
trying to get the data on one row
urls for below data ['http://www.pgatour.com/stats/stat.02356.html','http://www.pgatour.com/stats/stat.02568.html',...,'http://www.pgatour.com/stats/stat.111.html']
Statistics Titles
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek ranklastweek name events rating rounds avg
2017 2 3 Rickie Fowler 10 8.8 62 .614
TOTAL SG:APP MEASURED ROUNDS .... % # SAVES # BUNKERS TOTAL O/U PAR
26.386 43 ....70.37 76 108 +7.00
UPDATE (per comments)
This question is partly about technical methods (Pandas merge()), but it also seems like an opportunity to discuss useful workflows for data collection and cleaning. As such I'm adding a bit more detail and explanation than what is strictly required for a coding solution.
You can basically use the same approach as my original answer to get data from different URL categories. I'd recommend keeping a list of {url:data} dicts as you iterate over your URL list, and then building cleaned data frames from that dict.
There's a little legwork involved in setting up the cleaning portion, as you need to adjust for the different columns in each URL category. I've demonstrated with a manual approach, using only a few tests URLs. But if you have, say, thousands of different URL categories, then you may need to think about how to collect and organize column names programmatically. That feels out of scope for this OP.
As long as you're sure there's a year and PLAYER NAME field in each URL, the following merge should work. As before, let's assume that you don't need to write to CSV, and for now let's leave off making any optimizations to your scraping code:
First, define the url categories in urls. By url category I'm referring to the fact that http://www.pgatour.com/stats/stat.02356.html will actually be used multiple times by inserting a series of years into the url itself, e.g.: http://www.pgatour.com/stats/stat.02356.2017.html, http://www.pgatour.com/stats/stat.02356.2016.html. In this example, stat.02356.html is the url category that contains information about multiple years of player data.
import pandas as pd
# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']
# we'll store data from each url category in this dict.
url_data = {}
Now iterate over urls. Within the urls loop, this code is all the same as my original answer, which in turn is coming from OP - only with some variable names adjusted to reflect our new capturing method.
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass
Now for each key url in url_data, rows_sp contains the data you're interested in for that particular url category.
Note that rows_sp will now actually be url_data[url]["rows_sp"] when we iterate over url_data, but the next few code blocks are from my original answer, and so use the old rows_sp variable name.
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]
Writing rows_sp directly to a data frame shows that the data aren't quite in the right format:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None
7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None
pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object
With a little cleanup, we can get rows_sp into a data frame with appropriate numeric data types:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8
UPDATED CLEANING
Now that we have a series of url categories to contend with, each with a different set of fields to clean, the above section gets a little more complicated. If you only have a few pages, it may be feasible to just visually review the fields for each category, and store them, like this:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}
And then you can loop over url_data again and store in a dfs collection:
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
At this point, we're ready to merge all the different data categories by year and PLAYER NAME. (You could actually have merged iteratively in the cleaning loop, but I'm separating here for demonstrative purposes.)
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])
Now master contains the merged data for each player-year. Here's a view into the data, using groupby():
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2
RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548
TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57
RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91
TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0
You may want to do a bit more cleaning on the merged columns, as some are duplicated across data categories (e.g. ROUNDS_x and ROUNDS_y). From what I can tell, the duplicate field names seem to contain exactly the same information, so you might just drop the _y version of each one.