I have to create a dataframe from webscraping in a specific way

I have to create a dataframe from webscraping in a specific way - python

I have to create a dataframe in python by creating a bunch of lists from a table in a wikipedia article.
code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np
url = "https://en.wikipedia.org/wiki/Texas_Killing_Fields"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
all_tables = soup.find_all('table')
all_sortable_tables = soup.find_all('table', class_='wikitable sortable')
right_table = all_sortable_tables
A = []
B = []
C = []
D = []
E = []
for row in right_table.find_all('tr'):
cells = row.find_all('td')
if len(cells) == 5:
row.strip('\n')
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
E.append(cells[4].find(text=True))
df = pd.DataFrame(A, columns=['Victim'])
df['Victim'] = A
df['Age'] = B
df['Residence'] = C
df['Last Seen'] = D
df['Discovered'] = E
I keep getting an attribute error "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"
I have tried a bunch of methods and nothing has helped me. I'm also following a tutorial the teacher gave us and its not helpful either.
tutorial: https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/#heading-10.-loop-through-the-rows
first time here btw as a questioner.

Note: As mentioned by #ggorlen using an existig api would be the best approache. I also would recommend to use a more structured approache to store your data to avoid these bunch of lists.
data = []
for row in soup.select('table.wikitable.sortable tr:has(td)'):
data.append(
dict(
zip([h.text.strip() for h in soup.select('table.wikitable.sortable tr th')[:5]],
[c.text.strip() for c in row.select('td')][:5])
)
)
pd.DataFrame(data)
Just an alternative approach to scrape tables using pandas.read_html() cause you already imported pandas. It also uses BeautifulSoup and is doing the job for you:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Texas_Killing_Fields')[1]
df.iloc[:,:5] ### displays only the first 5 columns as in your example
Output:
Victim
Age
Residence
Last seen
Discovered
Brenda Jones
14
Galveston, Texas
July 1, 1971
July 2, 1971
Colette Wilson
13
Alvin, Texas
June 17, 1971
November 26, 1971
Rhonda Johnson
14
Webster, Texas
August 4, 1971
January 3, 1972
Sharon Shaw
13
Webster, Texas
August 4, 1971
January 3, 1972
Gloria Gonzales
19
Houston, Texas
October 28, 1971
November 23, 1971
...

Related

Returning only last item and splitting into columns

I'm having a couple of issues - I seem to be only returning the last item on this list. Can someone help me here please? I also want to split the df into columns filtering all of the postcodes into one column. Not sure where to start with this. Help much appreciated. Many thanks in advance!
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="dealer-overview")
company_elements = results.find_all("article")
for company_element in company_elements:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
print (company_info)
data = {company_info}
df = pd.DataFrame(data)
df.shape
df

IIUC, you need to replace the loop with:
df = pd.DataFrame({'info': [e.getText(separator=u', ')
.replace('Find out more »', '')
for e in company_elements]})
output:
info
0 ESP Bathrooms & Interiors, Queens Retail Park,...
1 Paul Scarr & Son Ltd, Supreme Centre, Haws Hil...
2 Stonebridge Interiors, 19 Main Street, Pontela...
3 Bathe Distinctive Bathrooms, 55 Pottery Road, ...
4 Draw A Bath Ltd, 68 Telegraph Road, Heswall, W...
.. ...
346 Warren Keys, Unit B Carrs Lane, Tromode, Dougl...
347 Haldane Fisher, Isle of Man Business Park, Coo...
348 David Scott (Agencies) Ltd, Supreme Centre, 11...
349 Ballycastle Homecare Ltd, 2 The Diamond, Bally...
350 Beggs & Partners, Great Patrick Street, Belfas...
[351 rows x 1 columns]

Python Limit time to run pandas read_html

I am trying to limit the time for running dfs = pd.read_html(str(response.text)). Once it runs for more than 5 seconds, it will stop running for this url and move to running the next url. I did not find out timeout attribute in pd.readhtml. So how can I do that?
from bs4 import BeautifulSoup
import re
import requests
import os
import time
from pandas import DataFrame
import pandas as pd
from urllib.request import urlopen
headers = {'User-Agent': 'regsre#jh.edu'}
urls={'https://www.sec.gov/Archives/edgar/data/1058307/0001493152-21-003451.txt', 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'}
for url in urls:
response = requests.get(url, headers = headers)
response.raise_for_status()
time.sleep(0.1)
dfs = pd.read_html(str(response.text))
print(url)
for item in dfs:
try:
Operation=(item[0].apply(str).str.contains('Revenue') | item[0].apply(str).str.contains('profit'))
if Operation.empty:
pass
if Operation.any():
Operation_sheet=item
if not Operation.any():
CashFlows=(item[0].apply(str).str.contains('income') | item[0].apply(str).str.contains('loss'))
if CashFlows.any():
Operation_sheet=item
if not CashFlows.any():
pass

I'm not certain what the issue is, but pandas seems to get overwhelmed by this file. If we utilize BeautifulSoup to instead search for tables, prettify them, and pass those to pd.read_html(), then it seems to be able to handle things just fine.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent': 'regsre#jh.edu'}
url = 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text)
dfs = []
for table in soup.find_all('table'):
dfs.extend(pd.read_html(table.prettify()))
# Printing the first few:
for df in dfs[0:3]:
print(df, '\n')
0 1 2 3 4
0 Nevada NaN 4813 NaN 65-0783722
1 (State or other jurisdiction of NaN (Primary Standard Industrial NaN (I.R.S. Employer
2 incorporation or organization) NaN Classification Code Number) NaN Identification Number)
0
0 Ralph V. De Martino, Esq.
1 Alec Orudjev, Esq.
2 Schiff Hardin LLP
3 901 K Street, NW, Suite 700
4 Washington, DC 20001
5 Phone (202) 778-6400
6 Fax: (202) 778-6460
0 1
0 Large accelerated filer [ ] Accelerated filer [ ]
1 NaN NaN
2 Non-accelerated filer [X] Smaller reporting company [X]
3 NaN NaN
4 NaN Emerging growth company [ ]

BeautifulSoup scraper doesn't retrieve any information

I am trying to retrieve football squads data from multiple wikipedia pages and put it in a Pandas Data frame. One example of the source is this [link][1], but I want too do this for links between 1930-2018.
The code that I will show used to work in Python 2 and I'm trying to adapt it to Python 3. The information in every page are multiple tables with 7 columns. All of the tables have the same format.
The code used to crash but now is running. The only problem is that it throws an empty .csv file.
Just to put more context I made some specific changes :
Python 2
path = os.path.join('.cache', hashlib.md5(url).hexdigest() + '.html')
Python 3
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
Python 2
open(path, 'w') as fd:
Python 3
open(path, 'wb') as fd:
Python 2
years = range(1930,1939,4) + range(1950,2015,4)
Python 3: Yes here I also changed the range so I could get World Cup 2018
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
This is the whole chunk of code. If somebody can spot where in the world is the problem and give a solution I would be very thankful.
import hashlib
import requests
from bs4 import BeautifulSoup
import pandas as pd
if not os.path.exists('.cache'):
os.makedirs('.cache')
ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/15612.1.29.41.4'
session = requests.Session()
def get(url):
'''Return cached lxml tree for url'''
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
if not os.path.exists(path):
print(url)
response = session.get(url, headers={'User-Agent': ua})
with open(path, 'wb') as fd:
fd.write(response.text.encode('utf-8'))
return BeautifulSoup(open(path), 'html.parser')
def squads(url):
result = []
soup = get(url)
year = url[29:33]
for table in soup.find_all('table','sortable'):
if "wikitable" not in table['class']:
country = table.find_previous("span","mw-headline").text
for tr in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in tr.find_all('td')]
cells += [country, td.a.get('title') if td.a else 'none', year]
result.append(cells)
return result
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
result = []
for year in years:
url = "http://en.wikipedia.org/wiki/"+str(year)+"_FIFA_World_Cup_squads"
result += squads(url)
Final_result = pd.DataFrame(result)
Final_result.to_csv('/Users/home/Downloads/data.csv', index=False, encoding='iso-8859-1')```
[1]: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads

To get information about each team for years 1930-2018 you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/{}_FIFA_World_Cup_squads"
dfs = []
for year in range(1930, 2019):
print(year)
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
tables = soup.find_all(
lambda tag: tag.name == "table"
and tag.select_one('th:-soup-contains("Pos.")')
)
for table in tables:
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
df["Year"] = year
df["Country"] = table.find_previous(["h3", "h2"]).span.text
dfs.append(df)
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=False)
Prints:
...
13 14 FW Moussa Konaté 3 April 1993 (aged 25) 28 Amiens 2018 Senegal 10.0
14 15 FW Diafra Sakho 24 December 1989 (aged 28) 12 Rennes 2018 Senegal 3.0
15 16 GK Khadim N'Diaye 5 April 1985 (aged 33) 26 Horoya 2018 Senegal 0.0
16 17 MF Badou Ndiaye 27 October 1990 (aged 27) 20 Stoke City 2018 Senegal 1.0
17 18 FW Ismaïla Sarr 25 February 1998 (aged 20) 16 Rennes 2018 Senegal 3.0
18 19 FW M'Baye Niang 19 December 1994 (aged 23) 7 Torino 2018 Senegal 0.0
19 20 FW Keita Baldé 8 March 1995 (aged 23) 19 Monaco 2018 Senegal 3.0
20 21 DF Lamine Gassama 20 October 1989 (aged 28) 36 Alanyaspor 2018 Senegal 0.0
21 22 DF Moussa Wagué 4 October 1998 (aged 19) 10 Eupen 2018 Senegal 0.0
22 23 GK Alfred Gomis 5 September 1993 (aged 24) 1 SPAL 2018 Senegal 0.0
and saves data.csv (screenshot from LibreOffice):

just tested, you have no data because "wikitable" class is present in every table
You can replace "not in" by "in":
if "wikitable" in table["class"]:
...
And your BeautifulSoup data will be there
Once you changed this condition, you will have a problem with this list comprehension:
cells += [country, td.a.get('title') if td.a else 'none', year]
This is because td is not defined in this list, not quite sure what is the aim of these lines but you can define tds before and use them after:
tds = tr.find_all('td')
cells += ...
In general you can add breakpoints in your code to easily identify where is the problem

How to use pandas to pull out the counties with the largest amount of water used in a given year?

I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!

import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965

This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.

Trouble merging Scraped data using Pandas and numpy in Python

I am trying to collect information from a lot of different urls and combine the data based on the year and Golfer name. As of now I am trying to write the information to csv and then match using pd.merge() but I have to use a unique name for each dataframe to merge. I tried to use a numpy array but I am stuck with the final process of getting all the separate data to be merged.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np
base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']
alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)
#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass
# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass
dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)
The urls, stat types, desired format
the ... is just all of the stuff inbetween
trying to get the data on one row
urls for below data ['http://www.pgatour.com/stats/stat.02356.html','http://www.pgatour.com/stats/stat.02568.html',...,'http://www.pgatour.com/stats/stat.111.html']
Statistics Titles
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek ranklastweek name events rating rounds avg
2017 2 3 Rickie Fowler 10 8.8 62 .614
TOTAL SG:APP MEASURED ROUNDS .... % # SAVES # BUNKERS TOTAL O/U PAR
26.386 43 ....70.37 76 108 +7.00

UPDATE (per comments)
This question is partly about technical methods (Pandas merge()), but it also seems like an opportunity to discuss useful workflows for data collection and cleaning. As such I'm adding a bit more detail and explanation than what is strictly required for a coding solution.
You can basically use the same approach as my original answer to get data from different URL categories. I'd recommend keeping a list of {url:data} dicts as you iterate over your URL list, and then building cleaned data frames from that dict.
There's a little legwork involved in setting up the cleaning portion, as you need to adjust for the different columns in each URL category. I've demonstrated with a manual approach, using only a few tests URLs. But if you have, say, thousands of different URL categories, then you may need to think about how to collect and organize column names programmatically. That feels out of scope for this OP.
As long as you're sure there's a year and PLAYER NAME field in each URL, the following merge should work. As before, let's assume that you don't need to write to CSV, and for now let's leave off making any optimizations to your scraping code:
First, define the url categories in urls. By url category I'm referring to the fact that http://www.pgatour.com/stats/stat.02356.html will actually be used multiple times by inserting a series of years into the url itself, e.g.: http://www.pgatour.com/stats/stat.02356.2017.html, http://www.pgatour.com/stats/stat.02356.2016.html. In this example, stat.02356.html is the url category that contains information about multiple years of player data.
import pandas as pd
# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']
# we'll store data from each url category in this dict.
url_data = {}
Now iterate over urls. Within the urls loop, this code is all the same as my original answer, which in turn is coming from OP - only with some variable names adjusted to reflect our new capturing method.
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass
Now for each key url in url_data, rows_sp contains the data you're interested in for that particular url category.
Note that rows_sp will now actually be url_data[url]["rows_sp"] when we iterate over url_data, but the next few code blocks are from my original answer, and so use the old rows_sp variable name.
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]
Writing rows_sp directly to a data frame shows that the data aren't quite in the right format:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None
7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None
pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object
With a little cleanup, we can get rows_sp into a data frame with appropriate numeric data types:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8
UPDATED CLEANING
Now that we have a series of url categories to contend with, each with a different set of fields to clean, the above section gets a little more complicated. If you only have a few pages, it may be feasible to just visually review the fields for each category, and store them, like this:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}
And then you can loop over url_data again and store in a dfs collection:
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
At this point, we're ready to merge all the different data categories by year and PLAYER NAME. (You could actually have merged iteratively in the cleaning loop, but I'm separating here for demonstrative purposes.)
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])
Now master contains the merged data for each player-year. Here's a view into the data, using groupby():
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2
RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548
TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57
RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91
TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0
You may want to do a bit more cleaning on the merged columns, as some are duplicated across data categories (e.g. ROUNDS_x and ROUNDS_y). From what I can tell, the duplicate field names seem to contain exactly the same information, so you might just drop the _y version of each one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I have to create a dataframe from webscraping in a specific way - python

Related

Returning only last item and splitting into columns

Python Limit time to run pandas read_html

BeautifulSoup scraper doesn't retrieve any information

How to use pandas to pull out the counties with the largest amount of water used in a given year?

Trouble merging Scraped data using Pandas and numpy in Python

Categories

Resources