how to " sort csv file" python - python

I'm trying to create a new file that contains only the data for movies with a rank above 9.
The dataset that I'm analyzing contains rating on many movies obtained from IMDB.
The data fields are:
Votes: the number of people rating the movie
Rank: the averate rating of the movie
Title: the name of the movie
Year: the year in which the movie was released
The code I tried:
import csv
filename = "IMDB.txt"
with open(filename, 'rt', encoding='utf-8-sig') as imdb_file:
imdb_reader = csv.DictReader(imdb_file, delimiter = '\t')
with open('new file.csv', 'w', newline='') as high_rank:
fieldnames = ['Votes', 'Rank', 'Title', 'Year']
writer = csv.DictWriter(high_rank, fieldnames=fieldnames)
writer.writeheader()
for line_number, current_row in enumerate (imdb_reader):
if(float(current_row['Rank']) > 9.0):
csv_writer.writerow(dict(current_row))
but unfortunately its not working, what should i do ?

Based on your comment, your locale default encoding appears to be something that doesn't support the whole Unicode range. You need to specify an encoding for the output file that will handle arbitrary Unicode characters. Typically, on non-Windows systems you'd use 'utf-8'; on Windows, you might use 'utf-16' or 'utf-8-sig' (Windows programs often assume UTF-8 without an explicit signature is in the locale encoding, and misinterpret it). The fix is as simple as changing:
with open('new file.csv', 'w', newline='') as high_rank:
to:
with open('new file.csv', 'w', encoding='utf-8', newline='') as high_rank:
changing the specified encoding to whatever makes sense for your OS and use case.

Let's considere you have the following excel sheet name temp.csv and you want to filter films with a rank above 9 (included):
One simple way to do is to use the pandas modules. It gives you the opportunity to :
read .csv files with pd.read_csv method (doc)
filter data as you want
export data to a new file : for .csv output, df.to_csv do the job (doc)
Assume you have the following dataframe:
The code bellow do the job:
# import modules
import pandas as pd
# Path - name of your file
filename = "temp.csv"
# Read the csv file
df = pd.read_csv(filename, sep=";")
print(df)
# Votes Rank Film Year
# 0 15 16 The Shawshank Redemption 1994
# 1 2004 5 The Godfather 1972
# 2 486 13 The Godfather: Part II 1974
# 3 529 9 Il buono, il brutto, il cattivo. 1966
# 4 289 12 Pulp Fiction 1994
# 5 98 11 Inception 2010
# 6 69 18 Schindler's List 1993
# 7 3 7 Angry Men 1957
# 8 584 14 One Flew Over the Cuckoo's Nest 1975
# Filter the csv file
df_filtered = df[df["Rank"] >= 9]
print(df_filtered)
# Votes Rank Film Year
# 0 15 16 The Shawshank Redemption 1994
# 2 486 13 The Godfather: Part II 1974
# 3 529 9 Il buono, il brutto, il cattivo. 1966
# 4 289 12 Pulp Fiction 1994
# 5 98 11 Inception 2010
# 6 69 18 Schindler's List 1993
# 8 584 14 One Flew Over the Cuckoo's Nest 1975
# name new csv file
new_filename = filename[:-3] + "_new" + filename[-3:]
# Export dataframe to csv file
df_filtered.to_csv(new_filename)
The new .csv looks like this:

Related

Text file to a excel file (tab delimited) with python

I have a txt file that looks like this
1000 lewis hamilton 36
1001 sebastian vettel 34
1002 lando norris 21
i want them to look like this
I tried the solution in here but it gave me a blank excel file and error when trying to open it
There is more than one million lines and each lines contains around 10 column
And one last thing i am not 100% sure if they are tab elimited because some columns looks like they have more space in between them than the others but when i press to backspace once they stick to each other so i guess it is
you can use pandas read_csv for read your txt file and then save it like an excel file with .to_excel
df = pd.read_csv('your_file.txt' , delim_whitespace=True)
df.to_excel('your_file.xlsx' , index = False)
here some documentation :
pandas.read_csv : https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
.to_excel : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
If you're not sure about how the fields are separated, you can use '\s' to split by spaces.
import pandas as pd
df = pd.read_csv('f1.txt', sep="\s+", header=None)
# you might need: pip install openpyxl
df.to_excel('f1.xlsx', 'Sheet1')
Example of randomly separated fields (f1.txt):
1000 lewis hamilton 2 36
1001 sebastian vettel 8 34
1002 lando norris 6 21
If you have some lines having more columns than the first one, causing:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 6
You can ignore those by using:
df = pd.read_csv('f1.txt', sep="\s+", header=None, error_bad_lines=False)
This is an example of data:
1000 lewis hamilton 2 36
1001 sebastian vettel 8 34
1002 lando norris 6 21
1003 charles leclerc 1 3
1004 carlos sainz ferrari 2 2
The last line will be ignored:
b'Skipping line 5: expected 5 fields, saw 6\n'

Python read specific value from text file and total sum

I have this text file, Masterlist.txt, which looks something like this:
S1234567A|Jan Lee|Ms|05/10/1990|Software Architect|IT Department|98785432|PartTime|3500
S1234567B|Feb Tan|Mr|10/12/1991|Corporate Recruiter|HR Corporate Admin|98766432|PartTime|1500
S1234567C|Mark Lim|Mr|15/07/1992|Benefit Specialist|HR Corporate Admin|98265432|PartTime|2900
S1234567D|Apr Tan|Ms|20/01/1996|Payroll Administrator|HR Corporate Admin|91765432|FullTime|1600
S1234567E|May Ng|Ms|25/05/1994|Training Coordinator|HR Corporate Admin|98767432|Hourly|1200
S1234567Y|Lea Law|Ms|11/07/1994|Corporate Recruiter|HR Corporate Admin|94445432|PartTime|1600
I want to reduce the Salary(the number at the end of each line) of each line, only if "PartTime" is in the line and after 1995, by 50%, and then add it up.
Currently I only know how to select only lines with "PartTime" in it, and my code looks like this:
f = open("Masterlist.txt", "r")
for x in f:
if "FullTime" in x:
print(x)
How do I extract the Salary and reduce by 50% + add it up only if the year is after 1995?
Try using pandas library.
From your question I suppose you want to reduce by 50% Salary if year is less than 1995, otherwise increase by 50%.
import pandas as pd
path = r'../Masterlist.txt' # path to your .txt file
df = pd.read_csv(path, sep='|', names = [0,1,2,'Date',4,5,6,'Type', 'Salary'], parse_dates=['Date'])
# Now column Date is treated as datetime object
print(df.head())
0 1 2 Date 4 \
0 S1234567A Jan Lee Ms 1990-05-10 Software Architect
1 S1234567B Feb Tan Mr 1991-10-12 Corporate Recruiter
2 S1234567C Mark Lim Mr 1992-07-15 Benefit Specialist
3 S1234567D Apr Tan Ms 1996-01-20 Payroll Administrator
4 S1234567E May Ng Ms 1994-05-25 Training Coordinator
5 6 Type Salary
0 IT Department 98785432 PartTime 3500
1 HR Corporate Admin 98766432 PartTime 1500
2 HR Corporate Admin 98265432 PartTime 2900
3 HR Corporate Admin 91765432 FullTime 1600
4 HR Corporate Admin 98767432 Hourly 1200
df.Salary = df.apply(lambda row: row.Salary*0.5 if row['Date'].year < 1995 and row['Type'] == 'PartTime' else row.Salary + (row.Salary*0.5 ), axis=1)
print(df.Salary.head())
0 1750.0
1 750.0
2 1450.0
3 2400.0
4 600.0
Name: Salary, dtype: float64
Add some modifications to the if, else statement inside the apply function if you wanted something different.

Comparing three tsv files to check for similar items and updating the differences in the original file

So I have three .tsv files, namely file1.tsv, file2.tsv and file3.tsv, and they are as follows:
file1.tsv =
ID Name
1 Abby
2 Lisa
3
4 John
5
6 Kevin
7 Joe
8 Sasha
9 Stuart
10 Amy
file2.tsv =
ID Name
8 Sasha
3 Iris
9 Stuart
file3.tsv =
ID Name
10 Amy
5 Kelly
6 Kevin
7 Joe
I need to parse the first file into ID and Names and check for the rows where the entry IDs have empty names. Then I have to check for those specific IDs in file2.tsv, if there is a matching name for that ID, I have to update file1.tsv with it. In case I don't find matches for all the empty IDs in file2.tsv, I will have to check for them in file3.tsv, which will result in all the appropriate matching of IDs and names, which again needs to be updated in file1.tsv.
The output.tsv file should have all the IDs showing their corresponding names, as follows:
ID Name
1 Abby
2 Lisa
3 Iris
4 John
5 Kelly
6 Kevin
7 Joe
8 Sasha
9 Stuart
10 Amy
Now, here is what I tried:
import csv
import sys
f1 = open("file1.tsv")
read_tsv = csv.reader(f1, delimiter="\t", quotechar='"')
with open('new.tsv', 'w', newline='') as g_output: #to save the empty entry IDs in a separate file
tsv_new = csv.writer(g_output, delimiter='\t')
for row in read_tsv:
if row[1] == '':
print(row)
tsv_new.writerow(row)
g_output.close()
f1.close()
entry_ID = input('Enter ID number to find:\n')
f2 = csv.reader(open('file2.tsv'), delimiter="\t")
with open('output.tsv', 'a', newline='') as f_output: #to append the output to this new file
tsv_output = csv.writer(f_output, delimiter='\t')
for row in f2:
if entry_ID == row[0]:
print(row)
tsv_output.writerow(row)
f_output.close()
f2.close()
Similarly, check in file3.tsv. However, it doesn't seem to work, and I don't want to manually look for entry IDs, but want to automate the process by manipulation of file handling and using loops. I am new to Python and this seems tricky
This here is an example because my original three .tsv files are huge, and manual update of IDs will not be feasible.
A dictionary is your best bet, as the keys are unique you can just read each row in and assign the key as the ID. Obviously files read in later will overwrite any existing identical keys. Secondly you can skip any empty rows or rows containing no name:
import csv
data = {}
for filename in ['file1.tsv', 'file2.tsv', 'file3.tsv']:
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input, delimiter='\t')
header = next(csv_input)
for row in csv_input:
if len(row) == 2 and row[1]:
data[int(row[0])] = row[1] # convert the ID to an int to ensure it is numerically sorted
with open('output.tsv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter='\t')
csv_output.writerow(header)
csv_output.writerows(sorted(data.items()))
This would give you an output file containing:
ID Name
1 Abby
2 Lisa
4 John
5 Kelly
6 Kevin
7 Joe
8 Sasha
9 Stuart
10 Amy
data could also then be used to look up an ID:
>> print(data[9])
>> Stuart

Not able to pull data from Census API because it is rejecting calls

I am trying to run this script to extract data from the US census but the census API is rejecting my request. It is rejecting my pulls, I did a bit of work, but am stumped....any ideas on how to deal with this
import pandas as pd
import requests
from pandas.compat import StringIO
#Sourced from the following site https://github.com/mortada/fredapi
from fredapi import Fred
fred = Fred(api_key='xxxx')
import StringIO
import datetime
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO as stio
else:
from io import StringIO as stio
year_list = '2013','2014','2015','2016','2017'
month_list = '01','02','03','04','05','06','07','08','09','10','11','12'
#############################################
#Get the total exports from the United States
#############################################
exports = pd.DataFrame()
for i in year_list:
for s in month_list:
try:
link="https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
str1 = ''.join([i])
txt = '-'
str2 = ''.join([s])
total_link=link+str1+txt+str2
r = requests.get(total_link, headers = {'User-agent': 'your bot 0.1'})
df = pd.read_csv(StringIO(r.text))
##################### change starts here #####################
##################### since it is a dataframe itself, so the method to create a dataframe from a list won't work ########################
# Drop the total sales line
df.drop(df.index[0])
# Rename Column name
df.columns=['CTY_CODE','CTY_NAME','EXPORT MTH','EXPORT YR','time','UN']
# Change the ["1234" to 1234
df['CTY_CODE']=df['CTY_CODE'].str[2:-1]
# Change the 2017-01] to 2017-01
df['time']=df['time'].str[:-1]
##################### change ends here #####################
exports = exports.append(df, ignore_index=False)
except:
print i
print s
Here you go:
import ast
import itertools
import pandas as pd
import requests
base = "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
year_list = ['2013','2014','2015','2016','2017']
month_list = ['01','02','03','04','05','06','07','08','09','10','11','12']
exports = []
rejects = []
for year, month in itertools.product(year_list, month_list):
url = '%s%s-%s' % (base, year, month)
r = requests.get(url, headers={'User-agent': 'your bot 0.1'})
if r.text:
r = ast.literal_eval(r.text)
df = pd.DataFrame(r[2:], columns=r[0])
exports.append(df)
else:
rejects.append((int(year), int(month)))
exports = pd.concat(exports).reset_index().drop('index', axis=1)
Your result looks like this:
CTY_CODE CTY_NAME ALL_VAL_MO ALL_VAL_YR time
0 1010 GREENLAND 233446 233446 2013-01
1 1220 CANADA 23170845914 23170845914 2013-01
2 2010 MEXICO 17902453702 17902453702 2013-01
3 2050 GUATEMALA 425978783 425978783 2013-01
4 2080 BELIZE 17795867 17795867 2013-01
5 2110 EL SALVADOR 207606613 207606613 2013-01
6 2150 HONDURAS 429806151 429806151 2013-01
7 2190 NICARAGUA 75752432 75752432 2013-01
8 2230 COSTA RICA 598484187 598484187 2013-01
9 2250 PANAMA 1046236431 1046236431 2013-01
10 2320 BERMUDA 47156737 47156737 2013-01
11 2360 BAHAMAS 256292297 256292297 2013-01
... ... ... ... ...
13883 0024 LAFTA 27790655209 193139639307 2017-07
13884 0025 EURO AREA 15994685459 121039479852 2017-07
13885 0026 APEC 76654291110 550552655105 2017-07
13886 0027 ASEAN 6030380132 44558200533 2017-07
13887 0028 CACM 2133048149 13333440411 2017-07
13888 1XXX NORTH AMERICA 41622877949 299981278306 2017-07
13889 2XXX CENTRAL AMERICA 4697852283 30756310800 2017-07
13890 3XXX SOUTH AMERICA 8117215081 55039567414 2017-07
13891 4XXX EUROPE 25201247938 189925038230 2017-07
13892 5XXX ASIA 38329181070 274304503490 2017-07
13893 6XXX AUSTRALIA AND OC... 2389798925 16656777753 2017-07
13894 7XXX AFRICA 1809443365 13022520158 2017-07
Walkthrough:
itertools.product iterates over the product of (year, month) combinations, joining them with your base url
if the text of the response object is not blank (periods such as 2017-12 will be blank), create a DataFrame out of the literally-evaluated text, which is a list of lists. Use the first element as columns and ignore the second element.
otherwise, add the (year, month) combo to rejects, a list of tuples of the items not found
I used exports = [] because it is much more efficiently to concatenate a list of DataFrames than to append to an existing DataFrame

Trouble merging Scraped data using Pandas and numpy in Python

I am trying to collect information from a lot of different urls and combine the data based on the year and Golfer name. As of now I am trying to write the information to csv and then match using pd.merge() but I have to use a unique name for each dataframe to merge. I tried to use a numpy array but I am stuck with the final process of getting all the separate data to be merged.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np
base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']
alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)
#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass
# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass
dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)
The urls, stat types, desired format
the ... is just all of the stuff inbetween
trying to get the data on one row
urls for below data ['http://www.pgatour.com/stats/stat.02356.html','http://www.pgatour.com/stats/stat.02568.html',...,'http://www.pgatour.com/stats/stat.111.html']
Statistics Titles
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek ranklastweek name events rating rounds avg
2017 2 3 Rickie Fowler 10 8.8 62 .614
TOTAL SG:APP MEASURED ROUNDS .... % # SAVES # BUNKERS TOTAL O/U PAR
26.386 43 ....70.37 76 108 +7.00
UPDATE (per comments)
This question is partly about technical methods (Pandas merge()), but it also seems like an opportunity to discuss useful workflows for data collection and cleaning. As such I'm adding a bit more detail and explanation than what is strictly required for a coding solution.
You can basically use the same approach as my original answer to get data from different URL categories. I'd recommend keeping a list of {url:data} dicts as you iterate over your URL list, and then building cleaned data frames from that dict.
There's a little legwork involved in setting up the cleaning portion, as you need to adjust for the different columns in each URL category. I've demonstrated with a manual approach, using only a few tests URLs. But if you have, say, thousands of different URL categories, then you may need to think about how to collect and organize column names programmatically. That feels out of scope for this OP.
As long as you're sure there's a year and PLAYER NAME field in each URL, the following merge should work. As before, let's assume that you don't need to write to CSV, and for now let's leave off making any optimizations to your scraping code:
First, define the url categories in urls. By url category I'm referring to the fact that http://www.pgatour.com/stats/stat.02356.html will actually be used multiple times by inserting a series of years into the url itself, e.g.: http://www.pgatour.com/stats/stat.02356.2017.html, http://www.pgatour.com/stats/stat.02356.2016.html. In this example, stat.02356.html is the url category that contains information about multiple years of player data.
import pandas as pd
# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']
# we'll store data from each url category in this dict.
url_data = {}
Now iterate over urls. Within the urls loop, this code is all the same as my original answer, which in turn is coming from OP - only with some variable names adjusted to reflect our new capturing method.
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass
Now for each key url in url_data, rows_sp contains the data you're interested in for that particular url category.
Note that rows_sp will now actually be url_data[url]["rows_sp"] when we iterate over url_data, but the next few code blocks are from my original answer, and so use the old rows_sp variable name.
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]
Writing rows_sp directly to a data frame shows that the data aren't quite in the right format:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None
7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None
pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object
With a little cleanup, we can get rows_sp into a data frame with appropriate numeric data types:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8
UPDATED CLEANING
Now that we have a series of url categories to contend with, each with a different set of fields to clean, the above section gets a little more complicated. If you only have a few pages, it may be feasible to just visually review the fields for each category, and store them, like this:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}
And then you can loop over url_data again and store in a dfs collection:
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
At this point, we're ready to merge all the different data categories by year and PLAYER NAME. (You could actually have merged iteratively in the cleaning loop, but I'm separating here for demonstrative purposes.)
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])
Now master contains the merged data for each player-year. Here's a view into the data, using groupby():
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2
RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548
TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57
RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91
TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0
You may want to do a bit more cleaning on the merged columns, as some are duplicated across data categories (e.g. ROUNDS_x and ROUNDS_y). From what I can tell, the duplicate field names seem to contain exactly the same information, so you might just drop the _y version of each one.

Categories

Resources