Speed optimization for loop - python

I'm trying to predict the outcome of sports games and therefore want to transform my dataframe in such a way that I can train a model. Currently I am using a for loop to loop through all played games, pick the two players of the game and check how they performed the x games before the actual game took place. After this I want to take the mean of the statistics of previous games of these players and concatenate these together. In the end I add the true outcome of the actual game so I can train a model on the true outcome.
Now I got some speed performance issues, my current code takes about 9 minutes to complete for 20000 games (with ~200 variables). I already managed to go from 20 to 9 minutes.
I started with adding each game to a dataframe, later I changed this to adding each seperate dataframe to a list and make one big dataframe of this list in the end.
I also included if statements which make sure that the loop continues if a player did not play at least x games.
I expect the outcome to be much faster than 9 minutes. I think it can be much faster.
Hope you guys can help me!
import pandas as pd
import numpy as np
import random
import string
letters = list(string.ascii_lowercase)
datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
'League': np.random.choice(['LeagueA','LeagueB'], 5000),
'Home_player':np.random.choice(letters, 5000),
'Away_player':np.random.choice(letters, 5000),
'Home_strikes':np.random.randint(1,20,5000),
'Home_kicks':np.random.randint(1,20,5000),
'Away_strikes':np.random.randint(1,20,5000),
'Away_kicks':np.random.randint(1,20,5000),
'Winner':np.random.randint(0,2,5000)})
leagues = list(data['League'].unique())
home_columns = [col for col in data if col.startswith('Home')]
away_columns = [col for col in data if col.startswith('Away')]
# Determine to how many last x games to take statistics
total_games = 5
final_df = []
# Make subframe of league
for league in leagues:
league_data = data[data.League == league]
league_data = league_data.sort_values(by='Date').reset_index(drop=True)
# Pick the last game
league_data = league_data.head(500)
for i in range(0,len(league_data)):
if i < 1:
league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
else:
league_copy = league_data[:-i].reset_index(drop=True)
# Loop back from the last game
last_game = league_copy.iloc[-1:].reset_index(drop=True)
# Take home and away player
Home_player = last_game.loc[0,"Home_player"] # Pick home team
Away_player = last_game.loc[0,'Away_player'] # pick away team
# # Remove last row so current game is not picked
df = league_copy[:-1]
# Now check the statistics of the games befóre this game was played
Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
# If the player did not play at least x number of games, then continue
if len(Home) < total_games:
continue
else:
Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Do the same for the away team
Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
if len(Away) < total_games:
continue
else:
Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Now concat home and away player data
Home_away = pd.concat([Home, Away], axis=1)
Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
# Take the mean of all columns
Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
# Now again add home team and away team to dataframe
Home_away["Home_player"] = Home_player
Home_away["Away_player"] = Away_player
winner = last_game.loc[0,"Winner"]
date = last_game.loc[0,"Date"]
Home_away['Winner'] = winner
Home_away['Date'] = date
final_df.append(Home_away)
final_df = pd.concat(final_df, axis=0)
final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]

This doesn't answer your question but you can leverage the package line_profiler to find the slow parts of your code.
Resource:
http://gouthamanbalaraman.com/blog/profiling-python-jupyter-notebooks.html
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 1 35.0 35.0 0.0 letters = list(string.ascii_lowercase)
3 1 11052.0 11052.0 0.0 datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
4
5 1 3483.0 3483.0 0.0 data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
6 1 1464.0 1464.0 0.0 'League': np.random.choice(['LeagueA','LeagueB'], 5000),
7 1 2532.0 2532.0 0.0 'Home_player':np.random.choice(letters, 5000),
8 1 1019.0 1019.0 0.0 'Away_player':np.random.choice(letters, 5000),
9 1 693.0 693.0 0.0 'Home_strikes':np.random.randint(1,20,5000),
10 1 682.0 682.0 0.0 'Home_kicks':np.random.randint(1,20,5000),
11 1 682.0 682.0 0.0 'Away_strikes':np.random.randint(1,20,5000),
12 1 731.0 731.0 0.0 'Away_kicks':np.random.randint(1,20,5000),
13 1 40409.0 40409.0 0.0 'Winner':np.random.randint(0,2,5000)})
14
15 1 6560.0 6560.0 0.0 leagues = list(data['League'].unique())
16 1 439.0 439.0 0.0 home_columns = [col for col in data if col.startswith('Home')]
17 1 282.0 282.0 0.0 away_columns = [col for col in data if col.startswith('Away')]
18
19 # Determine to how many last x games to take statistics
20 1 11.0 11.0 0.0 total_games = 5
21 1 12.0 12.0 0.0 final_df = []
22
23 # Make subframe of league
24 3 38.0 12.7 0.0 for league in leagues:
25
26 2 34381.0 17190.5 0.0 league_data = data[data.League == league]
27 2 30815.0 15407.5 0.0 league_data = league_data.sort_values(by='Date').reset_index(drop=True)
28 # Pick the last game
29 2 5045.0 2522.5 0.0 league_data = league_data.head(500)
30 1002 14202.0 14.2 0.0 for i in range(0,len(league_data)):
31 1000 11943.0 11.9 0.0 if i < 1:
32 2 28407.0 14203.5 0.0 league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
33 else:
34 998 5305364.0 5316.0 4.2 league_copy = league_data[:-i].reset_index(drop=True)
35
36 # Loop back from the last game
37 1000 4945240.0 4945.2 3.9 last_game = league_copy.iloc[-1:].reset_index(drop=True)
38
39 # Take home and away player
40 1000 1504055.0 1504.1 1.2 Home_player = last_game.loc[0,"Home_player"] # Pick home team
41 1000 899081.0 899.1 0.7 Away_player = last_game.loc[0,'Away_player'] # pick away team
42
43 # # Remove last row so current game is not picked
44 1000 2539351.0 2539.4 2.0 df = league_copy[:-1]
45
46 # Now check the statistics of the games befóre this game was played
47 1000 16428854.0 16428.9 13.0 Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
48
49 # If the player did not play at least x number of games, then continue
50 1000 49133.0 49.1 0.0 if len(Home) < total_games:
51 260 2867.0 11.0 0.0 continue
52 else:
53 740 12968016.0 17524.3 10.2 Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
54
55
56 # Do the same for the away team
57 740 12007650.0 16226.6 9.5 Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
58
59 740 33357.0 45.1 0.0 if len(Away) < total_games:
60 64 825.0 12.9 0.0 continue
61 else:
62 676 11598741.0 17157.9 9.1 Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
63
64
65 # Now concat home and away player data
66 676 5114022.0 7565.1 4.0 Home_away = pd.concat([Home, Away], axis=1)
67 676 9702001.0 14352.1 7.6 Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
68
69 # Take the mean of all columns
70 676 12171184.0 18004.7 9.6 Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
71
72 # Now again add home team and away team to dataframe
73 676 5112558.0 7563.0 4.0 Home_away["Home_player"] = Home_player
74 676 4880017.0 7219.0 3.8 Home_away["Away_player"] = Away_player
75
76 676 791718.0 1171.2 0.6 winner = last_game.loc[0,"Winner"]
77 676 696925.0 1031.0 0.5 date = last_game.loc[0,"Date"]
78 676 5142111.0 7606.7 4.1 Home_away['Winner'] = winner
79 676 9630466.0 14246.3 7.6 Home_away['Date'] = date
80
81 676 16125.0 23.9 0.0 final_df.append(Home_away)
82 1 5088063.0 5088063.0 4.0 final_df = pd.concat(final_df, axis=0)
83 1 18424.0 18424.0 0.0 final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]

IIUC, you can obtain the last 5 game statistics, including the current one by:
# replace this with you statistic columns
stat_cols = data.columns[4:]
total_games = 5
data.groupby(['League','Home_player', 'Away_player'])[stat_cols].rolling(total_games).mean()
If you want to exclude the current one:
last_stats = data.groupby(['League','Home_player', 'Away_player']).apply(lambda x: x[stat_cols].shift().rolling(total_games).mean())
This last_stats data frame should have the same index as the the original one, so you can do:
train_data = data.copy()
# backup the actual outcome
train_data['Actual'] = train_data['Winner']
# copy the average statistics
train_data[stat_cols] = last_stats
All together should not take more than 1 min.

Related

Calculate Multiple Column Growth in Python Dataframe

The data I used look like this
data
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
1 100 50 120 45 110 50
2 95 40 100 45 105 50
3 110 45 100 45 110 40
I want to calculate each variable growth for each year so the result will look like this
Subject 2001_X1_gro 2001_X2_gro 2002_X1_gro 2002_X2_gro
1 0.2 -0.1 -0.08333 0.11111
2 0.052632 0.125 0.05 0.11111
3 -0.09091 0 0.1 -0.11111
I already do it manually for each variable for each year with code like this
data[2001_X1_gro]= (data[2001_X1]-data[2000_X1])/data[2000_X1]
data[2002_X1_gro]= (data[2002_X1]-data[2001_X1])/data[2001_X1]
data[2001_X2_gro]= (data[2001_X2]-data[2000_X2])/data[2000_X2]
data[2002_X2_gro]= (data[2002_X2]-data[2001_X2])/data[2001_X2]
Is there a way to do it more efficient escpecially if I have more year and/or more variable?
import pandas as pd
df = pd.read_csv('data.txt', sep=',', header=0)
Input
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
0 1 100 50 120 45 110 50
1 2 95 40 100 45 105 50
2 3 110 45 100 45 110 40
Next, a loop is created and the columns are filled:
qqq = '_gro'
new_name = ''
year = ''
for i in range(1, len(df.columns) - 2):
year = str(int(df.columns[i][:4]) + 1) + df.columns[i][4:]
new_name = year + qqq
df[new_name] = (df[year] - df[df.columns[i]])/df[df.columns[i]]
print(df)
Output
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2 2001_X1_gro \
0 1 100 50 120 45 110 50 0.200000
1 2 95 40 100 45 105 50 0.052632
2 3 110 45 100 45 110 40 -0.090909
2001_X2_gro 2002_X1_gro 2002_X2_gro
0 -0.100 -0.083333 0.111111
1 0.125 0.050000 0.111111
2 0.000 0.100000 -0.111111
In the loop, the year is extracted from the column name, converted to int, 1 is added to it. The value is again converted to a string, the prefix '_Xn' is added. A new_name variable is created, to which the string '_gro ' is also appended. A column is created and filled with calculated values.
If you want to count, for example, for three years, then you need to add not 1, but 3. This is with the condition that your data will be ordered. And note that the loop does not go through all the elements: for i in range(1, len(df.columns) - 2):. In this case, it skips the Subject column and stops short of the last two values. That is, you need to know where to stop it.

Group By Median, Percentile and Percent of Total

I have a dataframe that looks like this...
ID Acuity TOTAL_ED_LOS
1 2 423
2 5 52
3 5 535
4 1 87
...
I would like to produce a table that looks like this:
Acuity Count Median Percentile_25 Percentile_75 % of total
1 234 ... 31%
2 65 ... 8%
3 56 ... 7%
4 345 ... 47%
5 35 ... 5%
I already have code that will give me everything I need except for the % of total column
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
df_grp = df_merged_v1.groupby(['Acuity'])
df_grp['TOTAL_ED_LOS'].agg(['count','median',
percentile(25), percentile(75)]).reset_index()
Is there an efficient way I can add the percent of total column? The link below contain code on how to obtain the percent of total but I'm unsure how to apply it to my code. I know that I could create two tables and then merge them but am curious if there is a cleaner way.
How to calculate count and percentage in groupby in Python
Here's a one way to do it using some pandas builtin tools:
# Set random number seeed and create a dummy datafame with two columns
np.random.seed(123)
df = pd.DataFrame({'activity':np.random.choice([*'ABCDE'], 40),
'TOTAL_ED_LDS':np.random.randint(50, 500, 40)})
# Reshape dataframe to get activit per column
# then use the output from describe and transpose
df_out = df.set_index([df.groupby('activity').cumcount(),'activity'])['TOTAL_ED_LDS']\
.unstack().describe().T
#Calculate percent count of total count
df_out['% of Total'] = df_out['count'] / df_out['count'].sum() * 100.
df_out
Output:
count mean std min 25% 50% 75% max % of Total
activity
A 8.0 213.125000 106.810162 93.0 159.50 200.0 231.75 421.0 20.0
B 10.0 308.200000 116.105125 68.0 240.75 324.5 376.25 461.0 25.0
C 6.0 277.666667 117.188168 114.0 193.25 311.5 352.50 409.0 15.0
D 7.0 370.285714 124.724649 120.0 337.50 407.0 456.00 478.0 17.5
E 9.0 297.000000 160.812002 51.0 233.00 294.0 415.00 488.0 22.5

Pandas: duplicating dataframe entries while column higher or equal to 0

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions
ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

Pandas groupby two columns and only keep records satisfying condition based on count

Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36
It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

Loop Through Table Rows Using BeautifulSoup

I need help looping through table rows and putting them into a list. On this website, there are three tables, each with different statistics - http://www.fangraphs.com/statsplits.aspx?playerid=15640&position=OF&season=0&split=0.4
For instance, these three tables have rows for 2016, 2017, and a total row. I would like the following:
A list of the following -->table 1 - row 1, table 2 - row 1, table 3 - row 1
A second list of the following -->table 1 - row 2, table 2 - row 2, table 3 - row 2
A third list: -->table 1 - row 3, table 2 - row 3, table 3 - row 3
I know I obviously need to create lists, and need to use the append function; however, I am not sure how to get it to loop through just the first row of each table, then the second row of each table, and etc through each row of the table (the number of rows will vary in each instance - this one just happens to have 3).
Any help is greatly appreciated. The code is below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
for id in idList2:
pos = 'OF'
for split in splitList:
url = 'http://www.fangraphs.com/statsplits.aspx?playerid=' +
str(id) + '&position=' + str(pos) + '&season=0&split=' +
str(split) + ''
r = requests.get(url)
for season in range(1,4):
print(season)
soup = BeautifulSoup(r.text, "html.parser")
tableStats = soup.find("table", {"id" : "SeasonSplits1_dgSeason" + str(season) + "_ctl00"})
soup.findAll('th')
column_headers = [th.getText() for th in soup.findAll('th')]
statistics = soup.find("table", {"id" :
'"SeasonSplits1_dgSeason" + str(season) + "_ctl00"})'
tabledata = [td.getText() for td in statistics('td')]
print(tabledata)
This will be my last attempt. It has every thing you should need. I created a traceback to where the tables, rows and columns are being scraped. This all happens in the function extract_table(). follow the traceback markers and don't worry about any other code. Don't let the large file size worry you its mostly documentation and spacing.
Traceback marker: ### ... ###
Start at line 95 with traceback marker ### START HERE ###
from bs4 import BeautifulSoup as Soup
import requests
import urllib
###### GO TO LINE 95 ######
### IGNORE ###
def generate_urls (idList, splitList):
""" Using and id list and a split list generate a list urls"""
urls = []
url = 'http://www.fangraphs.com/statsplits.aspx'
for id in idList:
for split in splitList:
# The parameters used in creating the url
url_payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
# Create the url and store add to the collection of urls
urls += ['?'.join([url, urllib.urlencode(url_payload)])]
return urls # Return the list of urls
### IGNORE ###
def extract_player_name (soup):
""" Extract the player name from the browser title """
# Browser title contains player name, strip all but name
player_name = repr(soup.title.text.strip('\r\n\t'))
player_name = player_name.split(' \\xbb')[0] # Split on ` »`
player_name = player_name[2:] # Erase a leading characters from using `repr`
return player_name
########## FINISH HERE ##########
def extract_table (table_id, soup):
""" Extract data from a table, return the column headers and the table rows"""
### IMPORTANT: THIS CODE IS WHERE ALL THE MAGIC HAPPENS ###
# - First: Find lowest level tag of all the data we want (container).
#
# - Second: Extract the table column headers, requires minimal mining
#
# - Third: Gather a list of tags that represent the tables rows
#
# - Fourth: Loop through the list of rows
# A): Mine all columns in the row
### IMPORTANT: Get A Reference To The Table ###
# SCRAPE 1:
table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})
# SCRAPE 2:
columns = [th.text for th in table_tag.findAll('th')]
# SCRAPE 3:
rows_tags = table_tag.tbody.findAll('tr'); # All 'tr' tags in the table `tbody` tag are row tags
### IMPORTANT: Cycle Through Rows And Collect Column Data ###
# SCRAPE 4:
rows = [] # List of all table rows
for row_tag in rows_tags:
### IMPORTANT: Mine All Columns In This Row || LOWEST LEVEL IN THE MINING OPERATION. ###
# SCRAPE 4.A
row = [col.text for col in row_tag.findAll('td')] # `td` represents a column in a row.
rows.append (row) # Add this row to all the other rows of this table
# RETURN: The column header and the rows of this table
return [columns, rows]
### Look Deeper ###
def extract_player (soup):
""" Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
player = [] # A list store data in
# player name is first in player list
player.append (extract_player_name (soup))
# Each table is a list entry
for season in range(1,4):
### IMPORTANT: No Table Related Data Has Been Mined Yet. START HERE ###
### - Line: 37
table = extract_table (season, soup) # `season` represents the table id
player.append(table) # Add this table(list to the player data list
# Return the player list
return player
##################################################
################## START HERE ####################
##################################################
###
### OBJECTIVE:
###
### - Follow the trail of important lines that extract the data
### - Important lines will be marked as the following `### ... ###`
###
### All this code really needs is a url and the `extract_table()` function.
###
### The `main()` function is where the journey starts
###
##################################################
##################################################
def main ():
""" The main function is the core program code. """
# Luckily the pages we will scrape all have the same layout making mining easier.
all_players = [] # A place to store all the data
# Values used to alter the url when making requests to access more player statistics
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
# Instead of looping through variables that dont tell a story,
# lets create a list of urls generated from those variables.
# This way the code is self-explanatory and is human-readable.
urls = generate_urls(idList2, splitList) # The creation of the url is not important right now
# Lets scrape each url
for url in urls:
print url
# First Step: get a web page via http request.
response = requests.get (url)
# Second step: use a parsing library to create a parsable object
soup = Soup(response.text, "html.parser") # Create a soup object (Once)
### IMPORTANT: Parsing Starts and Ends Here ###
### - Line: 75
# Final Step: Given a soup object, mine player data
player = extract_player (soup)
# Add the new entry to the list
all_players += [player]
return all_players
# If this script is being run, not imported, run the `main()` function.
if __name__ == '__main__':
all_players = main ()
print all_players[0][0] # Player List -> Name
print all_players[0][1] # Player List -> Table 1
print all_players[0][2] # Player List -> Table 2
print all_players[0][3] # Player List -> Table 3
print all_players[0][3][0] # Player List -> Table 1 -> Columns
print all_players[0][3][1] # Player List -> Table 1 -> All Rows
print all_players[0][3][1][0] # Player List -> Table 1 -> All Rows -> Row 1
print all_players[0][3][1][2] # Player List -> Table 1 -> All Rows -> Row 2
print all_players[0][3][1][2][0] # Player List -> Table 1 -> All Rows -> Row 2 -> Colum 1
I've updated the code, separated functionality, used lists instead of dictionaries (as requested). Lines 85+ is output testing (can ignore).
I see now that that your making multiple request (4) for the same player to gather more data on them. In the last answer I provided, the code only kept the last request made. Using a list eliminated this problem.
You may want to condense the list so that their is only one entry per player.
The core of the program is on lines 65-77
Everything above all_player decleration on line 57 is a function To handle scraping.
UPDATED: scrape_players.py
from bs4 import BeautifulSoup as Soup
import requests
def http_get (id, split):
""" Make a get request, return the response. """
# Create url parameters dictinoary
payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
url = 'http://www.fangraphs.com/statsplits.aspx'
return requests.get(url, params=payload) # Pass payload through `requests.get()`
def extract_player_name (soup):
""" Extract the player name from the browser title """
# Browser title contains player name, strip all but name
player_name = repr(soup.title.text.strip('\r\n\t'))
player_name = player_name.split(' \\xbb')[0] # Split on ` »`
player_name = player_name[2:] # Erase a leading characters from using `repr`
return player_name
def extract_table (table_id, soup):
""" Extract data from a table, return the column headers and the table rows"""
# SCRAPE: Get a table
table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})
# SCRAPE: Extract table column headers
columns = [th.text for th in table_tag.findAll('th')]
rows = []
# SCRAPE: Extract Table Contents
for row in table_tag.tbody.findAll('tr'):
rows.append ([col.text for col in row.findAll('td')]) # Gather all columns in the row
# RETURN: [columns, rows]
return [columns, rows]
def extract_player (soup):
""" Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
player = []
# player name is first in player list
player.append (extract_player_name (soup))
# Each table is a list entry
for season in range(1,4):
player.append(extract_table (season, soup))
# Return the player list
return player
# A list of all players
all_players = [
#'playername',
#[table_columns, table_rows],
#[table_columns, table_rows],
#[['Season', 'vs R as R'], [['2015', 'yes'], ['2016', 'no'], ['2017', 'no'],]],
]
# I dont know what these values are. Sorry!
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
# Scrape data
for id in idList2:
for split in splitList:
response = http_get (id, split)
soup = Soup(response.text, "html.parser") # Create a soup object (Once)
all_players.append (extract_player (soup))
# or all_players += [scrape_player (soup)]
# Output data
def PrintPlayerAsTable (player, show_name=True):
if show_name: print player[0] # First entry is the player name
for table in player[1:]: # All other entries are tables
PrintTableAsTable(table)
def PrintTableAsTable (table, table_sep='\n'):
print table_sep
PrintRowAsTable(table[0]) # The first row in the table is the columns
for row in table[1]: # The second item in the table is a list of rows
PrintRowAsTable (row)
def PrintRowAsTable (row=[], prefix='\t'):
""" Print out the list in a table foramt. """
print prefix + ''.join([col.ljust(15) for col in row])
# There are 4 entries to every player, one for each request made
PrintPlayerAsTable (all_players[0])
PrintPlayerAsTable (all_players[1], False)
PrintPlayerAsTable (all_players[2], False)
PrintPlayerAsTable (all_players[3], False)
print '\n\nScraped %d player Statistics' % len(all_players)
for player in all_players:
print '\t- %s' % player[0]
# 4th player entry
print '\n\n'
print all_players[4][0] # Player name
print '\n'
#print all_players[4][1] # Table 1
print all_players[4][1][0] # Table 1 Column Headers
#print all_players[4][1][1] # Table 1 Rows
print all_players[4][1][1][1] # Table 1 Rows Row 1
print all_players[4][1][1][2] # Table 1 Rows Row 2
print all_players[4][1][1][-1] # Table 1 Rows Last Row
print '\n'
#print all_players[4][2] # Table 2
print all_players[4][2][0] # Table 2 Column Headers
#print all_players[4][2][1] # Table 2 Rows
print all_players[4][2][1][1] # Table 2 Rows Row 1
print all_players[4][2][1][2] # Table 2 Rows Row 2
print all_players[4][2][1][-1] # Table 2 Rows Last Row
print '\nTable 3'
PrintRowAsTable(all_players[4][2][0], '') # Table 3 Column Headers
PrintRowAsTable(all_players[4][2][1][1], '') # Table 3 Rows Row 1
PrintRowAsTable(all_players[4][2][1][2], '') # Table 3 Rows Row 2
PrintRowAsTable(all_players[4][2][1][-1], '') # Table 3 Rows Last Row
OUTPUT:
Outputs scraped data, so you can see how the all_players is structured.
Aaron Judge
Season vs R as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as R 27 69 77 14 8 2 0 4 8 10 6 0 32 1 1 0 2 0 0 .203
2017 vs R as R 66 198 231 65 34 10 2 19 37 42 31 3 71 2 0 0 8 3 0 .328
Total vs R as R 93 267 308 79 42 12 2 23 45 52 37 3 103 3 1 0 10 3 0 .296
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as R 7.8 % 41.6 % 0.19 .203 .273 .406 .679 .203 .294 7 -1.7 .291 79
2017 vs R as R 13.4 % 30.7 % 0.44 .328 .424 .687 1.111 .359 .426 54 26.1 .454 189
Total vs R as R 12.0 % 33.4 % 0.36 .296 .386 .614 1.001 .318 .394 62 24.4 .413 162
Season vs R as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as R 0.74 13.2 % 36.8 % 50.0 % 0.0 % 21.1 % 7.1 % 0.0 % 50.0 % 29.0 % 21.1 % 7.9 % 42.1 % 50.0 % 327 117 210
2017 vs R as R 1.14 27.6 % 38.6 % 33.9 % 2.3 % 44.2 % 6.1 % 0.0 % 45.7 % 26.8 % 27.6 % 11.0 % 39.4 % 49.6 % 985 395 590
Total vs R as R 1.02 24.2 % 38.2 % 37.6 % 1.6 % 37.1 % 6.3 % 0.0 % 46.7 % 27.3 % 26.1 % 10.3 % 40.0 % 49.7 % 1312 512 800
Season vs R as L G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as L 3 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 .000
2017 vs R as L 20 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 3 1 .000
Total vs R as L 23 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 3 2 .000
Season vs R as L BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000  
2017 vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000  
Total vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000  
Season vs R as L GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %             0 0 0
2017 vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %             0 0 0
Total vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %             0 0 0
Season vs L as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs L as R 11 15 18 1 1 0 0 0 0 0 3 0 10 0 0 0 0 0 0 .067
2017 vs L as R 26 47 61 16 9 1 1 5 9 12 13 0 16 1 0 0 2 0 0 .340
Total vs L as R 37 62 79 17 10 1 1 5 9 12 16 0 26 1 0 0 2 0 0 .274
Season vs L as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs L as R 16.7 % 55.6 % 0.30 .067 .222 .067 .289 .000 .200 0 -2.3 .164 -8
2017 vs L as R 21.3 % 26.2 % 0.81 .340 .492 .723 1.215 .383 .423 17 9.1 .496 218
Total vs L as R 20.3 % 32.9 % 0.62 .274 .430 .565 .995 .290 .387 16 6.8 .421 166
Season vs L as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs L as R 0.33 20.0 % 20.0 % 60.0 % 0.0 % 0.0 % 0.0 % 0.0 % 20.0 % 60.0 % 20.0 % 20.0 % 40.0 % 40.0 % 81 32 49
2017 vs L as R 0.73 16.1 % 35.5 % 48.4 % 0.0 % 33.3 % 0.0 % 0.0 % 29.0 % 48.4 % 22.6 % 16.1 % 35.5 % 48.4 % 295 135 160
Total vs L as R 0.67 16.7 % 33.3 % 50.0 % 0.0 % 27.8 % 0.0 % 0.0 % 27.8 % 50.0 % 22.2 % 16.7 % 36.1 % 47.2 % 376 167 209
Season vs R as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as R 27 69 77 14 8 2 0 4 8 10 6 0 32 1 1 0 2 0 0 .203
2017 vs R as R 66 198 231 65 34 10 2 19 37 42 31 3 71 2 0 0 8 3 0 .328
Total vs R as R 93 267 308 79 42 12 2 23 45 52 37 3 103 3 1 0 10 3 0 .296
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as R 7.8 % 41.6 % 0.19 .203 .273 .406 .679 .203 .294 7 -1.7 .291 79
2017 vs R as R 13.4 % 30.7 % 0.44 .328 .424 .687 1.111 .359 .426 54 26.1 .454 189
Total vs R as R 12.0 % 33.4 % 0.36 .296 .386 .614 1.001 .318 .394 62 24.4 .413 162
Season vs R as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as R 0.74 13.2 % 36.8 % 50.0 % 0.0 % 21.1 % 7.1 % 0.0 % 50.0 % 29.0 % 21.1 % 7.9 % 42.1 % 50.0 % 327 117 210
2017 vs R as R 1.14 27.6 % 38.6 % 33.9 % 2.3 % 44.2 % 6.1 % 0.0 % 45.7 % 26.8 % 27.6 % 11.0 % 39.4 % 49.6 % 985 395 590
Total vs R as R 1.02 24.2 % 38.2 % 37.6 % 1.6 % 37.1 % 6.3 % 0.0 % 46.7 % 27.3 % 26.1 % 10.3 % 40.0 % 49.7 % 1312 512 800
Scraped 8 player Statistics
- Aaron Judge
- Aaron Judge
- Aaron Judge
- Aaron Judge
- A.J. Pollock
- A.J. Pollock
- A.J. Pollock
- A.J. Pollock
A.J. Pollock
[u'Season', u'vs R as R', u'G', u'AB', u'PA', u'H', u'1B', u'2B', u'3B', u'HR', u'R', u'RBI', u'BB', u'IBB', u'SO', u'HBP', u'SF', u'SH', u'GDP', u'SB', u'CS', u'AVG']
[u'2013', u'vs R as R', u'115', u'270', u'295', u'70', u'52', u'12', u'2', u'4', u'25', u'21', u'21', u'1', u'54', u'1', u'0', u'3', u'4', u'3', u'1', u'.259']
[u'2014', u'vs R as R', u'71', u'215', u'232', u'66', u'42', u'17', u'3', u'4', u'21', u'14', u'15', u'0', u'41', u'2', u'0', u'0', u'3', u'7', u'1', u'.307']
[u'Total', u'vs R as R', u'395', u'1120', u'1230', u'330', u'225', u'67', u'13', u'25', u'122', u'102', u'93', u'1', u'199', u'5', u'9', u'3', u'23', u'41', u'6', u'.295']
[u'Season', u'vs R as R', u'BB%', u'K%', u'BB/K', u'AVG', u'OBP', u'SLG', u'OPS', u'ISO', u'BABIP', u'wRC', u'wRAA', u'wOBA', u'wRC+']
[u'2013', u'vs R as R', u'7.1 %', u'18.3 %', u'0.39', u'.259', u'.315', u'.363', u'.678', u'.104', u'.311', u'29', u'-3.0', u'.301', u'84']
[u'2014', u'vs R as R', u'6.5 %', u'17.7 %', u'0.37', u'.307', u'.358', u'.470', u'.828', u'.163', u'.365', u'35', u'9.6', u'.364', u'128']
[u'Total', u'vs R as R', u'7.6 %', u'16.2 %', u'0.47', u'.295', u'.349', u'.445', u'.793', u'.150', u'.337', u'168', u'30.7', u'.345', u'113']
Table 3
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2013 vs R as R 7.1 % 18.3 % 0.39 .259 .315 .363 .678 .104 .311 29 -3.0 .301 84
2014 vs R as R 6.5 % 17.7 % 0.37 .307 .358 .470 .828 .163 .365 35 9.6 .364 128
Total vs R as R 7.6 % 16.2 % 0.47 .295 .349 .445 .793 .150 .337 168 30.7 .345 113

Categories

Resources