Loop Through Table Rows Using BeautifulSoup - python

I need help looping through table rows and putting them into a list. On this website, there are three tables, each with different statistics - http://www.fangraphs.com/statsplits.aspx?playerid=15640&position=OF&season=0&split=0.4
For instance, these three tables have rows for 2016, 2017, and a total row. I would like the following:
A list of the following -->table 1 - row 1, table 2 - row 1, table 3 - row 1
A second list of the following -->table 1 - row 2, table 2 - row 2, table 3 - row 2
A third list: -->table 1 - row 3, table 2 - row 3, table 3 - row 3
I know I obviously need to create lists, and need to use the append function; however, I am not sure how to get it to loop through just the first row of each table, then the second row of each table, and etc through each row of the table (the number of rows will vary in each instance - this one just happens to have 3).
Any help is greatly appreciated. The code is below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
for id in idList2:
pos = 'OF'
for split in splitList:
url = 'http://www.fangraphs.com/statsplits.aspx?playerid=' +
str(id) + '&position=' + str(pos) + '&season=0&split=' +
str(split) + ''
r = requests.get(url)
for season in range(1,4):
print(season)
soup = BeautifulSoup(r.text, "html.parser")
tableStats = soup.find("table", {"id" : "SeasonSplits1_dgSeason" + str(season) + "_ctl00"})
soup.findAll('th')
column_headers = [th.getText() for th in soup.findAll('th')]
statistics = soup.find("table", {"id" :
'"SeasonSplits1_dgSeason" + str(season) + "_ctl00"})'
tabledata = [td.getText() for td in statistics('td')]
print(tabledata)

This will be my last attempt. It has every thing you should need. I created a traceback to where the tables, rows and columns are being scraped. This all happens in the function extract_table(). follow the traceback markers and don't worry about any other code. Don't let the large file size worry you its mostly documentation and spacing.
Traceback marker: ### ... ###
Start at line 95 with traceback marker ### START HERE ###
from bs4 import BeautifulSoup as Soup
import requests
import urllib
###### GO TO LINE 95 ######
### IGNORE ###
def generate_urls (idList, splitList):
""" Using and id list and a split list generate a list urls"""
urls = []
url = 'http://www.fangraphs.com/statsplits.aspx'
for id in idList:
for split in splitList:
# The parameters used in creating the url
url_payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
# Create the url and store add to the collection of urls
urls += ['?'.join([url, urllib.urlencode(url_payload)])]
return urls # Return the list of urls
### IGNORE ###
def extract_player_name (soup):
""" Extract the player name from the browser title """
# Browser title contains player name, strip all but name
player_name = repr(soup.title.text.strip('\r\n\t'))
player_name = player_name.split(' \\xbb')[0] # Split on ` »`
player_name = player_name[2:] # Erase a leading characters from using `repr`
return player_name
########## FINISH HERE ##########
def extract_table (table_id, soup):
""" Extract data from a table, return the column headers and the table rows"""
### IMPORTANT: THIS CODE IS WHERE ALL THE MAGIC HAPPENS ###
# - First: Find lowest level tag of all the data we want (container).
#
# - Second: Extract the table column headers, requires minimal mining
#
# - Third: Gather a list of tags that represent the tables rows
#
# - Fourth: Loop through the list of rows
# A): Mine all columns in the row
### IMPORTANT: Get A Reference To The Table ###
# SCRAPE 1:
table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})
# SCRAPE 2:
columns = [th.text for th in table_tag.findAll('th')]
# SCRAPE 3:
rows_tags = table_tag.tbody.findAll('tr'); # All 'tr' tags in the table `tbody` tag are row tags
### IMPORTANT: Cycle Through Rows And Collect Column Data ###
# SCRAPE 4:
rows = [] # List of all table rows
for row_tag in rows_tags:
### IMPORTANT: Mine All Columns In This Row || LOWEST LEVEL IN THE MINING OPERATION. ###
# SCRAPE 4.A
row = [col.text for col in row_tag.findAll('td')] # `td` represents a column in a row.
rows.append (row) # Add this row to all the other rows of this table
# RETURN: The column header and the rows of this table
return [columns, rows]
### Look Deeper ###
def extract_player (soup):
""" Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
player = [] # A list store data in
# player name is first in player list
player.append (extract_player_name (soup))
# Each table is a list entry
for season in range(1,4):
### IMPORTANT: No Table Related Data Has Been Mined Yet. START HERE ###
### - Line: 37
table = extract_table (season, soup) # `season` represents the table id
player.append(table) # Add this table(list to the player data list
# Return the player list
return player
##################################################
################## START HERE ####################
##################################################
###
### OBJECTIVE:
###
### - Follow the trail of important lines that extract the data
### - Important lines will be marked as the following `### ... ###`
###
### All this code really needs is a url and the `extract_table()` function.
###
### The `main()` function is where the journey starts
###
##################################################
##################################################
def main ():
""" The main function is the core program code. """
# Luckily the pages we will scrape all have the same layout making mining easier.
all_players = [] # A place to store all the data
# Values used to alter the url when making requests to access more player statistics
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
# Instead of looping through variables that dont tell a story,
# lets create a list of urls generated from those variables.
# This way the code is self-explanatory and is human-readable.
urls = generate_urls(idList2, splitList) # The creation of the url is not important right now
# Lets scrape each url
for url in urls:
print url
# First Step: get a web page via http request.
response = requests.get (url)
# Second step: use a parsing library to create a parsable object
soup = Soup(response.text, "html.parser") # Create a soup object (Once)
### IMPORTANT: Parsing Starts and Ends Here ###
### - Line: 75
# Final Step: Given a soup object, mine player data
player = extract_player (soup)
# Add the new entry to the list
all_players += [player]
return all_players
# If this script is being run, not imported, run the `main()` function.
if __name__ == '__main__':
all_players = main ()
print all_players[0][0] # Player List -> Name
print all_players[0][1] # Player List -> Table 1
print all_players[0][2] # Player List -> Table 2
print all_players[0][3] # Player List -> Table 3
print all_players[0][3][0] # Player List -> Table 1 -> Columns
print all_players[0][3][1] # Player List -> Table 1 -> All Rows
print all_players[0][3][1][0] # Player List -> Table 1 -> All Rows -> Row 1
print all_players[0][3][1][2] # Player List -> Table 1 -> All Rows -> Row 2
print all_players[0][3][1][2][0] # Player List -> Table 1 -> All Rows -> Row 2 -> Colum 1

I've updated the code, separated functionality, used lists instead of dictionaries (as requested). Lines 85+ is output testing (can ignore).
I see now that that your making multiple request (4) for the same player to gather more data on them. In the last answer I provided, the code only kept the last request made. Using a list eliminated this problem.
You may want to condense the list so that their is only one entry per player.
The core of the program is on lines 65-77
Everything above all_player decleration on line 57 is a function To handle scraping.
UPDATED: scrape_players.py
from bs4 import BeautifulSoup as Soup
import requests
def http_get (id, split):
""" Make a get request, return the response. """
# Create url parameters dictinoary
payload = {'split': split, 'playerid': id, 'position': 'OF', 'season': 0}
url = 'http://www.fangraphs.com/statsplits.aspx'
return requests.get(url, params=payload) # Pass payload through `requests.get()`
def extract_player_name (soup):
""" Extract the player name from the browser title """
# Browser title contains player name, strip all but name
player_name = repr(soup.title.text.strip('\r\n\t'))
player_name = player_name.split(' \\xbb')[0] # Split on ` »`
player_name = player_name[2:] # Erase a leading characters from using `repr`
return player_name
def extract_table (table_id, soup):
""" Extract data from a table, return the column headers and the table rows"""
# SCRAPE: Get a table
table_tag = soup.find("table", {"id" : 'SeasonSplits1_dgSeason%d_ctl00' % table_id})
# SCRAPE: Extract table column headers
columns = [th.text for th in table_tag.findAll('th')]
rows = []
# SCRAPE: Extract Table Contents
for row in table_tag.tbody.findAll('tr'):
rows.append ([col.text for col in row.findAll('td')]) # Gather all columns in the row
# RETURN: [columns, rows]
return [columns, rows]
def extract_player (soup):
""" Extract player data and store in a list. ['name', [columns, rows], [table2]]"""
player = []
# player name is first in player list
player.append (extract_player_name (soup))
# Each table is a list entry
for season in range(1,4):
player.append(extract_table (season, soup))
# Return the player list
return player
# A list of all players
all_players = [
#'playername',
#[table_columns, table_rows],
#[table_columns, table_rows],
#[['Season', 'vs R as R'], [['2015', 'yes'], ['2016', 'no'], ['2017', 'no'],]],
]
# I dont know what these values are. Sorry!
idList2 = ['15640', '9256']
splitList=[0.4,0.2,0.3,0.4]
# Scrape data
for id in idList2:
for split in splitList:
response = http_get (id, split)
soup = Soup(response.text, "html.parser") # Create a soup object (Once)
all_players.append (extract_player (soup))
# or all_players += [scrape_player (soup)]
# Output data
def PrintPlayerAsTable (player, show_name=True):
if show_name: print player[0] # First entry is the player name
for table in player[1:]: # All other entries are tables
PrintTableAsTable(table)
def PrintTableAsTable (table, table_sep='\n'):
print table_sep
PrintRowAsTable(table[0]) # The first row in the table is the columns
for row in table[1]: # The second item in the table is a list of rows
PrintRowAsTable (row)
def PrintRowAsTable (row=[], prefix='\t'):
""" Print out the list in a table foramt. """
print prefix + ''.join([col.ljust(15) for col in row])
# There are 4 entries to every player, one for each request made
PrintPlayerAsTable (all_players[0])
PrintPlayerAsTable (all_players[1], False)
PrintPlayerAsTable (all_players[2], False)
PrintPlayerAsTable (all_players[3], False)
print '\n\nScraped %d player Statistics' % len(all_players)
for player in all_players:
print '\t- %s' % player[0]
# 4th player entry
print '\n\n'
print all_players[4][0] # Player name
print '\n'
#print all_players[4][1] # Table 1
print all_players[4][1][0] # Table 1 Column Headers
#print all_players[4][1][1] # Table 1 Rows
print all_players[4][1][1][1] # Table 1 Rows Row 1
print all_players[4][1][1][2] # Table 1 Rows Row 2
print all_players[4][1][1][-1] # Table 1 Rows Last Row
print '\n'
#print all_players[4][2] # Table 2
print all_players[4][2][0] # Table 2 Column Headers
#print all_players[4][2][1] # Table 2 Rows
print all_players[4][2][1][1] # Table 2 Rows Row 1
print all_players[4][2][1][2] # Table 2 Rows Row 2
print all_players[4][2][1][-1] # Table 2 Rows Last Row
print '\nTable 3'
PrintRowAsTable(all_players[4][2][0], '') # Table 3 Column Headers
PrintRowAsTable(all_players[4][2][1][1], '') # Table 3 Rows Row 1
PrintRowAsTable(all_players[4][2][1][2], '') # Table 3 Rows Row 2
PrintRowAsTable(all_players[4][2][1][-1], '') # Table 3 Rows Last Row
OUTPUT:
Outputs scraped data, so you can see how the all_players is structured.
Aaron Judge
Season vs R as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as R 27 69 77 14 8 2 0 4 8 10 6 0 32 1 1 0 2 0 0 .203
2017 vs R as R 66 198 231 65 34 10 2 19 37 42 31 3 71 2 0 0 8 3 0 .328
Total vs R as R 93 267 308 79 42 12 2 23 45 52 37 3 103 3 1 0 10 3 0 .296
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as R 7.8 % 41.6 % 0.19 .203 .273 .406 .679 .203 .294 7 -1.7 .291 79
2017 vs R as R 13.4 % 30.7 % 0.44 .328 .424 .687 1.111 .359 .426 54 26.1 .454 189
Total vs R as R 12.0 % 33.4 % 0.36 .296 .386 .614 1.001 .318 .394 62 24.4 .413 162
Season vs R as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as R 0.74 13.2 % 36.8 % 50.0 % 0.0 % 21.1 % 7.1 % 0.0 % 50.0 % 29.0 % 21.1 % 7.9 % 42.1 % 50.0 % 327 117 210
2017 vs R as R 1.14 27.6 % 38.6 % 33.9 % 2.3 % 44.2 % 6.1 % 0.0 % 45.7 % 26.8 % 27.6 % 11.0 % 39.4 % 49.6 % 985 395 590
Total vs R as R 1.02 24.2 % 38.2 % 37.6 % 1.6 % 37.1 % 6.3 % 0.0 % 46.7 % 27.3 % 26.1 % 10.3 % 40.0 % 49.7 % 1312 512 800
Season vs R as L G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as L 3 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 .000
2017 vs R as L 20 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 3 1 .000
Total vs R as L 23 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 3 2 .000
Season vs R as L BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000  
2017 vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000  
Total vs R as L 0.0 % 0.0 % 0.00 .000 .000 .000 .000 .000 .000 0 0.0 .000  
Season vs R as L GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %             0 0 0
2017 vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %             0 0 0
Total vs R as L 0.00 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %             0 0 0
Season vs L as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs L as R 11 15 18 1 1 0 0 0 0 0 3 0 10 0 0 0 0 0 0 .067
2017 vs L as R 26 47 61 16 9 1 1 5 9 12 13 0 16 1 0 0 2 0 0 .340
Total vs L as R 37 62 79 17 10 1 1 5 9 12 16 0 26 1 0 0 2 0 0 .274
Season vs L as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs L as R 16.7 % 55.6 % 0.30 .067 .222 .067 .289 .000 .200 0 -2.3 .164 -8
2017 vs L as R 21.3 % 26.2 % 0.81 .340 .492 .723 1.215 .383 .423 17 9.1 .496 218
Total vs L as R 20.3 % 32.9 % 0.62 .274 .430 .565 .995 .290 .387 16 6.8 .421 166
Season vs L as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs L as R 0.33 20.0 % 20.0 % 60.0 % 0.0 % 0.0 % 0.0 % 0.0 % 20.0 % 60.0 % 20.0 % 20.0 % 40.0 % 40.0 % 81 32 49
2017 vs L as R 0.73 16.1 % 35.5 % 48.4 % 0.0 % 33.3 % 0.0 % 0.0 % 29.0 % 48.4 % 22.6 % 16.1 % 35.5 % 48.4 % 295 135 160
Total vs L as R 0.67 16.7 % 33.3 % 50.0 % 0.0 % 27.8 % 0.0 % 0.0 % 27.8 % 50.0 % 22.2 % 16.7 % 36.1 % 47.2 % 376 167 209
Season vs R as R G AB PA H 1B 2B 3B HR R RBI BB IBB SO HBP SF SH GDP SB CS AVG
2016 vs R as R 27 69 77 14 8 2 0 4 8 10 6 0 32 1 1 0 2 0 0 .203
2017 vs R as R 66 198 231 65 34 10 2 19 37 42 31 3 71 2 0 0 8 3 0 .328
Total vs R as R 93 267 308 79 42 12 2 23 45 52 37 3 103 3 1 0 10 3 0 .296
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2016 vs R as R 7.8 % 41.6 % 0.19 .203 .273 .406 .679 .203 .294 7 -1.7 .291 79
2017 vs R as R 13.4 % 30.7 % 0.44 .328 .424 .687 1.111 .359 .426 54 26.1 .454 189
Total vs R as R 12.0 % 33.4 % 0.36 .296 .386 .614 1.001 .318 .394 62 24.4 .413 162
Season vs R as R GB/FB LD% GB% FB% IFFB% HR/FB IFH% BUH% Pull% Cent% Oppo% Soft% Med% Hard% Pitches Balls Strikes
2016 vs R as R 0.74 13.2 % 36.8 % 50.0 % 0.0 % 21.1 % 7.1 % 0.0 % 50.0 % 29.0 % 21.1 % 7.9 % 42.1 % 50.0 % 327 117 210
2017 vs R as R 1.14 27.6 % 38.6 % 33.9 % 2.3 % 44.2 % 6.1 % 0.0 % 45.7 % 26.8 % 27.6 % 11.0 % 39.4 % 49.6 % 985 395 590
Total vs R as R 1.02 24.2 % 38.2 % 37.6 % 1.6 % 37.1 % 6.3 % 0.0 % 46.7 % 27.3 % 26.1 % 10.3 % 40.0 % 49.7 % 1312 512 800
Scraped 8 player Statistics
- Aaron Judge
- Aaron Judge
- Aaron Judge
- Aaron Judge
- A.J. Pollock
- A.J. Pollock
- A.J. Pollock
- A.J. Pollock
A.J. Pollock
[u'Season', u'vs R as R', u'G', u'AB', u'PA', u'H', u'1B', u'2B', u'3B', u'HR', u'R', u'RBI', u'BB', u'IBB', u'SO', u'HBP', u'SF', u'SH', u'GDP', u'SB', u'CS', u'AVG']
[u'2013', u'vs R as R', u'115', u'270', u'295', u'70', u'52', u'12', u'2', u'4', u'25', u'21', u'21', u'1', u'54', u'1', u'0', u'3', u'4', u'3', u'1', u'.259']
[u'2014', u'vs R as R', u'71', u'215', u'232', u'66', u'42', u'17', u'3', u'4', u'21', u'14', u'15', u'0', u'41', u'2', u'0', u'0', u'3', u'7', u'1', u'.307']
[u'Total', u'vs R as R', u'395', u'1120', u'1230', u'330', u'225', u'67', u'13', u'25', u'122', u'102', u'93', u'1', u'199', u'5', u'9', u'3', u'23', u'41', u'6', u'.295']
[u'Season', u'vs R as R', u'BB%', u'K%', u'BB/K', u'AVG', u'OBP', u'SLG', u'OPS', u'ISO', u'BABIP', u'wRC', u'wRAA', u'wOBA', u'wRC+']
[u'2013', u'vs R as R', u'7.1 %', u'18.3 %', u'0.39', u'.259', u'.315', u'.363', u'.678', u'.104', u'.311', u'29', u'-3.0', u'.301', u'84']
[u'2014', u'vs R as R', u'6.5 %', u'17.7 %', u'0.37', u'.307', u'.358', u'.470', u'.828', u'.163', u'.365', u'35', u'9.6', u'.364', u'128']
[u'Total', u'vs R as R', u'7.6 %', u'16.2 %', u'0.47', u'.295', u'.349', u'.445', u'.793', u'.150', u'.337', u'168', u'30.7', u'.345', u'113']
Table 3
Season vs R as R BB% K% BB/K AVG OBP SLG OPS ISO BABIP wRC wRAA wOBA wRC+
2013 vs R as R 7.1 % 18.3 % 0.39 .259 .315 .363 .678 .104 .311 29 -3.0 .301 84
2014 vs R as R 6.5 % 17.7 % 0.37 .307 .358 .470 .828 .163 .365 35 9.6 .364 128
Total vs R as R 7.6 % 16.2 % 0.47 .295 .349 .445 .793 .150 .337 168 30.7 .345 113

Related

Python Beautiful Soup Webscraping: Cannot get a full table to display

I am relatively new to python and this is my first web scrape. I am trying to scrape a table and can only get the first column to show up. I am using the find method instead of find_all which I am pretty sure what is causing this, but when I use the find_all method I cannot get any text to display. Here is the url I am scraping from: https://www.fangraphs.com/teams/mariners/stats
I am trying to get the top table (Batting Stat Leaders) to work. My code is below:
from bs4 import BeautifulSoup
import requests
import time
htmlText = requests.get('https://www.fangraphs.com/teams/mariners/stats').text
soup = BeautifulSoup(htmlText, 'lxml', )
playerTable = soup.find('div', class_='team-stats-table')
input = input("Would you like to see Batting, Starting Pitching, Relief Pitching, or Fielding Stats? \n")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all('tr')[1:55]:
tds = tr.find('td').text
print(tds)
if input == "Batting" or "batting":
BattingStats()
You can use list-comprehension to get text from all rows:
import requests
from bs4 import BeautifulSoup
playerTable = soup.find("div", class_="team-stats-table")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all("tr")[1:55]:
tds = [td.text for td in tr.select("td")]
print(tds)
BattingStats()
Prints:
BATTING STATS:
Player Name:
Mitch Haniger 30 94 406 25 0 6.7% 23.4% .257 .291 .268 .323 .524 .358 133 0.2 16.4 -6.5 2.4
Ty France 26 89 372 9 0 7.3% 16.9% .150 .314 .276 .355 .426 .341 121 0.0 9.5 -2.6 2.0
Kyle Seager 33 97 403 18 2 8.4% 25.8% .201 .246 .215 .285 .416 .302 95 -0.3 -2.9 5.4 1.6
...
Solution with pandas:
import pandas as pd
url = "https://www.fangraphs.com/teams/mariners/stats"
df = pd.read_html(url)[7]
print(df)
Prints:
Name Age G PA HR SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR
0 Mitch Haniger 30 94 406 25 0 6.7% 23.4% 0.257 0.291 0.268 0.323 0.524 0.358 133.0 0.2 16.4 -6.5 2.4
1 Ty France 26 89 372 9 0 7.3% 16.9% 0.150 0.314 0.276 0.355 0.426 0.341 121.0 0.0 9.5 -2.6 2.0
2 Kyle Seager 33 97 403 18 2 8.4% 25.8% 0.201 0.246 0.215 0.285 0.416 0.302 95.0 -0.3 -2.9 5.4 1.6
...

Multiple table header <thead> in table <table> and how to scrape data from <thead> as a table row

I'm trying to scrape data from a website but the table has two sets of data, first, 2-3 lines of data are in thead and rest in tbody. I can easily extract data only from one at a time when I try both I got some error like TypeError, AttributeError. btw I'm using python
here is the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
url="https://www.worldometers.info/world-population/"
r=requests.get(url)
print(r)
html=r.text
soup=BeautifulSoup(html,'html.parser')
print(soup.title.text)
print()
print()
live_data=soup.find_all('div',id='maincounter-wrap')
print(live_data)
for i in live_data:
print(i.text)
table_body=soup.find('thead')
table_rows=table_body.find_all('tr')
table_body_2=soup.find('tbody')
table_rows_2=soup.find_all('tr')
year_july1=[]
population=[]
yearly_change_in_perchantage=[]
yearly_change=[]
median_age=[]
fertillity_rate=[]
density=[]#density (p\km**)
urban_population_in_perchantage=[]
urban_population=[]
for tr in table_rows:
td=tr.find_all('td')
year_july1.append(td[0].text)
population.append(td[1].text)
yearly_change_in_perchantage.append(td[2].text)
yearly_change.append(td[3].text)
median_age.append(td[4].text)
fertillity_rate.append(td[5].text)
density.append(td[6].text)
urban_population_in_perchantage.append(td[7].text)
urban_population.append(td[8].text)
for tr in table_rows_2:
td=tr.find_all('td')
year_july1.append(td[0].text)
population.append(td[1].text)
yearly_change_in_perchantage.append(td[2].text)
yearly_change.append(td[3].text)
median_age.append(td[4].text)
fertillity_rate.append(td[5].text)
density.append(td[6].text)
urban_population_in_perchantage.append(td[7].text)
urban_population.append(td[8].text)
headers=['year_july1','population','yearly_change_in_perchantage','yearly_change','median_age','fertillity_rate','density','urban_population_in_perchantage','urban_population']
data_2= pd.DataFrame(list(zip(year_july1,population,yearly_change_in_perchantage,yearly_change,median_age,fertillity_rate,density,urban_population_in_perchantage,urban_population)),columns=headers)
print(data_2)
data_2.to_csv("C:\\Users\\data_2.csv")
you can try the below code it generates the required data. Do let me know if you need any clarification:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/world-population/'
html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
#print(df)
df.to_csv("data.csv", index=False)
gives me below output
print(df)
Year (July 1) Population ... Urban Pop % Urban Population
0 2020 7794798739 ... 56.2 % 4378993944
1 2019 7713468100 ... 55.7 % 4299438618
2 2018 7631091040 ... 55.3 % 4219817318
3 2017 7547858925 ... 54.9 % 4140188594
4 2016 7464022049 ... 54.4 % 4060652683
5 2015 7379797139 ... 54.0 % 3981497663
6 2010 6956823603 ... 51.7 % 3594868146
7 2005 6541907027 ... 49.2 % 3215905863
8 2000 6143493823 ... 46.7 % 2868307513
9 1995 5744212979 ... 44.8 % 2575505235
10 1990 5327231061 ... 43.0 % 2290228096
11 1985 4870921740 ... 41.2 % 2007939063
12 1980 4458003514 ... 39.3 % 1754201029
13 1975 4079480606 ... 37.7 % 1538624994
14 1970 3700437046 ... 36.6 % 1354215496
15 1965 3339583597 ... N.A. N.A.
16 1960 3034949748 ... 33.7 % 1023845517
17 1955 2773019936 ... N.A. N.A.
[18 rows x 9 columns]

Speed optimization for loop

I'm trying to predict the outcome of sports games and therefore want to transform my dataframe in such a way that I can train a model. Currently I am using a for loop to loop through all played games, pick the two players of the game and check how they performed the x games before the actual game took place. After this I want to take the mean of the statistics of previous games of these players and concatenate these together. In the end I add the true outcome of the actual game so I can train a model on the true outcome.
Now I got some speed performance issues, my current code takes about 9 minutes to complete for 20000 games (with ~200 variables). I already managed to go from 20 to 9 minutes.
I started with adding each game to a dataframe, later I changed this to adding each seperate dataframe to a list and make one big dataframe of this list in the end.
I also included if statements which make sure that the loop continues if a player did not play at least x games.
I expect the outcome to be much faster than 9 minutes. I think it can be much faster.
Hope you guys can help me!
import pandas as pd
import numpy as np
import random
import string
letters = list(string.ascii_lowercase)
datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
'League': np.random.choice(['LeagueA','LeagueB'], 5000),
'Home_player':np.random.choice(letters, 5000),
'Away_player':np.random.choice(letters, 5000),
'Home_strikes':np.random.randint(1,20,5000),
'Home_kicks':np.random.randint(1,20,5000),
'Away_strikes':np.random.randint(1,20,5000),
'Away_kicks':np.random.randint(1,20,5000),
'Winner':np.random.randint(0,2,5000)})
leagues = list(data['League'].unique())
home_columns = [col for col in data if col.startswith('Home')]
away_columns = [col for col in data if col.startswith('Away')]
# Determine to how many last x games to take statistics
total_games = 5
final_df = []
# Make subframe of league
for league in leagues:
league_data = data[data.League == league]
league_data = league_data.sort_values(by='Date').reset_index(drop=True)
# Pick the last game
league_data = league_data.head(500)
for i in range(0,len(league_data)):
if i < 1:
league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
else:
league_copy = league_data[:-i].reset_index(drop=True)
# Loop back from the last game
last_game = league_copy.iloc[-1:].reset_index(drop=True)
# Take home and away player
Home_player = last_game.loc[0,"Home_player"] # Pick home team
Away_player = last_game.loc[0,'Away_player'] # pick away team
# # Remove last row so current game is not picked
df = league_copy[:-1]
# Now check the statistics of the games befóre this game was played
Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
# If the player did not play at least x number of games, then continue
if len(Home) < total_games:
continue
else:
Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Do the same for the away team
Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
if len(Away) < total_games:
continue
else:
Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Now concat home and away player data
Home_away = pd.concat([Home, Away], axis=1)
Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
# Take the mean of all columns
Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
# Now again add home team and away team to dataframe
Home_away["Home_player"] = Home_player
Home_away["Away_player"] = Away_player
winner = last_game.loc[0,"Winner"]
date = last_game.loc[0,"Date"]
Home_away['Winner'] = winner
Home_away['Date'] = date
final_df.append(Home_away)
final_df = pd.concat(final_df, axis=0)
final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
This doesn't answer your question but you can leverage the package line_profiler to find the slow parts of your code.
Resource:
http://gouthamanbalaraman.com/blog/profiling-python-jupyter-notebooks.html
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 1 35.0 35.0 0.0 letters = list(string.ascii_lowercase)
3 1 11052.0 11052.0 0.0 datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
4
5 1 3483.0 3483.0 0.0 data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
6 1 1464.0 1464.0 0.0 'League': np.random.choice(['LeagueA','LeagueB'], 5000),
7 1 2532.0 2532.0 0.0 'Home_player':np.random.choice(letters, 5000),
8 1 1019.0 1019.0 0.0 'Away_player':np.random.choice(letters, 5000),
9 1 693.0 693.0 0.0 'Home_strikes':np.random.randint(1,20,5000),
10 1 682.0 682.0 0.0 'Home_kicks':np.random.randint(1,20,5000),
11 1 682.0 682.0 0.0 'Away_strikes':np.random.randint(1,20,5000),
12 1 731.0 731.0 0.0 'Away_kicks':np.random.randint(1,20,5000),
13 1 40409.0 40409.0 0.0 'Winner':np.random.randint(0,2,5000)})
14
15 1 6560.0 6560.0 0.0 leagues = list(data['League'].unique())
16 1 439.0 439.0 0.0 home_columns = [col for col in data if col.startswith('Home')]
17 1 282.0 282.0 0.0 away_columns = [col for col in data if col.startswith('Away')]
18
19 # Determine to how many last x games to take statistics
20 1 11.0 11.0 0.0 total_games = 5
21 1 12.0 12.0 0.0 final_df = []
22
23 # Make subframe of league
24 3 38.0 12.7 0.0 for league in leagues:
25
26 2 34381.0 17190.5 0.0 league_data = data[data.League == league]
27 2 30815.0 15407.5 0.0 league_data = league_data.sort_values(by='Date').reset_index(drop=True)
28 # Pick the last game
29 2 5045.0 2522.5 0.0 league_data = league_data.head(500)
30 1002 14202.0 14.2 0.0 for i in range(0,len(league_data)):
31 1000 11943.0 11.9 0.0 if i < 1:
32 2 28407.0 14203.5 0.0 league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
33 else:
34 998 5305364.0 5316.0 4.2 league_copy = league_data[:-i].reset_index(drop=True)
35
36 # Loop back from the last game
37 1000 4945240.0 4945.2 3.9 last_game = league_copy.iloc[-1:].reset_index(drop=True)
38
39 # Take home and away player
40 1000 1504055.0 1504.1 1.2 Home_player = last_game.loc[0,"Home_player"] # Pick home team
41 1000 899081.0 899.1 0.7 Away_player = last_game.loc[0,'Away_player'] # pick away team
42
43 # # Remove last row so current game is not picked
44 1000 2539351.0 2539.4 2.0 df = league_copy[:-1]
45
46 # Now check the statistics of the games befóre this game was played
47 1000 16428854.0 16428.9 13.0 Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
48
49 # If the player did not play at least x number of games, then continue
50 1000 49133.0 49.1 0.0 if len(Home) < total_games:
51 260 2867.0 11.0 0.0 continue
52 else:
53 740 12968016.0 17524.3 10.2 Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
54
55
56 # Do the same for the away team
57 740 12007650.0 16226.6 9.5 Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
58
59 740 33357.0 45.1 0.0 if len(Away) < total_games:
60 64 825.0 12.9 0.0 continue
61 else:
62 676 11598741.0 17157.9 9.1 Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
63
64
65 # Now concat home and away player data
66 676 5114022.0 7565.1 4.0 Home_away = pd.concat([Home, Away], axis=1)
67 676 9702001.0 14352.1 7.6 Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
68
69 # Take the mean of all columns
70 676 12171184.0 18004.7 9.6 Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
71
72 # Now again add home team and away team to dataframe
73 676 5112558.0 7563.0 4.0 Home_away["Home_player"] = Home_player
74 676 4880017.0 7219.0 3.8 Home_away["Away_player"] = Away_player
75
76 676 791718.0 1171.2 0.6 winner = last_game.loc[0,"Winner"]
77 676 696925.0 1031.0 0.5 date = last_game.loc[0,"Date"]
78 676 5142111.0 7606.7 4.1 Home_away['Winner'] = winner
79 676 9630466.0 14246.3 7.6 Home_away['Date'] = date
80
81 676 16125.0 23.9 0.0 final_df.append(Home_away)
82 1 5088063.0 5088063.0 4.0 final_df = pd.concat(final_df, axis=0)
83 1 18424.0 18424.0 0.0 final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
IIUC, you can obtain the last 5 game statistics, including the current one by:
# replace this with you statistic columns
stat_cols = data.columns[4:]
total_games = 5
data.groupby(['League','Home_player', 'Away_player'])[stat_cols].rolling(total_games).mean()
If you want to exclude the current one:
last_stats = data.groupby(['League','Home_player', 'Away_player']).apply(lambda x: x[stat_cols].shift().rolling(total_games).mean())
This last_stats data frame should have the same index as the the original one, so you can do:
train_data = data.copy()
# backup the actual outcome
train_data['Actual'] = train_data['Winner']
# copy the average statistics
train_data[stat_cols] = last_stats
All together should not take more than 1 min.

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]
You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

How do I 'rearrange' my python code to fit a certain format?

This is my code at the moment. Please note "percentageOff" and "originalPrices" are lists with floats/ints in them.
print("Percent off:", percentageOff[0], '%')
for percentage in percentageOff[1:]:
print("\t\t\t", percentage, "%")
count = 0
for prices in originalPrices:
for percentage in percentageOff:
discount = prices * (percentage/100)
newPrice = prices - discount
count += 1
if(0 < count < 11):
print("%.2f" % newPrice)
elif(11 < count < 21):
print("\t\t\t%.2f" % newPrice)
The output is(with the rest of the code):
**Sale Prices**
Normal Price: $9.95 $14.95 $19.95 $24.95 $29.95 $34.95 $39.95 $44.95 $49.95
______________________________________________________________________________________
Percent off: 5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
9.45
8.96
8.46
7.96
7.46
6.96
6.47
5.97
5.47
4.97
But I want the output to be
**Sale Prices**
Normal Price: $9.95 $14.95 $19.95 $24.95 $29.95 $34.95 $39.95 $44.95 $49.95
______________________________________________________________________________________
Percent off: 5 % 9.45
10% 8.96
15% 8.46
20% 7.96
25% 7.46
30% 6.96
35% 6.47
40% 5.97
45% 5.47
50% 4.97
How can I fix my problem?
Depends on the #danidee
x = [1,2,3,4,5,34]
y = [3,4,5,6,5,4]
print("Percent off:",end="\t")
for i, j in zip(x, y):
print(i, '\t\t', j,end="\n\t\t")
output is;
Percent off: 1 3
2 4
3 5
4 6
5 5
34 4
The Best solution without using an external Library is put the different values into lists as you iterate over them and test them and then print them in a parallel manner...since you're testing two different conditions and only one is bound to be true at a time.
proof of concept
x = [1,2,3,4,5,34]
y = [3,4,5,6,5,4]
for i, j in zip(x, y):
print(i, '\t\t\t', j)
Output
1 3
2 4
3 5
4 6
5 5
34 4
I'd suggest using python formatter for great good!
Assuming you are using Python 3, the code could be rearranged this way (it's well commented, so hopefully no additional explanation need):
#!/usr/bin/env python3
percentageOff = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
originalPrices = [9.95, 14.95, 19.95, 24.95, 29.95, 34.95, 39.95, 44.95, 49.95]
# Helper function
def new_price(price, off):
"""Return a new price, calculated as original price minus discount."""
return price - (price * off / 100)
## Print header
price_header = "".join(["{0}$\t".format(str(p)) for p in originalPrices])
# centre string, given the string length is 78 chars
print("{0:^78}".format("Normal Price"))
print("Off % | {0}".format(price_header))
print('-' * 78)
## Print rows
for off in percentageOff:
# padding number to 4 digits; prevent newline char at the end
print("{0:4d}% | ".format(off), end="")
for price in originalPrices:
#
print("{0:.2f}\t".format(new_price(price, off)), end="")
# print newline at the end of the row
print()
This will produce an output like this:
Normal Price
Off % | 9.95$ 14.95$ 19.95$ 24.95$ 29.95$ 34.95$ 39.95$ 44.95$ 49.95$
------------------------------------------------------------------------------
5% | 9.45 14.20 18.95 23.70 28.45 33.20 37.95 42.70 47.45
10% | 8.96 13.45 17.95 22.45 26.95 31.46 35.96 40.46 44.96
15% | 8.46 12.71 16.96 21.21 25.46 29.71 33.96 38.21 42.46
20% | 7.96 11.96 15.96 19.96 23.96 27.96 31.96 35.96 39.96
25% | 7.46 11.21 14.96 18.71 22.46 26.21 29.96 33.71 37.46
30% | 6.96 10.46 13.96 17.46 20.96 24.47 27.97 31.47 34.97
35% | 6.47 9.72 12.97 16.22 19.47 22.72 25.97 29.22 32.47
40% | 5.97 8.97 11.97 14.97 17.97 20.97 23.97 26.97 29.97
45% | 5.47 8.22 10.97 13.72 16.47 19.22 21.97 24.72 27.47
50% | 4.97 7.47 9.97 12.47 14.97 17.48 19.98 22.48 24.98
Hope that helps.

Categories

Resources