I'm trying to scrape table data off of this website:
https://www.nfl.com/standings/league/2019/REG
I have working code (below), however, it seems like the table data is not in the order that I see on the website.
On the website I see (top-down):
Baltimore Ravens, Green Bay Packers, ..., Cincinatti Bengals
But in my code results, I see (top-down): Bengals, Lions, ..., Ravens
Why is soup returning the tags out of order? Does anyone know why this is happening? Thanks!
import requests
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import lxml
url = 'https://www.nfl.com/standings/league/2019/REG'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup) #not sure why soup isn't returning tags in the order I see on website
table = soup.table
headers = []
for th in table.select('th'):
headers.append(th.text)
print(headers)
df = pd.DataFrame(columns=headers)
for sup in table.select('sup'):
sup.decompose() #Removes sup tag from the table tree so x, xz* in nfl_team_name will not show up
for tr in table.select('tr')[1:]:
td_list = tr.select('td')
td_str_list = [td_list[0].select('.d3-o-club-shortname')[0].text]
td_str_list = td_str_list + [td.text for td in td_list[1:]]
df.loc[len(df)] = td_str_list
print(df.to_string())
After initial load the table is dynamically sorted by column PCT - To get your goal do the same with your DataFrame using sort_values():
pd.read_html('https://www.nfl.com/standings/league/2019/REG')[0].sort_values(by='PCT',ascending=False)
Or based on your example:
df.sort_values(by='PCT',ascending=False)
Output:
NFL Team
W
L
T
PCT
PF
PA
Net Pts
Home
Road
Div
Pct
Conf
Pct
Non-Conf
Strk
Last 5
Ravens
14
2
0
0.875
531
282
249
7 - 1 - 0
7 - 1 - 0
5 - 1 - 0
0.833
10 - 2 - 0
0.833
4 - 0 - 0
12W
5 - 0 - 0
49ers
13
3
0
0.813
479
310
169
6 - 2 - 0
7 - 1 - 0
5 - 1 - 0
0.833
10 - 2 - 0
0.833
3 - 1 - 0
2W
3 - 2 - 0
Saints
13
3
0
0.813
458
341
117
6 - 2 - 0
7 - 1 - 0
5 - 1 - 0
0.833
9 - 3 - 0
0.75
4 - 0 - 0
3W
4 - 1 - 0
Packers
13
3
0
0.813
376
313
63
7 - 1 - 0
6 - 2 - 0
6 - 0 - 0
1
10 - 2 - 0
0.833
3 - 1 - 0
5W
5 - 0 - 0
...
Related
I've this text file which has an increasing order in second column, but at some points some values repeat itself e.g.,0,12,12,36,... I'm referring to the rows which are separated by 0 0 and then 1 0 and so on. I just want to skip these, while reading the data. So the second column has the increasing value.
Can someone tell me any way to do that in python?
0 0 1 1 1 0 0 0
0 3 0.999551 0.998204 0.995963 2.02497e-06 8.08878e-06 1.81582e-05
0 6 0.999226 0.996908 0.993056 3.50103e-06 1.39702e-05 3.13067e-05
0 9 0.998916 0.995669 0.990283 4.90435e-06 1.95504e-05 4.3739e-05
0 12 0.998613 0.994464 0.987587 6.27845e-06 2.50041e-05 5.58512e-05
0 15 0.998309 0.993255 0.984888 7.63421e-06 3.03781e-05 6.77611e-05
0 18 0.998008 0.992055 0.982214 8.97082e-06 3.56643e-05 7.9433e-05
0 21 0.99771 0.990872 0.979581 1.03001e-05 4.09117e-05 9.09826e-05
0 24 0.997413 0.989692 0.976958 1.16094e-05 4.60742e-05 0.000102324
0 27 0.997111 0.988494 0.974298 1.29506e-05 5.13517e-05 0.000113877
0 30 0.996811 0.987306 0.971666 1.42973e-05 5.66363e-05 0.000125395
0 33 0.996514 0.986129 0.969062 1.56102e-05 6.17854e-05 0.000136606
0 36 0.99622 0.984966 0.96649 1.6868e-05 6.67128e-05 0.000147314
1 0 1 1 1 0 0 0
1 12 0.998615 0.994472 0.987606 1.24824e-05 4.97091e-05 0.000111026
1 24 0.997408 0.989674 0.976917 2.32538e-05 9.22819e-05 0.000204924
1 36 0.996216 0.98495 0.966456 3.37665e-05 0.000133547 0.000294894
1 48 0.995023 0.98024 0.956083 4.41221e-05 0.000173927 0.000381972
1 60 0.993849 0.975622 0.945978 5.45843e-05 0.000214354 0.000467853
1 72 0.992678 0.971031 0.93599 6.49638e-05 0.000254364 0.000552466
1 84 0.991501 0.966432 0.926044 7.5403e-05 0.000294247 0.000635589
1 96 0.990323 0.961846 0.916176 8.55362e-05 0.000332815 0.000715435
1 108 0.989133 0.95723 0.90631 9.602e-05 0.000372371 0.000796123
1 120 0.987925 0.952552 0.89635 0.000106095 0.000410211 0.000872709
1 132 0.986728 0.947946 0.886629 0.000116829 0.000449985 0.000951404
1 144 0.985536 0.943378 0.87706 0.000127786 0.000490311 0.00103029
1 156 0.984333 0.938787 0.867512 0.000138898 0.000531114 0.00110972
1 168 0.983124 0.93419 0.858003 0.000149945 0.000571148 0.00118605
2 0 1 1 1 0 0 0
2 60 0.993889 0.975779 0.946334 0.000122674 0.000481801 0.0010518
2 120 0.98802 0.95292 0.897129 0.000235474 0.000910013 0.0019347
2 180 0.981998 0.929939 0.849324 0.000360693 0.00136728 0.00281767
2 240 0.976087 0.907868 0.805034 0.00048759 0.00180865 0.0036021
2 300 0.970186 0.886203 0.762767 0.000606964 0.00221121 0.0042844
2 360 0.964519 0.865822 0.724262 0.000723555 0.00257783 0.0048463
2 420 0.959195 0.846993 0.689658 0.000830297 0.00290486 0.00533017
2 480 0.953931 0.828808 0.657473 0.000940967 0.00322907 0.00579317
2 540 0.948992 0.812283 0.629672 0.00105503 0.0035387 0.00617566
2 600 0.94387 0.795353 0.601452 0.00116622 0.00381699 0.00650445
2 660 0.938843 0.778862 0.57426 0.00126677 0.00406694 0.00680719
2 720 0.933909 0.762839 0.548423 0.0013606 0.0043114 0.00712883
2 780 0.929153 0.7477 0.525167 0.00145272 0.00455818 0.0074014
2 840 0.924413 0.732931 0.503387 0.00154657 0.00480149 0.00765192
2 900 0.919724 0.718536 0.482191 0.00163803 0.0050077 0.00783869
You can load the file with np.loadtxt and delete the second column with np.delete using the axis 1:
arr = np.loadtxt('test.txt')
arr = np.delete(arr, 1, axis=1)
So I have all the data points formatted into a table where I can now start to summarise my findings.
home_goals away_goals result home_points away_points
2006-2007 1 1 D 1 1
2006-2007 1 1 D 1 1
2006-2007 2 1 H 3 0
2006-2007 2 1 H 3 0
2006-2007 3 0 H 3 0
... ... ... ... ... ... ... ...
2019 - 2020 0 2 A 0 3
2019 - 2020 5 0 H 3 0
2019 - 2020 1 3 A 0 3
2019 - 2020 3 1 H 3 0
2019 - 2020 1 1 D 1 1
My objective is to collate this into a new data frame that summarises each season under the following columns:
Season_breakdown = pd.DataFrame(columns = ['Season','Matches Played','Home
Wins','Draws','Away Wins', 'Home Points',
'Away Points'])
My current solution is to run for something like this
index_count = pd.Index(data_master.index).value_counts()
index_count
That outputs:
2007-2008 380
2013-2014 380
2010-2011 380
2017-2018 380
2015-2016 380
2016-2017 380
2009-2010 380
2014-2015 380
2012-2013 380
2006-2007 380
2018-2019 380
2011-2012 380
2008-2009 380
2019 - 2020 P1 288
2019 - 2020 P2 92
and then hardcode the results into a new data variable which I can incorporate into my Season_breakdown and repeat similar steps to collate the information for home wins (by season) away wins (by season) home points (by season) away points (by season).
The aim is to have something along the lines of;
Season MatchesPlayed HomeWins Draws AwayWins HomePoints AwayPoints
2006-2007 380 (sum H 6/7) (sum D 6/7) (sum H 6/7) (sum h_points)(sum a_points)
2007-2008 380 (sum H 7/8) (sum D 7/8) (sum H 7/8) (sum h_points)(sum a_points)
2008-2009 380 (sum H 8/9) (sum D 8/9) (sum H 8/9) (sum h_points)(sum a_points)
2009-20010 380 (sum H 9/10)(sum D 9/10)(sum H 9/10)(sum h_points)(sum a_points)
Etc.
I feel like there is a far more robust way to approach this and was hoping for some insight.
Thanks
You have multi level aggregations. Points are aggregated at the season level, while Wins/Draws are aggregated at the combined season / result level. So one option is to aggregate the result in multiple steps and then concatenate / join the results:
season_points = df.groupby(level=[0]).agg({'home_points': 'sum', 'away_points': 'sum'})
season_count = df.groupby(level=[0]).result.count().rename('MatchesPlayed').to_frame()
season_results = pd.crosstab(df.index, df.result).rename(
columns={'A': 'AwayWins', 'D': 'Draws', 'H': 'HomeWins'})
season_results.index.name = None
agg_df = pd.concat([season_count, season_results, season_points], axis=1) \
.rename(columns={'home_points': 'HomePoints', 'away_points': 'AwayPoints'})
print(agg_df)
# MatchesPlayed AwayWins Draws HomeWins HomePoints AwayPoints
#2006-2007 5 0 2 3 11 2
#2019 - 2020 5 2 1 2 7 7
Working example
This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 4 years ago.
I have been having a serious problem with my PDF file. I want to extract all the text from my PDF. After extraction, I have all of it in byte code.
You can see below an extracted part of the extracted text:
b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en-US) /Metadata 89 0 R/ViewerPreferences 90 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 11/Kids[ 3 0 R 28 0 R 36 0 R 38 0 R 42 0 R 49 0 R 58 0 R 60 0 R 62 0 R 64 0 R 66 0 R] >>\r\nendobj\r\n3 0 obj\r\n<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 9 0 R/F3 12 0 R/F4 17 0 R/F5 19 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/XObject<</Image27 27 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/Annots[ 11 0 R 24 0 R 25 0 R 26 0 R] /MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S>>\r\nendobj\r\n4 0 obj\r\n<</Filter/FlateDecode/Length 5962>>\r\nstream\r\nx\x9c\xc5][o\xe3\xc6\x92~\x1f`\xfeC?J\x81\x87!\xbby\x1d\x1c,0\x17\'9\x07\xc9\\l\x03\xd9 \xc9\x03-\xd1\x16weI!9\xe3\xf1\xbf\xdf\xfa\xaa\x9b\x17\x89\xa4\xec\x91Z\xde\x01\xac\x91\xa8&\xab\xba\xaa\xba\xee\xdd\xfa\xe7\xe5\x0b\xd7q\xf1/\xf1\xa4pEH\xafQ"E\x91\xbd|\xf1\xfb\x0fb\xf5\xf2\xc5\xdb\xab\x97/~\xfc\xc9\x13\x9e\xe7\xb8\xbe\xb8\xbay\xf9\xc2\xa3q\xae\xf0\x84\x1f\x06\x8e\xa4\xe1A\xe2$\xa1\xb8\xba\xa3q?_F\xe2\xb6\xa4g\x8a[\xfe\x14\x9bO?\xbf|\xf1\xe7\xe4\xd7\xe9+5I\xcbJ\xe0\xff/S5\xd9\xd0\xdf\x9c\xfe\xd2j\xea\xb9\x93l\xfeZL\xff\x16W\xffy\xf9\xe2\x9c`~~\xf9\xe2\x9f#\x90\x0bd\xec\x04q\x179\xc6\xc9\xa0\xa2\x80\xc2\x8f\xd3P\xbfq\xa7\x11}x\xe5O$\xbd\xc1\x07\x0fWc\x8b\xc8D\xa1\xe3\xc91d\xbe{\xd6z\x90r\x9d\xd8\x17a(\x9d\xc8\x17^\xec9I$\x12\xfa#\x17\xdb\xa1O\x1d\xa7q\x97\x82`u\x11W\xa1\x88|\x1f\xb8?\x8e\xf4\xe7\xfa\x8d\xf4\x94#\x93\x1a\xa2\nb\xc7U\x83\x98=m`\x83Z\xc0\xc4\xeb`\'\xbd\xd8\xf1\x03\xc2\xd0ud\xdc\xc3\xf0\xb7\xacJ\xb5t\xa5\xd3Wr2
The code for this is a follows:
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
data = response.content
print(data)
How can I extract the text from this?
You would need to use a package to parse the PDF file and extract the text from it. For example PyPDF2 could be used as follows:
import io
import requests
import PyPDF2
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
pdf = PyPDF2.PdfFileReader(io.BytesIO(response.content))
with open('output.txt', 'w') as f_output:
for page in range(pdf.getNumPages()):
f_output.write(pdf.getPage(page).extractText())
This would create an output.txt file starting:
Last updated:
3/30/2018
Metadata:
Tivoli Bay
South
Hydrologic
Station
Location:
Tivoli Bay
, NY
(
42.027038,
-
73.925957
)
Data collection period:
July
1996*
I have a veery long list that contains the same pattern. Here an original example:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000 05:10 10 244.679 0 0
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
As you can see, there is one line with measurement-data within the string that starts with "Sonntag"
My target is:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000
05:10 10 244.679 0 0 !!
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
I managed to write the txt-file in a list, here called "data_list_splitted", catch this onle line over the whole txt-file, split it and extract the part with the measurements:
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
But i don't get it to break this line and add the measurement-values in the running list!
I think this should't be that difficult?
Any ideas?
Thank you very much!
You can create another list and insert values into it
new_data_list_splitted = []
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[0],ii[1],ii[2],ii[3],ii[4])
new_data_list_splitted.append(txt_line)
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
new_data_list_splitted.append(txt_line)
else:
new_data_list_splitted.append(i)
print new_data_list_splitted #this will have a new row for measurement value
I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???
You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
np.random.seed(0)
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
print(df)
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
df
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]