Web Scraping School Project

Web Scraping School Project - python

I need help with a school project. The code that I have "#" I can't seem to get to work with the table I scraped. I need to change it into a data frame. Can anyone see what I'm missing and if I am missing a step.
Tertiary=pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_tertiary_education_attainment")
Tertiary=pd.DataFrame(Tertiary[1])
#Tertiary=Tertiary.drop(["Non-OECD"], axis=1, inplace=True)
print(Tertiary.dtypes)
#Tertiary["Age25-64(%)"] = pd.to_numeric(Tertiary["Age25-64(%)"])
#Tertiary["Age"] = pd.to_numeric(Tertiary["Age"])
print(Tertiary.dtypes)
print()
#print(Tertiary.describe)
print()
#print(Tertiary.isnull().sum())
#print(Tertiary)

Everything works fine for me.
import pandas as pd
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_tertiary_education_attainment")
table = pd.DataFrame(df[1])
print(table)
print(table.columns)
Output:
Country Age 25–64 (%) Age Year Non-OECD
Country Age 25–64 (%) 25–34 (%) 35–44 (%) 45–54 (%) 55–64 (%) Year Non-OECD
0 Australia 42 48 46 38 33 2014 NaN
1 Austria 30 38 33 27 21 2014 NaN
2 Belgium 37 44 42 34 26 2014 NaN
3 Brazil 14 15 14 14 11 2013 NaN
4 Canada 54 58 61 51 45 2014 NaN
5 Chile 21 27 24 17 14 2013 NaN
6 China 17 27 15 7 2 2018 NaN
7 Colombia 22 28 23 18 16 2014 NaN
8 Costa Rica 18 21 19 17 17 2014 NaN
9 Czech Republic 22 30 21 20 15 2014 NaN
10 Denmark 36 42 41 33 29 2014 NaN
11 Estonia 38 40 39 35 36 2014 NaN
12 Finland 42 40 50 44 34 2014 NaN
13 France 32 44 39 26 20 2013 NaN
14 Germany 27 28 29 26 25 2014 NaN
15 Greece 28 39 27 26 21 2014 NaN
16 Hungary 23 32 25 20 17 2014 NaN
17 Iceland 37 41 42 36 29 2014 NaN
18 Indonesia 8 10 9 8 4 2011 NaN
19 Ireland 41 51 49 34 24 2014 NaN
20 Israel 49 46 53 48 47 2014 NaN
21 Italy 17 24 19 13 12 2014 NaN
22 Japan 48 59 53 47 35 2014 NaN
23 Latvia 30 39 31 27 23 2014 NaN
24 Lithuania 37 53 38 30 28 2014 NaN
25 Luxembourg 46 53 56 40 32 2014 NaN
26 Mexico 19 25 17 16 13 2014 NaN
27 Netherlands 34 44 38 30 27 2014 NaN
28 New Zealand 36 40 41 32 29 2014 NaN
29 Norway 42 49 49 36 32 2014 NaN
30 Poland 27 43 32 18 14 2014 NaN
31 Portugal 22 31 26 17 13 2014 NaN
32 Russia 54 58 55 53 50 2013 NaN
33 Saudi Arabia 22 26 22 18 14 2013 NaN
34 Slovakia 20 30 21 15 14 2014 NaN
35 Slovenia 29 38 35 24 18 2014 NaN
36 South Africa 7 5 7 8 7 2012 NaN
37 South Korea 45 68 56 33 17 2014 NaN
38 Spain 35 41 43 30 21 2014 NaN
39 Sweden 39 46 46 32 30 2014 NaN
40 Switzerland 40 46 45 38 31 2014 NaN
41 Turkey 17 25 16 10 10 2014 NaN
42 Taiwan[3] 45 X X X X 2015 NaN
43 United Kingdom 42 49 46 38 35 2014 NaN
44 United States 44 46 47 43 41 2014 NaN
__
MultiIndex([( 'Country', 'Country'),
('Age 25–64 (%)', 'Age 25–64 (%)'),
( 'Age', '25–34 (%)'),
( 'Age', '35–44 (%)'),
( 'Age', '45–54 (%)'),
( 'Age', '55–64 (%)'),
( 'Year', 'Year'),
( 'Non-OECD', 'Non-OECD')],
)

Related

Web scrape of forbes website using requests-html

I'm trying to scrape the list from https://www.forbes.com/best-states-for-business/list/#tab:overall
import requests_html
session= requests_html.HTMLSession()
r = session.get("https://www.forbes.com/best-states-for-business/list/#tab:overall")
r.html.render()
body=r.text.find('#list-table-body')
print(body)
This returns -1, not the table content. How can I get the actual table content?

Data is loaded dynamically from external source via API and You can grab the required data using requests module only then store via pandas DataFrame.
import pandas as pd
import requests
headers = {'user-agent':'Mozilla/5.0'}
url = 'https://www.forbes.com/ajax/list/data?year=2019&uri=best-states-for-business&type=place'
r= requests.get(url,headers=headers)
df = pd.DataFrame(r.json())
print(df)
Output:
position rank name uri ... regulatoryEnvironment economicClimate growthProspects lifeQuality
0 1 1 North Carolina nc ... 1 13 13 16
1 2 2 Texas tx ... 21 4 1 15
2 3 3 Utah ut ... 6 8 7 9
3 4 4 Virginia va ... 3 20 24 1
4 5 5 Florida fl ... 7 3 5 18
5 6 6 Georgia ga ... 9 7 11 23
6 7 7 Tennessee tn ... 4 11 14 29
7 8 8 Washington wa ... 29 6 8 30
8 9 9 Colorado co ... 19 2 4 21
9 10 10 Idaho id ... 8 10 2 24
10 11 11 Nebraska ne ... 2 28 36 19
11 12 12 Indiana in ... 5 25 25 7
12 13 13 Nevada nv ... 14 14 6 48
13 14 14 South Dakota sd ... 13 39 20 28
14 15 15 Minnesota mn ... 16 16 27 3
15 16 16 South Carolina sc ... 17 15 12 39
16 17 17 Iowa ia ... 11 36 35 10
17 18 18 Arizona az ... 18 12 3 35
18 19 19 Massachusetts ma ... 37 5 15 4
19 20 20 Oregon or ... 36 9 9 38
20 21 21 Wisconsin wi ... 10 19 37 8
21 22 22 Missouri mo ... 25 26 18 17
22 23 23 Delaware de ... 42 37 19 43
23 24 24 Oklahoma ok ... 15 31 33 31
24 25 25 New Hampshire nh ... 32 21 22 22
25 26 26 North Dakota nd ... 22 45 26 42
26 27 27 Pennsylvania pa ... 35 23 40 12
27 28 28 New York ny ... 34 18 21 14
28 29 29 Ohio oh ... 26 22 44 2
29 30 30 Montana mt ... 28 35 17 45
30 31 31 California ca ... 40 1 10 27
31 32 32 Wyoming wy ... 12 49 23 36
32 33 33 Arkansas ar ... 20 33 39 41
33 34 34 Maryland md ... 41 27 29 26
34 35 35 Michigan mi ... 22 17 41 13
35 36 36 Kansas ks ... 24 32 42 32
36 37 37 Illinois il ... 39 30 45 11
37 38 38 Kentucky ky ... 33 41 34 25
38 39 39 New Jersey nj ... 49 29 30 5
39 40 40 Alabama al ... 27 38 31 44
40 41 41 Rhode Island ri ... 44 40 32 20
41 42 42 Mississippi ms ... 30 46 47 37
42 43 43 Connecticut ct ... 43 42 48 6
43 44 44 Maine me ... 48 34 28 34
44 45 45 Vermont vt ... 45 43 38 33
45 46 46 Louisiana la ... 47 47 46 47
46 47 47 Hawaii hi ... 38 24 49 40
47 48 48 New Mexico nm ... 46 44 15 49
48 49 49 West Virginia wv ... 50 48 50 46
49 50 50 Alaska ak ... 31 50 43 50
[50 rows x 15 columns]

how to split an integer value from one column to two columns in text file using pandas or numpy (python)

I have a text file which has a number of integer values like this.
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10
I have to make a file by merging several files like this but you guys can see a problem with this data.
In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.
When I try to read these two lines, I always have had this result.
It came out that first values in first column next to index columns couldn't recognized as first values themselves.
In [14]: fw.iloc[80,:]
Out[14]:
3 72.0
4 46.0
5 52.0
6 29.0
7 29.0
8 22.0
9 204.0
10 41.0
11 46.0
12 51.0
13 57.0
14 67.0
15 82.0
16 92.0
17 17.0
18 NaN
Name: (20180722, 201807281017), dtype: float64
I tried to make it correct with indexing but failed.
The desirable result is,
In [14]: fw.iloc[80,:]
Out[14]:
2 1017.0
3 110.0
4 72.0
5 46.0
6 52.0
7 29.0
8 29.0
9 22.0
10 204.0
11 41.0
12 46.0
13 51.0
14 57.0
15 67.0
16 82.0
17 92.0
18 17.0
Name: (20180722, 201807281017), dtype: float64
How can I solve this problem?
+
I used this code to read this file.
fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)

A better fit for this would be pandas.read_fwf. For your example:
df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4])
I don't know if the column widths can be inferred for all your data or need to be hardcoded.

One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.
from textwrap import wrap
import pandas as pd
def read_file(f_name):
data = []
with open(f_name) as f:
for line in f.readlines():
idx1 = line[0:8]
idx2 = line[10:18]
points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
data.append([idx1, idx2, *points])
return pd.DataFrame(data).set_index([0, 1])

It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.
fw = pd.read_csv('test.txt', header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
num_str = str(fw.iat[i,1])
a,b = map(int,[num_str[:-4],num_str[-4:]])
fw.iloc[i,3:] = fw.iloc[i,2:-1]
fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])
The result of print(fw) from there is
2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69
20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2
16 17 18
0 1
20180701 20180707 4 5 2.0
20180708 20180714 34 54 11.0
20180715 20180721 66 87 14.0
20180722 20180728 82 92 17.0
20180729 20180804 41 53 12.0
20180805 20180811 29 36 21.0
20180812 20180818 27 64 7.0
20180819 20180825 7 11 1.0
20180826 20180901 6 6 10.0
Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.
2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8
15 16 17 18
0 1
20180701 20180707 0 4 5 2.0
20180708 20180714 27 34 54 11.0
20180715 20180721 69 66 87 14.0
20180722 201807281017 82 92 17 NaN
20180729 201808041106 41 53 12 NaN
20180805 20180811 47 29 36 21.0
20180812 20180818 34 27 64 7.0
20180819 20180825 10 7 11 1.0
20180826 20180901 2 6 6 10.0

Pandas: row by row operations on multiple columns

I have 2 variables A and B, both scalars. And a DataFrame df1 with 1000 columns and 86400 rows. In the table below there are just 10 columns for simplicity:
0 1 2 3 4 5 6 7 8 9 f
0 4.000000 23.000000 6.000000 36.000000 37.000000 33.000000 22.000000 28.000000 8.000000 14.000000 50.135
1 4.002361 23.002361 6.002361 36.002361 37.002361 33.002361 22.002361 28.002361 8.002361 14.002361 50.130
2 4.004722 23.004722 6.004722 36.004722 37.004722 33.004722 22.004722 28.004722 8.004722 14.004722 50.120
3 4.007083 23.007083 6.007083 36.007083 37.007083 33.007083 22.007083 28.007083 8.007083 14.007083 50.112
4 4.009444 23.009444 6.009444 36.009444 37.009444 33.009444 22.009444 28.009444 8.009444 14.009444 50.102
5 4.011806 23.011806 6.011806 36.011806 37.011806 33.011806 22.011806 28.011806 8.011806 14.011806 50.097
... ... ... ... ... ... ... ... ... ... ... ...
86387 207.969306 226.969306 209.969306 239.969306 240.969306 236.969306 225.969306 231.969306 211.969306 217.969306 49.920
86388 207.971667 226.971667 209.971667 239.971667 240.971667 236.971667 225.971667 231.971667 211.971667 217.971667 49.920
86389 207.974028 226.974028 209.974028 239.974028 240.974028 236.974028 225.974028 231.974028 211.974028 217.974028 49.920
86390 207.976389 226.976389 209.976389 239.976389 240.976389 236.976389 225.976389 231.976389 211.976389 217.976389 49.920
86391 207.978750 226.978750 209.978750 239.978750 240.978750 236.978750 225.978750 231.978750 211.978750 217.978750 49.917
86392 207.981111 226.981111 209.981111 239.981111 240.981111 236.981111 225.981111 231.981111 211.981111 217.981111 49.917
86393 207.983472 226.983472 209.983472 239.983472 240.983472 236.983472 225.983472 231.983472 211.983472 217.983472 49.915
86394 207.985833 226.985833 209.985833 239.985833 240.985833 236.985833 225.985833 231.985833 211.985833 217.985833 49.915
86395 207.988194 226.988194 209.988194 239.988194 240.988194 236.988194 225.988194 231.988194 211.988194 217.988194 49.915
86396 207.990556 226.990556 209.990556 239.990556 240.990556 236.990556 225.990556 231.990556 211.990556 217.990556 49.912
86397 207.992917 226.992917 209.992917 239.992917 240.992917 236.992917 225.992917 231.992917 211.992917 217.992917 49.915
86398 207.995278 226.995278 209.995278 239.995278 240.995278 236.995278 225.995278 231.995278 211.995278 217.995278 49.917
86399 207.997639 226.997639 209.997639 239.997639 240.997639 236.997639 225.997639 231.997639 211.997639 217.997639 49.917
I would like to perform a row by row Operation:
when f>50: add C=A/B/3600 to the value in columns 1-999.
when f<50: subtract C=A/B/3600 to the value in columns 1-999.
cols = df1.columns[df1.columns.isin(range(0, 999))]
df1[cols] = np.where(df1[cols] > 50,
df1[cols].values - np.arange(len(df1))[:, None] * C,
df1[cols].values + np.arange(len(df1))[:, None] * C)
As can be seen the value goes on increasing even if f<50.
Any Suggestion?
Thank you in advance

Use numpy.where and add numpy.arange by condition if performance is important:
cols = df.columns[df.columns.isin(range(1, 1000))]
df[cols] = np.where(df[cols] > 50,
df[cols].values - np.arange(len(df))[:, None],
df[cols].values + np.arange(len(df))[:, None])
print (df)
0 1 2 3 4 5 6 7 8 9 991 992 993 994 995 996 \
0 18 8 9 5 38 11 26 25 2 30 23 18 34 1 29 34
1 18 9 10 6 39 12 27 26 3 31 24 19 35 2 30 35
2 18 10 11 7 40 13 28 27 4 32 25 20 36 3 31 36
3 18 11 12 8 41 14 29 28 5 33 26 21 37 4 32 37
4 18 12 13 9 42 15 30 29 6 34 27 22 38 5 33 38
5 18 13 14 10 43 16 31 30 7 35 28 23 39 6 34 39
6 18 14 15 11 44 17 32 31 8 36 29 24 40 7 35 40
86393 18 15 16 12 45 18 33 32 9 37 30 25 41 8 36 41
86394 18 16 17 13 46 19 34 33 10 38 31 26 42 9 37 42
86395 18 17 18 14 47 20 35 34 11 39 32 27 43 10 38 43
86396 18 18 19 15 48 21 36 35 12 40 33 28 44 11 39 44
86397 18 19 20 16 49 22 37 36 13 41 34 29 45 12 40 45
86398 18 20 21 17 50 23 38 37 14 42 35 30 46 13 41 46
86399 18 21 22 18 51 24 39 38 15 43 36 31 47 14 42 47
86400 18 22 23 19 52 25 40 39 16 44 37 32 48 15 43 48
997 998 999 f
0 25 15 2 50.135
1 26 16 3 50.130
2 27 17 4 50.120
3 28 18 5 50.112
4 29 19 6 50.102
5 30 20 7 50.097
6 31 21 8 50.095
86393 32 22 9 49.915
86394 33 23 10 49.915
86395 34 24 11 49.915
86396 35 25 12 49.912
86397 36 26 13 49.915
86398 37 27 14 49.917
86399 38 28 15 49.917
86400 39 29 16 49.915
EDIT:
A = 360000
B = 5
C = A/B/3600
cols = df.columns[df.columns.isin(range(1, 1000))]
mask = df[cols] > 50
df[cols] = np.where(mask,
df[cols].values - mask.cumsum().sub(1) * C,
df[cols].values + (~mask).cumsum().sub(1) * C)
print (df)
0 1 2 3 4 5 \
0 4.000000 23.000000 6.000000 36.000000 37.000000 33.000000
1 4.002361 43.002361 26.002361 56.002361 57.002361 53.002361
2 4.004722 63.004722 46.004722 76.004722 77.004722 73.004722
3 4.007083 83.007083 66.007083 96.007083 97.007083 93.007083
4 4.009444 103.009444 86.009444 116.009444 117.009444 113.009444
5 4.011806 123.011806 106.011806 136.011806 137.011806 133.011806
86387 207.969306 226.969306 209.969306 239.969306 240.969306 236.969306
86388 207.971667 206.971667 189.971667 219.971667 220.971667 216.971667
86389 207.974028 186.974028 169.974028 199.974028 200.974028 196.974028
86390 207.976389 166.976389 149.976389 179.976389 180.976389 176.976389
86391 207.978750 146.978750 129.978750 159.978750 160.978750 156.978750
86392 207.981111 126.981111 109.981111 139.981111 140.981111 136.981111
86393 207.983472 106.983472 89.983472 119.983472 120.983472 116.983472
86394 207.985833 86.985833 69.985833 99.985833 100.985833 96.985833
86395 207.988194 66.988194 49.988194 79.988194 80.988194 76.988194
86396 207.990556 46.990556 29.990556 59.990556 60.990556 56.990556
86397 207.992917 26.992917 9.992917 39.992917 40.992917 36.992917
86398 207.995278 6.995278 -10.004722 19.995278 20.995278 16.995278
86399 207.997639 -13.002361 -30.002361 -0.002361 0.997639 -3.002361
6 7 8 9 f
0 22.000000 28.000000 8.000000 14.000000 50.135
1 42.002361 48.002361 28.002361 34.002361 50.130
2 62.004722 68.004722 48.004722 54.004722 50.120
3 82.007083 88.007083 68.007083 74.007083 50.112
4 102.009444 108.009444 88.009444 94.009444 50.102
5 122.011806 128.011806 108.011806 114.011806 50.097
86387 225.969306 231.969306 211.969306 217.969306 49.920
86388 205.971667 211.971667 191.971667 197.971667 49.920
86389 185.974028 191.974028 171.974028 177.974028 49.920
86390 165.976389 171.976389 151.976389 157.976389 49.920
86391 145.978750 151.978750 131.978750 137.978750 49.917
86392 125.981111 131.981111 111.981111 117.981111 49.917
86393 105.983472 111.983472 91.983472 97.983472 49.915
86394 85.985833 91.985833 71.985833 77.985833 49.915
86395 65.988194 71.988194 51.988194 57.988194 49.915
86396 45.990556 51.990556 31.990556 37.990556 49.912
86397 25.992917 31.992917 11.992917 17.992917 49.915
86398 5.995278 11.995278 -8.004722 -2.004722 49.917
86399 -14.002361 -8.002361 -28.002361 -22.002361 49.917

converting an HTML table in Pandas Dataframe

I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script
import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)
and since My results are coming as 1 list, I tried to convert it into a data frame with
data1=pd.DataFrame(Data)
and result came as
0
0 0 1 2 3 4...
and because of result as a list, I can't apply any functions such as rename, dropna, drop.
I will appreciate every help

I think you need add [0] if need select first item of list, because read_html return list of DataFrames:
So you can use:
import pandas as pd
data1 = pd.read_html('http://www.espn.com/nhl/statis‌tics/player/‌_/stat/point‌s/sort/point‌s/year/2015&‌#47;seasontype/2‌',skiprows=1)[0]
print (data1)
0 1 2 3 4 5 6 7 8 9 \
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
1 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
2 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
3 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
4 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
5 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
6 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
7 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
8 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
9 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
10 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
11 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
12 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
13 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
14 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
15 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
16 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
17 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94
18 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87
19 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85
20 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01
21 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91
22 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87
23 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
24 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
25 21 Max Pacioretty, LW MTL 80 37 30 67 38 32 0.84
26 NaN Logan Couture, C SJ 82 27 40 67 -6 12 0.82
27 23 Jonathan Toews, C CHI 81 28 38 66 30 36 0.81
28 NaN Erik Karlsson, D OTT 82 21 45 66 7 42 0.80
29 NaN Henrik Zetterberg, LW DET 77 17 49 66 -6 32 0.86
30 26 Pavel Datsyuk, C DET 63 26 39 65 12 8 1.03
31 NaN Joe Thornton, C SJ 78 16 49 65 -4 30 0.83
32 28 Nikita Kucherov, RW TB 82 28 36 64 38 37 0.78
33 NaN Patrick Kane, RW CHI 61 27 37 64 10 10 1.05
34 NaN Mark Stone, RW OTT 80 26 38 64 21 14 0.80
35 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
36 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
37 NaN Alexander Steen, LW STL 74 24 40 64 8 33 0.86
38 NaN Kyle Turris, C OTT 82 24 40 64 5 36 0.78
39 NaN Johnny Gaudreau, LW CGY 80 24 40 64 11 14 0.80
40 NaN Anze Kopitar, C LA 79 16 48 64 -2 10 0.81
41 35 Radim Vrbata, RW VAN 79 31 32 63 6 20 0.80
42 NaN Jaden Schwartz, LW STL 75 28 35 63 13 16 0.84
43 NaN Filip Forsberg, C NSH 82 26 37 63 15 24 0.77
44 NaN Jordan Eberle, RW EDM 81 24 39 63 -16 24 0.78
45 NaN Ondrej Palat, LW TB 75 16 47 63 31 24 0.84
46 40 Zach Parise, LW MIN 74 33 29 62 21 41 0.84
10 11 12 13 14 15 16
0 SOG PCT GWG G A G A
1 253 13.8 6 10 13 2 3
2 278 13.7 8 13 18 0 1
3 237 11.8 3 10 21 0 0
4 395 13.4 11 25 9 0 0
5 221 10.0 3 11 22 0 0
6 153 11.8 3 3 30 0 0
7 280 13.2 5 13 16 0 0
8 158 19.6 5 6 10 0 0
9 226 8.9 5 4 21 0 0
10 264 14.0 6 8 10 0 0
11 NaN NaN NaN NaN NaN NaN NaN
12 SOG PCT GWG G A G A
13 182 17.0 3 11 15 0 0
14 279 9.0 4 14 23 0 0
15 101 17.8 0 5 20 0 0
16 268 16.0 6 13 12 0 0
17 203 14.3 6 8 9 0 0
18 202 12.9 0 7 19 2 0
19 261 14.2 5 19 12 0 0
20 212 13.2 4 9 17 0 0
21 191 13.1 6 3 10 0 2
22 304 13.8 8 6 6 4 1
23 NaN NaN NaN NaN NaN NaN NaN
24 SOG PCT GWG G A G A
25 302 12.3 10 7 4 3 2
26 263 10.3 4 6 18 2 0
27 192 14.6 7 6 11 2 1
28 292 7.2 3 6 24 0 0
29 227 7.5 3 4 24 0 0
30 165 15.8 5 8 16 0 0
31 131 12.2 0 4 18 0 0
32 190 14.7 2 2 13 0 0
33 186 14.5 5 6 16 0 0
34 157 16.6 6 5 8 1 0
35 NaN NaN NaN NaN NaN NaN NaN
36 SOG PCT GWG G A G A
37 223 10.8 5 8 16 0 0
38 215 11.2 6 4 12 1 0
39 167 14.4 4 8 13 0 0
40 134 11.9 4 6 18 0 0
41 267 11.6 7 12 11 0 0
42 184 15.2 4 8 8 0 2
43 237 11.0 6 6 13 0 0
44 183 13.1 2 6 15 0 0
45 139 11.5 5 3 8 1 1
46 259 12.7 3 11 5 0 0

If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0
Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.
Full line would be:
data1 = pd.read_html(url, skiprows=1, header=0)[0]
[0] is the first table in the list of possible tables.
There are options for handling NA values as well. Check out the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

I know this is late, but here's a better way...
I noticed that the DataFrames in the list are all part of the same table/dataset you are trying to analyze, so instead of breaking them up and then merging them together, a better solution is to contact the list of DataFrames.
Check out the results of this code:
df = pd.concat(pd.read_html('https://www.espn.com/nhl/stats/player/_/view/goaltending'),axis=1)
output:
df.head(1)
index RK Name POS GP W L OTL GA/G SA GA SV SV% SO TOI PIM SOSA SOS SOS%
0 1 Igor ShesterkinNYR G 53 36 13 4 2.07 1622 106 1516 0.935 6 3070:32 2 28 20 0.714

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]

This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping School Project - python

Related

Web scrape of forbes website using requests-html

how to split an integer value from one column to two columns in text file using pandas or numpy (python)

Pandas: row by row operations on multiple columns

converting an HTML table in Pandas Dataframe

performing differences between rows in pandas based on columns values

Categories

Resources