How to append NBA player NAMES with STATS? - python

I am still learning web scraping and could use some help please. I would like to get NBA team stats for each player by combining 3 different dataframes: playerNames, playerStats_one, playerStats_two.
Here is an example of the dataframe that I am looking for:
Name GP MIN PTS ...
Anthony Davis 62 34.4 26.1 ...
Lebron James 67 34.6 25.3 ...
Kyle Kuzma 61 25.0 12.8 ...
... ... ... ...
Here is my code so far:
import pandas as pd
import requests
url = 'https://www.espn.com/nba/team/stats/_/name/lal/season/2020/seasontype/2'
df = pd.read_html(url)
#goal 1 get player names
playerNames = df[0]
#goal 2 get stats
playerStats_one = df[1]
playerStats_two = df[3]
#goal 3 append or concat player stats to player name dataframe
new_df = pd.concat([playerNames, playerStats_one, playerStats_two], ignore_index=True, sort=False)
new_df2 = playerNames.append(playerStats_one, ignore_index=True)
I tried to use pd.concat and the append option and the output had a bunch of 'nan' values. Any suggestions would be greatly appreciated. Thanks in advance for any insight you may offer.

Related

Issue with read html in pandas

I want to read a table form Wikipedia:
import pandas as pd
caption="Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df
But I got this errore:
"ValueError: No tables found matching pattern 'Edit section: 2019 inequality-adjusted HDI (IHDI) (2020 report)'"
This method worked for table like below table:
caption = "Average daily maximum and minimum temperatures for selected cities in Minnesota"
df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match=caption)
df
But I get confused for this one, how can I solved this problem?
You have multiple problems here.
pandas doesn't support https, and there's no such caption that you're looking for.
Try this:
import pandas as pd
import requests
caption = "Table of countries by IHDI"
df = pd.read_html(
requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index").text,
match=caption,
)
print(df[0].head())
Output:
Rank Country ... 2019 estimates (2020 report)[4][5][6]
Rank Country ... Overall loss (%) Growth since 2010
0 1 Norway ... 6.1 0.021
1 2 Iceland ... 5.8 0.055
2 3 Switzerland ... 6.9 0.015
3 4 Finland ... 5.3 0.040
4 5 Ireland ... 7.3 0.066
[5 rows x 6 columns]
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index')
df[2]
Or if you wish to use match argument
import pandas as pd
caption="Table of countries by IHDI"
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_Human_Development_Index',match=caption)
df[0]
Returns

An alert when trying to change a value in a column in pandas

I have this dataset (the Titanic dataset):
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'
df = pd.read_csv(url)
And I want to change for the column 'Sex' all the values 'male' with 'NaN'. This is the code:
df['Sex'] = df['Sex'].replace('male',np.nan)
df.head(3)
Name PClass Age Sex Survived SexCode
0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1
1 Allison, Miss Helen Loraine 1st 2.0 female 0 1
2 Allison, Mr Hudson Joshua... 1st 30.0 NaN 0 0
And I want to roll back and change the NaN values to 'male'. But I tried this:
df['Sex'][df['Sex'].isnull()]='male'
df
But I receive a message: A value is trying to be set on a copy of a slice from a DataFrame
The change was made, but perhaps my logic is bad. Please, do you suggest a best way to code this?
The recommendation from pandas is to do the setting like below which gets rid of the warning.
df.loc[df['Sex'].isnull(),'Sex']='male'
df.head()

Calculate a count of groupby rows that occur within a rolling window of days in Pandas

I have the following dataframe:
import pandas as pd
#Create DF
d = {'Name': ['Jim','Jim','Jim', 'Jim','Jack','Jack'],
'Date': ['08/01/2021','27/01/2021','05/02/2021','10/02/2021','26/01/2021','20/02/2021']}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date,format='%d/%m/%Y')
df
I would like to add a column (to this same dataframe) calculating the how many have occurred in the last 28 days grouped by Name. Does anyone know the most efficient way to do this over 200,000 rows of code? with about 1000 different Name's?
The new column values should be 1,2,3,3,1,2. Any help would be much appreciated! Thanks!
Set the index of dataframe to Date, then group the frame by Name and apply rolling count with a closed window having offset of 28 days
df['count'] = df.set_index('Date')\
.groupby('Name', sort=False)['Name']\
.rolling('28d', closed='both').count().tolist()
Name Date count
0 Jim 2021-01-08 1.0
1 Jim 2021-01-27 2.0
2 Jim 2021-02-05 3.0
3 Jim 2021-02-10 3.0
4 Jack 2021-01-26 1.0
5 Jack 2021-02-20 2.0

pandas replace null values for a subset of columns

I have a data frame with many columns, say:
df:
name salary age title
John 100 35 eng
Bill 200 NaN adm
Lena NaN 28 NaN
Jane 120 45 eng
I want to replace the null values in salary and age, but no in the other columns. I know I can do something like this:
u = df[['salary', 'age']]
df[['salary', 'age']] = u.fillna(-1)
But this seems terse as it involves copying. Is there a more efficient way to do this?
According to Pandas documentation in 23.3
values = {'salary': -1, 'age': -1}
df.fillna(value=values, inplace=True)
Try this:
subset = ['salary', 'age']
df.loc[:, subset] = df.loc[:, subset].fillna(-1)
It is not so beautiful, but it works:
df.salary.fillna(-1, inplace=True)
df.age.fillna(-1, inplace=True)
df
>>> name salary age title
0 John 101.0 35.0 eng
1 Bill 200.0 -1.0 adm
2 Lena -1.0 28.0 NaN
3 Jane 120.0 45.0 eng
I was hoping fillna() had subset parameter like drop(), maybe should post request to pandas however this is the cleanest version in my opinion.
df[["salary", "age"]] = df[["salary", "age"]].fillna(-1)
You can do:
df = df.assign(
salary=df.salary.fillna(-1),
age=df.age.fillna(-1),
)
if you want to chain it with other operations.

Expand pandas dataframe based on range in a column

I have a pandas dataframe like this:
Name SICs
Agric 0100-0199
Agric 0910-0919
Agric 2048-2048
Food 2000-2009
Food 2010-2019
Soda 2097-2097
The SICs column gives a range of integer values that match the Name given in the first column (although they're stored as a string).
I need to expand this DataFrame so that it has one row for each integer in the range:
Agric 100
Agric 101
Agric 102
...
Agric 199
Agric 910
Agric 911
...
Agric 919
Agric 2048
Food 2000
...
Is there a particularly good way to do this? I was going to do something like this
ranges = {i:r.split('-') for i, r in enumerate(inds['SICs'])}
ranges_expanded = {}
for r in ranges:
ranges_expanded[r] = range(int(ranges[r][0]),int(ranges[r][1])+1)
but I wonder if there's a better way or perhaps a pandas feature to do this. (Also, I'm not sure this will work, as I don't yet see how to read the ranges_expanded dictionary into a DataFrame.)
Quick and dirty but I think this gets you to what you need:
from io import StringIO
import pandas as pd
players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")
df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()
df2 = pd.DataFrame(columns=('Name', 'SIC'))
count = 0
for idx,r in df.iterrows():
data = r['SICs'].split("-")
for i in range(int(data[0]), int(data[1])+1):
df2.loc[count] = (r['Name'], i)
count += 1
The neatest way I found (building on from Andy Hayden's answer):
# Extract date min and max
df = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')
# Enumerate dates into wide table
enumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)
# Convert from wide to long table
df = df.stack().reset_index(1, drop=True)
It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.
You can use str.extract to get strings from a regular expression:
In [11]: df
Out[11]:
Name SICs
0 Agri 0100-0199
1 Agri 0910-0919
2 Food 2000-2009
First take out the name as that's the thing we want to keep:
In [12]: df1 = df.set_index("Name")
In [13]: df1
Out[13]:
SICs
Name
Agri 0100-0199
Agri 0910-0919
Food 2000-2009
In [14]: df1['SICs'].str.extract("(\d+)-(\d+)")
Out[14]:
0 1
Name
Agri 0100 0199
Agri 0910 0919
Food 2000 2009
Then flatten this with stack (which adds a MultiIndex):
In [15]: df1['SICs'].str.extract("(\d+)-(\d+)").stack()
Out[15]:
Name
Agri 0 0100
1 0199
0 0910
1 0919
Food 0 2000
1 2009
dtype: object
If you must you can remove the 0-1 level of the MultiIndex:
In [16]: df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1, drop=True)
Out[16]:
Name
Agri 0100
Agri 0199
Agri 0910
Agri 0919
Food 2000
Food 2009
dtype: object

Categories

Resources