Pandas merge dataframes with multiple columns

Pandas merge dataframes with multiple columns - python

I am trying to merge 2 dataframes and have a problem in figuring out how, as it is not straigh forward.
One data frame has match results for over 25000 games and looks like this.
The second one has team performance metrics but only for around 1500 games.
As I am not allowed to post pictures yet, here are the column names of interest:
df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']
Both data frames have additional columns with results or performance metrics.
To be able to merge correctly, I need to merge by date and by looking if the 'team_api_id' matches either 'home...' or 'away_team_api_id'
This is what I have tried until now:
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
left_on = ['date', 'team_api_id', 'team_api_id'],
right_on = ['date', 'home_team_api_id', 'home_team_api_id'])
I have tried also with only 2 columns, but w/o succes.
What I would like to get is a new data frame with only the rows of the df_team_attributes and columns from both data frames.
Thank you in advance!
Added to request by Correlien:
output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict())
{'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}}
output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict())
{'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}}

Have you tried casting the your date columns into the correct format and then attempting the merge? The following worked for me based on the example that you provided -
# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])
# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
on = 'date')
# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")
Please let me know if my understanding of your question is correct.

Related

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

I am performing df.apply() on a dataframe and I am getting the following error:
IndexingError: ('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).', 'occurred at index 4061')
This error comes from the following line of my df (at index 4061)
The relevant code is:
i = pd.DataFrame()
i = df1.apply(
lambda row: i.append(
df.loc[
(df1["ID"] == row["ID"])
& (df1["Date"] >= (row["Date"] + timedelta(-5)))
& (df1["Date"] <= (row["Date"] + timedelta(20)))
],
ignore_index=True,
inplace=True,
)
if row["Flag"] == 1
else None,
axis=1,
)
And an example of the first 5 rows of the df on which I am using the function:
{'ID': {1: 'A US Equity',
2: 'A US Equity',
3: 'A US Equity',
4: 'A US Equity',
5: 'A US Equity'},
'Date': {1: Timestamp('2020-12-22 00:00:00'),
2: Timestamp('2020-12-23 00:00:00'),
3: Timestamp('2020-12-24 00:00:00'),
4: Timestamp('2020-12-28 00:00:00'),
5: Timestamp('2020-12-29 00:00:00')},
'PX_Last': {1: 117.37, 2: 117.3, 3: 117.31, 4: 117.83, 5: 117.23},
'Short_Int': {1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0},
'Total_Call_Volume': {1: 187.0, 2: 353.0, 3: 141.0, 4: 467.0, 5: 329.0},
'Total_Put_Volume': {1: 54.0, 2: 30.0, 3: 218.0, 4: 282.0, 5: 173.0},
'Put_OI': {1: 13354.0, 2: 13350.0, 3: 13522.0, 4: 13678.0, 5: 13785.0},
'Call_OI': {1: 8923.0, 2: 8943.0, 3: 8973.0, 4: 9075.0, 5: 9040.0},
'pct_chng': {1: -0.34810663949736975,
2: -0.059640453267451043,
3: 0.008525149190119485,
4: 0.4432699684596253,
5: -0.5092081812781091},
'Short_Int_Category': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Put/Call': {1: 0.2887700534759358,
2: 0.08498583569405099,
3: 1.5460992907801419,
4: 0.6038543897216274,
5: 0.5258358662613982},
'10% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'10%-20% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'20%-30% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'30% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Time_to_pop': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan}}
The row at index 4061 that is causing the error is:
ID ADI US Equity
Date 2021-02-24 00:00:00
PX_Last 161.76
Short_Int 15.1847
Total_Call_Volume 52502
Total_Put_Volume 1929
Put_OI 32219
Call_OI 45557
pct_chng 2.57451
Short_Int_Category 15-20
Put/Call 0.0367415
10% + Pop Flag 0
10%-20% Pop Flag 0
20%-30% Pop Flag 0
30% + Pop Flag 0
Flag 1
Time_to_pop NaN
Name: 4061, dtype: object
How do I perform the function without getting the error mentioned above?

In python pandas, count the integers in a particular column and also count all the elements in particular column

There is a huge df with multiple columns but want to read only specific column that is interested to me:
in the below data, I would like to read only the column 'Type 1'
import numpy as np
import pandas as pd
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 'HH', 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'np.NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df = pd.DataFrame(data)
df
int_count = df['Type 1'].count(0,numeric_only = True) # should count only cells that contain integers and return 8
total_count = df['Type 1'].count(0,numeric_only = False) # should count all the cells and return 9
I want something like count only the numeric values in particular column
eg: df['Type 1'].count(0,numeric_only = True) should return 8 (exclude counting the string 'HH' in Type 1 column)
df['Type 1'].count(0,numeric_only = False) should return 9 (total number of cells in the particular column)
but "df['Type 1'].count(0,numeric_only = True/False)" this is not working as I expect...

I would suggest the below:
int_count = len(df.loc[df['Type 1'].astype(str).str.isnumeric()])
total_count = len(df)

KeyError: 0 when trying to plot multiple histograms

Having troubles plotting multiple histograms. I get the error message:
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()
KeyError: 0
This is the code I wrote:
xaxes = ['price','bedrooms','sqft_living','sqft_lot','floors','waterfront',
'view','condition','grade','sqft_above','sqft_basement','yr_built',
'yr_renovated','zipcode','lat','long','sqft_living15','sqft_loft15']
a,b = plt.subplots(4,5)
b = b.ravel()
for idx,ax in enumerate(b):
ax.hist(file[idx])
ax.set_title(titles[idx])
ax.set_xlabel(xaxes[id])
plt.tight_layout()
Here is a sample of my data:
{'bathrooms': {0: 1.0,
1: 2.25,
2: 1.0,
3: 3.0,
4: 2.0,
5: 4.5,
6: 2.25,
7: 1.5,
8: 1.0,
9: 2.5},
'bedrooms': {0: 3, 1: 3, 2: 2, 3: 4, 4: 3, 5: 4, 6: 3, 7: 3, 8: 3, 9: 3},
'condition': {0: 3, 1: 3, 2: 3, 3: 5, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3},
'date': {0: '20141013T000000',
1: '20141209T000000',
2: '20150225T000000',
3: '20141209T000000',
4: '20150218T000000',
5: '20140512T000000',
6: '20140627T000000',
7: '20150115T000000',
8: '20150415T000000',
9: '20150312T000000'},
'floors': {0: 1.0,
1: 2.0,
2: 1.0,
3: 1.0,
4: 1.0,
5: 1.0,
6: 2.0,
7: 1.0,
8: 1.0,
9: 2.0},
'grade': {0: 7, 1: 7, 2: 6, 3: 7, 4: 8, 5: 11, 6: 7, 7: 7, 8: 7, 9: 7},
'id': {0: 7129300520,
1: 6414100192,
2: 5631500400,
3: 2487200875,
4: 1954400510,
5: 7237550310,
6: 1321400060,
7: 2008000270,
8: 2414600126,
9: 3793500160},
'lat': {0: 47.511200000000002,
1: 47.721000000000004,
2: 47.737900000000003,
3: 47.520800000000001,
4: 47.616799999999998,
5: 47.656100000000002,
6: 47.309699999999999,
7: 47.409500000000001,
8: 47.512300000000003,
9: 47.368400000000001},
'long': {0: -122.25700000000001,
1: -122.319,
2: -122.23299999999999,
3: -122.39299999999999,
4: -122.045,
5: -122.005,
6: -122.32700000000001,
7: -122.315,
8: -122.337,
9: -122.03100000000001},
'price': {0: 221900.0,
1: 538000.0,
2: 180000.0,
3: 604000.0,
4: 510000.0,
5: 1230000.0,
6: 257500.0,
7: 291850.0,
8: 229500.0,
9: 323000.0},
'sqft_above': {0: 1180,
1: 2170,
2: 770,
3: 1050,
4: 1680,
5: 3890,
6: 1715,
7: 1060,
8: 1050,
9: 1890},
'sqft_basement': {0: 0,
1: 400,
2: 0,
3: 910,
4: 0,
5: 1530,
6: 0,
7: 0,
8: 730,
9: 0},
'sqft_living': {0: 1180,
1: 2570,
2: 770,
3: 1960,
4: 1680,
5: 5420,
6: 1715,
7: 1060,
8: 1780,
9: 1890},
'sqft_living15': {0: 1340,
1: 1690,
2: 2720,
3: 1360,
4: 1800,
5: 4760,
6: 2238,
7: 1650,
8: 1780,
9: 2390},
'sqft_lot': {0: 5650,
1: 7242,
2: 10000,
3: 5000,
4: 8080,
5: 101930,
6: 6819,
7: 9711,
8: 7470,
9: 6560},
'sqft_lot15': {0: 5650,
1: 7639,
2: 8062,
3: 5000,
4: 7503,
5: 101930,
6: 6819,
7: 9711,
8: 8113,
9: 7570},
'view': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'waterfront': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'yr_built': {0: 1955,
1: 1951,
2: 1933,
3: 1965,
4: 1987,
5: 2001,
6: 1995,
7: 1963,
8: 1960,
9: 2003},
'yr_renovated': {0: 0,
1: 1991,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0},
'zipcode': {0: 98178,
1: 98125,
2: 98028,
3: 98136,
4: 98074,
5: 98053,
6: 98003,
7: 98198,
8: 98146,
9: 98038}}

If i understand you in a right way and variable file contains your data in pandas dataframe then you simply faced with a problem of indexing that dataframe.
file[idx] corresponds to file.loc[idx] which means "give me a row with idx number in my dataframe" while you need a column instead of a row. Just replace it with file.loc[:,idx].
Check this link for mode details about indexing and selecting in pandas.

Animate a Plotly map with a sliding date bar

I'm struggling to turn this piece of code I wrote - which constructs a static heatmap - into an animated version with a date slider.
import pandas as pd
import plotly.graph_objects as go
...
fig = go.Figure(go.Densitymapbox(lat=df_heat['lat'], lon=df_heat['lon'], z=df_heat['count'],
radius=10,))
fig.update_layout(mapbox_style="carto-positron", mapbox_zoom=10, mapbox_center = {"lat": 40.7831, "lon": -73.9712},)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
The above code successfully converts a Pandas DataFrame df_heat, which looks like the following, into a Plotly heatmap.
lat lon count
0 -62.884215 39.440236 1
1 -62.834226 39.408072 1
2 -62.811707 39.380462 1
3 -62.744564 39.489112 1
...
Static heatmap output:
df_heat is itself just an aggregated view of the following DataFrame, which also includes a date.
date lat lon count
0 2018-07-29 40.691828 -73.944609 1
1 2018-07-29 40.693601 -73.945092 1
2 2018-07-29 40.696132 -73.945178 1
3 2018-07-29 40.692726 -73.945532 1
My question is, how can I convert this DataFrame with dates into an animated plotly map, such as the ones here, here, and here, which feature a date slider as a filter.
Dummy data for testing:
df = pd.DataFrame({'datetime': {0: '2018-09-29 00:00:00', 1: '2018-07-28 00:00:00', 2: '2018-07-29 00:00:00', 3: '2018-07-29 00:00:00', 4: '2018-08-01 00:00:00', 5: '2018-08-01 00:00:00', 6: '2018-08-01 00:00:00', 7: '2018-08-05 00:00:00', 8: '2018-09-06 00:00:00', 9: '2018-09-07 00:00:00', 10: '2018-09-07 00:00:00', 11: '2018-09-08 00:00:00', 12: '2018-09-08 00:00:00', 13: '2018-09-08 00:00:00', 14: '2018-10-08 00:00:00', 15: '2018-10-10 00:00:00', 16: '2018-10-10 00:00:00', 17: '2018-10-11 00:00:00', 18: '2018-10-11 00:00:00', 19: '2018-10-11 00:00:00'},
'lat': {0: 40.6908284, 1: 40.693601, 2: 40.6951317, 3: 40.6967261, 4: 40.697593, 5: 40.6987141, 6: 40.7186497, 7: 40.7187772, 8: 40.7196151, 9: 40.7196865, 10: 40.7187408, 11: 40.7189716, 12: 40.7214273, 13: 40.7226571, 14: 40.7236955, 15: 40.7247207, 16: 40.7221074, 17: 40.7445859, 18: 40.7476252, 19: 40.7476451},
'lon': {0: -73.9336094, 1: -73.9350917, 2: -73.9351778, 3: -73.9355315, 4: -73.9366737, 5: -73.9393797, 6: -74.0011939, 7: -74.0010918, 8: -73.9887851, 9: -74.0035125, 10: -74.0250842, 11: -74.0299202, 12: -74.029886, 13: -74.027542, 14: -74.0290157, 15: -74.0291541, 16: -74.0220728, 17: -73.9442636, 18: -73.9641326, 19: -73.9533039},
'count': {0: 1, 1: 2, 2: 5, 3: 1, 4: 6, 5: 1, 6: 3, 7: 2, 8: 1, 9: 7, 10: 3, 11: 3, 12: 1, 13: 2, 14: 1, 15: 1, 16: 2, 17: 1, 18: 1, 19: 1}})

I did few changes and added the timeline animation to your code.
Similarly to the solution of Teoretic I also used plotly.express which make things shorter.
Live example
http://www.erangrinberg.de/plotly/map-animation.html
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'Date': {0: '2018-09-29 00:00:00', 1: '2018-07-28 00:00:00', 2: '2018-07-29 00:00:00', 3: '2018-07-29 00:00:00', 4: '2018-08-01 00:00:00', 5: '2018-08-01 00:00:00', 6: '2018-08-01 00:00:00', 7: '2018-08-05 00:00:00', 8: '2018-09-06 00:00:00', 9: '2018-09-07 00:00:00', 10: '2018-09-07 00:00:00', 11: '2018-09-08 00:00:00', 12: '2018-09-08 00:00:00', 13: '2018-09-08 00:00:00', 14: '2018-10-08 00:00:00', 15: '2018-10-10 00:00:00', 16: '2018-10-10 00:00:00', 17: '2018-10-11 00:00:00', 18: '2018-10-11 00:00:00', 19: '2018-10-11 00:00:00'},
'lat': {0: 40.6908284, 1: 40.693601, 2: 40.6951317, 3: 40.6967261, 4: 40.697593, 5: 40.6987141, 6: 40.7186497, 7: 40.7187772, 8: 40.7196151, 9: 40.7196865, 10: 40.7187408, 11: 40.7189716, 12: 40.7214273, 13: 40.7226571, 14: 40.7236955, 15: 40.7247207, 16: 40.7221074, 17: 40.7445859, 18: 40.7476252, 19: 40.7476451},
'lon': {0: -73.9336094, 1: -73.9350917, 2: -73.9351778, 3: -73.9355315, 4: -73.9366737, 5: -73.9393797, 6: -74.0011939, 7: -74.0010918, 8: -73.9887851, 9: -74.0035125, 10: -74.0250842, 11: -74.0299202, 12: -74.029886, 13: -74.027542, 14: -74.0290157, 15: -74.0291541, 16: -74.0220728, 17: -73.9442636, 18: -73.9641326, 19: -73.9533039},
'count': {0: 1, 1: 2, 2: 5, 3: 1, 4: 6, 5: 1, 6: 3, 7: 2, 8: 1, 9: 7, 10: 3, 11: 3, 12: 1, 13: 2, 14: 1, 15: 1, 16: 2, 17: 1, 18: 1, 19: 1}})
fig = px.density_mapbox(df, lat=df['lat'],
lon=df['lon'],
z=df['count'],
radius=10,
animation_frame="Date"
)
fig.update_layout(mapbox_style="carto-positron", mapbox_zoom=10, mapbox_center = {"lat": 40.7831, "lon": -73.9712},)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 600
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 600
fig.layout.coloraxis.showscale = True
fig.layout.sliders[0].pad.t = 10
fig.layout.updatemenus[0].pad.t= 10
fig.show()
Would be nice to see the final result,
or maybe you can share the DataSet Source.

You can play with scatter_geo plot from plotly.express to get an interactive graph.
It doesn't produce the heat map, but it can make dots like on your graph.
Sample code with your dummy data:
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'datetime': {0: '2018-09-29 00:00:00', 1: '2018-07-28 00:00:00', 2: '2018-07-29 00:00:00', 3: '2018-07-29 00:00:00', 4: '2018-08-01 00:00:00', 5: '2018-08-01 00:00:00', 6: '2018-08-01 00:00:00', 7: '2018-08-05 00:00:00', 8: '2018-09-06 00:00:00', 9: '2018-09-07 00:00:00', 10: '2018-09-07 00:00:00', 11: '2018-09-08 00:00:00', 12: '2018-09-08 00:00:00', 13: '2018-09-08 00:00:00', 14: '2018-10-08 00:00:00', 15: '2018-10-10 00:00:00', 16: '2018-10-10 00:00:00', 17: '2018-10-11 00:00:00', 18: '2018-10-11 00:00:00', 19: '2018-10-11 00:00:00'},
'lat': {0: 40.6908284, 1: 40.693601, 2: 40.6951317, 3: 40.6967261, 4: 40.697593, 5: 40.6987141, 6: 40.7186497, 7: 40.7187772, 8: 40.7196151, 9: 40.7196865, 10: 40.7187408, 11: 40.7189716, 12: 40.7214273, 13: 40.7226571, 14: 40.7236955, 15: 40.7247207, 16: 40.7221074, 17: 40.7445859, 18: 40.7476252, 19: 40.7476451},
'lon': {0: -73.9336094, 1: -73.9350917, 2: -73.9351778, 3: -73.9355315, 4: -73.9366737, 5: -73.9393797, 6: -74.0011939, 7: -74.0010918, 8: -73.9887851, 9: -74.0035125, 10: -74.0250842, 11: -74.0299202, 12: -74.029886, 13: -74.027542, 14: -74.0290157, 15: -74.0291541, 16: -74.0220728, 17: -73.9442636, 18: -73.9641326, 19: -73.9533039},
'count': {0: 1, 1: 2, 2: 5, 3: 1, 4: 6, 5: 1, 6: 3, 7: 2, 8: 1, 9: 7, 10: 3, 11: 3, 12: 1, 13: 2, 14: 1, 15: 1, 16: 2, 17: 1, 18: 1, 19: 1}})
fig = px.scatter_geo(df,
lat='lat',
lon='lon',
scope='usa',
color="count",
size='count',
projection="albers usa",
animation_frame="datetime",
title='Your title')
fig.update(layout_coloraxis_showscale=False)
fig.show()
Also you can check this kaggle notebook for more examples of usage of this graph.

How can I scrape data from an HTML table into a Python list/dict?

I'm trying to import data from Baseball Prospectus into a Python table / dictionary (which would be better?).
Below is what I have, based on following along to Automate The Boring Stuff with Python.
I get that my method isn't properly using these functions, but I can't figure out what tools I should be using.
import requests
import webbrowser
import bs4
res = requests.get('https://legacy.baseballprospectus.com/card/70917/trea-turner')
res.raise_for_status()
webpage = bs4.BeautifulSoup(res.text)
table = webpage.select('newstat_career_log_datagrid')
list = []
for item in table:
list.append(item)
print(list)

Use pandas Data Frame to fetch the MLB Statistics table first and then convert dataframe into dictionary object.If you don't have pandas install you can do it in a single command.
pip install pandas
Then use the below code.
import pandas as pd
df=pd.read_html('https://legacy.baseballprospectus.com/card/70917/trea-turner')
data_dict = df[5].to_dict()
print(data_dict)
Output:
{'PA': {0: 44, 1: 324, 2: 447, 3: 740, 4: 15, 5: 1570}, '2B': {0: 1, 1: 14, 2: 24, 3: 27, 4: 1, 5: 67}, 'TEAM': {0: 'WAS', 1: 'WAS', 2: 'WAS', 3: 'WAS', 4: 'WAS', 5: 'Career'}, 'SB': {0: 2, 1: 33, 2: 46, 3: 43, 4: 4, 5: 128}, 'G': {0: 27, 1: 73, 2: 98, 3: 162, 4: 4, 5: 364}, 'HR': {0: 1, 1: 13, 2: 11, 3: 19, 4: 2, 5: 46}, 'FRAA': {0: 0.5, 1: -3.2, 2: 0.2, 3: 7.1, 4: -0.1, 5: 4.5}, 'BWARP': {0: 0.1, 1: 2.4, 2: 2.7, 3: 5.0, 4: 0.1, 5: 10.4}, 'CS': {0: 2, 1: 6, 2: 8, 3: 9, 4: 0, 5: 25}, '3B': {0: 0, 1: 8, 2: 6, 3: 6, 4: 0, 5: 20}, 'H': {0: 9, 1: 105, 2: 117, 3: 180, 4: 5, 5: 416}, 'AGE': {0: '22', 1: '23', 2: '24', 3: '25', 4: '26', 5: 'Career'}, 'OBP': {0: 0.295, 1: 0.37, 2: 0.33799999999999997, 3: 0.344, 4: 0.4, 5: 0.34700000000000003}, 'AVG': {0: 0.225, 1: 0.342, 2: 0.284, 3: 0.271, 4: 0.35700000000000004, 5: 0.289}, 'DRC+': {0: 77, 1: 128, 2: 99, 3: 107, 4: 103, 5: 108}, 'SO': {0: 12, 1: 59, 2: 80, 3: 132, 4: 5, 5: 288}, 'YEAR': {0: '2015', 1: '2016', 2: '2017', 3: '2018', 4: '2019', 5: 'Career'}, 'SLG': {0: 0.325, 1: 0.5670000000000001, 2: 0.451, 3: 0.41600000000000004, 4: 0.857, 5: 0.46}, 'DRAA': {0: -1.0, 1: 11.4, 2: 1.0, 3: 8.5, 4: 0.1, 5: 20.0}, 'HBP': {0: 0, 1: 1, 2: 4, 3: 5, 4: 0, 5: 10}, 'BRR': {0: 0.1, 1: 5.9, 2: 6.8, 3: 2.7, 4: 0.2, 5: 15.7}, 'BB': {0: 4, 1: 14, 2: 30, 3: 69, 4: 1, 5: 118}}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas merge dataframes with multiple columns - python

Related

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

In python pandas, count the integers in a particular column and also count all the elements in particular column

KeyError: 0 when trying to plot multiple histograms

Animate a Plotly map with a sliding date bar

How can I scrape data from an HTML table into a Python list/dict?

Categories

Resources