KeyError: 0 when trying to plot multiple histograms - python

Having troubles plotting multiple histograms. I get the error message:
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()
KeyError: 0
This is the code I wrote:
xaxes = ['price','bedrooms','sqft_living','sqft_lot','floors','waterfront',
'view','condition','grade','sqft_above','sqft_basement','yr_built',
'yr_renovated','zipcode','lat','long','sqft_living15','sqft_loft15']
a,b = plt.subplots(4,5)
b = b.ravel()
for idx,ax in enumerate(b):
ax.hist(file[idx])
ax.set_title(titles[idx])
ax.set_xlabel(xaxes[id])
plt.tight_layout()
Here is a sample of my data:
{'bathrooms': {0: 1.0,
1: 2.25,
2: 1.0,
3: 3.0,
4: 2.0,
5: 4.5,
6: 2.25,
7: 1.5,
8: 1.0,
9: 2.5},
'bedrooms': {0: 3, 1: 3, 2: 2, 3: 4, 4: 3, 5: 4, 6: 3, 7: 3, 8: 3, 9: 3},
'condition': {0: 3, 1: 3, 2: 3, 3: 5, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3},
'date': {0: '20141013T000000',
1: '20141209T000000',
2: '20150225T000000',
3: '20141209T000000',
4: '20150218T000000',
5: '20140512T000000',
6: '20140627T000000',
7: '20150115T000000',
8: '20150415T000000',
9: '20150312T000000'},
'floors': {0: 1.0,
1: 2.0,
2: 1.0,
3: 1.0,
4: 1.0,
5: 1.0,
6: 2.0,
7: 1.0,
8: 1.0,
9: 2.0},
'grade': {0: 7, 1: 7, 2: 6, 3: 7, 4: 8, 5: 11, 6: 7, 7: 7, 8: 7, 9: 7},
'id': {0: 7129300520,
1: 6414100192,
2: 5631500400,
3: 2487200875,
4: 1954400510,
5: 7237550310,
6: 1321400060,
7: 2008000270,
8: 2414600126,
9: 3793500160},
'lat': {0: 47.511200000000002,
1: 47.721000000000004,
2: 47.737900000000003,
3: 47.520800000000001,
4: 47.616799999999998,
5: 47.656100000000002,
6: 47.309699999999999,
7: 47.409500000000001,
8: 47.512300000000003,
9: 47.368400000000001},
'long': {0: -122.25700000000001,
1: -122.319,
2: -122.23299999999999,
3: -122.39299999999999,
4: -122.045,
5: -122.005,
6: -122.32700000000001,
7: -122.315,
8: -122.337,
9: -122.03100000000001},
'price': {0: 221900.0,
1: 538000.0,
2: 180000.0,
3: 604000.0,
4: 510000.0,
5: 1230000.0,
6: 257500.0,
7: 291850.0,
8: 229500.0,
9: 323000.0},
'sqft_above': {0: 1180,
1: 2170,
2: 770,
3: 1050,
4: 1680,
5: 3890,
6: 1715,
7: 1060,
8: 1050,
9: 1890},
'sqft_basement': {0: 0,
1: 400,
2: 0,
3: 910,
4: 0,
5: 1530,
6: 0,
7: 0,
8: 730,
9: 0},
'sqft_living': {0: 1180,
1: 2570,
2: 770,
3: 1960,
4: 1680,
5: 5420,
6: 1715,
7: 1060,
8: 1780,
9: 1890},
'sqft_living15': {0: 1340,
1: 1690,
2: 2720,
3: 1360,
4: 1800,
5: 4760,
6: 2238,
7: 1650,
8: 1780,
9: 2390},
'sqft_lot': {0: 5650,
1: 7242,
2: 10000,
3: 5000,
4: 8080,
5: 101930,
6: 6819,
7: 9711,
8: 7470,
9: 6560},
'sqft_lot15': {0: 5650,
1: 7639,
2: 8062,
3: 5000,
4: 7503,
5: 101930,
6: 6819,
7: 9711,
8: 8113,
9: 7570},
'view': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'waterfront': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'yr_built': {0: 1955,
1: 1951,
2: 1933,
3: 1965,
4: 1987,
5: 2001,
6: 1995,
7: 1963,
8: 1960,
9: 2003},
'yr_renovated': {0: 0,
1: 1991,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0},
'zipcode': {0: 98178,
1: 98125,
2: 98028,
3: 98136,
4: 98074,
5: 98053,
6: 98003,
7: 98198,
8: 98146,
9: 98038}}

If i understand you in a right way and variable file contains your data in pandas dataframe then you simply faced with a problem of indexing that dataframe.
file[idx] corresponds to file.loc[idx] which means "give me a row with idx number in my dataframe" while you need a column instead of a row. Just replace it with file.loc[:,idx].
Check this link for mode details about indexing and selecting in pandas.

Related

Pandas merge dataframes with multiple columns

I am trying to merge 2 dataframes and have a problem in figuring out how, as it is not straigh forward.
One data frame has match results for over 25000 games and looks like this.
The second one has team performance metrics but only for around 1500 games.
As I am not allowed to post pictures yet, here are the column names of interest:
df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']
Both data frames have additional columns with results or performance metrics.
To be able to merge correctly, I need to merge by date and by looking if the 'team_api_id' matches either 'home...' or 'away_team_api_id'
This is what I have tried until now:
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
left_on = ['date', 'team_api_id', 'team_api_id'],
right_on = ['date', 'home_team_api_id', 'home_team_api_id'])
I have tried also with only 2 columns, but w/o succes.
What I would like to get is a new data frame with only the rows of the df_team_attributes and columns from both data frames.
Thank you in advance!
Added to request by Correlien:
output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict())
{'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}}
output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict())
{'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}}
Have you tried casting the your date columns into the correct format and then attempting the merge? The following worked for me based on the example that you provided -
# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])
# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
on = 'date')
# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")
Please let me know if my understanding of your question is correct.

How do you sort column values in pandas to only include certain values?

Essentially my program finds the person with the most yards per carry, but finds the person with only a couple of attempts.
I'm trying to filter out the rest of the players so that I only get people with above 200 yards so far in the season.
All of the data comes from a CSV file and so it has to be done thru pandas.
import pandas as pd
wide_receiver = pd.read_csv('nfl-flex.csv')
wide_receiver['ypc'] = wide_receiver.reyds / wide_receiver.rec
wr_ypc = wide_receiver[wide_receiver['pos'] == 'WR']['ypc'].max()
yards_leader = wide_receiver.loc[wide_receiver['ypc'] == wr_ypc]
print(yards_leader['name'])
I'm not quite sure how to filter out those players with less than 200 yards.
Output:
{'id': {0: 11706, 1: 11791, 2: 11792, 3: 11793, 4: 11810}, 'name': {0: 'Mark Ingram', 1: 'Rob Gronkowski', 2: 'Marcedes Lewis', 3: 'Jimmy Graham', 4: 'Jared Cook'}, 'fpts': {0: 100.5, 1: 90.8, 2: 26.1, 3: 21.8, 4: 90.1}, 'gp': {0: 11, 1: 6, 2: 12, 3: 9, 4: 11}, 'cmp': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'att': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'payds': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'patd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'int': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruatt': {0: 137, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruyds': {0: 499, 1: 0, 2: 0, 3: 0, 4: 0}, 'rutd': {0: 2, 1: 0, 2: 0, 3: 0, 4: 0}, 'tar': {0: 31, 1: 39, 2: 17, 3: 12, 4: 55}, 'rec': {0: 24, 1: 29, 2: 14, 3: 6, 4: 33}, 'rzatt': {0: 22, 1: 0, 2: 0, 3: 0, 4: 0}, 'rztar': {0: 5, 1: 8, 2: 2, 3: 5, 4: 7}, 'reyds': {0: 156, 1: 378, 2: 121, 3: 98, 4: 371}, 'retd': {0: 0, 1: 4, 2: 0, 3: 1, 4: 3}, 'fuml': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0}, 'putd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'krtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'fumtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptpa': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptru': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptre': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1}, 'pct': {0: '0.00%', 1: '0.00%', 2: '0.00%', 3: '0.00%', 4: '0.00%'}, 'ruypc': {0: 3.64, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'reypc': {0: 6.5, 1: 13.03, 2: 8.64, 3: 16.33, 4: 11.24}, 'tchs': {0: 161, 1: 29, 2: 14, 3: 6, 4: 33}, 'tyds': {0: 655, 1: 378, 2: 121, 3: 98, 4: 371}, 'team': {0: 'NOS', 1: 'TBB', 2: 'GBP', 3: 'CHI', 4: 'LAC'}, 'pos': {0: 'RB', 1: 'TE', 2: 'TE', 3: 'TE', 4: 'TE'}, 'ypc': {0: 6.5, 1: 13.03448275862069, 2: 8.642857142857142, 3: 16.333333333333332, 4: 11.242424242424242}}
You filter when you did yards_leader = wide_receiver.loc[wide_receiver['ypc'] == wr_ypc], so now just use that same concept.
import pandas as pd
sample_dict = {'id': {0: 11706, 1: 11791, 2: 11792, 3: 11793, 4: 11810}, 'name': {0: 'Mark Ingram', 1: 'Rob Gronkowski', 2: 'Marcedes Lewis', 3: 'Jimmy Graham', 4: 'Jared Cook'}, 'fpts': {0: 100.5, 1: 90.8, 2: 26.1, 3: 21.8, 4: 90.1}, 'gp': {0: 11, 1: 6, 2: 12, 3: 9, 4: 11}, 'cmp': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'att': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'payds': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'patd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'int': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruatt': {0: 137, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruyds': {0: 499, 1: 0, 2: 0, 3: 0, 4: 0}, 'rutd': {0: 2, 1: 0, 2: 0, 3: 0, 4: 0}, 'tar': {0: 31, 1: 39, 2: 17, 3: 12, 4: 55}, 'rec': {0: 24, 1: 29, 2: 14, 3: 6, 4: 33}, 'rzatt': {0: 22, 1: 0, 2: 0, 3: 0, 4: 0}, 'rztar': {0: 5, 1: 8, 2: 2, 3: 5, 4: 7}, 'reyds': {0: 156, 1: 378, 2: 121, 3: 98, 4: 371}, 'retd': {0: 0, 1: 4, 2: 0, 3: 1, 4: 3}, 'fuml': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0}, 'putd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'krtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'fumtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptpa': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptru': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptre': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1}, 'pct': {0: '0.00%', 1: '0.00%', 2: '0.00%', 3: '0.00%', 4: '0.00%'}, 'ruypc': {0: 3.64, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'reypc': {0: 6.5, 1: 13.03, 2: 8.64, 3: 16.33, 4: 11.24}, 'tchs': {0: 161, 1: 29, 2: 14, 3: 6, 4: 33}, 'tyds': {0: 655, 1: 378, 2: 121, 3: 98, 4: 371}, 'team': {0: 'NOS', 1: 'TBB', 2: 'GBP', 3: 'CHI', 4: 'LAC'}, 'pos': {0: 'RB', 1: 'TE', 2: 'TE', 3: 'TE', 4: 'TE'}, 'ypc': {0: 6.5, 1: 13.03448275862069, 2: 8.642857142857142, 3: 16.333333333333332, 4: 11.242424242424242}}
sample_df = pd.DataFrame(sample_dict)
filtered_sample_df = sample_df[sample_df['reyds'] > 200]
Output:
print(sample_df)
id name fpts gp cmp ... tchs tyds team pos ypc
0 11706 Mark Ingram 100.5 11 0 ... 161 655 NOS RB 6.500000
1 11791 Rob Gronkowski 90.8 6 0 ... 29 378 TBB TE 13.034483
2 11792 Marcedes Lewis 26.1 12 0 ... 14 121 GBP TE 8.642857
3 11793 Jimmy Graham 21.8 9 0 ... 6 98 CHI TE 16.333333
4 11810 Jared Cook 90.1 11 0 ... 33 371 LAC TE 11.242424
[5 rows x 33 columns]
print(filtered_sample_df)
id name fpts gp cmp ... tchs tyds team pos ypc
1 11791 Rob Gronkowski 90.8 6 0 ... 29 378 TBB TE 13.034483
4 11810 Jared Cook 90.1 11 0 ... 33 371 LAC TE 11.242424
[2 rows x 33 columns]

Multiple plotting from multi-index dataframe

I want to plot by Step Typ: Traction and Stribeck. The different load stages should have his own plot. At the respective load level, the line plots should be broken down by temperature. y-axis is Traction (-) and x-axis the respective counterpart SRR (%) or Rolling speed (mm/s) (for Traction and Stribeck respectively). At the end, I should have four different plots.
Example, how it should look like:
My attempt so far, which leads to an empty plot.
import pandas as pd
import matplotlib.pyplot as plt
data = {'Step 1': {'Step Typ': 'Traction', 'SRR (%)': {1: 8.384, 2: 9.815, 3: 7.531, 4: 10.209, 5: 7.989, 6: 7.331, 7: 5.008, 8: 2.716, 9: 9.6, 10: 7.911}, 'Traction (-)': {1: 5.602, 2: 6.04, 3: 2.631, 4: 2.952, 5: 8.162, 6: 9.312, 7: 4.994, 8: 2.959, 9: 10.075, 10: 5.498}, 'Temperature': 30, 'Load': 40}, 'Step 3': {'Step Typ': 'Traction', 'SRR (%)': {1: 2.909, 2: 5.552, 3: 5.656, 4: 9.043, 5: 3.424, 6: 7.382, 7: 3.916, 8: 2.665, 9: 4.832, 10: 3.993}, 'Traction (-)': {1: 9.158, 2: 6.721, 3: 7.787, 4: 7.491, 5: 8.267, 6: 2.985, 7: 5.882, 8: 3.591, 9: 6.334, 10: 10.43}, 'Temperature': 80, 'Load': 40}, 'Step 5': {'Step Typ': 'Traction', 'SRR (%)': {1: 4.765, 2: 9.293, 3: 7.608, 4: 7.371, 5: 4.87, 6: 4.832, 7: 6.244, 8: 6.488, 9: 5.04, 10: 2.962}, 'Traction (-)': {1: 6.656, 2: 7.872, 3: 8.799, 4: 7.9, 5: 4.22, 6: 6.288, 7: 7.439, 8: 7.77, 9: 5.977, 10: 9.395}, 'Temperature': 30, 'Load': 70}, 'Step 7': {'Step Typ': 'Traction', 'SRR (%)': {1: 9.46, 2: 2.83, 3: 3.249, 4: 9.273, 5: 8.792, 6: 9.673, 7: 6.784, 8: 3.838, 9: 8.779, 10: 4.82}, 'Traction (-)': {1: 5.245, 2: 8.491, 3: 10.088, 4: 9.988, 5: 4.886, 6: 4.168, 7: 8.628, 8: 5.038, 9: 7.712, 10: 3.961}, 'Temperature': 80, 'Load': 70}, 'Step 2': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 4.862, 2: 4.71, 3: 4.537, 4: 6.35, 5: 6.691, 6: 5.337, 7: 8.419, 8: 10.303, 9: 5.018, 10: 10.195}, 'Traction (-)': {1: 6.674, 2: 10.137, 3: 2.822, 4: 5.494, 5: 9.986, 6: 9.095, 7: 3.53, 8: 6.96, 9: 8.251, 10: 7.836}, 'Temperature': 30, 'Load': 40}, 'Step 4': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 4.04, 2: 8.288, 3: 3.731, 4: 10.137, 5: 5.32, 6: 8.504, 7: 5.917, 8: 9.677, 9: 8.641, 10: 7.685}, 'Traction (-)': {1: 9.522, 2: 4.749, 3: 3.46, 4: 3.21, 5: 5.005, 6: 9.886, 7: 8.023, 8: 5.935, 9: 8.74, 10: 5.117}, 'Temperature': 80, 'Load': 40}, 'Step 6': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 6.244, 2: 7.015, 3: 5.998, 4: 4.894, 5: 6.117, 6: 6.644, 7: 7.619, 8: 10.477, 9: 9.61, 10: 2.958}, 'Traction (-)': {1: 7.353, 2: 7.98, 3: 6.675, 4: 8.853, 5: 7.537, 6: 5.256, 7: 4.923, 8: 10.293, 9: 2.873, 10: 10.407}, 'Temperature': 30, 'Load': 70}, 'Step 8': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 3.475, 2: 2.756, 3: 7.809, 4: 9.449, 5: 2.72, 6: 4.133, 7: 10.139, 8: 10.0, 9: 3.71, 10: 8.267}, 'Traction (-)': {1: 6.307, 2: 2.83, 3: 9.258, 4: 3.405, 5: 9.659, 6: 6.662, 7: 6.413, 8: 6.488, 9: 7.972, 10: 6.288}, 'Temperature': 80, 'Load': 70} }
df = pd.DataFrame(data)
items = list()
series = list()
for item, d in data.items():
items.append(item)
series.append(pd.DataFrame.from_dict(d))
df = pd.concat(series, keys=items)
df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True)
df.loc[('Stribeck')]
for force, _ in df.groupby(level=1):
plt.figure(figsize=(15, 12))
for i, row in df.loc[('Traction'), force].iterrows():
plt.ylim(0, 0.1)
plt.ylabel('Traction Coeff (-)')
plt.xlabel('Rolling Speed (mm/s)')
plt.title('Title comes later', loc='left')
plt.plot(row['Rolling Speed (mm/s)'], row['Traction (-)'], label=f"{i} - {force}")
print(f"{i} - {force}")
plt.show()
I have changed your plotting loop. The code below will generate two plots for Traction (one for each Load value), where each has two curves (one for each temperature). I have commented the line where you set the ylim(a, b) because this could lead to empty plot if data fall out of (a, b) range.
import pandas as pd
import matplotlib.pyplot as plt
data = {'Step 1': {'Step Typ': 'Traction', 'SRR (%)': {1: 8.384, 2: 9.815, 3: 7.531, 4: 10.209, 5: 7.989, 6: 7.331, 7: 5.008, 8: 2.716, 9: 9.6, 10: 7.911}, 'Traction (-)': {1: 5.602, 2: 6.04, 3: 2.631, 4: 2.952, 5: 8.162, 6: 9.312, 7: 4.994, 8: 2.959, 9: 10.075, 10: 5.498}, 'Temperature': 30, 'Load': 40}, 'Step 3': {'Step Typ': 'Traction', 'SRR (%)': {1: 2.909, 2: 5.552, 3: 5.656, 4: 9.043, 5: 3.424, 6: 7.382, 7: 3.916, 8: 2.665, 9: 4.832, 10: 3.993}, 'Traction (-)': {1: 9.158, 2: 6.721, 3: 7.787, 4: 7.491, 5: 8.267, 6: 2.985, 7: 5.882, 8: 3.591, 9: 6.334, 10: 10.43}, 'Temperature': 80, 'Load': 40}, 'Step 5': {'Step Typ': 'Traction', 'SRR (%)': {1: 4.765, 2: 9.293, 3: 7.608, 4: 7.371, 5: 4.87, 6: 4.832, 7: 6.244, 8: 6.488, 9: 5.04, 10: 2.962}, 'Traction (-)': {1: 6.656, 2: 7.872, 3: 8.799, 4: 7.9, 5: 4.22, 6: 6.288, 7: 7.439, 8: 7.77, 9: 5.977, 10: 9.395}, 'Temperature': 30, 'Load': 70}, 'Step 7': {'Step Typ': 'Traction', 'SRR (%)': {1: 9.46, 2: 2.83, 3: 3.249, 4: 9.273, 5: 8.792, 6: 9.673, 7: 6.784, 8: 3.838, 9: 8.779, 10: 4.82}, 'Traction (-)': {1: 5.245, 2: 8.491, 3: 10.088, 4: 9.988, 5: 4.886, 6: 4.168, 7: 8.628, 8: 5.038, 9: 7.712, 10: 3.961}, 'Temperature': 80, 'Load': 70}, 'Step 2': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 4.862, 2: 4.71, 3: 4.537, 4: 6.35, 5: 6.691, 6: 5.337, 7: 8.419, 8: 10.303, 9: 5.018, 10: 10.195}, 'Traction (-)': {1: 6.674, 2: 10.137, 3: 2.822, 4: 5.494, 5: 9.986, 6: 9.095, 7: 3.53, 8: 6.96, 9: 8.251, 10: 7.836}, 'Temperature': 30, 'Load': 40}, 'Step 4': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 4.04, 2: 8.288, 3: 3.731, 4: 10.137, 5: 5.32, 6: 8.504, 7: 5.917, 8: 9.677, 9: 8.641, 10: 7.685}, 'Traction (-)': {1: 9.522, 2: 4.749, 3: 3.46, 4: 3.21, 5: 5.005, 6: 9.886, 7: 8.023, 8: 5.935, 9: 8.74, 10: 5.117}, 'Temperature': 80, 'Load': 40}, 'Step 6': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 6.244, 2: 7.015, 3: 5.998, 4: 4.894, 5: 6.117, 6: 6.644, 7: 7.619, 8: 10.477, 9: 9.61, 10: 2.958}, 'Traction (-)': {1: 7.353, 2: 7.98, 3: 6.675, 4: 8.853, 5: 7.537, 6: 5.256, 7: 4.923, 8: 10.293, 9: 2.873, 10: 10.407}, 'Temperature': 30, 'Load': 70}, 'Step 8': {'Step Typ': 'Stribeck', 'Rolling Speed (mm/s)': {1: 3.475, 2: 2.756, 3: 7.809, 4: 9.449, 5: 2.72, 6: 4.133, 7: 10.139, 8: 10.0, 9: 3.71, 10: 8.267}, 'Traction (-)': {1: 6.307, 2: 2.83, 3: 9.258, 4: 3.405, 5: 9.659, 6: 6.662, 7: 6.413, 8: 6.488, 9: 7.972, 10: 6.288}, 'Temperature': 80, 'Load': 70} }
df = pd.DataFrame(data)
items = list()
series = list()
for item, d in data.items():
items.append(item)
series.append(pd.DataFrame.from_dict(d))
df = pd.concat(series, keys=items)
df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True)
for force, _ in df.groupby(level=1):
fig, ax = plt.subplots(figsize=(8, 6))
df_step = df.loc[('Traction'), force]
for temperature in df_step.index.unique():
df_temp = df_step.loc[temperature].sort_values('SRR (%)')
# ax.set_ylim(0, 0.1)
ax.set_ylabel('Traction Coeff (-)')
ax.set_xlabel('SRR (%)')
ax.set_title('Title comes later', loc='left')
ax.plot(df_temp['SRR (%)'], df_temp['Traction (-)'], label = f'T = {df_temp.index.unique().values[0]}°C - Load = {force}')
ax.legend(frameon = True)
plt.show()

how to clip pandas for a multiple column in a data frame

Here is the df:
{'Type 1': {1: 123.0,
2: 123.0,
3: 123.0,
4: 123.0,
5: 123.0,
6: 45.0,
7: 45.0,
8: 45.0,
9: 45.0,
10: 9.5,
11: 9.5,
12: 9.5,
13: 2.34,
14: 2.34,
15: 2.34},
'Type 2': {1: 0,
2: 0,
3: -90,
4: -90,
5: -90,
6: -90,
7: -90,
8: -270,
9: -270,
10: -270,
11: -270,
12: 180,
13: 180,
14: 181,
15: 181},
'Type 3': {1: 0,
2: 0,
3: 0,
4: 0,
5: 55,
6: 55,
7: 55,
8: 55,
9: 55,
10: 9,
11: 9,
12: 3,
13: 3,
14: 3,
15: 3},
'Type 4': {1: 5.0,
2: 5.0,
3: 5.0,
4: 5.0,
5: 10.0,
6: 123.0,
7: 12.0,
8: 23.0,
9: 16.0,
10: 3.14,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 18.0},
'Type 5': {1: 65536,
2: 65536,
3: 65536,
4: 65536,
5: 78888888,
6: 665,
7: 665,
8: 665,
9: 665,
10: 665,
11: 665,
12: 665,
13: 665,
14: 665,
15: 665},
'Type 6': {1: 3.4124,
2: 3.4124,
3: 3.4124,
4: 3.4124,
5: 3.4124,
6: 3.4124,
7: 3.4124,
8: 3.4124,
9: 3.4124,
10: 3.4124,
11: 3.4124,
12: 3.4124,
13: 3.4124,
14: 3.4124,
15: 3.4124},
'Type 7': {1: 0,
2: 0,
3: 2,
4: 2,
5: 2,
6: 1,
7: 1,
8: 1,
9: 1,
10: 10,
11: 10,
12: 9,
13: 9,
14: -5,
15: -5},
'Type 8': {1: 'convert the string to 0 and non-zero value to 1',
2: 'convert the string to 0 and non-zero value to 1',
3: 'convert the string to 0 and non-zero value to 1',
4: 'convert the string to 0 and non-zero value to 1',
5: 'convert the string to 0 and non-zero value to 1',
6: 'convert the string to 0 and non-zero value to 1',
7: 'convert the string to 0 and non-zero value to 1',
8: 'convert the string to 0 and non-zero value to 1',
9: 'convert the string to 0 and non-zero value to 1',
10: 'convert the string to 0 and non-zero value to 1',
11: 'convert the string to 0 and non-zero value to 1',
12: 'convert the string to 0 and non-zero value to 1',
13: 'convert the string to 0 and non-zero value to 1',
14: 'convert the string to 0 and non-zero value to 1',
15: 'convert the string to 0 and non-zero value to 1'},
'Type 9': {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 1,
8: 0,
9: 0,
10: 8,
11: 8,
12: 0,
13: 0,
14: 45,
15: 45}}
each column in the dataframe has a lower and an upper limit as mentioned in the below list
eg:
lower_limit = [3,-90,0,0,0,1,0,0,0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,1,1] #Type 1 upper limit is 100...
lower_limit = pd.Series(lower_limit)
upper_limit = pd.Series(upper_limit)
df.clip(lower_limit, upper_limit, axis = 1)
But this returns every element as nan
whereas the expected result is to clip each column based on the upper limit and lower limit mentioned in the list...
Using for loop, I was able to make the necessary change, but it was extremely slower when the size of df is huge
I understand clipping is the faster way to make the changes to df but it doesnt work as expected, I am doing some mistake in it and advice if any other alternative ways of clipping the columns in a faster way?
From documentation, lower and upper must be float or array-like, not Series.
You could do
lower_limit = [3,-90,0,0,0,1,0,'',0] #Type 1 lower limit is 3...
upper_limit = [100,90,50,100,65535,3,1,'',1] #Type 1 upper limit is 100...
df.clip(lower_limit, upper_limit, axis = 1)
but column Type 8 is as string so you'd get an empty column with clip, you can fix with
lower_limit = [3,-90,0,0,0,1,0,df['Type 8'].min(),0]
upper_limit = [100,90,50,100,65535,3,1,df['Type 8'].max(),1]

How can I scrape data from an HTML table into a Python list/dict?

I'm trying to import data from Baseball Prospectus into a Python table / dictionary (which would be better?).
Below is what I have, based on following along to Automate The Boring Stuff with Python.
I get that my method isn't properly using these functions, but I can't figure out what tools I should be using.
import requests
import webbrowser
import bs4
res = requests.get('https://legacy.baseballprospectus.com/card/70917/trea-turner')
res.raise_for_status()
webpage = bs4.BeautifulSoup(res.text)
table = webpage.select('newstat_career_log_datagrid')
list = []
for item in table:
list.append(item)
print(list)
Use pandas Data Frame to fetch the MLB Statistics table first and then convert dataframe into dictionary object.If you don't have pandas install you can do it in a single command.
pip install pandas
Then use the below code.
import pandas as pd
df=pd.read_html('https://legacy.baseballprospectus.com/card/70917/trea-turner')
data_dict = df[5].to_dict()
print(data_dict)
Output:
{'PA': {0: 44, 1: 324, 2: 447, 3: 740, 4: 15, 5: 1570}, '2B': {0: 1, 1: 14, 2: 24, 3: 27, 4: 1, 5: 67}, 'TEAM': {0: 'WAS', 1: 'WAS', 2: 'WAS', 3: 'WAS', 4: 'WAS', 5: 'Career'}, 'SB': {0: 2, 1: 33, 2: 46, 3: 43, 4: 4, 5: 128}, 'G': {0: 27, 1: 73, 2: 98, 3: 162, 4: 4, 5: 364}, 'HR': {0: 1, 1: 13, 2: 11, 3: 19, 4: 2, 5: 46}, 'FRAA': {0: 0.5, 1: -3.2, 2: 0.2, 3: 7.1, 4: -0.1, 5: 4.5}, 'BWARP': {0: 0.1, 1: 2.4, 2: 2.7, 3: 5.0, 4: 0.1, 5: 10.4}, 'CS': {0: 2, 1: 6, 2: 8, 3: 9, 4: 0, 5: 25}, '3B': {0: 0, 1: 8, 2: 6, 3: 6, 4: 0, 5: 20}, 'H': {0: 9, 1: 105, 2: 117, 3: 180, 4: 5, 5: 416}, 'AGE': {0: '22', 1: '23', 2: '24', 3: '25', 4: '26', 5: 'Career'}, 'OBP': {0: 0.295, 1: 0.37, 2: 0.33799999999999997, 3: 0.344, 4: 0.4, 5: 0.34700000000000003}, 'AVG': {0: 0.225, 1: 0.342, 2: 0.284, 3: 0.271, 4: 0.35700000000000004, 5: 0.289}, 'DRC+': {0: 77, 1: 128, 2: 99, 3: 107, 4: 103, 5: 108}, 'SO': {0: 12, 1: 59, 2: 80, 3: 132, 4: 5, 5: 288}, 'YEAR': {0: '2015', 1: '2016', 2: '2017', 3: '2018', 4: '2019', 5: 'Career'}, 'SLG': {0: 0.325, 1: 0.5670000000000001, 2: 0.451, 3: 0.41600000000000004, 4: 0.857, 5: 0.46}, 'DRAA': {0: -1.0, 1: 11.4, 2: 1.0, 3: 8.5, 4: 0.1, 5: 20.0}, 'HBP': {0: 0, 1: 1, 2: 4, 3: 5, 4: 0, 5: 10}, 'BRR': {0: 0.1, 1: 5.9, 2: 6.8, 3: 2.7, 4: 0.2, 5: 15.7}, 'BB': {0: 4, 1: 14, 2: 30, 3: 69, 4: 1, 5: 118}}

Categories

Resources