Related
I have a table of an "Id" column and multiple integer columns that I want to convert to categorical variables. Therefore, I want to apply this transformation only to those multiple integer columns, but leave the ID column unchanged.
All the other methods involve dropping the ID column. How do I do this without dropping the ID column?
This is the current code i have:
df= df.loc[:, df.columns != 'Id'].apply(lambda x: x.astype('category'))
Sample dataframe:
{'Id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'Foundation': {0: 2, 1: 1, 2: 2, 3: 0, 4: 2},
'GarageFinish': {0: 1, 1: 1, 2: 1, 3: 2, 4: 1},
'LandSlope': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'LotConfig': {0: 4, 1: 2, 2: 4, 3: 0, 4: 2},
'GarageQual': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'GarageCond': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'LandContour': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'Utilities': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'GarageType': {0: 1, 1: 1, 2: 1, 3: 5, 4: 1},
'LotShape': {0: 3, 1: 3, 2: 0, 3: 0, 4: 0},
'Alley': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'Street': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'PoolQC': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'Fence': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'MiscFeature': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'MSZoning': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'SaleType': {0: 8, 1: 8, 2: 8, 3: 8, 4: 8},
'PavedDrive': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'FireplaceQu': {0: 5, 1: 4, 2: 4, 3: 2, 4: 4},
'Condition1': {0: 2, 1: 1, 2: 2, 3: 2, 4: 2},
'Functional': {0: 6, 1: 6, 2: 6, 3: 6, 4: 6},
'BsmtQual': {0: 2, 1: 2, 2: 2, 3: 3, 4: 2},
'BsmtCond': {0: 3, 1: 3, 2: 3, 3: 1, 4: 3},
'BsmtExposure': {0: 3, 1: 1, 2: 2, 3: 3, 4: 0},
'BsmtFinType1': {0: 2, 1: 0, 2: 2, 3: 0, 4: 2},
'ExterQual': {0: 2, 1: 3, 2: 2, 3: 3, 4: 2},
'BsmtFinType2': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5},
'MasVnrType': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1},
'Exterior2nd': {0: 13, 1: 8, 2: 13, 3: 15, 4: 13},
'Heating': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Neighborhood': {0: 5, 1: 24, 2: 5, 3: 6, 4: 15},
'SaleCondition': {0: 4, 1: 4, 2: 4, 3: 0, 4: 4},
'Electrical': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'Exterior1st': {0: 12, 1: 8, 2: 12, 3: 13, 4: 12},
'RoofMatl': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'RoofStyle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'HouseStyle': {0: 5, 1: 2, 2: 5, 3: 5, 4: 5},
'BldgType': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Condition2': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'KitchenQual': {0: 2, 1: 3, 2: 2, 3: 2, 4: 2},
'ExterCond': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'CentralAir': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'HeatingQC': {0: 0, 1: 0, 2: 0, 3: 2, 4: 0}}
One way to do this is by isolating the Id column and then joining the converted columns:
df = df[['Id']].join(
df.loc[:, df.columns != 'Id'].astype('category')
)
Another way is to try:
df = df.groupby('Id').transform(lambda x: pd.Categorical(x)).reset_index(names = 'id')
I think the easier way would be to use astype directly, and provide a generated dictionary.
cast_df = df.astype({col: 'category' for col in df if col != 'Id'})
It's probably more performant than the other solutions too.
I am trying to merge 2 dataframes and have a problem in figuring out how, as it is not straigh forward.
One data frame has match results for over 25000 games and looks like this.
The second one has team performance metrics but only for around 1500 games.
As I am not allowed to post pictures yet, here are the column names of interest:
df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']
Both data frames have additional columns with results or performance metrics.
To be able to merge correctly, I need to merge by date and by looking if the 'team_api_id' matches either 'home...' or 'away_team_api_id'
This is what I have tried until now:
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
left_on = ['date', 'team_api_id', 'team_api_id'],
right_on = ['date', 'home_team_api_id', 'home_team_api_id'])
I have tried also with only 2 columns, but w/o succes.
What I would like to get is a new data frame with only the rows of the df_team_attributes and columns from both data frames.
Thank you in advance!
Added to request by Correlien:
output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict())
{'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}}
output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict())
{'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}}
Have you tried casting the your date columns into the correct format and then attempting the merge? The following worked for me based on the example that you provided -
# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])
# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
on = 'date')
# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")
Please let me know if my understanding of your question is correct.
Essentially my program finds the person with the most yards per carry, but finds the person with only a couple of attempts.
I'm trying to filter out the rest of the players so that I only get people with above 200 yards so far in the season.
All of the data comes from a CSV file and so it has to be done thru pandas.
import pandas as pd
wide_receiver = pd.read_csv('nfl-flex.csv')
wide_receiver['ypc'] = wide_receiver.reyds / wide_receiver.rec
wr_ypc = wide_receiver[wide_receiver['pos'] == 'WR']['ypc'].max()
yards_leader = wide_receiver.loc[wide_receiver['ypc'] == wr_ypc]
print(yards_leader['name'])
I'm not quite sure how to filter out those players with less than 200 yards.
Output:
{'id': {0: 11706, 1: 11791, 2: 11792, 3: 11793, 4: 11810}, 'name': {0: 'Mark Ingram', 1: 'Rob Gronkowski', 2: 'Marcedes Lewis', 3: 'Jimmy Graham', 4: 'Jared Cook'}, 'fpts': {0: 100.5, 1: 90.8, 2: 26.1, 3: 21.8, 4: 90.1}, 'gp': {0: 11, 1: 6, 2: 12, 3: 9, 4: 11}, 'cmp': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'att': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'payds': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'patd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'int': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruatt': {0: 137, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruyds': {0: 499, 1: 0, 2: 0, 3: 0, 4: 0}, 'rutd': {0: 2, 1: 0, 2: 0, 3: 0, 4: 0}, 'tar': {0: 31, 1: 39, 2: 17, 3: 12, 4: 55}, 'rec': {0: 24, 1: 29, 2: 14, 3: 6, 4: 33}, 'rzatt': {0: 22, 1: 0, 2: 0, 3: 0, 4: 0}, 'rztar': {0: 5, 1: 8, 2: 2, 3: 5, 4: 7}, 'reyds': {0: 156, 1: 378, 2: 121, 3: 98, 4: 371}, 'retd': {0: 0, 1: 4, 2: 0, 3: 1, 4: 3}, 'fuml': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0}, 'putd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'krtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'fumtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptpa': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptru': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptre': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1}, 'pct': {0: '0.00%', 1: '0.00%', 2: '0.00%', 3: '0.00%', 4: '0.00%'}, 'ruypc': {0: 3.64, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'reypc': {0: 6.5, 1: 13.03, 2: 8.64, 3: 16.33, 4: 11.24}, 'tchs': {0: 161, 1: 29, 2: 14, 3: 6, 4: 33}, 'tyds': {0: 655, 1: 378, 2: 121, 3: 98, 4: 371}, 'team': {0: 'NOS', 1: 'TBB', 2: 'GBP', 3: 'CHI', 4: 'LAC'}, 'pos': {0: 'RB', 1: 'TE', 2: 'TE', 3: 'TE', 4: 'TE'}, 'ypc': {0: 6.5, 1: 13.03448275862069, 2: 8.642857142857142, 3: 16.333333333333332, 4: 11.242424242424242}}
You filter when you did yards_leader = wide_receiver.loc[wide_receiver['ypc'] == wr_ypc], so now just use that same concept.
import pandas as pd
sample_dict = {'id': {0: 11706, 1: 11791, 2: 11792, 3: 11793, 4: 11810}, 'name': {0: 'Mark Ingram', 1: 'Rob Gronkowski', 2: 'Marcedes Lewis', 3: 'Jimmy Graham', 4: 'Jared Cook'}, 'fpts': {0: 100.5, 1: 90.8, 2: 26.1, 3: 21.8, 4: 90.1}, 'gp': {0: 11, 1: 6, 2: 12, 3: 9, 4: 11}, 'cmp': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'att': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'payds': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'patd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'int': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruatt': {0: 137, 1: 0, 2: 0, 3: 0, 4: 0}, 'ruyds': {0: 499, 1: 0, 2: 0, 3: 0, 4: 0}, 'rutd': {0: 2, 1: 0, 2: 0, 3: 0, 4: 0}, 'tar': {0: 31, 1: 39, 2: 17, 3: 12, 4: 55}, 'rec': {0: 24, 1: 29, 2: 14, 3: 6, 4: 33}, 'rzatt': {0: 22, 1: 0, 2: 0, 3: 0, 4: 0}, 'rztar': {0: 5, 1: 8, 2: 2, 3: 5, 4: 7}, 'reyds': {0: 156, 1: 378, 2: 121, 3: 98, 4: 371}, 'retd': {0: 0, 1: 4, 2: 0, 3: 1, 4: 3}, 'fuml': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0}, 'putd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'krtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'fumtd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptpa': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptru': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, '2ptre': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1}, 'pct': {0: '0.00%', 1: '0.00%', 2: '0.00%', 3: '0.00%', 4: '0.00%'}, 'ruypc': {0: 3.64, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'reypc': {0: 6.5, 1: 13.03, 2: 8.64, 3: 16.33, 4: 11.24}, 'tchs': {0: 161, 1: 29, 2: 14, 3: 6, 4: 33}, 'tyds': {0: 655, 1: 378, 2: 121, 3: 98, 4: 371}, 'team': {0: 'NOS', 1: 'TBB', 2: 'GBP', 3: 'CHI', 4: 'LAC'}, 'pos': {0: 'RB', 1: 'TE', 2: 'TE', 3: 'TE', 4: 'TE'}, 'ypc': {0: 6.5, 1: 13.03448275862069, 2: 8.642857142857142, 3: 16.333333333333332, 4: 11.242424242424242}}
sample_df = pd.DataFrame(sample_dict)
filtered_sample_df = sample_df[sample_df['reyds'] > 200]
Output:
print(sample_df)
id name fpts gp cmp ... tchs tyds team pos ypc
0 11706 Mark Ingram 100.5 11 0 ... 161 655 NOS RB 6.500000
1 11791 Rob Gronkowski 90.8 6 0 ... 29 378 TBB TE 13.034483
2 11792 Marcedes Lewis 26.1 12 0 ... 14 121 GBP TE 8.642857
3 11793 Jimmy Graham 21.8 9 0 ... 6 98 CHI TE 16.333333
4 11810 Jared Cook 90.1 11 0 ... 33 371 LAC TE 11.242424
[5 rows x 33 columns]
print(filtered_sample_df)
id name fpts gp cmp ... tchs tyds team pos ypc
1 11791 Rob Gronkowski 90.8 6 0 ... 29 378 TBB TE 13.034483
4 11810 Jared Cook 90.1 11 0 ... 33 371 LAC TE 11.242424
[2 rows x 33 columns]
This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 1 year ago.
I have a Data frame in the following format:
df.columns=['Timestamp','Voltage','Temperature','string_1','string_2','string_3','string_4','string_5','string_6']
I want to run a for loop in the following manner:
for column in df[['string_1','string_2','string_3','string_4','string_5','string_6']]:
I want to run this for loop without mentioning the column names separately. I want to include all columns with the word "string" in it. How do I do this?
A small sample of the data:
{'Timestamp': {0: '2019-01-01 06:00:00+00:00',
1: '2019-01-01 06:05:00+00:00',
2: '2019-01-01 06:10:00+00:00',
3: '2019-01-01 06:15:00+00:00',
4: '2019-01-01 06:20:00+00:00',
5: '2019-01-01 06:25:00+00:00',
6: '2019-01-01 06:30:00+00:00'},
'Voltage': {0: 500, 1: 500, 2: 500, 3: 500, 4: 500, 5: 500, 6: 500},
'Temperature': {0: 25, 1: 25, 2: 25, 3: 25, 4: 25, 5: 25, 6: 25},
'string_1': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
'string_2': {0: 3, 1: 4, 2: 1, 3: 2, 4: 5, 5: 8, 6: 9},
'string_3': {0: 2, 1: 4, 2: 5, 3: 5, 4: 3, 5: 4, 6: 9},
'string_4': {0: 1, 1: 4, 2: 7, 3: 2, 4: 2, 5: 3, 6: 1},
'string_5': {0: 3, 1: 4, 2: 1, 3: 2, 4: 5, 5: 8, 6: 9},
'string_6': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}
Use DataFrame.filter:
df.filter(like='string')
Try pd.DataFrame.filter:
df.filter(like='string')
I'm trying to import data from Baseball Prospectus into a Python table / dictionary (which would be better?).
Below is what I have, based on following along to Automate The Boring Stuff with Python.
I get that my method isn't properly using these functions, but I can't figure out what tools I should be using.
import requests
import webbrowser
import bs4
res = requests.get('https://legacy.baseballprospectus.com/card/70917/trea-turner')
res.raise_for_status()
webpage = bs4.BeautifulSoup(res.text)
table = webpage.select('newstat_career_log_datagrid')
list = []
for item in table:
list.append(item)
print(list)
Use pandas Data Frame to fetch the MLB Statistics table first and then convert dataframe into dictionary object.If you don't have pandas install you can do it in a single command.
pip install pandas
Then use the below code.
import pandas as pd
df=pd.read_html('https://legacy.baseballprospectus.com/card/70917/trea-turner')
data_dict = df[5].to_dict()
print(data_dict)
Output:
{'PA': {0: 44, 1: 324, 2: 447, 3: 740, 4: 15, 5: 1570}, '2B': {0: 1, 1: 14, 2: 24, 3: 27, 4: 1, 5: 67}, 'TEAM': {0: 'WAS', 1: 'WAS', 2: 'WAS', 3: 'WAS', 4: 'WAS', 5: 'Career'}, 'SB': {0: 2, 1: 33, 2: 46, 3: 43, 4: 4, 5: 128}, 'G': {0: 27, 1: 73, 2: 98, 3: 162, 4: 4, 5: 364}, 'HR': {0: 1, 1: 13, 2: 11, 3: 19, 4: 2, 5: 46}, 'FRAA': {0: 0.5, 1: -3.2, 2: 0.2, 3: 7.1, 4: -0.1, 5: 4.5}, 'BWARP': {0: 0.1, 1: 2.4, 2: 2.7, 3: 5.0, 4: 0.1, 5: 10.4}, 'CS': {0: 2, 1: 6, 2: 8, 3: 9, 4: 0, 5: 25}, '3B': {0: 0, 1: 8, 2: 6, 3: 6, 4: 0, 5: 20}, 'H': {0: 9, 1: 105, 2: 117, 3: 180, 4: 5, 5: 416}, 'AGE': {0: '22', 1: '23', 2: '24', 3: '25', 4: '26', 5: 'Career'}, 'OBP': {0: 0.295, 1: 0.37, 2: 0.33799999999999997, 3: 0.344, 4: 0.4, 5: 0.34700000000000003}, 'AVG': {0: 0.225, 1: 0.342, 2: 0.284, 3: 0.271, 4: 0.35700000000000004, 5: 0.289}, 'DRC+': {0: 77, 1: 128, 2: 99, 3: 107, 4: 103, 5: 108}, 'SO': {0: 12, 1: 59, 2: 80, 3: 132, 4: 5, 5: 288}, 'YEAR': {0: '2015', 1: '2016', 2: '2017', 3: '2018', 4: '2019', 5: 'Career'}, 'SLG': {0: 0.325, 1: 0.5670000000000001, 2: 0.451, 3: 0.41600000000000004, 4: 0.857, 5: 0.46}, 'DRAA': {0: -1.0, 1: 11.4, 2: 1.0, 3: 8.5, 4: 0.1, 5: 20.0}, 'HBP': {0: 0, 1: 1, 2: 4, 3: 5, 4: 0, 5: 10}, 'BRR': {0: 0.1, 1: 5.9, 2: 6.8, 3: 2.7, 4: 0.2, 5: 15.7}, 'BB': {0: 4, 1: 14, 2: 30, 3: 69, 4: 1, 5: 118}}