Related
I have a table of an "Id" column and multiple integer columns that I want to convert to categorical variables. Therefore, I want to apply this transformation only to those multiple integer columns, but leave the ID column unchanged.
All the other methods involve dropping the ID column. How do I do this without dropping the ID column?
This is the current code i have:
df= df.loc[:, df.columns != 'Id'].apply(lambda x: x.astype('category'))
Sample dataframe:
{'Id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'Foundation': {0: 2, 1: 1, 2: 2, 3: 0, 4: 2},
'GarageFinish': {0: 1, 1: 1, 2: 1, 3: 2, 4: 1},
'LandSlope': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'LotConfig': {0: 4, 1: 2, 2: 4, 3: 0, 4: 2},
'GarageQual': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'GarageCond': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'LandContour': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'Utilities': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'GarageType': {0: 1, 1: 1, 2: 1, 3: 5, 4: 1},
'LotShape': {0: 3, 1: 3, 2: 0, 3: 0, 4: 0},
'Alley': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'Street': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'PoolQC': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'Fence': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'MiscFeature': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'MSZoning': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'SaleType': {0: 8, 1: 8, 2: 8, 3: 8, 4: 8},
'PavedDrive': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'FireplaceQu': {0: 5, 1: 4, 2: 4, 3: 2, 4: 4},
'Condition1': {0: 2, 1: 1, 2: 2, 3: 2, 4: 2},
'Functional': {0: 6, 1: 6, 2: 6, 3: 6, 4: 6},
'BsmtQual': {0: 2, 1: 2, 2: 2, 3: 3, 4: 2},
'BsmtCond': {0: 3, 1: 3, 2: 3, 3: 1, 4: 3},
'BsmtExposure': {0: 3, 1: 1, 2: 2, 3: 3, 4: 0},
'BsmtFinType1': {0: 2, 1: 0, 2: 2, 3: 0, 4: 2},
'ExterQual': {0: 2, 1: 3, 2: 2, 3: 3, 4: 2},
'BsmtFinType2': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5},
'MasVnrType': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1},
'Exterior2nd': {0: 13, 1: 8, 2: 13, 3: 15, 4: 13},
'Heating': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Neighborhood': {0: 5, 1: 24, 2: 5, 3: 6, 4: 15},
'SaleCondition': {0: 4, 1: 4, 2: 4, 3: 0, 4: 4},
'Electrical': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'Exterior1st': {0: 12, 1: 8, 2: 12, 3: 13, 4: 12},
'RoofMatl': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'RoofStyle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'HouseStyle': {0: 5, 1: 2, 2: 5, 3: 5, 4: 5},
'BldgType': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Condition2': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'KitchenQual': {0: 2, 1: 3, 2: 2, 3: 2, 4: 2},
'ExterCond': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4},
'CentralAir': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'HeatingQC': {0: 0, 1: 0, 2: 0, 3: 2, 4: 0}}
One way to do this is by isolating the Id column and then joining the converted columns:
df = df[['Id']].join(
df.loc[:, df.columns != 'Id'].astype('category')
)
Another way is to try:
df = df.groupby('Id').transform(lambda x: pd.Categorical(x)).reset_index(names = 'id')
I think the easier way would be to use astype directly, and provide a generated dictionary.
cast_df = df.astype({col: 'category' for col in df if col != 'Id'})
It's probably more performant than the other solutions too.
I am unable to drop the index column which is normally given by the python on its own. I melted a data frame and for further processing, I need to drop the index column and I am unable to do that.
Attached is the data frame which is uploaded in df:
{'Key': {0: 65162552161356, 1: 65162552635756, 2: 65162552843456, 3: 65162552842856, 4: 65162552736856}, '2021-04-01': {0: 31, 1: 0, 2: 281, 3: 207, 4: 55}, '2021-05-01': {0: 25, 1: 0, 2: 72, 3: 104, 4: 6}, '2021-06-01': {0: 16, 1: 0, 2: 108, 3: 32, 4: 14}, '2021-07-01': {0: 8, 1: 0, 2: 107, 3: 78, 4: 10}, '2021-08-01': {0: 21, 1: 0, 2: 80, 3: 40, 4: 9}, '2021-09-01': {0: 24, 1: 0, 2: 40, 3: 73, 4: 3}, '2021-10-01': {0: 13, 1: 0, 2: 36, 3: 79, 4: 11}, '2021-11-01': {0: 59, 1: 0, 2: 65, 3: 139, 4: 14}, '2021-12-01': {0: 51, 1: 0, 2: 41, 3: 87, 4: 10}, '2022-01-01': {0: 2, 1: 0, 2: 43, 3: 47, 4: 6}, '2022-02-01': {0: 0, 1: 0, 2: 0, 3: 63, 4: 3}, '2022-03-01': {0: 0, 1: 0, 2: 16, 3: 76, 4: 18}, '2022-04-01': {0: 0, 1: 0, 2: 37, 3: 32, 4: 8}, '2022-05-01': {0: 0, 1: 0, 2: 106, 3: 96, 4: 40}, '2022-06-01': {0: 0, 1: 0, 2: 101, 3: 75, 4: 16}, '2022-07-01': {0: 0, 1: 0, 2: 60, 3: 46, 4: 14}, '2022-08-01': {0: 0, 1: 0, 2: 73, 3: 91, 4: 13}, '2022-09-01': {0: 0, 1: 0, 2: 19, 3: 17, 4: 2}
Can someone help me out and let me know how to make the changes.
df = pd.read_excel ('C:/X/X/X/Demand_Data_Used.xlsx')
df['Key'] = df['Key'].astype(str)
df = pd.melt(df,id_vars='Key',value_vars=list(df.columns[1:]),var_name ='ds')
df.columns = df.columns.str.replace('Key', 'unique_id')
df.columns = df.columns.str.replace('value', 'y')
df["ds"] = pd.to_datetime(df["ds"],format='%Y-%m-%d')
df=df[["ds","unique_id","y"]]
df
The df data frame looks like this after the completion of this operation:
I would like it to look like this:
I know the doesnt contain the same values, I was just trying to show the expectation. Can someone help me in figure out the correct way to drop the index column?
It looks like you don't actually want to drop the index column, but instead assign another column as the index. You can do this very easily with this assignment:
df.index = df["unique_id"]
Now if you want to drop the column afterwards (the value would now show up twice essentially), you can do this as well:
df = df.drop("unique_id", axis=1)
I am trying to merge 2 dataframes and have a problem in figuring out how, as it is not straigh forward.
One data frame has match results for over 25000 games and looks like this.
The second one has team performance metrics but only for around 1500 games.
As I am not allowed to post pictures yet, here are the column names of interest:
df_match['date', 'home_team_api_id', 'away_team_api_id']
df_team_attributes['date', 'team_api_id']
Both data frames have additional columns with results or performance metrics.
To be able to merge correctly, I need to merge by date and by looking if the 'team_api_id' matches either 'home...' or 'away_team_api_id'
This is what I have tried until now:
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
left_on = ['date', 'team_api_id', 'team_api_id'],
right_on = ['date', 'home_team_api_id', 'home_team_api_id'])
I have tried also with only 2 columns, but w/o succes.
What I would like to get is a new data frame with only the rows of the df_team_attributes and columns from both data frames.
Thank you in advance!
Added to request by Correlien:
output of print(df_match[['date', 'home_team_api_id', 'away_team_api_id', 'win_home', 'win_away', 'draw', 'win']].head(10).to_dict())
{'date': {0: '2008-08-17 00:00:00', 1: '2008-08-16 00:00:00', 2: '2008-08-16 00:00:00', 3: '2008-08-17 00:00:00', 4: '2008-08-16 00:00:00', 5: '2008-09-24 00:00:00', 6: '2008-08-16 00:00:00', 7: '2008-08-16 00:00:00', 8: '2008-08-16 00:00:00', 9: '2008-11-01 00:00:00'}, 'home_team_api_id': {0: 9987, 1: 10000, 2: 9984, 3: 9991, 4: 7947, 5: 8203, 6: 9999, 7: 4049, 8: 10001, 9: 8342}, 'away_team_api_id': {0: 9993, 1: 9994, 2: 8635, 3: 9998, 4: 9985, 5: 8342, 6: 8571, 7: 9996, 8: 9986, 9: 8571}, 'win_home': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1}, 'win_away': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 0, 7: 1, 8: 0, 9: 0}, 'draw': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0}, 'win': {0: 0, 1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 1, 8: 1, 9: 1}}
output for print(df_team_attributes[['date', 'team_api_id', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass']].head(10).to_dict())
{'date': {0: '2010-02-22 00:00:00', 1: '2014-09-19 00:00:00', 2: '2015-09-10 00:00:00', 3: '2010-02-22 00:00:00', 4: '2011-02-22 00:00:00', 5: '2012-02-22 00:00:00', 6: '2013-09-20 00:00:00', 7: '2014-09-19 00:00:00', 8: '2015-09-10 00:00:00', 9: '2010-02-22 00:00:00'}, 'team_api_id': {0: 9930, 1: 9930, 2: 9930, 3: 8485, 4: 8485, 5: 8485, 6: 8485, 7: 8485, 8: 8485, 9: 8576}, 'buildUpPlaySpeed': {0: 60, 1: 52, 2: 47, 3: 70, 4: 47, 5: 58, 6: 62, 7: 58, 8: 59, 9: 60}, 'buildUpPlaySpeedClass': {0: 'Balanced', 1: 'Balanced', 2: 'Balanced', 3: 'Fast', 4: 'Balanced', 5: 'Balanced', 6: 'Balanced', 7: 'Balanced', 8: 'Balanced', 9: 'Balanced'}}
Have you tried casting the your date columns into the correct format and then attempting the merge? The following worked for me based on the example that you provided -
# Casting to date
df_match["date"] = pd.to_datetime(df_match["date"])
df_team_attributes["date"] = pd.to_datetime(df_match["date"])
# Merging on the date field alone
df_team_performance = pd.merge(df_team_attributes, df_match,
how = 'left',
on = 'date')
# Filtering out the required rows
result = df_team_performance.query("(team_api_id == home_team_api_id) | (team_api_id == away_team_api_id)")
Please let me know if my understanding of your question is correct.
Having troubles plotting multiple histograms. I get the error message:
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()
KeyError: 0
This is the code I wrote:
xaxes = ['price','bedrooms','sqft_living','sqft_lot','floors','waterfront',
'view','condition','grade','sqft_above','sqft_basement','yr_built',
'yr_renovated','zipcode','lat','long','sqft_living15','sqft_loft15']
a,b = plt.subplots(4,5)
b = b.ravel()
for idx,ax in enumerate(b):
ax.hist(file[idx])
ax.set_title(titles[idx])
ax.set_xlabel(xaxes[id])
plt.tight_layout()
Here is a sample of my data:
{'bathrooms': {0: 1.0,
1: 2.25,
2: 1.0,
3: 3.0,
4: 2.0,
5: 4.5,
6: 2.25,
7: 1.5,
8: 1.0,
9: 2.5},
'bedrooms': {0: 3, 1: 3, 2: 2, 3: 4, 4: 3, 5: 4, 6: 3, 7: 3, 8: 3, 9: 3},
'condition': {0: 3, 1: 3, 2: 3, 3: 5, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3},
'date': {0: '20141013T000000',
1: '20141209T000000',
2: '20150225T000000',
3: '20141209T000000',
4: '20150218T000000',
5: '20140512T000000',
6: '20140627T000000',
7: '20150115T000000',
8: '20150415T000000',
9: '20150312T000000'},
'floors': {0: 1.0,
1: 2.0,
2: 1.0,
3: 1.0,
4: 1.0,
5: 1.0,
6: 2.0,
7: 1.0,
8: 1.0,
9: 2.0},
'grade': {0: 7, 1: 7, 2: 6, 3: 7, 4: 8, 5: 11, 6: 7, 7: 7, 8: 7, 9: 7},
'id': {0: 7129300520,
1: 6414100192,
2: 5631500400,
3: 2487200875,
4: 1954400510,
5: 7237550310,
6: 1321400060,
7: 2008000270,
8: 2414600126,
9: 3793500160},
'lat': {0: 47.511200000000002,
1: 47.721000000000004,
2: 47.737900000000003,
3: 47.520800000000001,
4: 47.616799999999998,
5: 47.656100000000002,
6: 47.309699999999999,
7: 47.409500000000001,
8: 47.512300000000003,
9: 47.368400000000001},
'long': {0: -122.25700000000001,
1: -122.319,
2: -122.23299999999999,
3: -122.39299999999999,
4: -122.045,
5: -122.005,
6: -122.32700000000001,
7: -122.315,
8: -122.337,
9: -122.03100000000001},
'price': {0: 221900.0,
1: 538000.0,
2: 180000.0,
3: 604000.0,
4: 510000.0,
5: 1230000.0,
6: 257500.0,
7: 291850.0,
8: 229500.0,
9: 323000.0},
'sqft_above': {0: 1180,
1: 2170,
2: 770,
3: 1050,
4: 1680,
5: 3890,
6: 1715,
7: 1060,
8: 1050,
9: 1890},
'sqft_basement': {0: 0,
1: 400,
2: 0,
3: 910,
4: 0,
5: 1530,
6: 0,
7: 0,
8: 730,
9: 0},
'sqft_living': {0: 1180,
1: 2570,
2: 770,
3: 1960,
4: 1680,
5: 5420,
6: 1715,
7: 1060,
8: 1780,
9: 1890},
'sqft_living15': {0: 1340,
1: 1690,
2: 2720,
3: 1360,
4: 1800,
5: 4760,
6: 2238,
7: 1650,
8: 1780,
9: 2390},
'sqft_lot': {0: 5650,
1: 7242,
2: 10000,
3: 5000,
4: 8080,
5: 101930,
6: 6819,
7: 9711,
8: 7470,
9: 6560},
'sqft_lot15': {0: 5650,
1: 7639,
2: 8062,
3: 5000,
4: 7503,
5: 101930,
6: 6819,
7: 9711,
8: 8113,
9: 7570},
'view': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'waterfront': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'yr_built': {0: 1955,
1: 1951,
2: 1933,
3: 1965,
4: 1987,
5: 2001,
6: 1995,
7: 1963,
8: 1960,
9: 2003},
'yr_renovated': {0: 0,
1: 1991,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0},
'zipcode': {0: 98178,
1: 98125,
2: 98028,
3: 98136,
4: 98074,
5: 98053,
6: 98003,
7: 98198,
8: 98146,
9: 98038}}
If i understand you in a right way and variable file contains your data in pandas dataframe then you simply faced with a problem of indexing that dataframe.
file[idx] corresponds to file.loc[idx] which means "give me a row with idx number in my dataframe" while you need a column instead of a row. Just replace it with file.loc[:,idx].
Check this link for mode details about indexing and selecting in pandas.
I'm trying to import data from Baseball Prospectus into a Python table / dictionary (which would be better?).
Below is what I have, based on following along to Automate The Boring Stuff with Python.
I get that my method isn't properly using these functions, but I can't figure out what tools I should be using.
import requests
import webbrowser
import bs4
res = requests.get('https://legacy.baseballprospectus.com/card/70917/trea-turner')
res.raise_for_status()
webpage = bs4.BeautifulSoup(res.text)
table = webpage.select('newstat_career_log_datagrid')
list = []
for item in table:
list.append(item)
print(list)
Use pandas Data Frame to fetch the MLB Statistics table first and then convert dataframe into dictionary object.If you don't have pandas install you can do it in a single command.
pip install pandas
Then use the below code.
import pandas as pd
df=pd.read_html('https://legacy.baseballprospectus.com/card/70917/trea-turner')
data_dict = df[5].to_dict()
print(data_dict)
Output:
{'PA': {0: 44, 1: 324, 2: 447, 3: 740, 4: 15, 5: 1570}, '2B': {0: 1, 1: 14, 2: 24, 3: 27, 4: 1, 5: 67}, 'TEAM': {0: 'WAS', 1: 'WAS', 2: 'WAS', 3: 'WAS', 4: 'WAS', 5: 'Career'}, 'SB': {0: 2, 1: 33, 2: 46, 3: 43, 4: 4, 5: 128}, 'G': {0: 27, 1: 73, 2: 98, 3: 162, 4: 4, 5: 364}, 'HR': {0: 1, 1: 13, 2: 11, 3: 19, 4: 2, 5: 46}, 'FRAA': {0: 0.5, 1: -3.2, 2: 0.2, 3: 7.1, 4: -0.1, 5: 4.5}, 'BWARP': {0: 0.1, 1: 2.4, 2: 2.7, 3: 5.0, 4: 0.1, 5: 10.4}, 'CS': {0: 2, 1: 6, 2: 8, 3: 9, 4: 0, 5: 25}, '3B': {0: 0, 1: 8, 2: 6, 3: 6, 4: 0, 5: 20}, 'H': {0: 9, 1: 105, 2: 117, 3: 180, 4: 5, 5: 416}, 'AGE': {0: '22', 1: '23', 2: '24', 3: '25', 4: '26', 5: 'Career'}, 'OBP': {0: 0.295, 1: 0.37, 2: 0.33799999999999997, 3: 0.344, 4: 0.4, 5: 0.34700000000000003}, 'AVG': {0: 0.225, 1: 0.342, 2: 0.284, 3: 0.271, 4: 0.35700000000000004, 5: 0.289}, 'DRC+': {0: 77, 1: 128, 2: 99, 3: 107, 4: 103, 5: 108}, 'SO': {0: 12, 1: 59, 2: 80, 3: 132, 4: 5, 5: 288}, 'YEAR': {0: '2015', 1: '2016', 2: '2017', 3: '2018', 4: '2019', 5: 'Career'}, 'SLG': {0: 0.325, 1: 0.5670000000000001, 2: 0.451, 3: 0.41600000000000004, 4: 0.857, 5: 0.46}, 'DRAA': {0: -1.0, 1: 11.4, 2: 1.0, 3: 8.5, 4: 0.1, 5: 20.0}, 'HBP': {0: 0, 1: 1, 2: 4, 3: 5, 4: 0, 5: 10}, 'BRR': {0: 0.1, 1: 5.9, 2: 6.8, 3: 2.7, 4: 0.2, 5: 15.7}, 'BB': {0: 4, 1: 14, 2: 30, 3: 69, 4: 1, 5: 118}}