I am having tremendous difficulty getting my data sorted. I'm at the point where I could have manually created a new .csv file in the time I have spent trying to figure this out, but I need to do this through code. I have a large dataset of baseball salaries by player going back 150 years.
This is what my dataset looks like.
I want to create a new dataframe that adds the individual player salaries for a given team for a given year, organized by team and by year. Using the following technique I have come up with this: team_salaries_groupby_team = salaries.groupby(['teamID','yearID']).agg({'salary' : ['sum']}), which outputs this: my output. On screen it looks sort of like what I want, but I want a dataframe with three columns (plus an index on the left). I can't really do the sort of analysis I want to do with this output.
Lastly, I have also tried this method: new_column = salaries['teamID'] + salaries['yearID'].astype(str) salaries['teamyear'] = new_column salaries teamyear = salaries.groupby(['teamyear']).agg({'salary' : ['sum']}) print(teamyear). Another output It adds the individual player salaries per team for a given year, but now I don't know how to separate the year and put it into its own column. Help please?
You just need to reset_index()
Here is sample code :
salaries = pd.DataFrame(columns=['yearID','teamID','igID','playerID','salary'])
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'C','salary':5000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'C','salary':50000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
After that , groupby and reset_index
sample_df = salaries.groupby(['teamID', 'yearID']).salary.sum().reset_index()
Is this what you are looking for ?
Related
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
I am currently working on a project that uses a data Frame of almost 24000 basketball games from the years 2004-2021. what I want to do in the end is make a single data Frame that has only 1 row for each year and the column values will be the mean for that category. What I have so far is a mask function that can separate by year but I want to make a for loop that will go through the list of years, get the mean of that, and then concatenate them into a new data frame. The code might help explain this better.
## now i want to seperate this into data sets based on year so ill make a function this will be used to seperate by year. in my original dataset "SEASON" is the year.
def mask(year):
mask = stats['SEASON']== year
year_mask= stats[mask]
return year_mask
how can I possibly make this into a loop that seperates by year, finds mean clues of all columns in that year, and combines them into 1 data from that should have 18 rows that span from 2004-2021?
If you are using Pandas dataframes it's best to let pandas do the work for you.
I assume you want to calculate the mean of some category in your dataframe grouped by the year. To do this we can create a function like so:
def foo(df, category):
return df.groupby(by=["year"])[category].mean()
If you want to mean all the categories just use:
df.groupby(by=["year"]).mean()
I have 30 years of data that has been collected from 385 stations. I would like to calculate the monthly average of all years according to individual stations and export it into a CSV file. I am very new to coding I don't know how to execute this. please help someone to sort out my issues .herewith I have enclosed the code for one station. as same as like i should prepare csv file all 385 stations
#selective column only
ap= data[data["station_id"]=='C0A520']
ap=ap[['station_id','TEMP','YEAR','MONTH']]
grouped = ap.groupby(by=["YEAR","MONTH"])
monthly_mean = grouped.mean()
monthly_mean.head()
#export groupby
grouped.mean().reset_index().to_csv('D:/My_files/Research Progress/data/Temperature/final/coa520.csv')
I am assuming that your existing code works as intended and that you do not want to write the code for each of the 385 stations. This can be achieved in a simple for loop iterating over the station names:
for station in data["station_id"].unique():
# selective column only
ap= data[data["station_id"]==station]
ap=ap[['station_id','TEMP','YEAR','MONTH']]
grouped = ap.groupby(by=["YEAR","MONTH"])
# export groupby
grouped.mean().reset_index().to_csv(f'D:/My_files/Research Progress/data/Temperature/final/{station}.csv')
(You did not use the monthly_mean variable, so I wrote it out)
I'm trying to learn how to use python and I've never used pandas before. I wanted to create a simple calculation using excel data - here's an example of the excel data:
Example Data
There are 3 columns, Unique ID, Vehicle and the Hours.
I know how to make a simple calculator on python where I have to manually enter these values but is it possible to extract data from the columns directly to get the calculation?
So ideally, it would pick up the ID itself, then the vehicle type with pay (the pay defined within the code e.g. Bike = 15.00), multiplied by number of hours to give a total of the pay?
ID: 28392
Vehicle: Bike
Hours: 40
Total: $600
Hopefully this makes sense, thanks in advance!
First you need to load your dataset into a pandas dataframe which you can do using following command.
import pandas as pd
df = pd.read_excel('name_of_file_here.xlsx',sheet_name='Sheet_name_here')
So your excel data is now a pandas dataframe called df.
if pay rate is the same for all vehicles you can do the following.
rate = 15
df['Pay'] = df['Hours']*rate
This creates a new column in your dataframe called 'Pay' and multiplies the rate which is 15 by each row in the Hours Column.
If however the rate is different for different vehicles types you can do the following.
bike_rate = 15
df[df['Vehicle']=='Bike'] = df[df['Vehicle']=='Bike']*bike_rate
cargo_bike_rate = 20
df[df['Vehicle']=='Cargo-Bike'] = df[df['Vehicle']=='Cargo-Bike']*cargo_bike_rate
This will select the rows in the dataframe where vehicle is equal to whatever type you want and operate on these rows.
Another way and the best way i think is to use a function.
def calculate_pay(vehicle,hours):
if vehicle == 'Bike':
rate = 15
elif vehicle == 'Cargo-Bike':
rate = 20
#And so on ....
else:
rate = 10
return hours*rate
Then you can apply this function to your dataframe.
df['Pay'] = df.apply(lambda x: calculate_pay(x['Vehicle'],x['Hours']),axis=1)
This creates a new column in your dataframe called 'Pay' and applies a function called calculate_pay which takes inputs vehicle and hours from the dataframe and returns the pay.
To print your results on screen you can simply just type df and enter if you are using jupyter notebook, or to select specific columns you mentioned in comments you can do the following.
df[['Id','Vehicle','Hours','Pay']]
To save back to excel you can do the following
df.to_excel('output.xslx')
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()