I'm trying to learn how to use python and I've never used pandas before. I wanted to create a simple calculation using excel data - here's an example of the excel data:
Example Data
There are 3 columns, Unique ID, Vehicle and the Hours.
I know how to make a simple calculator on python where I have to manually enter these values but is it possible to extract data from the columns directly to get the calculation?
So ideally, it would pick up the ID itself, then the vehicle type with pay (the pay defined within the code e.g. Bike = 15.00), multiplied by number of hours to give a total of the pay?
ID: 28392
Vehicle: Bike
Hours: 40
Total: $600
Hopefully this makes sense, thanks in advance!
First you need to load your dataset into a pandas dataframe which you can do using following command.
import pandas as pd
df = pd.read_excel('name_of_file_here.xlsx',sheet_name='Sheet_name_here')
So your excel data is now a pandas dataframe called df.
if pay rate is the same for all vehicles you can do the following.
rate = 15
df['Pay'] = df['Hours']*rate
This creates a new column in your dataframe called 'Pay' and multiplies the rate which is 15 by each row in the Hours Column.
If however the rate is different for different vehicles types you can do the following.
bike_rate = 15
df[df['Vehicle']=='Bike'] = df[df['Vehicle']=='Bike']*bike_rate
cargo_bike_rate = 20
df[df['Vehicle']=='Cargo-Bike'] = df[df['Vehicle']=='Cargo-Bike']*cargo_bike_rate
This will select the rows in the dataframe where vehicle is equal to whatever type you want and operate on these rows.
Another way and the best way i think is to use a function.
def calculate_pay(vehicle,hours):
if vehicle == 'Bike':
rate = 15
elif vehicle == 'Cargo-Bike':
rate = 20
#And so on ....
else:
rate = 10
return hours*rate
Then you can apply this function to your dataframe.
df['Pay'] = df.apply(lambda x: calculate_pay(x['Vehicle'],x['Hours']),axis=1)
This creates a new column in your dataframe called 'Pay' and applies a function called calculate_pay which takes inputs vehicle and hours from the dataframe and returns the pay.
To print your results on screen you can simply just type df and enter if you are using jupyter notebook, or to select specific columns you mentioned in comments you can do the following.
df[['Id','Vehicle','Hours','Pay']]
To save back to excel you can do the following
df.to_excel('output.xslx')
Related
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
I am having tremendous difficulty getting my data sorted. I'm at the point where I could have manually created a new .csv file in the time I have spent trying to figure this out, but I need to do this through code. I have a large dataset of baseball salaries by player going back 150 years.
This is what my dataset looks like.
I want to create a new dataframe that adds the individual player salaries for a given team for a given year, organized by team and by year. Using the following technique I have come up with this: team_salaries_groupby_team = salaries.groupby(['teamID','yearID']).agg({'salary' : ['sum']}), which outputs this: my output. On screen it looks sort of like what I want, but I want a dataframe with three columns (plus an index on the left). I can't really do the sort of analysis I want to do with this output.
Lastly, I have also tried this method: new_column = salaries['teamID'] + salaries['yearID'].astype(str) salaries['teamyear'] = new_column salaries teamyear = salaries.groupby(['teamyear']).agg({'salary' : ['sum']}) print(teamyear). Another output It adds the individual player salaries per team for a given year, but now I don't know how to separate the year and put it into its own column. Help please?
You just need to reset_index()
Here is sample code :
salaries = pd.DataFrame(columns=['yearID','teamID','igID','playerID','salary'])
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'C','salary':5000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'C','salary':50000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
After that , groupby and reset_index
sample_df = salaries.groupby(['teamID', 'yearID']).salary.sum().reset_index()
Is this what you are looking for ?
For the following dataframe: df_data, is there a way to make a new column that counts the nr of vehicles of the past 24 hours or just of the previous day?
df_data = {'day_of_year' : [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2], 'nr_of_vehicles' : [254,154,896,268,254,501,840,868,654,684,684,681,632,468,987,134,336,119,874,658,121,254,154,896,268,254,501,840,868,654,684,684,681,632,468,987,134,336,119,874,658,121,268,254,501,840,868,654], 'hour' : [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]}
Visual representation (nr_of_vehicles is counted per hour):
I thought of grouping the data by day_of_year by using the following
df_data_day = df_data.groupby('day_of_year').agg({'nr_of_vehicles': 'sum'})
but I don't know how I could assign it correctly to the column, because the are more rows in the original dataframe.
You were not far: you had just to use transform instead of agg:
df_data_day = df_data.groupby('day_of_year')['nr_of_vehicles'].transform('mean')
You can even directly add a new column:
df_data['nr_by_day'] = df_data.groupby('day_of_year')['nr_of_vehicles'].transform('mean')
BTW: I used your proposed code which computes the average, when your title says sum...
I have found some tasks to do, to develop myself more with Pandas, but I found some unexpected errors in the data files I used. And actually wanted to fix it by myself, but I have no idea how.
Basically I have an excel file, with columns - PayType, Money, Date. In the column of PayType, I have 4 different types of payment. Car rent payment, car service fee payment, and 2 more which are not important. Basically, on every entry of car rent payment, there is an automatic service fee deduction, which happens at the exactly same time. I used the Pivot table and divided PayTypes as columns, as I wanted to count the percentage of these fees.
Before Pivot Table:
enter image description here
Time difference example:
enter image description here
After Pivot Table:
enter image description here
import numpy as np
import pandas as pd
import xlrd
from pandas import Series, DataFrame
df = pd.read_excel ('C:/Data.xlsx', sheet_name = 'Sheet1',
usecols = ['PayType', 'Money', 'Date'])
df['Date'] = pd.to_datetime(df['Date'], format = '%Y-%m-%d %H:$M:%S.%f')
df = df.pivot_table(index = ['Date'],
columns = ['PayType']).fillna(0)
df = pd.merge_asof(df['Money', 'serviceFee'], df['Money', 'carRenting'], on = 'Date', tolerance =
pd.Timedelta('2s'))
df['Percentage'] = df['Money','serviceFee'] / df['Money','carRenting'] * 100
df['Percentage'] = df['Percentage'].abs()
df['Charges'] = np.where(df['Percentage'].notna(), np.where(df['Percentage'] > 26, 'Overcharge -
30%', 'Fixed - 25%'), 'Null')
df.to_excel("Finale123.xlsx")
So in the Pivot table, entries for renting the car and fee payments almost all of them happened at the same moment, so their time is equal and they are in one row. But there are few mistakes, where time is different for carrenting and feepayment just for 1 or 2 seconds. Because of this time difference, they are divided into 2 different rows.
I tried to use merge_asof, but it didn't work.
How can I merge 2 rows, which have different times (by 2 seconds max) and also this time column (date) is the actual index for the pivot table.
I had a similar problem. I needed to merge time series data of multiple sensors. The time interval of the sensor measurements are 5 seconds. The time format is yyyy:MM:dd HH:mm:ss. To do the merge, I also needed to sort the column used for the merge.
sensors_livingroom = load(filename_livingroom)
sensors_bedroom = load(filename_bedroom)
sensors_livingroom = sensors_livingroom.set_index("time")
sensors_bedroom = sensors_bedroom.set_index("time")
sensors_livingroom.index = pd.to_datetime(sensors_livingroom.index, dayfirst=True)
sensors_bedroom.index = pd.to_datetime(sensors_bedroom.index, dayfirst=True)
sensors_livingroom.sort_index(inplace=True)
sensors_bedroom.sort_index(inplace=True)
sensors = pd.merge_asof(sensors_bedroom, sensors_livingroom, on='time', direction="nearest")
In my case I wanted to merge to the nearest time value so I set the parameter direction to nearest. In your case, it seems that the time of one dataframe will always be smaller that the time of the other, so it may be better to set direction parameter to forward or backward. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
can anyone please explain me how the below code is working? My Question is like if y variable has only price than how the last function is able to grouby doors? I am not able to get the flow and debug the flow. Please let me know as i am very new to this field.
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
y.groupby(df.Doors).mean()
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
print("The Doors")
print(df.Doors)
print("The Price")
print(y)
y.groupby(df.Doors).mean()
Try the above code you will understand the position or the index where the "df.Doors" given 4 and the price at that index in "y" are considered as one group and mean is taken, same is for 2 doors in "df.Doors" the other group.
It works because y is a pandas series, in which the values are prices but also has the index that it had in the df. When you do df.Doors you get a series with different values, but the same indexes (since an index is for the whole row). By comparing the indexes, pandas can perform the group by.
It loads the popular cars dataset to the dataframe df and assigns the colum price of the dataset to the variable y.
I would recommend you to get a general understanding of the data you loaded with the following commands:
df.info()
#shows you the range of the index as
#well as the data type of the colums
df.describe()
#shows common stats like mean or median
df.head()
#shows you the first 5 rows
The groupby command packs the rows (also called observations) of the cars dataframe df by the number of doors. And shows you the average price for cars with 2 doors or 4 doors and so on.
Check the output by adding a print() around the last line of code
edit: sorry I answered to fast, thought u asked for a general explanation of the code and not why is it working