creating new column with the sum of the past 24 hours - python
For the following dataframe: df_data, is there a way to make a new column that counts the nr of vehicles of the past 24 hours or just of the previous day?
df_data = {'day_of_year' : [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2], 'nr_of_vehicles' : [254,154,896,268,254,501,840,868,654,684,684,681,632,468,987,134,336,119,874,658,121,254,154,896,268,254,501,840,868,654,684,684,681,632,468,987,134,336,119,874,658,121,268,254,501,840,868,654], 'hour' : [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]}
Visual representation (nr_of_vehicles is counted per hour):
I thought of grouping the data by day_of_year by using the following
df_data_day = df_data.groupby('day_of_year').agg({'nr_of_vehicles': 'sum'})
but I don't know how I could assign it correctly to the column, because the are more rows in the original dataframe.
You were not far: you had just to use transform instead of agg:
df_data_day = df_data.groupby('day_of_year')['nr_of_vehicles'].transform('mean')
You can even directly add a new column:
df_data['nr_by_day'] = df_data.groupby('day_of_year')['nr_of_vehicles'].transform('mean')
BTW: I used your proposed code which computes the average, when your title says sum...
Related
Using a Rolling Function in Pandas based on Date and a Categorical Column
Im currently working on a dataset where I am using the rolling function in pandas to create features. The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row. Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID. The following two functions are working. Mean of DaysLate for the last five invoices raised for the row's customer df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\ DaysLate.mean().reset_index().set_index("level_1").\ sort_index()["DaysLate"] Mean of DaysLate for all invoices raised in the last 30 days df = df.sort_values('InvoiceDate') df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean() Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean. mean_30d = ( df .set_index('InnvoiceDate') # !important .sort_index() .groupby('customerID') .rolling('30d')['DaysLate'].mean() .reset_index(name='GlobalDaysLate_30days') ) # merge the rolling mean back to original dataframe result = df.merge(mean_30d)
pandas computing new column as a average of other two conditions
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day. So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes. I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem: df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean() The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code: df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean() It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error. There is two ways to solve this problem. The first one is using transform as: df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean()) The second is to create a new dfn from groupby then merge back to df dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average') df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by. Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column'] This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
Is there a function to get the difference between two values on a pandas dataframe timeseries?
I am messing around in the NYT covid dataset which has total covid cases for each county, per day. I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache. Tried methods: df.resample('2d').diff() 'DatetimeIndexResampler' object has no attribute 'diff' df.resample('1d').agg(np.subtract) ufunc() missing 1 of 2required positional argument(s) df.rolling(2).diff() 'Rolling' object has no attribute 'diff' df.rolling('2').agg(np.subtract) ufunc() missing 1 of 2required positional argument(s) Sample data: pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'], 'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)], 'covid_cases':[1.2,2.0,2.9,3.6,3.9] }) Desired sample output: pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'], 'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)], 'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3] }) Recreate sample data from original NYT dataset: df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date']) df.groupby(['state','date'])[['cases']].mean().reset_index() Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.
Let's try this bit of complete code: import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv') df['date'] = pd.to_datetime(df['date']) df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack() daily_new_cases_AL = df_daily_state.diff()['Alabama'] ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases') Output: Details: Download the historical case records from NYTimes github using the raw URL Convert the dtype of the 'date' column to datetime dtype Groupby 'date' and 'state' columns sum 'cases' and unstack the state level of the index to get dates of rows and states for columns. Take the difference by columns and select only the Alabama column Plot the last 30 days
The diff function is correct, but if you look at your error message: 'DatetimeIndexResampler' object has no attribute 'diff' in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it. If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.
how to add specific two columns and get new column as a total using pandas library?
I'm trying to add two-columns and trying to display their total in a new column and following as well The total sum of sales in the month of Jan The minimum sales amount in the month of Feb The average (mean) sales for the month of Mar and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values I have implemented the following code import pandas as pd new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python- class/assignments/data/assg-01-data.csv") new_val['total'] = 'total' new_val.to_csv('output.csv', index=False) display(new_val) d.head(5)# it's not showing top file lines of the .csv data # .CSV file sample data #account name street city state postal-code Jan Feb Mar total #0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total #0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total. What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you. You can iterate through your DataFrame and add the two columns to get the variable for the third. for i,j in new_val.iterrows(): new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar'] Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total. You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar'] For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work: new_val.isna().sum(axis=1)==4 Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.
What is pandas syntax for lookup based on existing columns + row values?
I'm trying to recreate a bit of a convoluted scenario, but I will do my best to explain it: Create a pandas df1 with two columns: 'Date' and 'Price' - done I add two new columns: 'rollmax' and 'rollmin', where the 'rollmax' is an 8 days rolling maximum and 'rollmin' is a rolling minimum. - done Now I need to create another column 'rollmax_date' that would get populated through a look up rule: for the row n, go to the column 'Price' and parse through the values for the last 8 days and find the maximum, then get the value of the corresponding column 'Price' and put this value in the column 'rollingmax_date'. the same logic for the 'rollingmin_date', but instead of rolling maximum date, we look for the rolling minimum date. Now I need to find the previous 8 days max and min for the same rolling window of 8 days that I have already found. I did the first two and tried the third one, but I'm getting wrong results. The code below gives me only dates where on the same row df["Price"] is the same as df['rollmax'], but it doesn't bring all the corresponding dates from 'Date' to 'rollmax_date' df['rollmax_date'] = df.loc[(df["Price"] == df.rollmax), 'Date'] This is an image with steps for recreating the lookup