I have a pandas dataframe that looks like
Name Date Value
Sarah 11-01-2015 3
Sarah 11-02-2015 2
Sarah 11-03-2015 27
Bill 11-01-2015 42
Bill 11-02-2015 5
Bill 11-03-2015 15
.... (a couple hundred rows)
How do I get a 30 day (or x day) rolling sum of these values broken out by whoever is in the 'Name' column? The ideal output would have the same columns as the current dataframe, but instead of having the values for each row be what that person had as a value for that day, it would be the cumulative sum of what their values over the past 30 days.
I know I can do
result = pd.rolling_sum(df, 30)
to get the rolling sum overall. But how do I return a dataframe with that rolling sum grouped by the 'Name' column?
Figured it out using the grigri group_resample function.
df = group_resample(df,date_column='Date',groupby=group_by,value_column='Value',how='sum',freq='d')
df = df.unstack(group_by).fillna(0)
result = pd.rolling_mean(df,30)
Note that if you don't need a precise temporal window, or if your dataset has 1 line per [day , user] (which seems to be your case), then the standard groupby of pandas is perfectly suited. See this very similar question
Otherwise, something like:
df.groupby('Name').rolling('30D', on="Date").Value.sum()
should work.
Related
I have a dataset containing hourly temperatures for a year. So, I have 24 entries for each day (temp for every hour) and I want to find out the 5 days with highest temp. I am aware of nlargest() function to find out 5 max values but those values happen to be on a single day only. How do I find out the 5 max values but on different days?
I tried using nlargest() and .loc() but could not find the solution. Please help.
I have attached what the dataset looks like.
You might want to get the max per group with groupby.max then find the top 5 with nlargest
df.groupby(['year','month','day'])['temp'].max().nlargest(5)
You can use groupby
An example would be
max_5_temps = df.groupby('date_column')['temperature_column'].max().nlargest(5)
I calculated the average of the values contained in a column within my df as follows:
meanBpm = df['tempo'].mean()
the average is calculated for different days of the week and for some days the value I expect is returned, while for other days it returns NaN. This is because it is possible that the bpm (the tempo column) for a certain day is not there because for example I have not listened to any songs. I would like to replace these NaNs in ouput with a default value which could be 0 or -1
EDIT: i solved it, thanks a lot everyone for the replies
what your'e looking for is -
df['tempo'].fillna(0).mean()
Given a data frame with start time of a new time period (a new work shift), sum all sales that occur up to next time period (work shift).
import pandas as pd
df_checkpoints = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl']})
df_sales = pd.DataFrame({'time':[2,6,7,9,15], 'soldCount':[1,2,3,4,5]})
# This is the wanted output...
df_output = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl'], 'totSold':[1,9,5]})
So pd.merge_asof does what I want except it only does 1:1 merge. Best would be to get a multiIndex dataframe with index[0] being the checkpoints and index[1] being the sales rows, such that I can aggregate freely afterwards. Last resort would be an ugly O(n) loop.
Number of rows in each df is a couple of millions.
Any idea?
You can use pd.cut
For instance if you want to group by range you can use like this.
As you aware I added 24 to show finish of range
pd.cut(df_sales["time"], [1,5,10,24])
If you want to automate this you can use like this:
get your checkpoints, add 24 to finish time, group it, sum sales, reset index for concat
group_and_sum = df_sales.groupby(pd.cut(df_sales["time"], df_checkpoints['time'].append(pd.Series(24))),as_index = False).sum().drop('time',axis=1)
concat 2 dataframes for names
pd.concat([group_and_sum,df_checkpoints],axis=1)
output
soldCount time shift
0 1 1 Adam
1 9 5 Ben
2 5 10 Carl
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.
I am trying to develop a program to convert daily data into monthly or yearly data and so on.
I have a DataFrame with datetime index and price change %:
% Percentage
Date
2015-06-02 0.78
2015-06-10 0.32
2015-06-11 0.34
2015-06-12 -0.06
2015-06-15 -0.41
...
I had success grouping by some frequency. Then I tested:
df.groupby('Date').sum()
df.groupby('Date').cumsum()
If it was the case it would work fine, but the problem is that I can't sum it percent way (1+x0) * (1+x1)... -1. Then I tried:
def myfunc(values):
p = 0
for val in values:
p = (1+p)*(1+val)-1
return p
df.groupby('Date').apply(myfunc)
I can't understand how apply () works. It seems to apply my function to all data and not just to the grouped items.
Your apply is applying to all rows individually because you're grouping by the date column. Your date column looks to have unique values for each row, so each group has only one row in it. You need to use a Grouper to group by month, then use cumprod and get the last value for each group:
# make sure Date is a datetime
df["Date"] = pd.to_datetime(df["Date"])
# add one to percentages
df["% Percentage"] += 1
# use cumprod on each month group, take the last value, and subtract 1
df.groupby(pd.Grouper(key="Date", freq="M"))["% Percentage"].apply(lambda g: g.cumprod().iloc[-1] - 1)
Note, though, that this applies the percentage growth as if the steps between your rows were the same, but it looks like sometimes it's 8 days and sometimes it's 1 day. You may need to do some clean-up depnding on the result you want.