I have the following tables,first one (vle) has behavioral activities ( many types of activities, some shown in the activity type column), and the other (UsersVle) has users' activities.The date column represents a day and starts from 0 till 222. I want to aggregate users' activities into weeks based on the activity types. For example in a week1 user1 will have columns as the number of activities types, and each column includes the total sum_clicks during that week. I wonder how I can do that in a pandas data frame using python?
I will appreciate your help.
Derive a new field called WEEK from date (you haven't provided enough info about date to suggest how to translate it to a week (e.g. 1 = Jan 1st?))
Join your two tables. Is id_site in table 2 a foreign key for id_site in table 1? If so, combined_df = table2.merge(table1, on='id_site'). Now, you should have all the fields in a single data frame.
Pivot like this: user_summary_by_week = pd.pivot_table(combined_df, index=['id_user', 'WEEK'], columns='activity_type', aggfunc='sum', fill_value=0).reset_index(col_level=1)
Related
Im currently working on a dataset where I am using the rolling function in pandas to
create features.
The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.
Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.
The following two functions are working.
Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
DaysLate.mean().reset_index().set_index("level_1").\
sort_index()["DaysLate"]
Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()
Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.
Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.
mean_30d = (
df
.set_index('InnvoiceDate') # !important
.sort_index()
.groupby('customerID')
.rolling('30d')['DaysLate'].mean()
.reset_index(name='GlobalDaysLate_30days')
)
# merge the rolling mean back to original dataframe
result = df.merge(mean_30d)
I have one data frame with start_date and end_date (01-02-2020), based on these two dates it can be daily (if start and end are one day apart), similarly for yearly or quarterly.
Then there is a column Value (3.5) in values column.
Now if there exists one record for monthly with 2.5 value and one record with quarterly with 4.5 and multiple for daily like 1.5 and one for yearly like 0.5.
enter image description here
Then I need to get one row for one date like (01-01-2020) with summing values and giving aggregate value (2.5+4.5+1.5+0.5 = 9), hence 9 is total_value on 01-01-2020. Something like below:
enter image description here
There are years of data like this with multiple records existing for same time period. And I need to get aggregated value for one by one dates for all distinct 'names'
I have been trying to do this in Python with no success till now. Any help is appreciated.
I currently have a dataframe with sales data, named "visitresult_and_outcome".
I have a column named "DATEONLY" that holds the sale date (format yyyy-mm-dd) in string format.
I now want to make 2 new dataframes: 1 for the sales made in the weekend, 1 for the sales made on weekdays. How can i do this in an efficient way?
df['dayofweek'] = df['DATEONLY'].dt.dayofweek
This will pull the day of the week out of your date attributes. Creating your other dataframes will just be a matter of slicing.
I have a large (+10m rows) dataframe with three columns: sales dates (dtype: datetime64[ns]), customer names and sales per customer. Sales dates include day, month and year in the form yyyy-mm-dd (i.e. 2019-04-19). I discovered the pandas to_period function and like to use the period[A-MAR] dtype. As the business year (ending in March) is different from the calendar year that is exactly what I was looking for. With the to_period function I am able to assign the respective sales dates to the correct business year while avoiding to create new columns with additional information.
I convert the date column as follows:
df_input['Date'] = pd.DatetimeIndex(df_input['Date']).to_period("A-MAR")
Now a peculiar issue arrises when I use pivot_table to aggregate the data and set margins=True. The aggfunc returns the correct values in the output table. However, the results in the last row (total value, created by the margins) are wrong as NaN is shown (or in my case a 0 as I set fill_value = 0). The function I use:
df_output = df_input.pivot_table(index="Customer",
columns = "Date",
values = "Sales",
aggfunc ={"Sales": np.sum},
fill_value = 0,
margins= True)
When I do not convert the dates to a period but use a simple year (integer) instead, the margins are calculated correctly and no NaN appears in the last row of the pivot output table.
I searched all over the internet but could not find a solution that was working. I would like to keep working with the period datatype and just need the margins to be calculated correctly. I hope someone can help me out here. Thank you!
I have 4 columns which have Date , Account #, Quantity and Sale respectively. I have daily data but I want to be able to show Weekly Sales per Customer and the Quantity.
I have been able to group the column by week, but I also want to group it by OracleNumber, and Sum the Quantity and Sales columns. How would I get that to work without messing up the Week format.
import pandas as pd
names = ['Date','OracleNumber','Quantity','Sale']
sales = pd.read_csv("CustomerSalesNVG.csv",names=names)
sales['Date'] = pd.to_datetime(sales['Date'])
grouped=sales.groupby(sales['Date'].map(lambda x:x.week))
print(grouped.head())
IIUC, you could groupby w.r.t the week column and OracleNumber column by providing an extra key to the list for which the Groupby object has to use and perform sum operation later:
sales.groupby([sales['Date'].dt.week, 'OracleNumber']).sum()