Aggregation sum Pandas - python

I have the following dataframe
How can I aggregate the number of tickets (summing) for every month?
I tried:
df_res[df_res["type"]=="other"].groupby(["type","date"])["n_tickets"].sum()
date is an object

You need assign to new DataFrame for same size of Series created by Series.dt.month:
#if necessary convert to datetimes
df['date'] = pd.to_datetime(df['date'])
df = df_res[df_res["type"]=="pax"]
#type is same, so should be omited
out = df.groupby(df["date"].dt.month)["n_tickets"].sum()
#if need column with same value `pax`
#out = df.groupby(['type',df["date"].dt.month])["n_tickets"].sum()
If want grouping by pax and no pax:
types = np.where(df_res["type"]=="pax", 'pax', 'no pax')
df_res.groupby([types, df_res["date"].dt.month])["n_tickets"].sum()

Related

How to make a new dataframe with output from Pandas apply function?

I'm currently struggling with a problem of which I try not to use for loops (even though that would make it easier for me to understand) and instead use the 'pandas' approach.
The problem I'm facing is that I have a big dataframe of logs, allLogs, like:
index message date_time user_id
0 message1 2023-01-01 09:00:49 123
1 message2 2023-01-01 09:00:58 123
2 message3 2023-01-01 09:01:03 125
... etc
I'm doing analysis per user_id, for which I've written a function. This function needs a subset of the allLogs dataframe: all id's, messages ande date_times per user_id. Think of it like: for each unique user_id I want to run the function.
This function calculates the date-times between each message and makes a Series with all those time-delta's (time differences). I want to make this into a separate dataframe, for which I have a big list/series/array of time-delta's for each unique user_id.
The current function looks like this:
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64[s]')
return seconds
And I use allLogs.groupby('user_id').apply(lambda x: makeSeriesPerUser(x)) to apply it to my user_id groups.
How do I, instead of returning something and adding it to the existing dataframe, make a new dataframe with for each unique user_id a series of these time-delta's (each user has different amounts of logs)?
You should just create a dict where the keys are the user IDs and the values are the relevant DataFrames per user. There is no need to keep everything in one giant DataFrame, unless you have millions of users with only a few records apiece.
First off, you should use chaining. It's much simpler to read.
Secondly, the pd.DataFrame.groupby().apply can take the function itself. No lambda function is required.
Your sort_values(inplace=True) is returning None. Removing this will return the sorted DataFrame.
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64[s]')
return seconds
Turns into
def extract_timedelta(df_grouped_by_user: pd.DataFrame) -> Series:
selected_columns = ['message', 'date_time']
time_delta = (df_grouped_by_user[selected_columns]
.drop_duplicates(selected_columns) # drop duplicate entries
['date_time'] # select date_time column
.sort_values() # sort values of selected date_time column
.diff() # take difference
.astype('timedelta64[s]') # as type
.reset_index(drop=True)
)
return time_delta
time_delta_df = df.groupby('user_id').apply(extract_timedelta)
This returns a dataframe of timedeltas and is grouped by each user_id. The grouped dataframe is actually just a series with a MultiIndex. This index is just a tuple['user_id', int].
If you want a new dataframe with users as columns, then you want to this
data = {group_name: extract_timedelta(group_df) for group_name, group_df in messages_df.groupby('user_id')}
time_delta_df = pd.DataFrame(data)

Check if value in Date column value is month end

I am new to Python so I'm sorry if this sounds silly. I have a date column in a DataFrame. I need to check if the values in the date column is the end of the month, if yes then add one day and display the result in the new date column and if not we will just replace the day of with the first of that month.
For example. If the date 2000/3/31 then the output date column will be 2000/4/01
and if the date is 2000/3/30 then the output value in the date column would be 2000/3/1
Now I can do a row wise iteration of the column but I was wondering if there is a pythonic way to do it.
Let's say my Date column is called "Date" and new column which I want to create is "Date_new" and my dataframe is df, I am trying to code it like this but it is giving me an error:
if(df['Date'].dt.is_month_end == 'True'):
df['Date_new'] = df['Date'] + timedelta(days = 1)
else:
df['Date_new'] =df['Date'].replace(day=1)
I made your if statement into a function and modified it a bit so it works for columns. I used dataframe .apply method with axis=1 so it operates on columns instead of rows
import pandas as pd
import datetime
df = pd.DataFrame({'Date': [datetime.datetime(2022, 1, 31), datetime.datetime(2022, 1, 20)]})
print(df)
def my_func(column):
if column['Date'].is_month_end:
return column['Date'] + datetime.timedelta(days = 1)
else:
return column['Date'].replace(day=1)
df['Date_new'] = df.apply(my_func, axis=1)
print(df)

Randomly selecting rows from dataframe column by date

For a given dataframe column, I would like to randomly select by day roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together for each day (so that each day I have a ratio of 60/40):
I have asked the same question without the daily specification here: Randomly selecting rows from dataframe column
Example below illustrates this (although my ratio is not exactly 60/40 there):
dict0 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6]}
df = pd.DataFrame(dict0)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict1 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
df = pd.DataFrame(dict1)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict2 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
df = pd.DataFrame(dict2)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict3 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',- 4,'nan','nan'],'x4': [1,-2,3,-4,5,6]}
df = pd.DataFrame(dict3)###
df['date'] = pd.to_datetime(df['date']).dt.date
you can use groupby and sample, get the index values, then create the column x4 with loc, and fillna with the -1 multiplied column like:
idx= df.groupby('date').apply(lambda x: x.sample(frac=0.6)).index.get_level_values(1)
df.loc[idx, 'x4'] = df.loc[idx, 'x1']
df['x4'] = df['x4'].fillna(-df['x1'])

Filter each column by having the same value three times or more

I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)

Row wise operations in pandas dataframe based on dates (sorting issue)

This question has two parts:
1) Is there a better way to do this?
2) If NO to #1, how can I fix my date issue?
I have a dataframe as follows
GROUP DATE VALUE DELTA
A 12/20/2015 2.5 ??
A 11/30/2015 25
A 1/31/2016 8.3
B etc etc
B etc etc
C etc etc
C etc etc
This is a representation, there are close to 100 rows for each group (each row representing a unique date).
For each letter in GROUP, I want to find the change in value between successive dates. So for example for GROUP A I want the change between 11/30/2015 and 12/20/2015, which is -22.5. Currently I am doing the following:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df.sort_values('DATE',ascending=True)
df_out = []
for GROUP in df.GROUP.unique():
x = df[df.GROUP == GROUP]
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
The challenge I am running into is the dates are not sorted correctly. So when the shift takes place and I calculate the delta it is not really the delta between successive dates.
Is this the right approach to handle? If so how can I fix my date issue? I have reviewed/tried the following to no avail:
Applying datetime format in pandas for sorting
how to make a pandas dataframe column into a datetime object showing just the date to correctly sort
doing calculations in pandas dataframe based on trailing row
Pandas - Split dataframe into multiple dataframes based on dates?
Answering my own question. This works:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df_out = []
for ID in df.GROUP.unique():
x = df[df.GROUP == ID]
x.sort_values('DATE',ascending=True, inplace=True)
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
1) Added inplace=True to sort value.
2) Added the sort within the for loop.
3) Changed by loop from using GROUP to ID since it is also the name of a column name, which I imagine is considered sloppy?

Categories

Resources