I have a dataframe df1, and I want to calculate the days between two dates given three conditions and create a new column DiffDays with the difference in days.
1) When Yes is 1
2) When values in Value are non-zero
3) Must be UserId specific (perhaps with groupby())
df1 = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017'],
'UserId':[1,1,1,1,2,2,2],
'Value':[0,0,0,100,0,1000,0],
'Yes':[1,0,0,0,1,0,0]})
For example, when Yes is 1, calculate the dates between when Value is non-zero, which is 05.01.2017 and when Yes is 1, which is 02.01.2017. The result is three days for UserId in row 3.
Expected outcome:
Date UserId Value Yes DiffDays
0 02.01.2017 1 0.0 1 0
1 03.01.2017 1 0.0 0.0 0
2 04.01.2017 1 0.0 0.0 0
3 05.01.2017 1 100 0.0 3
4 01.01.2017 2 0.0 1 0
5 02.01.2017 2 1000 0.0 1
6 03.01.2017 2 0.0 0.0 0
I couldn't find anything on Stackoverflow about this, and not sure how to start.
def dayDiff(groupby):
if (not (groupby.Yes == 1).any()) or (not (groupby.Value > 0).any()):
return np.zeros(groupby.Date.count())
min_date = groupby[groupby.Yes == 1].Date.iloc[0]
max_date = groupby[groupby.Value > 0].Date.iloc[0]
delta = max_date - min_date
return np.where(groupby.Value > 0 , delta.days, 0)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
DateDiff = df1.groupby('UserId').apply(dayDiff).explode().rename('DateDiff').reset_index(drop=True)
pd.concat([df1, DateDiff], axis=1)
Returns:
Date UserId Value Yes DateDiff
0 2017-01-02 1 0 1 0
1 2017-01-03 1 0 0 0
2 2017-01-04 1 0 0 0
3 2017-01-05 1 100 0 3
4 2017-01-01 2 0 1 0
5 2017-01-02 2 1000 0 1
6 2017-01-03 2 0 0 0
Although this answers your question, the date diff logic is hard to follow, especially when it comes to the placement of the DateDiff values.
Update
pd.Series.explode() was only introduced in pandas version 0.25, for those using previous versions:
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
DateDiff = (df1
.groupby('UserId')
.apply(dayDiff)
.to_frame()
.explode(0)
.reset_index(drop=True)
.rename(columns={0: 'DateDiff'}))
pd.concat([df1, DateDiff], axis=1)
This will yield the same results.
Related
I have a pandas data frame that looks like this:
Count Status
Date
2021-01-01 11 1
2021-01-02 13 1
2021-01-03 14 1
2021-01-04 8 0
2021-01-05 8 0
2021-01-06 5 0
2021-01-07 2 0
2021-01-08 6 1
2021-01-09 8 1
2021-01-10 10 0
I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.
The output for this example would be:
Cycle Difference
1 3
2 -6
3 2
Use GroupBy.agg by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last'])
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
3 4 0
If need filter out groups rows with 1 row is possible add aggregation GroupBy.size and then filter oupt rows by DataFrame.loc:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last', 'size'])
.loc[lambda x: x['size'] > 1]
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
You can use a GroupBy.agg on the groups formed of the consecutive values, then get the first minus last value (see below for variants):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
)
output:
Status
1 3
2 -6
3 2
4 0
Name: Count, dtype: int64
If you only want to do this for groups of more than one element:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
)
output:
Status
1 3
2 -6
3 2
Name: Count, dtype: object
output as DataFrame:
add .rename_axis('Cycle').reset_index(name='Difference'):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
.rename_axis('Cycle').reset_index(name='Difference')
)
output:
Cycle Difference
0 1 3
1 2 -6
2 3 2
Task:
Calculate the frequency of each ID for each month of 2021
Frequency formula: Activity period (Length of time between last activity and first activity) / (Number of activity Days - 1)
e.g. ID 1 - Month 2: Activity Period (2021-02-23 - 2021-02-18 = 5 days) / (3 active days - 1) == Frequency = 2,5
Sample:
times = [
'2021-02-18',
'2021-02-22',
'2021-02-23',
'2021-04-23',
'2021-01-18',
'2021-01-19',
'2021-01-20',
'2021-01-03',
'2021-02-04',
'2021-02-04'
]
id = [1, 1, 1, 1, 44, 44, 44, 46, 46, 46]
df = pd.DataFrame({'ID':id, 'Date': pd.to_datetime(times)})
df = df.reset_index(drop=True)
print(df)
ID Date
0 1 2021-02-18
1 1 2021-02-22
2 1 2021-02-23
3 1 2021-04-23
4 44 2021-01-18
5 44 2021-01-19
6 44 2021-01-20
7 46 2021-01-03
8 46 2021-02-04
9 46 2021-02-04
Desired Output:
If frequency negative == 0
id 01_2021 02_2021 03_2021 04_2021
0 1 0 2 0 0
1 44 1 0 0 0
2 46 0 0 0 0
Try a pivot_table with a custom aggfunc:
# Create Columns For Later
dr = pd.date_range(start=df['Date'].min(),
end=df['Date'].max() + pd.offsets.MonthBegin(1), freq='M') \
.map(lambda dt: dt.strftime('%m_%Y'))
new_df = (
df.pivot_table(
index='ID',
# Columns are dates in MM_YYYY format
columns=df['Date'].dt.strftime('%m_%Y'),
# Custom Agg Function
aggfunc=lambda x: (x.max() - x.min()) /
pd.offsets.Day(max(1, len(x) - 1))
# max(1, len(x) -1) to prevent divide by 0
)
# Fix Axis Names and Column Levels
.droplevel(0, axis=1)
.rename_axis(None, axis=1)
# Reindex to include every month from min to max date
.reindex(dr, axis=1)
# Clip to exclude negatives
.clip(lower=0)
# Fillna with 0
.fillna(0)
# Reset index
.reset_index()
)
print(new_df)
new_df:
ID 01_2021 02_2021 03_2021 04_2021
0 1 0.0 2.5 0.0 0.0
1 44 1.0 0.0 0.0 0.0
2 46 0.0 0.0 0.0 0.0
You will need to pivot the table, but first if you want only the month and year of the date, you need to transform it.
df['Date'] = df.Date.map(lambda s: "{}_{}".format(s.year,s.month))
df['counts'] = 1
df_new = pd.pivot_table(df, index=['ID'],
columns=['Date'], aggfunc=np.sum)
I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)
Context
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Data
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
Objectives
For each customerId, there are several distinct id.
The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.
Simplified definition of my_date_predicate:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
Awaited result
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
Question
How could I use pandas to get such result?
Here's a solution:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:
For unset dates, replace the end date with the very last date you're interested in before you start the calculation.
You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum to get the wanted amount per date.
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date as I don't see what is used for
I am trying to apply left join to the two dataframe shown below.
outlier day season
0 11556.0 0 1
==========================================
date bikeid date2
0 1 16736 2016-06-06
1 1 16218 2016-06-13
2 1 15254 2016-06-20
3 1 16327 2016-06-27
4 1 17745 2016-07-04
5 1 16975 2016-07-11
6 1 17705 2016-07-18
7 1 16792 2016-07-25
8 1 18540 2016-08-01
9 1 17212 2016-08-08
10 1 11556 2016-08-15
11 1 17694 2016-08-22
12 1 14936 2016-08-29
outliers = pd.merge(outliers, sum_Day, how = 'left', left_on = ['outlier'], right_on = ['bikeid'])
outliers = outliers.dropna(axis=1, how='any')
trip_outlier day season
0 11556.0 0 1
As shown after above applying left join i dropped all the NaN rows which gives the result above. However the desired results should be as shown below
trip_outlier day season date2
0 11556.0 0 1 2016-08-15
It seems dtype of outlier column in outliers is float. Need same dtypes in both joined columns.
Check it by:
print (outliers['outlier'].dtype)
print (sum_Day['bikeid'].dtype)
So use astype for convert:
outliers['outlier'] = outliers['outlier'].astype(int)
#if not int
#sum_Day['bikeid'] = sum_Day['bikeid'].astype(int)
EDIT:
If some NaNs in outlier column is not possible convert to int, first is necessary remove NaNs:
outliers = outliers.dropna('outlier')
outliers['outlier'] = outliers['outlier'].astype(int)
One way to get the desired result would be using the below code:
outliers = outliers.merge(sum_Day.rename(columns={'bikeid': 'outlier'}), on = 'outlier', \
how = 'left')