GroupBy makes time index disappear - python

With this DataFrame:
import pandas as pd
df = pd.DataFrame([[1,1],[1,2],[1,3],[1,5],[1,7],[1,9]], index=pd.date_range('2015-01-01', periods=6), columns=['a', 'b'])
i.e.
a b
2015-01-01 1 1
2015-01-02 1 2
2015-01-03 1 3
2015-01-04 1 5
2015-01-05 1 7
2015-01-06 1 9
the fact of using df = df.groupby(df.b // 4).last() makes the datetime index disappear. Why?
a b
b
0 1 3
1 1 7
2 1 9
Expected result instead:
a b
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9

For groupby your index always getting from grouping values. For you case you could use reset_index and then set_index:
df['c'] = df.b // 4
result = df.reset_index().groupby('c').last().set_index('index')
In [349]: result
Out[349]:
a b
index
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9

Related

Pandas dataframe condition on datetime at other rows

My dataframe is shown as follows:
User Date Unit
1 A 2000-10-31 1
2 A 2001-10-31 2
3 A 2002-10-31 1
4 A 2003-10-31 2
5 B 2000-07-31 1
6 B 2000-08-31 2
7 B 2001-07-31 1
8 B 2002-06-30 1
9 B 2002-07-31 1
10 B 2002-08-31 1
I want to make the following judgement:
(1) For the 'User' with 'Unit' in the same month in the past consecutive two years. The data should be classified as 'Routine' with a dummy variable 1.
(2) Otherwise, the data should be classified as 0 in the 'Routine' column.
(3) For the data do not have two past consecutive years. The 'Routine' column should show NaN.
My desired output is:
User Date Unit Routine
1 A 2000-10-31 1 NaN
2 A 2001-10-31 2 NaN
3 A 2002-10-31 1 1
4 A 2003-10-31 2 1
5 B 2000-07-31 1 NaN
6 B 2000-08-31 2 NaN
7 B 2001-07-31 1 NaN
8 B 2002-06-30 1 0
9 B 2002-07-31 1 1
10 B 2002-08-31 1 0
The code of the dataframe is shown as follows:
df=pd.DataFrame({'User':list('AAAABBBBBB'),
'Date':['2000-10-31','2001-10-31','2002-10-31','2003-10-31','2000-07-31',
'2000-08-31','2001-07-31','2002-06-30','2002-07-31','2002-08-31'],
'Unit':[1,2,1,2,1,2,1,1,1,1]})
df['Date']=pd.to_datetime(df['Date'])
I want to use groupby function since there are many users in the dataframe. Thank you.
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'User': list('AAAABBBBBB'),
'Date': [
'2000-10-31', '2001-10-31', '2002-10-31', '2003-10-31',
'2000-07-31', '2000-08-31', '2001-07-31', '2002-06-30',
'2002-07-31', '2002-08-31'],
'Unit': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1]})
df['Date'] = pd.to_datetime(df['Date'])
def routine(user, cdate, unit):
result = np.nan
two_years = [cdate.year - 1, cdate.year - 2]
mask = df.User == user
mask = mask & df.Date.dt.year.isin(two_years)
sdf = df[mask]
years = sdf.Date.dt.year.to_list()
got_years = all([y in years for y in two_years])
result = 0 if (sdf.shape[0] > 0) & got_years else result
mask2 = (sdf.Date.dt.month == cdate.month) & (sdf.Unit == unit)
sdf = sdf[mask2]
result = 1 if (sdf.shape[0] > 0) & got_years else result
return result
df['Routine'] = df.apply(
lambda row: routine(row['User'], row['Date'], row['Unit']), axis=1)
print(df)
Output:
User Date Unit Routine
0 A 2000-10-31 1 NaN
1 A 2001-10-31 2 NaN
2 A 2002-10-31 1 1.0
3 A 2003-10-31 2 1.0
4 B 2000-07-31 1 NaN
5 B 2000-08-31 2 NaN
6 B 2001-07-31 1 NaN
7 B 2002-06-30 1 0.0
8 B 2002-07-31 1 1.0
9 B 2002-08-31 1 0.0

Conditional Running Count in Pandas for All Previous Rows Only

Suppose I have the following DataFrame:
df = pd.DataFrame({'Event': ['A', 'B', 'A', 'A', 'B', 'C', 'B', 'B', 'A', 'C'],
'Date': ['2019-01-01', '2019-02-01', '2019-03-01', '2019-03-01', '2019-02-15',
'2019-03-15', '2019-04-05', '2019-04-05', '2019-04-15', '2019-06-10'],
'Sale':[100,200,150,200,150,100,300,250,500,400]})
df['Date'] = pd.to_datetime(df['Date'])
df
Event Date
A 2019-01-01
B 2019-02-01
A 2019-03-01
A 2019-03-01
B 2019-02-15
C 2019-03-15
B 2019-04-05
B 2019-04-05
A 2019-04-15
C 2019-06-10
I would like to obtain the following result:
Event Date Previous_Event_Count
A 2019-01-01 0
B 2019-02-01 0
A 2019-03-01 1
A 2019-03-01 1
B 2019-02-15 1
C 2019-03-15 0
B 2019-04-05 2
B 2019-04-05 2
A 2019-04-15 3
C 2019-06-10 1
where df['Previous_Event_Count'] is the number of an event (rows) when the event (df['Event']) takes place before its adjacent date (df['Date']). For instance,
The number of event A takes place before 2019-01-01 is 0,
The number of event A takes place before 2019-03-01 is 1, and
The number of event A takes place before 2019-04-15 is 3.
I am able to obtain the desired result using this line:
df['Previous_Event_Count'] = [df.loc[(df.loc[i, 'Event'] == df['Event']) & (df.loc[i, 'Date'] > df['Date']),
'Date'].count() for i in range(len(df))]
Although, it is slow but it works fine. I believe there is a better way to do that. I have tried this line:
df['Previous_Event_Count'] = df.query('Date < Date').groupby(['Event', 'Date']).cumcount()
but it produces NaNs.
groupby + rank
Dates can be treated as numeric. Use'min' to get your counting logic.
df['PEC'] = (df.groupby('Event').Date.rank(method='min')-1).astype(int)
Event Date PEC
0 A 2019-01-01 0
1 B 2019-02-01 0
2 A 2019-03-01 1
3 A 2019-03-01 1
4 B 2019-02-15 1
5 C 2019-03-15 0
6 B 2019-04-05 2
7 B 2019-04-05 2
8 A 2019-04-15 3
9 C 2019-06-10 1
First get counts by GroupBy.size per both columns, then aggregate by first level with shift and cumulative sum and last join to original:
s = (df.groupby(['Event', 'Date'])
.size()
.groupby(level=0)
.apply(lambda x: x.shift(1).cumsum())
.fillna(0)
.astype(int))
df = df.join(s.rename('Previous_Event_Count'), on=['Event','Date'])
print (df)
Event Date Previous_Event_Count
0 A 2019-01-01 0
1 B 2019-02-01 0
2 A 2019-03-01 1
3 A 2019-03-01 1
4 B 2019-02-15 1
5 C 2019-03-15 0
6 B 2019-04-05 2
7 B 2019-04-05 2
8 A 2019-04-15 3
9 C 2019-06-10 1
Finally, I can find a better and faster way to get the desired result. It turns out that it is very easy. One can try:
df['Total_Previous_Sale'] = df.groupby('Event').cumcount() \
- df.groupby(['Event', 'Date']).cumcount()

Pandas pivoting/stacking/reshaping

I'm trying to import data to a pandas DataFrame with columns being date string, label, value. My data looks like the following (just with 4 dates and 5 labels)
from numpy import random
import numpy as np
import pandas as pd
# Creating the data
dates = ("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04")
values = [random.rand(5) for _ in range(4)]
data = dict(zip(dates,values))
So, the data is a dictionary where the keys are dates, the keys a list of values where the index is the label.
Loading this data structure into a DataFrame
df1 = pd.DataFrame(data)
gives me the dates as columns, the label as index, and the value as the value.
An alternative loading would be
df2 = pd.DataFrame()
df2.from_dict(data, orient='index')
where the dates are index, and columns are labels.
In either of both cases do I manage to do pivoting or stacking to my preferred view.
How should I approach the pivoting/stacking to get the view I want? Or should I change my data structure before loading it into a DataFrame? In particular I'd like to avoid of having to create all the rows of the table beforehand by using a bunch of calls to zip.
IIUC:
Option 1
pd.DataFrame.stack
pd.DataFrame(data).stack() \
.rename('value').rename_axis(['label', 'date']).reset_index()
label date value
0 0 2015-01-01 0.345109
1 0 2015-01-02 0.815948
2 0 2015-01-03 0.758709
3 0 2015-01-04 0.461838
4 1 2015-01-01 0.584527
5 1 2015-01-02 0.823529
6 1 2015-01-03 0.714700
7 1 2015-01-04 0.160735
8 2 2015-01-01 0.779006
9 2 2015-01-02 0.721576
10 2 2015-01-03 0.246975
11 2 2015-01-04 0.270491
12 3 2015-01-01 0.465495
13 3 2015-01-02 0.622024
14 3 2015-01-03 0.227865
15 3 2015-01-04 0.638772
16 4 2015-01-01 0.266322
17 4 2015-01-02 0.575298
18 4 2015-01-03 0.335095
19 4 2015-01-04 0.761181
Option 2
comprehension
pd.DataFrame(
[[i, d, v] for d, l in data.items() for i, v in enumerate(l)],
columns=['label', 'date', 'value']
)
label date value
0 0 2015-01-01 0.345109
1 1 2015-01-01 0.584527
2 2 2015-01-01 0.779006
3 3 2015-01-01 0.465495
4 4 2015-01-01 0.266322
5 0 2015-01-02 0.815948
6 1 2015-01-02 0.823529
7 2 2015-01-02 0.721576
8 3 2015-01-02 0.622024
9 4 2015-01-02 0.575298
10 0 2015-01-03 0.758709
11 1 2015-01-03 0.714700
12 2 2015-01-03 0.246975
13 3 2015-01-03 0.227865
14 4 2015-01-03 0.335095
15 0 2015-01-04 0.461838
16 1 2015-01-04 0.160735
17 2 2015-01-04 0.270491
18 3 2015-01-04 0.638772
19 4 2015-01-04 0.761181

Generating sub data frame based on a value in an column

I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5

split, groupby, combine in Pandas to find a difference in dates

I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

Categories

Resources