I have a DataFrame with a timestamp column
d1=DataFrame({'a':[datetime(2015,1,1,20,2,1),datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5),datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28),datetime(2015,1,1,20,37,51),datetime(2015,1,1,20,41,19),
datetime(2015,1,1,20,49,4),datetime(2015,1,1,20,59,21)], 'b':[2,4,26,22,45,3,8,121,34]})
a b
0 2015-01-01 20:02:01 2
1 2015-01-01 20:14:58 4
2 2015-01-01 20:17:05 26
3 2015-01-01 20:31:05 22
4 2015-01-01 20:34:28 45
5 2015-01-01 20:37:51 3
6 2015-01-01 20:41:19 8
7 2015-01-01 20:49:04 121
8 2015-01-01 20:59:21 34
I can group by 15 minute intervals by doing these operations
d2=d1.set_index('a')
d3=d2.groupby(pd.TimeGrouper('15Min'))
The number of rows by group is found by
d3.size()
a
2015-01-01 20:00:00 2
2015-01-01 20:15:00 1
2015-01-01 20:30:00 4
2015-01-01 20:45:00 2
I want my original DataFrame to have a column corresponding to the unique number of rows in the specific group that it belongs to. For example, the first group
2015-01-01 20:00:00
has 2 rows so the first two rows of my new column in d1 should have the number 1
the second group
2015-01-01 20:15:00
has 1 row so the third row of my new column in d1 should have the number 2
the third group
2015-01-01 20:15:00
has 4 rows so the fourth, fifth, sixth, and seventh rows of my new column in d1 should have the number 3
I want my new DataFrame to look like this
a b c
0 2015-01-01 20:02:01 2 1
1 2015-01-01 20:14:58 4 1
2 2015-01-01 20:17:05 26 2
3 2015-01-01 20:31:05 22 3
4 2015-01-01 20:34:28 45 3
5 2015-01-01 20:37:51 3 3
6 2015-01-01 20:41:19 8 3
7 2015-01-01 20:49:04 121 4
8 2015-01-01 20:59:21 34 4
Use .transform() on your groupby object with an itertools.count iterator:
from datetime import datetime
from itertools import count
import pandas as pd
d1 = pd.DataFrame({'a': [datetime(2015,1,1,20,2,1), datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5), datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28), datetime(2015,1,1,20,37,51),
datetime(2015,1,1,20,41,19), datetime(2015,1,1,20,49,4),
datetime(2015,1,1,20,59,21)],
'b': [2, 4, 26, 22, 45, 3, 8, 121, 34]})
d2 = d1.set_index('a')
counter = count(1)
d2['c'] = (d2.groupby(pd.TimeGrouper('15Min'))['b']
.transform(lambda x: next(counter)))
print(d2)
Output:
b c
a
2015-01-01 20:02:01 2 1
2015-01-01 20:14:58 4 1
2015-01-01 20:17:05 26 2
2015-01-01 20:31:05 22 3
2015-01-01 20:34:28 45 3
2015-01-01 20:37:51 3 3
2015-01-01 20:41:19 8 3
2015-01-01 20:49:04 121 4
2015-01-01 20:59:21 34 4
Related
I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
My goal is to complement the missing date entries per project_id with 0 in the data row.
For example
df = pd.DataFrame({
'project_id': ['A', 'A', 'A', 'B', 'B'],
'timestamp': ['2018-01-01', '2018-03-01', '2018-04-01', '2018-03-01', '2018-06-01'],
'data': [100, 28, 45, 64, 55]})
which is
project_id timestamp data
0 A 2018-01-01 100
1 A 2018-03-01 28
2 A 2018-04-01 45
3 B 2018-03-01 64
4 B 2018-06-01 55
shall become
project_id timestamp data
0 A 2018-01-01 100
1 A 2018-02-01 0
2 A 2018-03-01 28
3 A 2018-04-01 45
4 B 2018-03-01 64
5 B 2018-04-01 0
6 B 2018-05-01 0
7 B 2018-06-01 55
where indices 1, 5, and 6 are added.
My current approach :
df.groupby('project_id').apply(lambda x: x[['timestamp', 'data']].set_index('timestamp').asfreq('M', how='start', fill_value=0))
is obviously wrong, because it sets everything to 0 and resampled not the first date of a month but the last one - although I thought this should be handled by how.
How do I expand/complement missing datetime entries after groupby to get a continuous time series for each group?
You are close:
df.timestamp = pd.to_datetime(df.timestamp)
# notice 'MS'
new_df = df.groupby('project_id').apply(lambda x: x[['timestamp', 'data']]
.set_index('timestamp').asfreq('MS'))
new_df.data = df.set_index(['project_id', 'timestamp']).data
df = new_df.fillna(0).reset_index()
You can use groupby in combination with pandas.Grouper:
df_new = pd.concat([
d for n, d in df.set_index('timestamp').groupby(pd.Grouper(freq='MS'))
])
df_new = df_new.sort_values('project_id').reset_index()
Output
print(df_new)
timestamp project_id data
0 2018-01-01 A 100.0
1 2018-02-01 A 0.0
2 2018-03-01 A 28.0
3 2018-04-01 A 45.0
4 2018-03-01 B 64.0
5 2018-04-01 B 0.0
6 2018-05-01 B 0.0
7 2018-06-01 B 55.0
What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18
I'm trying to import data to a pandas DataFrame with columns being date string, label, value. My data looks like the following (just with 4 dates and 5 labels)
from numpy import random
import numpy as np
import pandas as pd
# Creating the data
dates = ("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04")
values = [random.rand(5) for _ in range(4)]
data = dict(zip(dates,values))
So, the data is a dictionary where the keys are dates, the keys a list of values where the index is the label.
Loading this data structure into a DataFrame
df1 = pd.DataFrame(data)
gives me the dates as columns, the label as index, and the value as the value.
An alternative loading would be
df2 = pd.DataFrame()
df2.from_dict(data, orient='index')
where the dates are index, and columns are labels.
In either of both cases do I manage to do pivoting or stacking to my preferred view.
How should I approach the pivoting/stacking to get the view I want? Or should I change my data structure before loading it into a DataFrame? In particular I'd like to avoid of having to create all the rows of the table beforehand by using a bunch of calls to zip.
IIUC:
Option 1
pd.DataFrame.stack
pd.DataFrame(data).stack() \
.rename('value').rename_axis(['label', 'date']).reset_index()
label date value
0 0 2015-01-01 0.345109
1 0 2015-01-02 0.815948
2 0 2015-01-03 0.758709
3 0 2015-01-04 0.461838
4 1 2015-01-01 0.584527
5 1 2015-01-02 0.823529
6 1 2015-01-03 0.714700
7 1 2015-01-04 0.160735
8 2 2015-01-01 0.779006
9 2 2015-01-02 0.721576
10 2 2015-01-03 0.246975
11 2 2015-01-04 0.270491
12 3 2015-01-01 0.465495
13 3 2015-01-02 0.622024
14 3 2015-01-03 0.227865
15 3 2015-01-04 0.638772
16 4 2015-01-01 0.266322
17 4 2015-01-02 0.575298
18 4 2015-01-03 0.335095
19 4 2015-01-04 0.761181
Option 2
comprehension
pd.DataFrame(
[[i, d, v] for d, l in data.items() for i, v in enumerate(l)],
columns=['label', 'date', 'value']
)
label date value
0 0 2015-01-01 0.345109
1 1 2015-01-01 0.584527
2 2 2015-01-01 0.779006
3 3 2015-01-01 0.465495
4 4 2015-01-01 0.266322
5 0 2015-01-02 0.815948
6 1 2015-01-02 0.823529
7 2 2015-01-02 0.721576
8 3 2015-01-02 0.622024
9 4 2015-01-02 0.575298
10 0 2015-01-03 0.758709
11 1 2015-01-03 0.714700
12 2 2015-01-03 0.246975
13 3 2015-01-03 0.227865
14 4 2015-01-03 0.335095
15 0 2015-01-04 0.461838
16 1 2015-01-04 0.160735
17 2 2015-01-04 0.270491
18 3 2015-01-04 0.638772
19 4 2015-01-04 0.761181
I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4