Join dataframe value series to another on datetime range from 1st dataframe - python

I am trying to retrieve values from a dataframe (df2) that fall within a datetime range that is specified in another dataframe (df1).
df1 = pd.DataFrame({'date': pd.date_range("20180101", periods=5)}, index=list('ABCDE'))
df1['t-2'] = df1['date'] + dt.timedelta(days=-2)
df1['t+2'] = df1['date'] + dt.timedelta(days=2)
df1:
date t-2 t+2
A 2018-01-01 2017-12-30 2018-01-03
B 2018-01-02 2017-12-31 2018-01-04
C 2018-01-03 2018-01-01 2018-01-05
D 2018-01-04 2018-01-02 2018-01-06
E 2018-01-05 2018-01-03 2018-01-07
df2 = pd.DataFrame(np.random.randint(0,100,size=(5, 50)),
index=list('ABCDE'), columns = pd.date_range("20171201", periods=50))
df2:
2017-12-01 2017-12-02 2017-12-03 ... 2018-01-17 2018-01-18 2018-01-19
A 58 61 45 ... 72 77 68
B 88 94 68 ... 93 68 24
C 97 47 21 ... 22 48 89
D 44 8 62 ... 57 29 29
E 0 21 26 ... 65 46 36
I would like to take the 5 cells from df2 that fall between the dates in df1 t-2 & t+2 & append these to df1 so that it looks like the following:
date t-2 t+2 t-2d t-1d t t+1d t+2d
A 2018-01-01 2017-12-30 2018-01-03 21 28 25 7 6
B 2018-01-02 2017-12-31 2018-01-04 28 25 7 6 45
C 2018-01-03 2018-01-01 2018-01-05 25 7 6 45 74
D 2018-01-04 2018-01-02 2018-01-06 7 6 45 74 23
E 2018-01-05 2018-01-03 2018-01-07 6 45 74 23 57
So far I have tried any number of combinations of the following
pd.merge(df1, df2.loc[:, df1['t-2']:df1['t+2']], how='inner', left_index=True, right_index=True)
but i receive a TypeError. Any help greatly appreciated.
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [A 2017-12-30
B 2017-12-31
C 2018-01-01
D 2018-01-02
E 2018-01-03
Name: t-2, dtype: datetime64[ns]] of type Series

Related

How to remove the all values of a specific person from dataframe which is not continuous based on date time

date consumption customer_id
2018-01-01 12 111
2018-01-02 12 111
*2018-01-03* 14 111
*2018-01-05* 12 111
2018-01-06 45 111
2018-01-07 34 111
2018-01-01 23 112
2018-01-02 23 112
2018-01-03 45 112
2018-01-04 34 112
2018-01-05 23 112
2018-01-06 34 112
2018-01-01 23 113
2018-01-02 34 113
2018-01-03 45 113
2018-01-04 34 113
The values in customer 111 is not continuous, it has missing value in 2018-01-04,
so i want to remove all 111 from my dataframe in pandas.
date consumption customer_id
2018-01-01 23 112
2018-01-02 23 112
2018-01-03 45 112
2018-01-04 34 112
2018-01-05 23 112
2018-01-06 34 112
2018-01-01 23 113
2018-01-02 34 113
2018-01-03 45 113
2018-01-04 34 113
i want result like this ? how does it possible in pandas?
You can compute the successive delta and check if any is greater than 1d:
drop = (pd.to_datetime(df['date'])
.groupby(df['customer_id'])
.apply(lambda s: s.diff().gt('1d').any())
)
out = df[df['customer_id'].isin(drop[~drop].index)]
Or with groupby.filter:
df['date'] = pd.to_datetime(df['date'])
out = (df.groupby(df['customer_id'])
.filter(lambda d: ~d['date'].diff().gt('1d').any())
)
Output:
date consumption customer_id
6 2018-01-01 23 112
7 2018-01-02 23 112
8 2018-01-03 45 112
9 2018-01-04 34 112
10 2018-01-05 23 112
11 2018-01-06 34 112
12 2018-01-01 23 113
13 2018-01-02 34 113
14 2018-01-03 45 113
15 2018-01-04 34 113
If you the dates are not necessarily increasing, also check you cannot go back in time:
df['date'] = pd.to_datetime(df['date'])
out = (df.groupby(df['customer_id'])
.filter(lambda d: d['date'].diff().iloc[1:].eq('1d').all())
)

select the rows by intersection of a column

I have a DataFrame like
In [67]: df
Out[67]:
id ts
0 a 2018-01-01
1 a 2018-01-02
2 a 2018-01-03
3 a 2018-01-04
4 a 2018-01-05
5 a 2018-01-06
6 a 2018-01-07
7 a 2018-01-08
8 b 2018-01-03
9 b 2018-01-04
10 b 2018-01-05
11 b 2018-01-06
12 b 2018-01-07
13 b 2018-01-08
14 b 2018-01-09
15 b 2018-01-10
16 b 2018-01-11
How can I extract the part where a and b has a same ts?
id ts
2 a 2018-01-03
3 a 2018-01-04
4 a 2018-01-05
5 a 2018-01-06
6 a 2018-01-07
7 a 2018-01-08
8 b 2018-01-03
9 b 2018-01-04
10 b 2018-01-05
11 b 2018-01-06
12 b 2018-01-07
13 b 2018-01-08
There might be many unique id beside a and b. I want all the intersection of column ts.
what would be the expected output with an additional row of c 2018-01-04?
It would be
a 2018-01-04
b 2018-01-04
c 2018-01-04
Idea is reshape by DataFrame.pivot_table, so get missing values for different datetimes, remove them by DataFrame.dropna and then filter original DataFrame by Series.isin:
df1 = df.pivot_table(index='ts', columns='id', aggfunc='size').dropna()
df = df[df['ts'].isin(df1.index)]
print (df)
id ts
2 a 2018-01-03
3 a 2018-01-04
4 a 2018-01-05
5 a 2018-01-06
6 a 2018-01-07
7 a 2018-01-08
8 b 2018-01-03
9 b 2018-01-04
10 b 2018-01-05
11 b 2018-01-06
12 b 2018-01-07
13 b 2018-01-08
Test if added new c row:
df1 = df.pivot_table(index='ts', columns='id', aggfunc='size').dropna()
df = df[df['ts'].isin(df1.index)]
print (df)
id ts
3 a 2018-01-04
9 b 2018-01-04
17 c 2018-01-04
To keep only the intersecting values, you could take the groupby.size of ts, and check the which of these groups have a size equal to the amount of unique values in ts. Then use the result to index the dataframe.
Checking on the proposed dataframe, and the additional row c 2018-01-04, this would return only the intersecting dates in ts:
s = df.groupby(df.ts).size().eq(df.id.nunique())
df[df.ts.isin(s[s].index)]
id ts
3 a 2018-01-04
9 b 2018-01-04
16 c 2018-01-04

How to resample using forward fill python

My Dataframe df3 looks something like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I wanted to resample using ffill for every second so that it looks like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:06.000 125.00 101
1 2 2018-01-01 00:00:07.000 125.00 101
2 3 2018-01-01 00:00:08.000 125.00 101
3 4 2018-01-02 00:00:09.000 125.00 52
4 5 2018-01-02 00:00:10.000 127.00 52
...
My code:
def resample(df):
indexing = df[['Timestamp','Data']]
indexing['Timestamp']=pd.to_datetime(indexing['Timestamp'])
indexing =indexing.set_index('Timestamp')
indexing1= indexing.resample('1S',fill_method='ffill')
# indexing1 = indexing1.resample('D')
return indexing1
indexing = resample(df3)
but incurred error
ValueError: cannot reindex a non-unique index with a method or limit
I don't quite understand what this error mean. #jezrael from this similar question suggested using drop_duplicates with groupby. I am not sure what this does to the data as it seems there are no duplicates in my data? Can someone explain this please? Thanks.
This error is caused because of the following:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
When you resample both these timestamps to the nearest second they both become
2018-01-01 00:00:06 and pandas doesn't know which value for the data to pick
because it has two to select from. Instead what you can do is use an aggregation function
such as last (though mean, max, min may also be suitable) in order to
select one of the values. Then you can apply the forward fill.
Example:
from io import StringIO
import pandas as pd
df = pd.read_table(StringIO(""" Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50"""), sep='\s\s+')
df['Timestamp'] = pd.to_datetime(df['Timestamp']).dt.round('s')
df.set_index('Timestamp', inplace=True)
df = df.resample('1S').last().ffill()

Pandas - Sum of first X hours of datetime index

I have a dataframe with a datetime index and 100 columns.
I want to have a new dataframe with the same datetime index and columns, but the values would contain the sum of the first 10 hours of each day.
So if I had an original dataframe like this:
A B C
---------------------------------
2018-01-01 00:00:00 2 5 -10
2018-01-01 01:00:00 6 5 7
2018-01-01 02:00:00 7 5 9
2018-01-01 03:00:00 9 5 6
2018-01-01 04:00:00 10 5 2
2018-01-01 05:00:00 7 5 -1
2018-01-01 06:00:00 1 5 -1
2018-01-01 07:00:00 -4 5 10
2018-01-01 08:00:00 9 5 10
2018-01-01 09:00:00 21 5 -10
2018-01-01 10:00:00 2 5 -1
2018-01-01 11:00:00 8 5 -1
2018-01-01 12:00:00 8 5 10
2018-01-01 13:00:00 8 5 9
2018-01-01 14:00:00 7 5 -10
2018-01-01 15:00:00 7 5 5
2018-01-01 16:00:00 7 5 -10
2018-01-01 17:00:00 4 5 7
2018-01-01 18:00:00 5 5 8
2018-01-01 19:00:00 2 5 8
2018-01-01 20:00:00 2 5 4
2018-01-01 21:00:00 8 5 3
2018-01-01 22:00:00 1 5 3
2018-01-01 23:00:00 1 5 1
2018-01-02 00:00:00 2 5 2
2018-01-02 01:00:00 3 5 8
2018-01-02 02:00:00 4 5 6
2018-01-02 03:00:00 5 5 6
2018-01-02 04:00:00 1 5 7
2018-01-02 05:00:00 7 5 7
2018-01-02 06:00:00 5 5 1
2018-01-02 07:00:00 2 5 2
2018-01-02 08:00:00 4 5 3
2018-01-02 09:00:00 6 5 4
2018-01-02 10:00:00 9 5 4
2018-01-02 11:00:00 11 5 5
2018-01-02 12:00:00 2 5 8
2018-01-02 13:00:00 2 5 0
2018-01-02 14:00:00 4 5 5
2018-01-02 15:00:00 5 5 4
2018-01-02 16:00:00 7 5 4
2018-01-02 17:00:00 -1 5 7
2018-01-02 18:00:00 1 5 7
2018-01-02 19:00:00 1 5 7
2018-01-02 20:00:00 5 5 7
2018-01-02 21:00:00 2 5 7
2018-01-02 22:00:00 2 5 7
2018-01-02 23:00:00 8 5 7
So for all rows with date 2018-01-01:
The value for column A would be 68 (2+6+7+9+10+7+1-4+9+21)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 22 (-10+7+9+6+2-1-1+10+10-10)
So for all rows with date 2018-01-02:
The value for column A would be 39 (2+3+4+5+1+7+5+2+4+6)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 46 (2+8+6+6+7+7+1+2+3+4)
The outcome would be:
A B C
---------------------------------
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I figured I'd group by date first and perform a sum and then merge the results based on the date. Is there a better/faster way to do this?
Thanks.
EDIT: I worked on this answer in the mean time:
df= df.between_time('0:00','9:00').groupby(pd.Grouper(freq='D')).sum()
df= df.resample('1H').ffill()
You need groupby df.index.date and use transfrom with lambda function to find sum of first 10 values as:
df.loc[:,['A','B','C']] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
Or if the sequence is the same for both grouped values and real columns
df.loc[:,:] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
print(df)
A B C
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46

Pandas: How to plot a barchar with dataframes with labels?

I have the following dataframe df:
timestamp objectId result
0 2015-11-24 09:00:00 Stress 3
1 2015-11-24 09:00:00 Productivity 0
2 2015-11-24 09:00:00 Abilities 4
3 2015-11-24 09:00:00 Challenge 0
4 2015-11-24 10:00:00 Productivity 87
5 2015-11-24 10:00:00 Abilities 84
6 2015-11-24 10:00:00 Challenge 58
7 2015-11-24 10:00:00 Stress 25
8 2015-11-24 11:00:00 Productivity 93
9 2015-11-24 11:00:00 Abilities 93
10 2015-11-24 11:00:00 Challenge 93
11 2015-11-24 11:00:00 Stress 19
12 2015-11-24 12:00:00 Challenge 90
13 2015-11-24 12:00:00 Abilities 96
14 2015-11-24 12:00:00 Stress 94
15 2015-11-24 12:00:00 Productivity 88
16 2015-11-24 13:00:00 Productivity 12
17 2015-11-24 13:00:00 Challenge 17
18 2015-11-24 13:00:00 Abilities 89
19 2015-11-24 13:00:00 Stress 13
I would like to achieve a barchart like the following
Where instead of a,b,c,d there would be the labels in the column ObjectID the y-axis should correspond to the column result and x-axis should be the values grouped of the column timestamp.
I tried several things but nothing worked. This was the closest, but the plot() method doesn't take any customisation via parameters (e.g. kind='bar' doesn't work).
groups = df.groupby('objectId')
sgb = groups['result']
sgb.plot()
Any other idea?
import seaborn as sns
In [36]:
df.timestamp = df.timestamp.factorize()[0]
In [39]:
df.objectId = df.objectId.map({'Stress' : 'a' , 'Productivity' : 'b' , 'Abilities' : 'c' , 'Challenge' : 'd'})
In [41]:
df
Out[41]:
timestamp objectId result
0 0 a 3
1 0 b 0
2 0 c 4
3 0 d 0
4 1 b 87
5 1 c 84
6 1 d 58
7 1 a 25
8 2 b 93
9 2 c 93
10 2 d 93
11 2 a 19
12 3 d 90
13 3 c 96
14 3 a 94
15 3 b 88
16 4 b 12
17 4 d 17
18 4 c 89
19 4 a 13
In [40]:
sns.barplot(x = 'timestamp' , y = 'result' , hue = 'objectId' , data = df );
The answer of #NaderHisham is a very easy solution!
But just as a reference, if you for some reason cannot use seaborn, this is a pure pandas/matplotlib solution:
You need to reshape your data, so the different objectIds becomes the columns:
In [20]: df.set_index(['timestamp', 'objectId'])['result'].unstack()
Out[20]:
objectId Abilities Challenge Productivity Stress
timestamp
09:00:00 4 0 0 3
10:00:00 84 58 87 25
11:00:00 93 93 93 19
12:00:00 96 90 88 94
13:00:00 89 17 12 13
If you make a bar plot of this, you get the desired result:
In [24]: df.set_index(['timestamp', 'objectId'])['result'].unstack().plot(kind='bar')
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0xc44a5c0>

Categories

Resources