GroupBy - How to extract seconds from DateTime with diff() - python

I have the following dataframe:
In [372]: df_2
Out[372]:
A ID3 DATETIME
0 B-028 b76cd912ff 2014-10-08 13:43:27
1 B-054 4a57ed0b02 2014-10-08 14:26:19
2 B-076 1a682034f8 2014-10-08 14:29:01
3 B-023 b76cd912ff 2014-10-08 18:39:34
4 B-023 f88g8d7sds 2014-10-08 18:40:18
5 B-033 b76cd912ff 2014-10-08 18:44:30
6 B-032 b76cd912ff 2014-10-08 18:46:00
7 B-037 b76cd912ff 2014-10-08 18:52:15
8 B-046 db959faf02 2014-10-08 18:59:59
9 B-053 b76cd912ff 2014-10-08 19:17:48
10 B-065 b76cd912ff 2014-10-08 19:21:38
And I want to find the difference between different entries - grouped by 'ID3'.
I am trying to use transform() on a GroupBy like this:
In [379]: df_2['diff'] = df_2.sort_values(by='DATETIME').groupby('ID3')['DATETIME'].transform(lambda x: x.diff()); df_2['diff']
Out[379]:
0 NaT
1 NaT
2 NaT
3 1970-01-01 04:56:07
4 NaT
5 1970-01-01 00:04:56
6 1970-01-01 00:01:30
7 1970-01-01 00:06:15
8 NaT
9 1970-01-01 00:25:33
10 1970-01-01 00:03:50
Name: diff, dtype: datetime64[ns]
I have also tried with x.diff().astype(int) for lambda, with the exact same result.
Datatype of both 'DATETIME' and 'diff' is: datetime64[ns]
What I am trying to achieve is have diff represented in seconds instead of some time in relation to Epoch time.
I have figured out that I can convert df_2['diff'] to TimeDelta and then extract seconds in one chained call at this point, like this:
In [405]: df_2['diff'] = pd.to_timedelta(df_2['diff']).map(lambda x: x.total_seconds()); df_2['diff']
Out[407]:
0 NaN
1 NaN
2 NaN
3 17767.0
4 NaN
5 296.0
6 90.0
7 375.0
8 NaN
9 1533.0
10 230.0
Name: diff, dtype: float64
Is there a way to achieve this (seconds as values for df_2['diff']) in one step in the transform instead of having to take a couple of steps in the process?
Finally, I have already tried making conversion to TimeDelta in transform without any success.
Thanks for the help!

UPDATE: transform() from class NDFrameGroupBy(GroupBy) doesn't seem to do downcasting and works as expected:
In [220]: (df_2[['ID3','DATETIME']]
.....: .sort_values(by='DATETIME')
.....: .groupby('ID3')
.....: .transform(lambda x: x.diff().dt.total_seconds())
.....: )
Out[220]:
DATETIME
0 NaN
1 NaN
2 NaN
3 17767.0
4 NaN
5 296.0
6 90.0
7 375.0
8 NaN
9 1533.0
10 230.0
the transform() from class SeriesGroupBy(GroupBy) tries to do the following:
result = _possibly_downcast_to_dtype(result, dtype)
which could (i'm not sure) cause your problem
OLD answer:
try this:
In [168]: df_2.sort_values(by='DATETIME').groupby('ID3')['DATETIME'].diff().dt.total_seconds()
Out[168]:
0 NaN
1 NaN
2 NaN
3 17767.0
4 NaN
5 296.0
6 90.0
7 375.0
8 NaN
9 1533.0
10 230.0
dtype: float64

Related

How to calculate monthly changes in a time series using pandas dataframe

As I am new to Python I am probably asking for something basic for most of you. However, I have a df where 'Date' is the index, another column that is returning the month related to the Date, and one Data column.
Mnth TSData
Date
2012-01-05 1 192.6257
2012-01-12 1 194.2714
2012-01-19 1 192.0086
2012-01-26 1 186.9729
2012-02-02 2 183.7700
2012-02-09 2 178.2343
2012-02-16 2 172.3429
2012-02-23 2 171.7800
2012-03-01 3 169.6300
2012-03-08 3 168.7386
2012-03-15 3 167.1700
2012-03-22 3 165.9543
2012-03-29 3 165.0771
2012-04-05 4 164.6371
2012-04-12 4 164.6500
2012-04-19 4 166.9171
2012-04-26 4 166.4514
2012-05-03 5 166.3657
2012-05-10 5 168.2543
2012-05-17 5 176.8271
2012-05-24 5 179.1971
2012-05-31 5 183.7120
2012-06-07 6 195.1286
I wish to calculate monthly changes in the data set that I can later use in a boxplot. So from the table above the results i seek are:
Mnth Chng
1 -8,9 (183,77 - 192,66)
2 -14,14 (169,63 - 183,77)
3 -5 (164,63 - 169,63)
4 1,73 (166,36 - 164,63)
5 28,77 (195,13 - 166,36)
and so on...
any suggestions?
thanks :)
IIUC, starting from this as df:
Date Mnth TSData
0 2012-01-05 1 192.6257
1 2012-01-12 1 194.2714
2 2012-01-19 1 192.0086
3 2012-01-26 1 186.9729
4 2012-02-02 2 183.7700
...
20 2012-05-24 5 179.1971
21 2012-05-31 5 183.7120
22 2012-06-07 6 195.1286
you can use:
df.groupby('Mnth')['TSData'].first().diff().shift(-1)
# or
# -df.groupby('Mnth')['TSData'].first().diff(-1)
NB. the data must be sorted by date to have the desired date to be used in the computation as the first item of each group (df.sort_values(by=['Mnth', 'Date']))
output:
Mnth
1 -8.8557
2 -14.1400
3 -4.9929
4 1.7286
5 28.7629
6 NaN
Name: TSData, dtype: float64
I'll verify that we have a datetime index:
df.index = pd.to_datetime(df.index)
Then it's simply a matter of using resample:
df['TSData'].resample('M').first().diff().shift(freq='-1M')
Output:
Date
2011-12-31 NaN
2012-01-31 -8.8557
2012-02-29 -14.1400
2012-03-31 -4.9929
2012-04-30 1.7286
2012-05-31 28.7629
Name: TSData, dtype: float64

How to convert a Unix Timestamp of a Pandas DataFrame Column with NaN Values to a Datetime

I have a pandas dataframe with a Unix timestamp column and some NaN values, like that:
>> df_to_datetime
0 1.571687e+09
1 1.586099e+09
2 NaN
3 1.589994e+09
4 1.593363e+09
5 1.585852e+09
6 1.580754e+09
7 1.582201e+09
8 1.576595e+09
9 1.586874e+09
Name: last_replied_at, dtype: float64
When I try that:
for i in range (len(df_to_datetime)):
if not df_to_datetime[i]:
pass
else:
df_to_datetime[i] = [datetime.utcfromtimestamp(df_to_datetime[i]).astimezone(time_zone)]
print(df_to_datetime[i])
it returns this:
11 pass
12 else:
---> 13 df_to_datetime[i] = [datetime.utcfromtimestamp(df_to_datetime[i]).astimezone(time_zone)]
14 print(df_to_datetime[i])
15
ValueError: Invalid value NaN (not a number)
I want to convert my Unix timestamp columns to a datetime. I tried to do without if/else before, but got the same problem with NaN values...
df_to_datetime[i] = [datetime.utcfromtimestamp(df_to_datetime[i]).astimezone(time_zone), errors='coerce']
Instead of doing this in a loop and with if/else, checkout the apply method, with setup:
>> df_to_datetime = pd.Series(pd.date_range(start='1/1/2018', end='1/08/2018'),
name='last_replied_at').apply(pd.Timestamp.timestamp)
>> df_to_datetime.iloc[2] = np.nan
>> print(df_to_datetime)
0 1.514765e+09
1 1.514851e+09
2 NaN
3 1.515024e+09
4 1.515110e+09
5 1.515197e+09
6 1.515283e+09
7 1.515370e+09
Name: last_replied_at, dtype: float64
>> df_to_datetime.apply(pd.to_datetime, errors='coerce', utc=True, unit='s')
0 2018-01-01 00:00:00+00:00
1 2018-01-02 00:00:00+00:00
2 NaT
3 2018-01-04 00:00:00+00:00
4 2018-01-05 00:00:00+00:00
5 2018-01-06 00:00:00+00:00
6 2018-01-07 00:00:00+00:00
7 2018-01-08 00:00:00+00:00
Name: last_replied_at, dtype: datetime64[ns, UTC]

How to vectorize an operation that uses previous values?

I want to do something like this:
df['indicator'] = df.at[x-1] + df.at[x-2]
or
df['indicator'] = df.at[x-1] > df.at[x-2]
I guess edge cases would be taken care of automatically, e.g. skip the first few rows.
This line should give you what you need. The first two rows for your indicator column will be automatically filled with 'NaN'.
df['indicator'] = df.at.shift(1) + df.at.shift(2)
For example, if we had the following dataframe:
a = pd.DataFrame({'date':['2017-06-01','2017-06-02','2017-06-03',
'2017-06-04','2017-06-05','2017-06-06'],
'count' :[10,15,17,5,3,7]})
date at
0 2017-06-01 10
1 2017-06-02 15
2 2017-06-03 17
3 2017-06-04 5
4 2017-06-05 3
5 2017-06-06 7
Then running this line will give the below result:
df['indicator'] = df.at.shift(1) + df.at.shift(2)
date at indicator
0 2017-06-01 10 NaN
1 2017-06-02 15 NaN
2 2017-06-03 17 25.0
3 2017-06-04 5 32.0
4 2017-06-05 3 22.0
5 2017-06-06 7 8.0

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

How to take log of only non-zero values in dataframe and replace O's with NA's?

How do i take log of non-zero values in dataframe and replace 0's with NA's.
I have dataframe like below:
time y1 y2
0 2017-08-06 00:52:00 0 10
1 2017-08-06 00:52:10 1 20
2 2017-08-06 00:52:20 2 0
3 2017-08-06 00:52:30 3 0
4 2017-08-06 00:52:40 0 5
5 2017-08-06 00:52:50 4 6
6 2017-08-06 00:53:00 6 11
7 2017-08-06 00:53:10 7 12
8 2017-08-06 00:53:20 8 0
9 2017-08-06 00:53:30 0 13
I want to take log of all columns expect first column time and log should be calculate of only non-zero values and zero's should be replace with NA's? How do i do this?
So, I tried to do something like this:
cols = df.columns.difference(['time'])
# Replacing O's with NA's using below:
df[cols] = df[cols].mask(np.isclose(df[cols].values, 0), np.nan)
df[cols] = np.log(df[cols]) # but this will try take log of NA's also.
Please help.
Output should be dataframe with same time column, and all the zero's replaced with NA's and log equivalent of the remaining values of all columns expect 1st column.
If I understand correctly, you can just replace the zeros with np.nan and then call np.log directly - it ignores NaN values just fine.
np.log(df[['y1', 'y2']].replace(0, np.nan))
Example
>>> df = pd.DataFrame({'time': pd.date_range('20170101', '20170110'),
'y1' : np.random.randint(0, 3, 10),
'y2': np.random.randint(0, 3, 10)})
>>> df
time y1 y2
0 2017-01-01 1 2
1 2017-01-02 0 1
2 2017-01-03 2 0
3 2017-01-04 0 1
4 2017-01-05 1 0
5 2017-01-06 1 1
6 2017-01-07 2 0
7 2017-01-08 1 0
8 2017-01-09 0 1
9 2017-01-10 2 1
>>> df[['log_y1', 'log_y2']] = np.log(df[['y1', 'y2']].replace(0, np.nan))
>>> df
time y1 y2 log_y1 log_y2
0 2017-01-01 1 2 0.000000 0.693147
1 2017-01-02 0 1 NaN 0.000000
2 2017-01-03 2 0 0.693147 NaN
3 2017-01-04 0 1 NaN 0.000000
4 2017-01-05 1 0 0.000000 NaN
5 2017-01-06 1 1 0.000000 0.000000
6 2017-01-07 2 0 0.693147 NaN
7 2017-01-08 1 0 0.000000 NaN
8 2017-01-09 0 1 NaN 0.000000
9 2017-01-10 2 1 0.693147 0.000000

Categories

Resources