Is there a way to compute the percentile for a dataframe column with datetime format while still retaining the datetime format (Y-m-d H:M:S) and not converted to seconds for the percentile value?
example of the data with datetime format
df:
0 2016-07-31 08:00:00
1 2016-07-30 14:30:00
2 2006-06-24 14:15:00
3 2016-07-15 08:15:45
4 2016-08-01 23:50:00
There is a built-in function quantile that can be used for that. Let
df = pd.Series(['2016-07-31 08:00:00', '2016-07-30 14:30:00', '2006-06-24 14:15:00', '2016-07-15 08:15:45', '2016-08-01 23:50:00'])
df
0 2016-07-31 08:00:00
1 2016-07-30 14:30:00
2 2006-06-24 14:15:00
3 2016-07-15 08:15:45
4 2016-08-01 23:50:00
then
>>> df.quantile(0.5)
Timestamp('2016-07-30 14:30:00')
See also the official documentation
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html
describe() method on datetime column doesn't work the same way as it does on integer columns or float columns
So we can create our custom method to do the same:
import pandas as pd
from datetime import timedelta
from datetime import datetime
base = datetime.now()
date_list = [base - timedelta(days=x) for x in range(0, 20)]
df = pd.DataFrame.from_dict({'Date': date_list})
df
Date
0 2017-08-17 21:32:54.044948
1 2017-08-16 21:32:54.044948
2 2017-08-15 21:32:54.044948
3 2017-08-14 21:32:54.044948
def describe_datetime(dataframe, column, percentiles=[i/10 for i in range(1,11)]):
new_date = dataframe[column].dt.strftime('%Y-%m-%d').sort_values().values
length = len(new_date)
for percentile in percentiles:
print(percentile, ':', new_date[int(percentile * length)-1])
describe_datetime(df, 'Date')
output:
0.1 : 2017-07-30
0.2 : 2017-08-01
0.3 : 2017-08-03
0.4 : 2017-08-05
0.5 : 2017-08-07
0.6 : 2017-08-09
0.7 : 2017-08-11
0.8 : 2017-08-13
0.9 : 2017-08-15
1.0 : 2017-08-17
After trying some code. I was a able to compute the percentile using the code below, I sorted the column and used its index to compute the percentile.
dataframe is 'df', column with datetime format is 'dates'
date_column = list(df.sort_values('dates')['dates'])
index = range(0,len(date_column)+1)
date_column[np.int((np.percentile(index, 50)))]
Related
Say I have this dataframe:
import pandas as pd
import datetime
x = [datetime.time(23,0),datetime.time(6,0),datetime.time(18,0),datetime.time(17,0)]
y = [datetime.time(22,0),datetime.time(9,0),datetime.time(9,0),datetime.time(23,0)]
df = pd.DataFrame({'time1':x,'time2':y})
which looks like this:
How would I compute the absolute difference between the two columns? Subtraction doesn't work. The result should look like this:
df['abs_diff'] = [1,3,9,6]
Thanks so much!
Pandas doesn't like datetime objects so very much; it labels the series as object dtype, so you can't really do any arithmetics on those. You can convert the data to Pandas' timedelta:
df['abs_diff'] = (pd.to_timedelta(df['time1'].astype(str)) # convert to timedelta
.sub(pd.to_timedelta(df['time2'].astype(str))) # then you can subtract
.abs().div(pd.Timedelta('1H')) # and absolute value, and divide
)
Output:
time1 time2 abs_diff
0 23:00:00 22:00:00 1.0
1 06:00:00 09:00:00 3.0
2 18:00:00 09:00:00 9.0
3 17:00:00 23:00:00 6.0
I had a data where my time was in UNIX format. I used the following code to convert my time in dataframe to Date format from Unix.
import pandas as pd
df = pd.read_csv(r'C:\Users\My Computer\Desktop\Data Analysis\BATS_SPY, 1D.csv')
df['time'] = pd.to_datetime(df['time'],unit='s')
print(df.head())
I get the result as
time
0 1993-01-29 14:30:00
1 1993-02-01 14:30:00
2 1993-02-02 14:30:00
3 1993-02-03 14:30:00
4 1993-02-04 14:30:00
What should I do if I only want dates (that is I want to exclude 14:30:00 from the time)
My data was as follows
time
0 728317800
1 728577000
2 728663400
3 728749800
4 728836200
Take your date series:
df['date'] = df['time'].dt.floor('D')
Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))
This table is a pandas dataframe. Can someone help me out with writing function that show the probability of price up for 5 consecutive days in a row for the past 1000 days. So I will know the probability of price up for tomorrow if the past 4 days' price has been increasing.
Appreciate if any help.
import ccxt
import pandas as pd
binance=ccxt.binance()
def get_price(pair):
df=binance.fetch_ohlcv(pair,timeframe="1d",limit=1000) #limit = 30
df = pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"volume"})
df['date'] = pd.to_datetime(df['date'], unit='ms') + pd.Timedelta(hours=8)
df.set_index("date",inplace=True)
return df
df=get_price("BTC/USDT")
df["daily_return"]=df.close.pct_change()
Random comment, I think based on context you're after empirical probability, in which case this is a simple one-liner using pandas rolling. If this isn't the case, you probably need to explain what you mean by "probability" or explain in words what you're after.
df["probability_5"] = (df["daily_return"] > 0).rolling(5).mean()
df[["daily_return", "probability_5"]].head(15)
Output:
daily_return probability_5
date
2017-08-17 08:00:00 NaN NaN
2017-08-18 08:00:00 -0.041238 NaN
2017-08-19 08:00:00 0.007694 NaN
2017-08-20 08:00:00 -0.012969 NaN
2017-08-21 08:00:00 -0.017201 0.2
2017-08-22 08:00:00 0.005976 0.4
2017-08-23 08:00:00 0.018319 0.6
2017-08-24 08:00:00 0.049101 0.6
2017-08-25 08:00:00 -0.008186 0.6
2017-08-26 08:00:00 0.013260 0.8
2017-08-27 08:00:00 -0.006324 0.6
2017-08-28 08:00:00 0.017791 0.6
2017-08-29 08:00:00 0.045773 0.6
2017-08-30 08:00:00 -0.007050 0.6
2017-08-31 08:00:00 0.037266 0.6
Just to frame the question properly, I believe you are trying to calculate the relative frequency of n-consecutive postive or negative days of price series/array.
Some research:
https://medium.com/#mikeharrisny/probability-in-trading-and-finance-96344108e1d9
Please see my implementation bellow, using a Pandas Dataframe:
import pandas as pd
random_prices = [100, 90, 95, 98, 99, 98, 97, 100, 99, 98]
df = pd.DataFrame(random_prices, columns=['price'])
def consecutive_days_proba(pandas_series, n, positive=True):
# Transform to daily returns
daily_return = pandas_series.pct_change()
# Drop NA values, this makes the original series n-1
daily_return.dropna(inplace=True)
# Count the total number of days in the new series
total_days = len(daily_return)
if positive:
# count the number of n consecutive days with positive returns
consecutive_n = ((daily_return > 0).rolling(n).sum() == n).sum()
else:
# count the number of n consecutive days with negative returns
consecutive_n = ((daily_return < 0).rolling(n).sum() == n).sum()
return ((consecutive_n / total_days) * 100).round(2)
consecutive_days_proba(df['price'], n=3, positive=True)
So this returns 11.11% which is 1/9. Although the original series is a length of 10, I dont' think it makes sense to use the null days as part of the base.
I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914