Pandas drop rows by time duration - python

I would like to drop dataframe rows by time condition (ignoring date). My data contains around 100 million rows. I have around 100 columns and each column has different sampling frequency.
I prepared following snippet of code that takes into account different sampling frequency:
import pandas as pd
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
print(df)
# drop by duration....
In this simple example, there is data that lasts for around 1 second, and has 3 different sampling frequencies. The goal is to drop rows that last for (eg) 0.1 second duration and leave rows of (eg) 0.01 second duration. How can I do it with a one-liner?

by df=df.loc['2018-01-01 00:00:00.000000 ':'2018-01-01 00:00:00.000500 '] you will have new df witch data are between 2018-01-01 00:00:00.000000 and 2018-01-01 00:00:00.000500
now you can apply you filter for desire dates
import pandas as pd
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
print(df)
#filter data between 2018-01-01 00:00:00.000000 ':'2018-01-01 00:00:00.000500
df=df.loc['2018-01-01 00:00:00.000000 ':'2018-01-01 00:00:00.000500 ']
print(df)
Output:
Before data filter applied
A
2018-01-01 00:00:00.000000 0
2018-01-01 00:00:00.000000 2000
2018-01-01 00:00:00.000000 1000
2018-01-01 00:00:00.000500 2001
2018-01-01 00:00:00.001000 2002
... ...
2018-01-01 00:00:02.985000 1995
2018-01-01 00:00:02.988000 1996
2018-01-01 00:00:02.991000 1997
2018-01-01 00:00:02.994000 1998
2018-01-01 00:00:02.997000 1999
[3000 rows x 1 columns]
After date filter applied:
A
2018-01-01 00:00:00.000000 0
2018-01-01 00:00:00.000000 2000
2018-01-01 00:00:00.000000 1000
2018-01-01 00:00:00.000500 2001

Related

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

How to find the minimum value between two dates and put into new column using Pandas

I have 2 dataset
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0516
2018-01-01 00:01:00 1.0516
2018-01-01 00:02:00 1.0516
2018-01-01 00:03:00 1.0516
2018-01-01 00:04:00 1.0516
....
# df2 - daily based dataset
date_from date_to
2018-01-01 2018-01-01 02:21:00
2018-01-02 2018-01-02 01:43:00
2018-01-03 NA
2018-01-04 2018-01-04 03:11:00
2018-01-05 2018-01-05 00:19:00
For each value in df2, date_from and date_to, I want to grab the minimum/low value in Open in df1 and put it in a new column in df2 called min_value
df1 is a minute based sorted dataset.
For the NA in date_to in df2, we can skip those row entirely and move to the next row.
What did I do?
Firstly I tried to find values between two dates.
after that I used this code:
df2['min_value'] =
df1[df1['date'].dt.hour.between(df2['date_from'], df2['date_to'])].min()
I wanted to search between two dates but I am not sure if that is how to do it.
however it does not work. Could you please help identify what should I do?
Does this works for you?
df1 = pd.DataFrame({'date':['2018-01-01 00:00:00', '2018-01-01 00:01:00', '2018-01-01 00:02:00', '2018-01-01 00:03:00','2018-01-01 00:04:00'],
'Open':[1.0516, 1.0516, 1.0516, 1.0516, 1.0516]})
df2 = pd.DataFrame({'date_from':['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04','2018-01-05'],
'date_to':['2018-01-01 02:21:00', '2018-01-02 01:43:00', np.nan,
'2018-01-04 03:11:00', '2018-01-05 00:19:00']})
## converting to datetime
df1['date'] = pd.to_datetime(df1['date'])
df1.set_index('date', inplace=True)
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
def func(val):
minimum_val = np.nan
minimum_date = np.nan
if val['date_from'] is pd.NaT or val['date_to'] is pd.NaT:
pass
minimum_val = df1[val['date_from'] : val['date_to']]['Open'].min()
if minimum_val is not np.nan:
minimum_date = df1[val['date_from'] : val['date_to']].reset_index().head(1)['date'].values[0]
pass
else:
pass
return pd.DataFrame({'date_from':[val['date_from']], 'date_to':[val['date_to']], 'Open': [minimum_val], 'min_date': [minimum_date]})
df3=pd.concat(list(df2.apply(func, axis=1)))
Following codesnap is readable.
import pandas as pd
def get_minimum_value(row, df):
temp = df[(df['date'] >= row['date_from']) & (df['date'] <= row['date_to'])]
return temp['value'].min()
df1 = pd.read_csv("test.csv")
df2 = pd.read_csv("test2.csv")
df1['date'] = pd.to_datetime(df1['date'])
df2['date_from'] = pd.to_datetime(df2['date_from'])
df2['date_to'] = pd.to_datetime(df2['date_to'])
df2['value'] = df2.apply(func=get_minimum_value, df=df1, axis=1)
Here df2.apply() function sent each row as first argument to get_minimum_value function. Applying this to your given data, result is:
date_from date_to value
0 2018-01-01 2018-01-01 02:21:00 1.0512
1 2018-01-02 2018-01-02 01:43:00 NaN
2 2018-01-03 NaT NaN
3 2018-01-04 2018-01-04 03:11:00 NaN
4 2018-01-05 2018-01-05 00:19:00 NaN

Pandas: decompress date range to individual dates

Dataset: I have a 1GB dataset of stocks, which have values between date ranges. There is no overlapping in date ranges and the dataset is sorted on (ticker, start_date).
>>> df.head()
start_date end_date val
ticker
AAPL 2014-05-01 2014-05-01 10.0000000000
AAPL 2014-06-05 2014-06-10 20.0000000000
GOOG 2014-06-01 2014-06-15 50.0000000000
MSFT 2014-06-16 2014-06-16 None
TWTR 2014-01-17 2014-05-17 10.0000000000
Goal: I want to decompress the dataframe so that I have individual dates instead of date ranges. For example, the AAPL rows would go from being only 2 rows to 7 rows:
>>> AAPL_decompressed.head()
val
date
2014-05-01 10.0000000000
2014-06-05 20.0000000000
2014-06-06 20.0000000000
2014-06-07 20.0000000000
2014-06-08 20.0000000000
I'm hoping there's a nice optimized method from pandas like resample that can do this in a couple lines.
A bit more than a few lines, but I think it results in what you asked:
Starting with your dataframe:
In [70]: df
Out[70]:
start_date end_date val row
ticker
AAPL 2014-05-01 2014-05-01 10 0
AAPL 2014-06-05 2014-06-10 20 1
GOOG 2014-06-01 2014-06-15 50 2
MSFT 2014-06-16 2014-06-16 NaN 3
TWTR 2014-01-17 2014-05-17 10 4
First I reshape this dataframe to a dataframe with one date column (so every row two times repeated for each date of start_date and end_date (and I add a counter column called row):
In [60]: df['row'] = range(len(df))
In [61]: starts = df[['start_date', 'val', 'row']].rename(columns={'start_date': 'date'})
In [62]: ends = df[['end_date', 'val', 'row']].rename(columns={'end_date':'date'})
In [63]: df_decomp = pd.concat([starts, ends])
In [64]: df_decomp = df_decomp.set_index('row', append=True)
In [65]: df_decomp.sort_index()
Out[65]:
date val
ticker row
AAPL 0 2014-05-01 10
0 2014-05-01 10
1 2014-06-05 20
1 2014-06-10 20
GOOG 2 2014-06-01 50
2 2014-06-15 50
MSFT 3 2014-06-16 NaN
3 2014-06-16 NaN
TWTR 4 2014-01-17 10
4 2014-05-17 10
Based on this new dataframe, I can group it by ticker and row, and apply a daily resample on each of these groups and fillna (with method 'pad' to forward fill)
In [66]: df_decomp = df_decomp.groupby(level=[0,1]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))
In [67]: df_decomp = df_decomp.reset_index(level=1, drop=True)
The last command was to drop the now superfluous row index level.
When we access the AAPL rows, it gives your desired output:
In [69]: df_decomp.loc['AAPL']
Out[69]:
val
date
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20
I think you can do this in five steps:
1) filter the ticker column to find the stock you want
2) use pandas.bdate_range to build a list of date ranges between start and end
3) flatten this list using reduce
4) reindex your new filtered dataframe
5) fill nans using the method pad
Here is the code:
>>> import pandas as pd
>>> import datetime
>>> data = [('AAPL', datetime.date(2014, 4, 28), datetime.date(2014, 5, 2), 90),
('AAPL', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 80),
('MSFT', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 150),
('AAPL', datetime.date(2014, 5, 12), datetime.date(2014, 5, 16), 85)]
>>> df = pd.DataFrame(data=data, columns=['ticker', 'start', 'end', 'val'])
>>> df_new = df[df['ticker'] == 'AAPL']
>>> df_new.name = 'AAPL'
>>> df_new.index = df_new['start']
>>> df_new.index.name = 'date'
>>> df_new.index = df_new.index.to_datetime()
>>> from functools import reduce #for py3k only
>>> new_index = [pd.bdate_range(**d) for d in df_new[['start','end']].to_dict('record')]
>>> new_index_flat = reduce(pd.tseries.index.DatetimeIndex.append, new_index)
>>> df_new = df_new.reindex(new_index_flat)
>>> df_new = df_new.fillna(method='pad')
>>> df_new
ticker start end val
2014-04-28 AAPL 2014-04-28 2014-05-02 90
2014-04-29 AAPL 2014-04-28 2014-05-02 90
2014-04-30 AAPL 2014-04-28 2014-05-02 90
2014-05-01 AAPL 2014-04-28 2014-05-02 90
2014-05-02 AAPL 2014-04-28 2014-05-02 90
2014-05-05 AAPL 2014-05-05 2014-05-09 80
2014-05-06 AAPL 2014-05-05 2014-05-09 80
2014-05-07 AAPL 2014-05-05 2014-05-09 80
2014-05-08 AAPL 2014-05-05 2014-05-09 80
2014-05-09 AAPL 2014-05-05 2014-05-09 80
2014-05-12 AAPL 2014-05-12 2014-05-16 85
2014-05-13 AAPL 2014-05-12 2014-05-16 85
2014-05-14 AAPL 2014-05-12 2014-05-16 85
2014-05-15 AAPL 2014-05-12 2014-05-16 85
2014-05-16 AAPL 2014-05-12 2014-05-16 85
[15 rows x 4 columns]
Hope it helps!
Here is a hacky way to do it - I post this poor answer (remember - I can't code :-) ) because I am new to pandas and would not mind someone improving it.
This reads the file that had the data originally posted - then creates a multi-index off of the stock_id and the end_date. The get_val function below takes the entire frame, a ticker e.g. 'AAPL', and a date and uses index.searchsorted which behaves like map::upper_bound in C++ - i.e. finds the index where the date would be inserted if you wanted to insert - i.e. find the enddate closest to, but after the date in question - this will have the value that we want and we return that with get_val.
Then I grab a cross section from a Series with this Multiindex based on the stock_id of 'AAPL'. Then we form an empty list which will be used to flatten the list of tuples of dates from the multiindex with the key of 'AAPL'. These dates become the index and values of a series. Then I map this series to get_val to grab the stock price that is desired.
I know this is probably wrong...but...happy to learn.
I would not be surprised to find out that there is an easy way to inflate a data frame like this that uses some fill forward interpolation method...
stocks=pd.read_csv('stocks2.csv', parse_dates=['start_date', 'end_date'], index_col='ticker')
mi=zip(stocks.index, pd.Series(zip(stocks['start_date'],stocks['end_date'].values)).map(lambda z: tuple(pd.date_range(start=z[0], end=z[1]))).values)
mi=pd.MultiIndex.from_tuples(mi)
ticker='AAPL'
s=pd.Series(index=mi,data=0)
s=list(s.xs(key=ticker).index)
l=[]
map(lambda x: l.extend(x), s)
s=pd.Series(index=l,data=l)
stocks_byticker=stocks[stocks.index==ticker].set_index('end_date')
print(s.map(lambda x: stocks_byticker.ix[stocks_byticker.index.searchsorted(x), 'val']))
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20
Here's a slightly more general way to do this that expands on joris's good answer, but allows this to work with any number of additional columns:
import pandas as pd
df['join_id'] = range(len(df))
starts = df[['start_date', 'join_id']].rename(columns={'start_date': 'date'})
ends = df[['end_date', 'join_id']].rename(columns={'end_date': 'date'})
start_end = pd.concat([starts, ends]).set_index('date')
fact_table = start_end.groupby("task_id").apply(lambda x: x.resample('D').fillna(method='pad'))
del fact_table["join_id"]
fact_table = fact_table.reset_index()
final = fact_table.merge(df, right_on='join', left_on='join', how='left')

How do you convert time series data with pandas and show graph via highstock

I have some time series data as(financial stock trading data):
TIMESTAMP PRICE VOLUME
1294311545 24990 1500000000
1294317813 25499 5000000000
1294318449 25499 100000000
I need to convert them to OHLC values (JSON list) based on price column,ie,(open,high,low,close), and show that as OHLC graph with highstock JS framework.
The output should be as following:
[{'time':'2013-09-01','open':24999,'high':25499,'low':24999,'close':25000,'volume':15000000},
{'time':'2013-09-02','open':24900,'high':25600,'low':24800,'close':25010,'volume':16000000},
{...}]
For example,my sample have 10 data for day 2013-09-01,the output will have one object for the day with high is the highest price of all 10 data, low is the lowest price,open is the first price of the day, close is the last price of that day,volume should be the TOTAL volume of all 10 data.
I know there is a python library pandas maybe could do that,but i still could not try it out.
Updated: As suggestion, i use resample() as:
df['VOLUME'].resample('H', how='sum')
df['PRICE'].resample('H', how='ohlc')
But how to merge the result?
At the moment you can only perform ohlc on a column/Series (will be fixed in 0.13).
First, coerce TIMESTAMP columns to a pandas Timestamp:
In [11]: df.TIMESTAMP = pd.to_datetime(df.TIMESTAMP, unit='s')
In [12]: df.set_index('TIMESTAMP', inplace=True)
In [13]: df
Out[13]:
PRICE VOLUME
TIMESTAMP
2011-01-06 10:59:05 24990 1500000000
2011-01-06 12:43:33 25499 5000000000
2011-01-06 12:54:09 25499 100000000
The resample via ohlc (here I've resampled by hour):
In [14]: df['VOLUME'].resample('H', how='ohlc')
Out[14]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 1500000000 1500000000 1500000000 1500000000
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 5000000000 5000000000 100000000 100000000
In [15]: df['PRICE'].resample('H', how='ohlc')
Out[15]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 24990 24990 24990 24990
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 25499 25499 25499 25499
You can apply to_json to any DataFrame:
In [16]: df['PRICE'].resample('H', how='ohlc').to_json()
Out[16]: '{"open":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"high":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"low":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"close":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0}}'
*This would probably be a straightforward enhancement for a DataFrame atm its NotImplemented.
Updated: from your desired output (or at least very close to), can be achieved as follows:
In [21]: price = df['PRICE'].resample('D', how='ohlc').reset_index()
In [22]: price
Out[22]:
TIMESTAMP open high low close
0 2011-01-06 00:00:00 24990 25499 24990 25499
Use the records orientation and the iso date_format:
In [23]: price.to_json(date_format='iso', orient='records')
Out[23]: '[{"TIMESTAMP":"2011-01-06T00:00:00.000Z","open":24990,"high":25499,"low":24990,"close":25499}]'
In [24]: price.to_json('foo.json', date_format='iso', orient='records') # save as json file

Categories

Resources