Finding nearest time in a DataFrame - python

I have two different time format dataset like that
df1 = pd.DataFrame( {'A': [1499503900, 1512522054, 1412525061, 1502527681, 1512532303]})
df2 = pd.DataFrame( {'B' : ['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z'] })
I need to find the nearest date for each data in the first dataset. Doesn't matter how far is it. Just needed the nearest time. For example:
1499503900 for '2017-07-03T10:20:46.333Z'
1512522054 for '2017-12-15T12:26:01.347Z'
1412525061 for '2017-05-31T08:27:41.943Z'
1502527681 for '2017-08-10T08:48:01.347Z'
1512532303 for '2017-06-05T14:44:56.425Z'
here is a few help:
This is for converting to long format date :
def time1(date_text):
date = datetime.datetime.strptime(date_text, "%Y-%m-%dT%H:%M:%S.%fZ")
return calendar.timegm(date.utctimetuple())
x = '2017-12-15T12:26:01.347Z'
print(time1(x))
out: 1513340761
And this is for converting to ISO format:
def time_covert(time):
seconds_since_epoch = time
DT.datetime.utcfromtimestamp(seconds_since_epoch)
return DT.datetime.utcfromtimestamp(seconds_since_epoch).isoformat()
y = 1499503900
print(time_covert(y))
out = 2017-07-08T08:51:40
Any idea will be extremely useful.
Thank you all in advance!

Here a quick start:
def time_covert(time):
seconds_since_epoch = time
return datetime.utcfromtimestamp(seconds_since_epoch)
# real time series
df2['B'] = pd.to_datetime(df2['B'])
df2.index = df2['B']
del df2['B']
for a in df1['A']:
print( time_covert(a))
i = np.argmin(np.abs(df2.index.to_pydatetime() - time_covert(a)))
print(df2.iloc[i])

I would like to approach this as an algorithmic question rather than pandas specific. My approach is to sort the "df2" series and for each DateTime in df1, perform a binary search on the sorted df2, to get the indexes of insertion. Then check the indexes just below and above the found index to get the desired output.
Here is the code for above procedure.
Use standard pandas DateTime for easy comparison
df1 = pd.DataFrame( {'A': pd.to_datetime([1499503900, 1512522054, 1412525061, 1502527681, 1512532303], unit='s')})
df2 = pd.DataFrame( {'B' : pd.to_datetime(['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z']) })
sort df2 according to dates, and get the position of insertion using binary search
df2 = df2.sort_values('B').reset_index(drop=True)
ind = df2['B'].searchsorted(df1['A'])
Now check for the minimum difference between the index just above and just below the position of the insertion
for index, row in df1.iterrows():
i = ind[index]
if i not in df2.index:
print(df2.iloc[i-1]['B'])
elif i-1 not in df2.index:
print(df2.iloc[i]['B'])
else:
if abs(df2.iloc[i]['B'] - row['A']) > abs(df2.iloc[i-1]['B'] - row['A']):
print(df2.iloc[i-1]['B'])
else:
print(df2.iloc[i]['B'])
The test outputs are these, for each value in df1 respectively. (Note: Please recheck your outputs given in the question, they do not correspond to the minimum difference)
2017-07-03 10:20:46.333000
2017-11-28 15:25:39.016000
2017-05-30 16:24:03.175000
2017-08-10 08:48:01.347000
2017-11-28 15:25:39.016000
The above procedure has the time complexity of O(NlogN) for sorting and O(logN) (N = len(df2)) for finding each output. If the size of "df1" is large this will be a fairly fast approach.

Related

Can this pandas workflow be converted to dask?

Please be nice - I'm not a proper programmer, I'm a scientist and I've read as many docs on this as I can find (they're a bit sparse).
I'm trying to convert this pandas code into dash because my input file is ~0.5TB with gz and it loads too slowly in native pandas. I have a 3 TB machine, btw.
This is an example of what I'm doing with pandas:
df = pd.DataFrame([['chr1',33329,17,'''33)'6'4?1&AB=?+..''','''X%&=E&!%,0("&"Y&!'''],
['chr1',33330,15,'''6+'/7=1#><C1*'*''','''X%=E!%,("&"Y&&!'''],
['chr1',33331,13,'''2*3A#/9#CC3--''','''X%E!%,("&"Y&!'''],
['chr1',33332,1,'''4**(,:3)+7-#<(0-''','''X%&E&!%,0("&"Y&!'''],
['chr1',33333,2,'''66(/C=*42A:.&*''','''X%=&!%0("&"&&!''']],
columns = ['chrom','pos','depth','phred','map'])
df.loc[:,'phred'] = [(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df.loc[:,"map"] = [(sum(map(ord,i)))/len(i) for i in df.loc[:,"map"]]
df = df.astype({'phred': 'int32', 'map': 'int32'})
df.query('(depth < 10) | (phred < 7) | (map < 10)', inplace=True)
for chrom, df_tmp in df.groupby('chrom'):
df_end = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(-1)-1))]
df_start = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(+1)+1))]
for start, end in zip(df_start.pos, df_end.pos):
print (start, end)
Gives
33332 33333
This works (to find regions of a cancer genome with no data) and it's optimised as much as I know how.
I load the real thing like:
df = pd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
index_col=None,
usecols=[0,1,3,5,6],
names = ['chrom','pos','depth','phred','map']
)
and I can do the same with Dask (way faster!):
df = dd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
usecols=[0,1,3,5,6],
compression='gzip',
blocksize=None,
names = ['chrom','pos','depth','phred','map']
)
but i'm stuck here:
ff=[(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df['phred'] = ff
Error: Column assignment doesn't support type list
Question - is this sort of thing possible? If so are there good tutes somewhere? I need to convert the whole block of pandas code above.
Thanks in advance!
You created list comprehensions to transform 'Fred' and 'map'; I converted these list comps to functions, and wrapped the functions in np.vectorize().
def func_p(p):
return (sum(map(ord, p)) - len(p) * 33) / len(p)
def func_m(m):
return (sum(map(ord, m))) / len(m)
vec_func_p = np.vectorize(func_p)
vec_func_m = np.vectorize(func_m)
np.vectorize() does not make code faster, but it does let you write a function with scalar inputs and outputs, and convert it to a function that takes array inputs and outputs.
The benefit is that we can now pass pandas Series to these functions (I also added the type conversion to this step):
df.loc[:, 'phred'] = vec_func_p( df.loc[:, 'phred']).astype(np.int32)
df.loc[:, 'map'] = vec_func_m( df.loc[:, 'map']).astype(np.int32)
Replacing the list comprehensions with these new functions gives the same results as your version (33332 33333).
#rpanai noted that you could eliminate the for loops. The following example uses groupby() (and a couple helper columns) to find the start and end position for each contiguous sequence of positions.
Using only pandas built-in functions should be compatible with Dask (and fast).
First, create demo data frame with multiple chromosomes and multiple contiguous blocks of positions:
data1 = {
'chrom' : 'chrom_1',
'pos' : [1000, 1001, 1002,
2000, 2001, 2002, 2003]}
data2 = {
'chrom' : 'chrom_2',
'pos' : [30000, 30001, 30002, 30003, 30004,
40000, 40001, 40002, 40003, 40004, 40005]}
df = pd.DataFrame(data1).append( pd.DataFrame(data2) )
Second, create two helper functions:
rank is a sequential counter for each group;
key is constant for positions in a contiguous 'run' of positions.
df['rank'] = df.groupby('chrom')['pos'].rank(method='first')
df['key'] = df['pos'] - df['rank']
Third, group by chrom and key to create a groupby object for each contiguous block of positions, then use min and max to find start and end value for the positions.
result = (df.groupby(['chrom', 'key'])['pos']
.agg(['min', 'max'])
.droplevel('key')
.rename(columns={'min': 'start', 'max': 'end'})
)
print(result)
start end
chrom
chrom_1 1000 1002
chrom_1 2000 2003
chrom_2 30000 30004
chrom_2 40000 40005

Merge two series of time intervals in pandas (intersection)

I have multiple lists of time intervals and I need to find the time intervals (intersection) that are common to all of them.
E.g.
a = [['2018-02-03 15:06:30', '2018-02-03 17:06:30'], # each line is read as [start, end]
['2018-02-05 10:30:30', '2018-02-05 10:36:30'],
['2018-02-05 11:30:30', '2018-02-05 11:42:32']]
b = [['2018-02-03 15:16:30', '2018-02-03 18:06:30'],
['2018-02-04 10:30:30', '2018-02-05 10:32:30']]
c = [['2018-02-01 15:00:30', '2018-02-05 18:06:30']]
The result would be
common_intv = [['2018-02-03 15:16:30','2018-02-03 17:06:30'],
['2018-02-05 10:30:30','2018-02-05 10:32:30']]
I've found this solution that should work also for time intervals but I was wondering whether there is a more efficient way to do it in pandas.
The proposed solution in the link would process two lists at a time i.e. it would first find the common intervals between a and b, then put these common intervals inside a variable common, then find the common intervals between common and c and so on...
Of course a global solution (considering all intervals at the same time) would be even better!
You can use pandas.merge_asof in both directions to get a first selection and then carefully cleanup the resulting rows. Code could be:
# build the dataframes and ensure Timestamp types
dfa = pd.DataFrame(a, columns=['start', 'end']).astype('datetime64[ns]')
dfb = pd.DataFrame(b, columns=['start', 'end']).astype('datetime64[ns]')
dfc = pd.DataFrame(c, columns=['start', 'end']).astype('datetime64[ns]')
# merge a and b
tmp = pd.concat([pd.merge_asof(dfa, dfb, on='start'),
pd.merge_asof(dfb, dfa, on='start')]
).sort_values('start').dropna()
# keep the minimum end and ensure end <= start
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]
# merge c
tmp = pd.concat([pd.merge_asof(tmp, dfc, on='start'),
pd.merge_asof(dfc, tmp, on='start')]
).sort_values('start').dropna()
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]
It gives as expected:
start end
0 2018-02-03 15:16:30 2018-02-03 17:06:30
1 2018-02-05 10:30:30 2018-02-05 10:32:30

Finding index of a closest value from another dataframe

I have two dataframes measuring two properties from an instrument, where the depths are offset for a certain dz. Note that the example below is extremely simplified.
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
How do I get the index of df2.depth_2 that gets closest the first element of df1.depth_1 ?
Using reindex with method nearest
df2.reset_index().set_index('depth_2').reindex(df1.depth_1,method = 'nearest')['index'].unique()
Out[265]: array([14], dtype=int64)
You can use pandas merge_asof function (you will need to order your data first if it isn't in real life)
df1 = df1.sort_values(by='depth_1')
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1, df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
if you just wanted that for the first value in df1 you could do the join on the top row:
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1.head(1), df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
Get the absolute difference between all elements of df2 and first element of df1 and then get it's index:
import pandas as pd
import numpy as np
def get_closest(df1, df2, idx):
abs_diff = np.array([abs(df1['depth_1'][idx]-item) for item in df2['depth_2']])
return abs_diff.argmin()
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
get_closest(df1,df2,0)
Output:
14

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Speeding up pandas array calculation

I have working code that achieves the desired calculation result, but I am currently using an algorithm that iterates over the pandas array. this is obviously slower than pure pandas DataFrame calculations. Would like some advice on how i can use pandas functions to speed up this calculation
Code to generate dummy data
df = pd.DataFrame(index=pd.date_range(start='2014-01-01', periods=365))
df['Month'] = df.index.month
df['MTD'] = (df.index.day+0.001)/10000
This is basically a pandas DataFrame with MTD figures for some value. This is purely given so that we have some data to play with.
Needed calculation
what I need is a new DataFrame that has starting (investment) dates as columns - populating them with a few beginning of month values. the index is all possible dates and the values should be the YTD figure. I am using this Dataframe as a lookup/cache for investement dates
pseudocode
YTD = (1+last MTD figure) * ((1+last MTD figure)... for all months to the required date
Working function
def calculate_YTD(df): # slow takes 3.5s on my machine!!!!!!
YTD_df = pd.DataFrame(index=df.index)
for investment_date in [datetime.datetime(2014,x+1,1) for x in range(12)]:
YTD_df[investment_date] =1.0 # pre-populate with dummy floats
for date in df.index: # iterate over all dates in period
h = (df[investment_date:date].groupby('Month')['MTD'].max().fillna(0) + 1).product() -1
YTD_df[investment_date][date] = h
return YTD_df
I have hardcoded the investment dates list to simplify the problem statement. On my machines this code takes 2.5 to 3.5 seconds. Any suggestions on how i can speed it up?
Here's an approach that should be reasonably quick. Quite possible there is something faster/cleaner, but this should be an improvement.
#assuming a fixed number of investments dates, build a list
investment_dates = pd.date_range('2014-1-1', periods=12, freq='MS')
#build a table, by month, which contains the cumulative MTD
#return for each invesment date. Still have to loop over the investment dates,
#but don't need to loop over each daily value
running_mtd = []
for date in investment_dates:
curr_mo = (df[df.index >= date].groupby('Month')['MTD'].last() + 1.).cumprod()
curr_mo.name = date
running_mtd.append(curr_mo)
running_mtd_df = pd.concat(running_mtd, axis=1)
running_mtd_df = running_mtd_df.shift(1).fillna(1.)
#merge running mtd returns with base dataframe
df = df.merge(running_mtd_df, left_on='Month', right_index=True)
#calculate ytd return for each column / day, by multipling the running
#monthly return with the current MTD value
for date in investment_dates:
df[date] = np.where(df.index < date, np.nan, df[date] * (1. + df['MTD']) - 1.)

Categories

Resources