I have multiple lists of time intervals and I need to find the time intervals (intersection) that are common to all of them.
E.g.
a = [['2018-02-03 15:06:30', '2018-02-03 17:06:30'], # each line is read as [start, end]
['2018-02-05 10:30:30', '2018-02-05 10:36:30'],
['2018-02-05 11:30:30', '2018-02-05 11:42:32']]
b = [['2018-02-03 15:16:30', '2018-02-03 18:06:30'],
['2018-02-04 10:30:30', '2018-02-05 10:32:30']]
c = [['2018-02-01 15:00:30', '2018-02-05 18:06:30']]
The result would be
common_intv = [['2018-02-03 15:16:30','2018-02-03 17:06:30'],
['2018-02-05 10:30:30','2018-02-05 10:32:30']]
I've found this solution that should work also for time intervals but I was wondering whether there is a more efficient way to do it in pandas.
The proposed solution in the link would process two lists at a time i.e. it would first find the common intervals between a and b, then put these common intervals inside a variable common, then find the common intervals between common and c and so on...
Of course a global solution (considering all intervals at the same time) would be even better!
You can use pandas.merge_asof in both directions to get a first selection and then carefully cleanup the resulting rows. Code could be:
# build the dataframes and ensure Timestamp types
dfa = pd.DataFrame(a, columns=['start', 'end']).astype('datetime64[ns]')
dfb = pd.DataFrame(b, columns=['start', 'end']).astype('datetime64[ns]')
dfc = pd.DataFrame(c, columns=['start', 'end']).astype('datetime64[ns]')
# merge a and b
tmp = pd.concat([pd.merge_asof(dfa, dfb, on='start'),
pd.merge_asof(dfb, dfa, on='start')]
).sort_values('start').dropna()
# keep the minimum end and ensure end <= start
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]
# merge c
tmp = pd.concat([pd.merge_asof(tmp, dfc, on='start'),
pd.merge_asof(dfc, tmp, on='start')]
).sort_values('start').dropna()
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]
It gives as expected:
start end
0 2018-02-03 15:16:30 2018-02-03 17:06:30
1 2018-02-05 10:30:30 2018-02-05 10:32:30
Related
I have the following code:
import pandas.util.testing as testing
df = testing.makeDataFrame()
df
This this I have created 2 dataframes with one dataframe have 2 less lines than the original one.
This is df - Original
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
71ZFgWaeDE 1.887420 0.337702 -0.176539 0.149089
alWOjkQ2eZ 1.997701 -0.354276 1.997802 -0.086803
This is df1 - with 2 less lines
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
What I am trying to do is to remove all the rows which are not common between the two dataframes. To do this, we find the duplicate index in the two columns.
duplicates = set(df.index).intersection(df1.index)
Could you please advise how can I remove rows where index is not in the duplicates ?
If you want to remove the indices in place:
idx = df.index.difference(df1.index)
df.drop(idx, inplace=True)
If you want to create a new object:
idx = df.index.intersection(df1.index)
new_df = df.loc[idx]
I have a large dataframe (sample). I was filtering the data according to this code:
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
for i in A:
cond_A = (df[i]>= -0.0423) & (df[i]<=3)
filt_df = df[cond_A]
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df[cond_B]
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df2[cond_B]
When I print filt_df3, I am getting only an empty dataframe - why?
How can I improve the code, other approaches like some advanced techniques?
I am not sure the code above works as outlined in the edit below?
I would like to know how can I change the code, such that it works as outlined in the edit below?
Edit:
I want to remove the rows based on columns (A0 - A49) based on cond_A.
Then filter the dataframe from 1 based on columns (B0 - B49) with cond_B.
Then filter the dataframe from 2 based on columns (C0 - C49) with cond_C.
Thank you very much in advance.
It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?
And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).
condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)
Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.
Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.
This should work:
import pandas as pd
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
filt_df = df.copy()
for i in A:
cond_A = (df[i] >= -0.0423) & (df[i]<=3)
filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df3[cond_B]
print(filt_df3)
Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns
For example this:
https://stackoverflow.com/a/39820329/6139079
You can filter by all columns together with DataFrame.all for test if all rows match together:
A = [f"A{i}" for i in range(50)]
cond_A = ((df[A] >= -0.0423) & (df[A]<=3)).all(axis=1)
B = [f"B{i}" for i in range(50)]
cond_B = ((df[B]>= 15) & (df[B]<=20)).all(axis=1)
C = [f"C{i}" for i in range(50)]
cond_C = ((df[C]>= 15) & (df[C]<=20)).all(axis=1)
And last chain all masks by & for bitwise AND:
filt_df = df[cond_A & cond_B & cond_C]
If get empty DataFrame it seems no row satisfy all conditions.
I have two different time format dataset like that
df1 = pd.DataFrame( {'A': [1499503900, 1512522054, 1412525061, 1502527681, 1512532303]})
df2 = pd.DataFrame( {'B' : ['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z'] })
I need to find the nearest date for each data in the first dataset. Doesn't matter how far is it. Just needed the nearest time. For example:
1499503900 for '2017-07-03T10:20:46.333Z'
1512522054 for '2017-12-15T12:26:01.347Z'
1412525061 for '2017-05-31T08:27:41.943Z'
1502527681 for '2017-08-10T08:48:01.347Z'
1512532303 for '2017-06-05T14:44:56.425Z'
here is a few help:
This is for converting to long format date :
def time1(date_text):
date = datetime.datetime.strptime(date_text, "%Y-%m-%dT%H:%M:%S.%fZ")
return calendar.timegm(date.utctimetuple())
x = '2017-12-15T12:26:01.347Z'
print(time1(x))
out: 1513340761
And this is for converting to ISO format:
def time_covert(time):
seconds_since_epoch = time
DT.datetime.utcfromtimestamp(seconds_since_epoch)
return DT.datetime.utcfromtimestamp(seconds_since_epoch).isoformat()
y = 1499503900
print(time_covert(y))
out = 2017-07-08T08:51:40
Any idea will be extremely useful.
Thank you all in advance!
Here a quick start:
def time_covert(time):
seconds_since_epoch = time
return datetime.utcfromtimestamp(seconds_since_epoch)
# real time series
df2['B'] = pd.to_datetime(df2['B'])
df2.index = df2['B']
del df2['B']
for a in df1['A']:
print( time_covert(a))
i = np.argmin(np.abs(df2.index.to_pydatetime() - time_covert(a)))
print(df2.iloc[i])
I would like to approach this as an algorithmic question rather than pandas specific. My approach is to sort the "df2" series and for each DateTime in df1, perform a binary search on the sorted df2, to get the indexes of insertion. Then check the indexes just below and above the found index to get the desired output.
Here is the code for above procedure.
Use standard pandas DateTime for easy comparison
df1 = pd.DataFrame( {'A': pd.to_datetime([1499503900, 1512522054, 1412525061, 1502527681, 1512532303], unit='s')})
df2 = pd.DataFrame( {'B' : pd.to_datetime(['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z']) })
sort df2 according to dates, and get the position of insertion using binary search
df2 = df2.sort_values('B').reset_index(drop=True)
ind = df2['B'].searchsorted(df1['A'])
Now check for the minimum difference between the index just above and just below the position of the insertion
for index, row in df1.iterrows():
i = ind[index]
if i not in df2.index:
print(df2.iloc[i-1]['B'])
elif i-1 not in df2.index:
print(df2.iloc[i]['B'])
else:
if abs(df2.iloc[i]['B'] - row['A']) > abs(df2.iloc[i-1]['B'] - row['A']):
print(df2.iloc[i-1]['B'])
else:
print(df2.iloc[i]['B'])
The test outputs are these, for each value in df1 respectively. (Note: Please recheck your outputs given in the question, they do not correspond to the minimum difference)
2017-07-03 10:20:46.333000
2017-11-28 15:25:39.016000
2017-05-30 16:24:03.175000
2017-08-10 08:48:01.347000
2017-11-28 15:25:39.016000
The above procedure has the time complexity of O(NlogN) for sorting and O(logN) (N = len(df2)) for finding each output. If the size of "df1" is large this will be a fairly fast approach.
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).
I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)