I am using pandas in a python notebook to make some data analysis. I am trying to make a simple nested loop, but this is very bad performing.
The problem is that I have two tables made of two columns each, the first containing time stamps (hh:mm:ss) and the second containing some integer values.
The first table (big_table) contains 86400 rows, one for each possible timestamp in a day, and each integer value is initially set to 0.
The second table (small_table) contains less rows, one for each timestamp in which an actual integer value is registered.
The goal is to map the small_table integers to the big_table integers, in the rows where the timestamp is the same. I also want to write the last written integer when the small_table timestamp is not found in the big_table timestamps.
I am doing this trying to "force" a Java/C way of doing it, which iterates over each element accessing them as the [i][j] elements of a matrix.
Is there any better way of doing this using pandas/numpy?
Code:
rel_time_pointer = small_table.INTEGER.iloc[0]
for i in range(small_table.shape[0]):
for j in range(big_table.shape[0]):
if (small_table.time.iloc[i] == big_table.time.iloc[j]):
rel_time_pointer = small_table.INTEGER.iloc[i]
big_table.INTEGER.iloc[j] = rel_time_pointer
break
else:
big_table.INTEGER.iloc[j] = rel_time_pointer
example:
big_table:
time INTEGER
00:00:00 0
00:00:01 0
00:00:02 0
00:00:03 0
00:00:04 0
00:00:05 0
00:00:06 0
.
.
.
23:59:59 0
small_table:
time INTEGER
00:00:03 100
00:00:05 100
big_table_after_execution:
time INTEGER
00:00:00 0
00:00:01 0
00:00:02 0
00:00:03 100
00:00:04 100
00:00:05 200
00:00:06 200
Using the #gtomer merge command:
big_table = big_table.merge(small_table, on='time', how='left')
and adding .fillna(0) at the end of the command I get:
time INTEGER__x INTEGER__y
00:00:00 0 0.0
00:00:01 0 0.0
... ... ...
with the INTEGER values of small_table in the right places of big_table_after_execution. Now I'm trying to set the 0 values to the not-0 top element:
time INTEGER__x INTEGER__y
00:00:00 0 0.0
00:00:01 0 0.0
00:00:02 0 0.0
00:00:03 0 1.0
00:00:04 0 1.0
00:00:05 0 2.0
00:00:06 0 2.0
instead of:
00:00:00 0 0.0
00:00:01 0 0.0
00:00:02 0 0.0
00:00:03 0 1.0
00:00:04 0 0.0
00:00:05 0 2.0
00:00:06 0 0.0
Please try the following:
big_table_after_execution = big_table.merge(small_table, on='time', how='left')
Please post the output you get and we'll continue from there
Numpy iteration and enumeration options:
if you have a 2d np.ndarray type object, then iteration can be achieved in one line as follows:
for (i,j), value in np.ndenumerate(ndarray_object):...
This works like regular enumerate, but allows you to deconstruct the higher dimensional index into a tuple of appropriate dimensions.
You could maybe place your values into a 2d array structure from numpy and iterate through them like that?
The easiest way to modify what you already have so that it looks less 'c-like' is probably to just use regular enumerate:
for small_index, small_value in enumerate(small_table):
for big_index, big_value in enumerate(big_table):...
zip
Another option for grouping your iteration together is the zip() function, which will combine iterable 1 and 2, but it will only produce a resultant iterable as with a length equal to the minimum iterable length.
Related
I am trying to group a data set of travel duration with 5 minutes interval, starting from 0 to inf. How may I do that?
My sample dataFrame looks like:
Duration
0 00:01:37
1 00:18:19
2 00:22:03
3 00:41:07
4 00:11:54
5 00:21:34
I have used this code: df.groupby([pd.Grouper(key='Duration', freq='5T')]).size()
And I have found following result:
Duration
00:01:37 1
00:06:37 0
00:11:37 1
00:16:37 2
00:21:37 1
00:26:37 0
00:31:37 0
00:36:37 1
00:41:37 0
Freq: 5T, dtype: int64
My expected result is:
Duration Counts
00:00:00 0
00:05:00 1
00:10:00 0
00:15:00 1
00:20:00 1
........ ...
My expectation is the index will start from 00:00:00 instead of 00:01:37.
Or, showing bins will also work for me, I mean:
Duration Counts
0-5 1
5-10 0
10-15 1
15-20 1
20-25 2
........ ...
I need your help please. Thank you.
First, you need to roud off your time to lower 5th minute. Then simply count it.
I suppose this is what you are looking for -
def round_to_5min(t):
""" This function rounds a timedelta timestamp to the nearest 5-min mark"""
t = datetime.datetime(1991,2,13, t.hour, t.minute - t.minute%5, 0)
return t
data['new_col'] = data.Duration.map(round_to_5min).dt.time
I have a df like,
stamp value
0 00:00:00 2
1 00:00:00 3
2 01:00:00 5
converting to time delta
df['stamp']=pd.to_timedelta(df['stamp'])
slicing only odd index and adding 30 mins,
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
#print(odd_df)
1 00:30:00
Name: stamp, dtype: timedelta64[ns]
now, updating df with odd_df,
as per the documentation it should give my expected output.
expected output:
df.update(odd_df)
#print(df)
stamp value
0 00:00:00 2
1 00:30:00 3
2 01:00:00 5
What I am getting,
df.update(odd_df)
#print(df)
stamp value
0 00:30:00 00:30:00
1 00:30:00 00:30:00
2 00:30:00 00:30:00
please help, what is wrong in this.
Try this instead:
df.loc[1::2, 'stamp'] += pd.to_timedelta('30 min')
This ensures you update just the values in DataFrame specified by the .loc() function while keeping the rest of your original DataFrame. To test, run df.shape. You will get (3,2) with the method above.
In your code here:
odd_df=pd.to_timedelta(df[1::2]['stamp'])+pd.to_timedelta('30 min')
The odd_df DataFrame only has parts of your original DataFrame. The parts you sliced. The shape of odd_df is (1,).
I have an amount of seconds in a dataframe, let's say:
s = 122
I want to convert it to the following format:
00:02:02.0000
To do that I try using to_datetime the following way:
pd.to_datetime(s, format='%H:%M:%S.%f')
However this doesn't work:
ValueError: time data 122 does not match format '%H:%M:%S.%f' (match)
I also tried using unit='ms' instead of format, but then I get the date before the time.
How can I modify my code to get the desired convertion ?
It needs to be done in the dataframe using pandas if possible.
EDIT: both jezrael and MedAli solutions below are valid, however Jezrael solution have the advantage to work not only with integers but also with Datetime.time as input!
Use to_timedelta with convert seconds to nanoseconds:
df = pd.DataFrame({'sec':[122,3,5,7,1,0]})
df['t'] = pd.to_timedelta(df['sec'] * 10**9)
print (df)
sec t
0 122 00:02:02
1 3 00:00:03
2 5 00:00:05
3 7 00:00:07
4 1 00:00:01
5 0 00:00:00
You can edit your code as follows to get the desired result:
df = pd.DataFrame({'sec':[122,3,5,7,1,0]})
df['time'] = pd.to_datetime(df.sec, unit="s").dt.time
Output:
In [10]: df
Out[10]:
sec time
0 110 00:01:50
1 3 00:00:03
2 5 00:00:05
3 7 00:00:07
4 1 00:00:01
5 0 00:00:00
I have two dataframes each with a datetime column:
df_long=
mytime_long
0 00:00:01 1/10/2013
1 00:00:05 1/10/2013
2 00:00:55 1/10/2013
df_short=
mytime_short
0 00:00:02 1/10/2013
1 00:00:03 1/10/2013
2 00:00:06 1/10/2013
The timestamps are unique and can be assumed sorted in each of the two dataframes.
I would like to create a new dataframe that contains the nearest (index,mytime_long) after or at the same time value in mytime_short (hence with a non-negative timedelta)
ex.
0 (0, 00:00:02 1/10/2013)
1 (2, 00:00:06 1/10/2013)
2 (np.nan,np.nat)
write a function to get the closest index & timestamp in df_short given a timestamp
def get_closest(n):
mask = df_short.mytime_short >= n
ids = np.where(mask)[0]
if ids.size > 0:
return ids[0], df_short.mytime_short[ids[0]]
else:
return np.nan, np.nan
apply this function over df_long.mytime_long, to get a new data frame with the index & timestamp values in a tuple
df = df_long.mytime_long.apply(get_closest)
df
# output:
0 (0, 2013-01-10 00:00:02)
1 (2, 2013-01-10 00:00:06)
2 (nan, nan)
ilia timofeev's answer reminded me of this pandas.merge_asof function which is perfect for this type of join
df = pd.merge_asof(df_long,
df_short.reset_index(),
left_on='mytime_long',
right_on='mytime_short',
direction='forward')[['index', 'mytime_short']]
df
# output:
index mytime_short
0 0.0 2013-01-10 00:00:02
1 2.0 2013-01-10 00:00:06
2 NaN NaT
Little bit ugly, but effective way to solve task. Idea is to join them on timestamp and select first "short" after "long" if any.
#recreate data
df_long = pd.DataFrame(
pd.to_datetime( ['00:00:01 1/10/2013','00:00:05 1/10/2013','00:00:55 1/10/2013']),
index = [0,1,2],columns = ['mytime_long'])
df_short = pd.DataFrame(
pd.to_datetime( ['00:00:02 1/10/2013','00:00:03 1/10/2013','00:00:06 1/10/2013']),
index = [0,1,2],columns = ['mytime_short'])
#join by time, preserving ids
df_all = df_short.assign(inx_s=df_short.index).set_index('mytime_short').join(
df_long.assign(inx_l=df_long.index).set_index('mytime_long'),how='outer')
#mark all "short" rows with nearest "long" id
df_all['inx_l'] = df_all.inx_l.ffill().fillna(-1)
#select "short" rows
df_short_candidate = df_all[~df_all.inx_s.isnull()].astype(int)
df_short_candidate['mytime_short'] = df_short_candidate.index
#select get minimal "short" time in "long" group,
#join back with long to recover empty intersection
df_res = df_long.join(df_short_candidate.groupby('inx_l').first())
print (df_res)
Out:
mytime_long inx_s mytime_short
0 2013-01-10 00:00:01 0.0 2013-01-10 00:00:02
1 2013-01-10 00:00:05 2.0 2013-01-10 00:00:06
2 2013-01-10 00:00:55 NaN NaT
Performance comparison on sample of 100000 elements:
186 ms to execute this implementation.
1min 3s to execute df_long.mytime_long.apply(get_closest)
UPD: but the winner is #Haleemur Ali's pd.merge_asof with 10ms
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting.
I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example:
inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ]
The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc.
My current solution takes the timestamps and:
finds the num seconds between each successive pair of timestamps;
finds the median of these delays;
creates an array of the correct size;
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);
averages values that happen to go into the same time bucket;
adds data to this array according to the correct (timestamp - starttime)/median element.
if there's no value for a time range, I obviously output a None value.
Output data has to be in the format:
outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ]
I suspect this is a solved problem with Python Pandas http://pandas.pydata.org/ (or Numpy / SciPy).
Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on large numbers of sets of data.
So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect?
-- Edit to show expected output --
The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval.
So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3.
outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] )
This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.
Here's a sketch
Create your input series
In [24]: x = zip(pd.date_range('20130101',periods=1000000,freq='s').asi8/1000000000,np.random.randn(1000000))
In [49]: x[0]
Out[49]: (1356998400, 1.2809949462375376)
Create the frame
In [25]: df = DataFrame(x,columns=['time','value'])
Make the dates a bit random (to simulate some data)
In [26]: df['time1'] = df['time'] + np.random.randint(0,10,size=1000000)
Convert the epoch seconds to datetime64[ns] dtype
In [29]: df['time2'] = pd.to_datetime(df['time1'],unit='s')
Difference the series (to create timedeltas)
In [32]: df['diff'] = df['time2'].diff()
Looks like this
In [50]: df
Out[50]:
time value time1 time2 diff
0 1356998400 -0.269644 1356998405 2013-01-01 00:00:05 NaT
1 1356998401 -0.924337 1356998401 2013-01-01 00:00:01 -00:00:04
2 1356998402 0.952466 1356998410 2013-01-01 00:00:10 00:00:09
3 1356998403 0.604783 1356998411 2013-01-01 00:00:11 00:00:01
4 1356998404 0.140927 1356998407 2013-01-01 00:00:07 -00:00:04
5 1356998405 -0.083861 1356998414 2013-01-01 00:00:14 00:00:07
6 1356998406 1.287110 1356998412 2013-01-01 00:00:12 -00:00:02
7 1356998407 0.539957 1356998414 2013-01-01 00:00:14 00:00:02
8 1356998408 0.337780 1356998412 2013-01-01 00:00:12 -00:00:02
9 1356998409 -0.368456 1356998410 2013-01-01 00:00:10 -00:00:02
10 1356998410 -0.355176 1356998414 2013-01-01 00:00:14 00:00:04
11 1356998411 -2.912447 1356998417 2013-01-01 00:00:17 00:00:03
12 1356998412 -0.003209 1356998418 2013-01-01 00:00:18 00:00:01
13 1356998413 0.122424 1356998414 2013-01-01 00:00:14 -00:00:04
14 1356998414 0.121545 1356998421 2013-01-01 00:00:21 00:00:07
15 1356998415 -0.838947 1356998417 2013-01-01 00:00:17 -00:00:04
16 1356998416 0.329681 1356998419 2013-01-01 00:00:19 00:00:02
17 1356998417 -1.071963 1356998418 2013-01-01 00:00:18 -00:00:01
18 1356998418 1.090762 1356998424 2013-01-01 00:00:24 00:00:06
19 1356998419 1.740093 1356998428 2013-01-01 00:00:28 00:00:04
20 1356998420 1.480837 1356998428 2013-01-01 00:00:28 00:00:00
21 1356998421 0.118806 1356998427 2013-01-01 00:00:27 -00:00:01
22 1356998422 -0.935749 1356998427 2013-01-01 00:00:27 00:00:00
Calc median
In [34]: df['diff'].median()
Out[34]:
0 00:00:01
dtype: timedelta64[ns]
Calc mean
In [35]: df['diff'].mean()
Out[35]:
0 00:00:00.999996
dtype: timedelta64[ns]
Should get you started
You can pass your inArr to a pandas Dataframe:
df = pd.DataFrame(inArr, columns=['time', 'value'])
num seconds between each successive pair of timestamps: df['time'].diff()
median delay: df['time'].diff().median()
creates an array of the correct size (I think that's taken care of?)
presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period); I don't know what you mean here
averages values that happen to go into the same time bucket
For several of these problems it may make since to convert your seconds to datetime and set it as the index:
In [39]: df['time'] = pd.to_datetime(df['time'], unit='s')
In [41]: df = df.set_index('time')
In [42]: df
Out[42]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
Then to handle multiple values in the same time, use groupby.
In [49]: df.groupby(level='time').mean()
Out[49]:
value
time
2013-12-30 20:27:22 12.3
2013-12-30 20:27:42 11.1
2013-12-30 20:28:01 12.8
It's the same since there aren't any dupes.
Not sure what you mean about the last two.
And your desired output seems to contradict what you wanted earlier. You values with the same timestamp should be averaged, and now you want them all? Maybe clear that up a bit.