Pandas dataframe interpolation with logarithmically sampled time intervals

Pandas dataframe interpolation with logarithmically sampled time intervals - python

I have a pandas dataframe with columns 'Time' in minutes and 'Value' pulled in from a data logger. The data are logged in logarithmic time intervals, meaning that the first values are logged in fractional minutes then as time proceeds the time intervals get longer:
print(df)
Minutes Value
0 0.001 0.00100
1 0.005 0.04495
2 0.010 0.04495
3 0.015 0.09085
4 0.020 0.11368
.. ... ...
561 4275.150 269.17782
562 4285.150 266.90964
563 4295.150 268.35306
564 4305.150 269.42984
565 4315.150 268.37594
I would like to linearly interpolate the 'Value' at one minute intervals from 0 to 4315 minutes.
I have attempted a few different iterations of df.interpolate() however have not found success. Can someone please help me out? Thank you

I think it's possible that my question was very basic or I made a confusing question. Either way I just wrote a little loop to solve my problem and felt like I should share it. I am sure that this not the most efficient way of doing what I was asking and hopefully somebody could suggest better ways of accomplishing this. I am still very new a this whole thing.
First a few qualifying things:
The 'Value' data that I was talking about is called 'drawdown', which refers to a difference in water level from the initial starting water level inside a water well. It starts at 0.
This kind of data is often viewed in a semi-log plot and sometimes its easier to replace 0 with a very low number instead (i.e 0.0001) so that it plots easy in other programs.
This code takes a .csv file with column names 'Minutes' and 'Drawdown' and compares time values with a new reference dataframe of minutes from 0 through the end of the dataset. It references the 2 closest time values to the desired time value in the list and makes a weighted average of those values then creates a new csv of the integer minutes with drawdown.
Cheers!
# -*- coding: utf-8 -*-
"""
Created on Tue Sep 22 13:42:29 2020
#author: cmeyer
"""
import pandas as pd
import numpy as np
df=pd.read_csv('Read_in.csv')
length=len(df)-1
last=df.at[length,'Drawdown']
lengthpump=int(df.at[length,'Minutes'])
minutes=np.arange(0,lengthpump,1)
dfminutes=pd.DataFrame(minutes)
dfminutes.columns = ['Minutes']
for i in range(1, lengthpump, 1):
non_uni_minutes=df['Minutes']
uni_minutes=dfminutes.at[i,'Minutes']
close1=non_uni_minutes[np.argsort(np.abs(non_uni_minutes-uni_minutes))[0]]
close2=non_uni_minutes[np.argsort(np.abs(non_uni_minutes-uni_minutes))[1]]
index1 = np.where(non_uni_minutes == close1)
index1 = int(index1[0])
index2 = np.where(non_uni_minutes == close2)
index2 = int(index2[0])
num1=df.at[index1,'Drawdown']
num2=df.at[index2,'Drawdown']
weight1 = 1-abs((i-close1)/i)
weight2 = 1-abs((i-close2)/i)
Value = (weight1*num1+weight2*num2)/(weight1+weight2)
dfminutes.at[i,'Drawdown'] = Value
dfminutes.at[0,'Drawdown'] = 0.000001
dfminutes.at[0,'Minutes'] = 0.000001
dfminutes.to_csv('integer_minutes_drawdown.csv')

Here I implemented efficient solution using numpy.interp. I've coded a bit fancy way of reading-in data into pandas.DataFrame from string, you may use any simpler suitable way for your needs like pandas.read_csv(...).
Try next code here online!
import math
import pandas as pd, numpy as np
# Here is just fancy way of reading data, use any other method of reading instead
df = pd.DataFrame([map(float, line.split()) for line in """
0.001 0.00100
0.005 0.04495
0.010 0.04495
0.015 0.09085
0.020 0.11368
4275.150 269.17782
4285.150 266.90964
4295.150 268.35306
4305.150 269.42984
4315.150 268.37594
""".splitlines() if line.strip()], columns = ['Time', 'Value'])
a = df.values
# Create array of integer x = [0 1 2 3 ... LastTimeFloor].
x = np.arange(math.floor(a[-1, 0] + 1e-6) + 1)
# Linearly interpolate
y = np.interp(x, a[:, 0], a[:, 1])
df = pd.DataFrame({'Time': x, 'Value': y})
print(df)

Related

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?
For example:
My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.
import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance
items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))},
index = item_item_idx
)
My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.
users = [1,2,3,4,5]
ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
{'rating': np.random.randint(low = 1, high = 6, size = len(users)*len(items))},
columns = ['rating'],
index = ratings_idx
)
Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df, select to from rating_df the vector of ratings for each item, and calculate the cosine distance between those two.
I want to know the least dumb way to do this. Here's what I tried so far:
==== FIRST TRY - Iterating over rows
def similarity(ii, iu):
for index, row in ii.iterrows():
v = iu.loc[index[0]]
u = iu.loc[index[1]]
row['similarity'] = distance.cosine(v, u)
return(ii)
import time
start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.
The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:
==== SECOND TRY - index.map()
def similarity_map(idx):
v = rating_df.loc[idx[0]]
u = rating_df.loc[idx[1]]
return distance.cosine(v, u)
start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.034961s to execute. Slower than just iterating over rows.
So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?
Thanks for the attention.

For your given example I'd just pivot it into an array and move on with my life.
from sklearn.metrics.pairwise import cosine_similarity
rating_df = rating_df.reset_index().pivot(index='item', columns='user')
cs_df = pd.DataFrame(cosine_similarity(rating_df),
index=rating_df.index, columns=rating_df.index)
>>> cs_df
item 1 2 3 4
item
1 1.000000 0.877346 0.660529 0.837611
2 0.877346 1.000000 0.608781 0.852029
3 0.660529 0.608781 1.000000 0.758098
4 0.837611 0.852029 0.758098 1.000000
This would be more difficult with a giant, highly-sparse array. Sklearn cosine_similarity takes sparse arrays though so as long as your number of items is reasonable (since the output matrix will be dense) this should be solvable.

Same thing but different. Work with numpy arrays. Fine for small arrays but with 10k rows you'll have some large arrays.
import numpy as np
data = rating_df.unstack().values # shape (4,5)
udotv = np.dot(data,data.T) # shape (4,4)
mag_data = np.linalg.norm(data,axis=1)
mag = mag_data * mag_data[:,None]
cos_sim = 1 - (udotv / mag)
df['sim2'] = cos_sim.flatten()
4k users and 14k items pretty much blows up my poor computer. I'm going to have to look how sklearn.metrics.pairwise.cosine_similarity handles that large data.

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?

Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Cross-correlation (time-lag-correlation) with pandas?

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.
I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.
Edit
The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.
The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.
Edit2
Some minimal sample data:
import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)
Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:
bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)
This is what I get when I correlate with pandas and shift one dataset:
In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523
And trying scipy:
In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]:
array([[ nan],
[ nan],
[ nan],
...,
[ nan],
[ nan],
[ nan]])
which according to whos is
scicorr ndarray 1801x1: 1801 elems, type `float64`, 14408 bytes
But I'd just like to have 12 entries.
/Edit2
The idea I have come up with, is to implement a time-lag-correlation myself, like so:
corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on
But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:
def autocorr(self, lag=1):
"""
Lag-N autocorrelation
Parameters
----------
lag : int, default 1
Number of lags to apply before performing autocorrelation.
Returns
-------
autocorr : float
"""
return self.corr(self.shift(lag))
So a simple timelagged cross covariance function would be
def crosscorr(datax, datay, lag=0):
""" Lag-N cross correlation.
Parameters
----------
lag : int, default 0
datax, datay : pandas.Series objects of equal length
Returns
----------
crosscorr : float
"""
return datax.corr(datay.shift(lag))
Then if you wanted to look at the cross correlations at each month, you could do
xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]

There is a better approach: You can create a function that shifted your dataframe first before calling the corr().
Get this dataframe like an example:
d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)
>>> df
prcp stp
0 0.1 0.0
1 0.2 0.1
2 0.3 0.2
3 0.0 0.3
Your function to shift others columns (except the target):
def df_shifted(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for c in df.columns:
if c == target:
new[c] = df[target]
else:
new[c] = df[c].shift(periods=lag)
return pd.DataFrame(data=new)
Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)
If you do at the present will be:
>>> df.corr()
prcp stp
prcp 1.0 -0.2
stp -0.2 1.0
But if you shifted 1(one) period all other columns and keep the target (prcp):
df_new = df_shifted(df, 'prcp', lag=-1)
>>> print df_new
prcp stp
0 0.1 0.1
1 0.2 0.2
2 0.3 0.3
3 0.0 NaN
Note that now the column stp is shift one up position at period, so if you call the corr(), will be:
>>> df_new.corr()
prcp stp
prcp 1.0 1.0
stp 1.0 1.0
So, you can do with lag -1, -2, -n!!

To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (e.g. to see which lag gives the highest correlations), you can do something like this:
lagged_correlation = pd.DataFrame.from_dict(
{x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})
This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).

Optimizing Python code - overhead due to pandas.core.series.Series.getitem

I have pandas data object - data - that is stored as Series of Series. The first series is indexed on ID1 and the second on ID2.
ID1 ID2
1 10259 0.063979
14166 0.120145
14167 0.177417
14244 0.277926
14245 0.436048
15021 0.624367
15260 0.770925
15433 0.918439
15763 1.000000
...
1453 812690 0.752274
813000 0.755041
813209 0.756425
814045 0.778434
814474 0.910647
814475 1.000000
Length: 19726, dtype: float64
I have a function that uses values from this object for further data processing. Here is the function:
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
I use np.vectorize to apply this function on a DataFrame - dataFrame - that has about 22 million rows.
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])
where ID1 and RAND are columns with values that are feeding into the function.
The code takes about 6 hours to process everything. A similar implementation in Java takes only about 6 minutes to get through 22 million rows of data.
On running a profiler on my program I find that the most expensive call is the indexing into data and the second most expensive is searchsorted.
Function Name: pandas.core.series.Series.__getitem__
Elapsed inclusive time percentage: 54.44
Function Name: numpy.core.fromnumeric.searchsorted
Elapsed inclusive time percentage: 25.49
Using data.loc[ID1] to get data makes the program even slower. How can I make this faster? I understand that Python cannot achieve the same efficiency as Java but 6 hours compared to 6 minutes seems too much of a difference. Maybe I should be using a different data structure/ functions? I am using Python 2.7 and PTVS IDE.
Adding a minimum working example:
import numpy as np
import pandas as pd
np.random.seed = 0
#Creating a dummy data object - Series within Series
alt = pd.Series(np.array([ 0.25, 0.50, 0.75, 1.00]), index=np.arange(1,5))
data = pd.Series([alt]*1500, index=np.arange(1,1501))
#Creating dataFrame -
nRows = 200000
d = {'ID1': np.random.randint(1500, size=nRows) + 1
,'RAND': np.random.uniform(low=0.0, high=1.0, size=nRows)}
dataFrame = pd.DataFrame(d)
#Function
def getData(ID1, randomDraw):
dataID2 = data[ID1]
value = dataID2.index[np.searchsorted(dataID2, randomDraw, side='left').iloc[0]]
return value
dataFrame['ID2'] = np.vectorize(getData)(dataFrame['ID1'], dataFrame['RAND'])

You may get a better performance with this code:
>>> def getData(ts):
... dataID2 = data[ts.name]
... i = np.searchsorted(dataID2.values, ts.values, side='left')
... return dataID2.index[i]
...
>>> dataFrame['ID2'] = dataFrame.groupby('ID1')['RAND'].transform(getData)

pandas groupby - return min() along with time where min() occurs

My data is organized in multi-index dataframes. I am trying to groupby the "Sweep" index and return both the min (or max) in a specific time range, along with the time at which that time occurs.
Data looks like:
Time Primary Secondary BL LED
Sweep
Sweep1 0 0.00000 -28173.828125 -0.416565 -0.000305
1 0.00005 -27050.781250 -0.416260 0.000305
2 0.00010 -27490.234375 -0.415955 -0.002441
3 0.00015 -28222.656250 -0.416260 0.000305
4 0.00020 -28759.765625 -0.414429 -0.002136
Getting the min or max is very straightforward.
def find_groupby_peak(voltage_df, start_time, end_time, peak="min"):
boolean_vr = (voltage_df.Time >= start_time) & (voltage_df.Time <=end_time)
df_subset = voltage_df[boolean_vr]
grouped = df_subset.groupby(level="Sweep")
if peak == "min":
peak = grouped.Primary.min()
elif peak == "max":
peak = grouped.max()
return peak
Which gives (partial output):
Sweep
Sweep1 -92333.984375
Sweep10 -86523.437500
Sweep11 -85205.078125
Sweep12 -87109.375000
Sweep13 -77929.687500
But I need to time where those peaks occur as well. I know I could iterate over the output and find where in the original dataset those values occur, but that seems like a rather brute-force way to do it. I also could write a different function to apply to the grouped object that returns both the max and the time where that max occurs (at least in theory - haven't tried to do this, but I assume it's pretty straightforward).
Other than those two options, is there a simpler way to pass the outputs from grouped.Primary.min() (i.e. the peak values) to return where in Time those values occur?

You could consider using the transform function with groupby. If you had data that look a bit like this:
import pandas as pd
sweep = ["sweep1", "sweep1", "sweep1", "sweep1",
"sweep2", "sweep2", "sweep2", "sweep2",
"sweep3", "sweep3", "sweep3", "sweep3",
"sweep4", "sweep4", "sweep4", "sweep4"]
Time = [0.009845, 0.002186, 0.006001, 0.00265,
0.003832, 0.005627, 0.002625, 0.004159,
0.00388, 0.008107, 0.00813, 0.004813,
0.003205, 0.003225, 0.00413, 0.001202]
Primary = [-2832.013203, -2478.839133, -2100.671551, -2057.188346,
-2605.402055, -2030.195497, -2300.209967, -2504.817095,
-2865.320903, -2456.0049, -2542.132906, -2405.657053,
-2780.140743, -2351.743053, -2232.340363, -2820.27356]
s_count = [ 0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3]
df = pd.DataFrame({ 'Time' : Time,
'Primary' : Primary}, index = [sweep, s_count])
Then you could write a very simple transform function that will return for each group of data (grouped by the sweep index), the row at which the minimum value of 'Primary' is located. This you would do with simple boolean slicing. That would look like this:
def trans_function(df):
return df[df.Primary == min(df.Primary)]
Then to use this function simply call it inside the transform method:
df.groupby(level = 0).transform(trans_function)
And that gives me the following output:
Primary Time
sweep1 0 -2832.013203 0.009845
sweep2 0 -2605.402055 0.003832
sweep3 0 -2865.320903 0.003880
sweep4 3 -2820.273560 0.001202
Obviously you could incorporate that into you function that is acting on some subset of the data if that is what you require.
As an alternative you could index the group by using the argmin() function. I tried to do this with transform but it was just returning the entire dataframe. I'm not sure why that should be, it does however work with apply:
def trans_function2(df):
return df.loc[df['Primary'].argmin()]
df.groupby(level = 0).apply(trans_function2)
That again gives me:
Primary Time
sweep1 -2832.013203 0.009845
sweep2 -2605.402055 0.003832
sweep3 -2865.320903 0.003880
sweep4 -2820.273560 0.001202
I'm not totally sure why this function does not work with transform - perhaps someone will enlighten us.

I do not know if this will work with your multi-index frame, but it is worth a try; working with:
>>> df
tag tick val
z C 2014-09-07 32
y C 2014-09-08 67
x A 2014-09-09 49
w A 2014-09-10 80
v B 2014-09-11 51
u B 2014-09-12 25
t C 2014-09-13 22
s B 2014-09-14 8
r A 2014-09-15 76
q C 2014-09-16 4
find the indexer using idxmax and then use .loc:
>>> i = df.groupby('tag')['val'].idxmax()
>>> df.loc[i]
tag tick val
w A 2014-09-10 80
v B 2014-09-11 51
y C 2014-09-08 67

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe interpolation with logarithmically sampled time intervals - python

Related

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

Slicing my data frame is returning unexpected results

Cross-correlation (time-lag-correlation) with pandas?

Optimizing Python code - overhead due to pandas.core.series.Series.getitem

pandas groupby - return min() along with time where min() occurs

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe interpolation with logarithmically sampled time intervals - python

Related

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

Slicing my data frame is returning unexpected results

Cross-correlation (time-lag-correlation) with pandas?

Optimizing Python code - overhead due to pandas.core.series.Series.__getitem__

pandas groupby - return min() along with time where min() occurs

Categories

Resources

Optimizing Python code - overhead due to pandas.core.series.Series.getitem