Find the average for user-defined window in pandas - python

I have a pandas dataframe that has raw heart rate data with the index of time (in seconds).
I am trying to bin the data so that I can have the average of a user define window (e.g. 10s) - not a rolling average, just an average of 10s, then the 10s following, etc.
import pandas as pd
hr_raw = pd.read_csv('hr_data.csv', index_col='time')
print(hr_raw)
heart_rate
time
0.6 164.0
1.0 182.0
1.3 164.0
1.6 150.0
2.0 152.0
2.4 141.0
2.9 163.0
3.2 141.0
3.7 124.0
4.2 116.0
4.7 126.0
5.1 116.0
5.7 107.0
Using the example data above, I would like to be able to set a user defined window size (let's use 2 seconds) and produce a new dataframe that has index of 2sec increments and averages the 'heart_rate' values if the time falls into that window (and should continue to the end of the dataframe).
For example:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25
I can only seem to find methods to bin the data based on a predetermined number of bins (e.g. making a histogram) and this only returns the count/frequency.
thanks.

A groupby should do it.
df.groupby((df.index // 2 + 1) * 2).mean()
heart_rate
time
2.0 165.00
4.0 144.20
6.0 116.25
Note that the reason for the slight difference between our answers is that the upper bound is excluded. That means, a reading taken at 2.0s will be considered for the 4.0s time interval. This is how it is usually done, a similar solution with the TimeGrouper will yield the same result.

Like coldspeed pointed out, 2s will be considered in 4s, however, if you need it in 2x bucket, you can
In [1038]: df.groupby(np.ceil(df.index/2)*2).mean()
Out[1038]:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25

Related

Pandas aggregating and comparing across conditions

Say I have a dataframe df of many conditions
w_env numChances initial_cost ratio ev
0 0.5 1.0 4.0 1.2 6.800000
1 0.6 1.0 4.0 1.2. 2.960000
... ... ... ... ... ...
1195 0.6 3.0 12.0 2.6 8.009467
1196. 0.7 3.0 12.0 2.6 7.409467
my objective is to group the dataframe by initial_cost and ratio (averaging over w_env) and then calculating the difference in value of the ev column for when numChances=3 and numChance=1.
Then, to find the initial_cost and ratio which corresponds to that maximum difference
(i.e. for what initial cost and ratio is ev (when numChance==3) - ev (for numChance==1) the largest).
i tried
df.groupby(["numChances","initial_cost","ratio"]).agg({"ev":"mean"})
and then to pivot so that I can line up the rows for entries when numChance=1 and numChance=3. But this seems overly complicated.
Is there a simpler way to solve this problem?

Adding column names and values to statistic output in Python?

Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!
I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'

Apply Function to every group: TypeError: unhashable type: 'numpy.ndarray'

I am trying to do a curve fit for every group and get the resluts for c,a,b for every group.
I tried it this way:
x=df.T.iloc[1]
y=df.T.iloc[2]
def logifunc(x,c,a,b):
return c / (1 + (a) * np.exp(-b*(x)))
df.groupby('Seriennummer').apply(curve_fit(logifunc, x, y, p0=[110,400,-2]))
But I get the Error:
TypeError: unhashable type: 'numpy.ndarray'
This is a part of my df with one million rows:
Seriennummer mrwSmpVWi mrwSmpP
1915 701091.0 1.8 4.0
1916 701085.0 2.0 2.0
1917 701089.0 1.7 0.0
1918 701087.0 1.8 3.0
1919 701090.0 1.8 0.0
1920 701088.0 2.4 0.0
1921 701086.0 2.7 5.0
1922 701092.0 1.1 0.0
1923 701085.0 2.0 2.0
1924 701089.0 2.0 10.0
1925 701091.0 0.8 0.0
1926 701087.0 2.3 10.0
1927 701090.0 1.6 1.0
1928 701092.0 2.2 6.0
1929 701086.0 1.5 0.0
1930 701088.0 2.1 3.0
A weird point in your code is that:
although you perform grouping by Seriennummer,
then, for each group you attempt to perform curve fitting
on data from full your DataFrame.
To get proper result, you should perform curve fitting to the
current group only. Something like:
import scipy.optimize as opt
result = df.groupby('Seriennummer').apply(lambda grp:
opt.curve_fit(logifunc, grp.mrwSmpVWi, grp.mrwSmpP, p0=[110, 400, -2]))
My lambda function is something like a wrapper mentioned in the other
answer and other parameters are hard-coded in this function.
As your data sample includes only 2 rows for each group, I prepared
my own DataFrame:
Seriennummer mrwSmpVWi mrwSmpP
1915 701091.0 1.8 4.0
1916 701091.0 1.6 3.4
1917 701091.0 1.4 3.0
1918 701091.0 1.0 1.5
1919 701091.0 0.8 0.0
1920 701085.0 2.0 2.0
1921 701085.0 2.5 3.0
1922 701085.0 3.0 3.5
1923 701085.0 3.6 4.2
and ran the above code, with no error.
To print results in an easy to assess way, I ran:
for k, v in result.iteritems():
print(f'Group {k:}:\n{v[0]}\n{v[1]}')
getting:
Group 701085.0:
[ 4.66854588 24.45419288 1.47315989]
[[ 3.43664761e-01 -1.05587500e+01 -2.65359878e-01]
[-1.05587500e+01 4.60108288e+02 1.03214386e+01]
[-2.65359878e-01 1.03214386e+01 2.40785819e-01]]
Group 701091.0:
[ 3.89988734 617.72482118 5.54935645]
[[ 3.42006760e-01 -6.02519226e+02 -1.11651569e+00]
[-6.02519226e+02 2.43770095e+06 3.83083902e+03]
[-1.11651569e+00 3.83083902e+03 6.28930797e+00]]
First repeat the above procedure on my data, then on your own.
Edit following the comment as of 11:03Z
Read the documentation of scipy.optimize.curve_fit.
The description of the result (of each call) contains:
popt - Optimal values for the parameters (of the curve fitted),
pcov2 - The estimated covariance of popt.
If you want only popt for each group and don't care about pcov2,
then the lambda function should return only the first element from its
(2-element) result:
result = df.groupby('Seriennummer').apply(lambda grp: opt.curve_fit(
logifunc, grp.mrwSmpVWi, grp.mrwSmpP, p0=[110, 400, -2])[0])
(note [0] added at the end).
A few notes:
Notice that the parameter your are passing to pandas GroupBy object is actually the result of invoking curve_fit function which returns an ndarray. The first argument of GroupBy.apply is a callable that needs to return a pandas object (DataFrame, Series of scalar), that is the reason you are getting that error.
I am not sure exactly of what you are trying to do but I assume that it's making a curve fit for every group based on the function you have written.
If that is the case I suggest you to wrap that functionality in another function and pass it to the apply method.
def wrapper(df-of-group-by, *args):
# somehow work with your given DataFrame to achieve what you are looking for
# you can also print what-ever and export images
# the important thing is that you return a DataFrame back
# usage:
ohlala.groupby('Seriennummer').apply(wrapper, YOUR-ARGS)

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Row wise calculations(Python)

Trying to run the following code to create a new column 'Median Rank':
N=data2.Rank.count()
for i in data2.Rank:
data2['Median_Rank']=i-0.3/(N+0.4)
But I'm getting a constant value of 0.99802. Even though my rank column is as follows:
data2.Rank.head()
Out[464]:
4131 1.0
4173 3.0
4172 3.0
4132 3.0
5335 10.0
4171 10.0
4159 10.0
5079 10.0
4115 10.0
4179 10.0
4180 10.0
4147 10.0
4181 10.0
4175 10.0
4170 10.0
4116 24.0
4129 24.0
4156 24.0
4153 24.0
4160 24.0
5358 24.0
4152 24.0
Somebody please point out the errors in my code.
Your code isn't vectorised. Use this:
N = data2.Rank.count()
data2['Median_Rank'] = data2['Rank'] - 0.3 / (N+0.4)
The reason your code does not work is because you are assigning the entire column in each loop. So only the last i iteration sticks, values in data2['Median_Rank'] are guaranteed to be identical.
This occurs because every time you make data2['Median_Rank']=i-0.3/(N+0.4) you are updating the entire column with the value calculated by the expression, the easiest way to do that actually don't need a loop:
N=data2.Rank.count()
data2['Median_Rank'] = data2.Rank-0.3/(N+0.4)
It is possible because pandas supports element-wise operations with series.
if you still want to use for loop, you will need to use .at and iterate by rows as follow:
for i, el in zip(df_filt.index,df_filt.rendimento_liquido.values):
df_filt.at[i,'Median_Rank']=el-0.3/(N+0.4)

Categories

Resources