Finding the average of three consecutive rows in pandas and groupby - python

I have the GPS data set (in csv format) of hundreds of people and I have to study the mobility of them. I have managed to compute the distance between each two point and then compute the speed by simply dividing by the time increment between these two points. I have done all these calculations using pandas and grouping by nickname (this is important because each person has a different trajectory and you can not mix distances and speeds).
The next step I have to do is to compute the average of every three or four velocities to clean some GPS data errors. I have tried this and it works fine but I can not find the way to group it by nickname since the speeds of each user are mixed. Any ideas?

this can be done simply by using the index as a way to group the rows
df['bins'] = df.index // n
and then doing a group by on 'bins'. to put it in a cleaner function here is the code
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,3,4,4,4],'B':[1,2,3,4,5,6,7]})
def n_average(df, n):
df['bin'] = df.index // n
grouped_df = df.groupby(['bin']).mean()
return grouped_df
n_average(df, 3)

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Calculate the percentage difference between two specific rows in Python pandas

The problem is that I am trying to run a specific row I choose to calculate what percentage the specific row value is away from the intended outputs mean (which is already calculated from another column), to find what percentage it deviates from the intended outputs mean.
I want to run each item individually like so:
Below I made a dataframe column to store the result
df['pct difference'] = ((df['tertiary_tag']['price'] - df['ab roller']['mean'])/df['ab roller']['mean']) * 100
For example, let's say the mean is 10 and I know that the item is 8 dollars, figure out whatever percentage away from the mean that product is and return that number for each items of that dataset. Return what percentage it deviates from the mean.
Keep in mind, the problem is not solved by a loop because I am sure pandas has something more practical to calculate the % difference not pct_change.
I also thought maybe to get the very specific row make a column as some indexing so I can use, and use that to access any row with in the columns by using that index and from indexing do my operation whatever you want for example to calculate difference in percentage between two rows?
I thought maybe through indexing the column of the price?
df = df.set_index(['price']) df.index = pd.to_datetime(df.index)
def percent_diff(df, row1, row2):
"""
Calculating the percentage difference between two specific rows in a dataframe
"""
return (df.loc[row1, 'value'] - df.loc[row2, 'value']) / df.loc[row2, 'value'] * 100

Selecting slices of a Pandas Series based on both index and value conditions

I have a Pandas Series which contains acceleration timeseries data. My goal is to select slices of extreme force given some threshold. I was able to get part way with the following:
extremes = series.where(lambda force: abs(force - RESTING_FORCE) >= THRESHOLD, other=np.nan)
Now extremes contains all values which exceed the threshold and NaN for any that don't, maintaining the original index.
However, a secondary requirement is that nearby peaks should be merged into a single event. Visually, you can picture the three extremes on the left (two high, one low) being joined into one complete segment and the two peaks on the right being joined into another complete segment.
I've read through the entire Series reference but I'm having trouble finding methods to operate on my partial dataset. For example, if I had a method that returned an array of non-NaN index ranges, I would be able to sequentially compare each range and decide whether or not to fill in the space between with values from the original series (nearby) or leave them NaN (too far apart).
Perhaps I need to abandon the intermediate step and approach this from an entirely different angle? I'm new to Python so I'm having trouble getting very far with this. Any tips would be appreciated.
It actually wasn't so simple to come up with a vectorized solution without looping.
You'll probably need to go through the code step by step to see the actual outcome of each method but here is short sketch of the idea:
Solution outline
Identify all peaks via simple threshold filter
Get timestamps of peak values into a column and forward fill gaps in between in order to allow to compare current valid timestamp with previous valid timestamp
Do actual comparison via diff() to get time deltas and apply time delta comparison
Convert booleans to integers to use cummulative sum to create signal groups
Group by signals and get min and max timestamp values
Example data
Here is the code with a dummy example:
%matplotlib inline
import pandas as pd
import numpy as np
size = 200
# create some dummy data
ts = pd.date_range(start="2017-10-28", freq="d", periods=size)
values = np.cumsum(np.random.normal(size=size)) + np.sin(np.linspace(0, 100, size))
series = pd.Series(values, index=ts, name="force")
series.plot(figsize=(10, 5))
Solution code
# define thresholds
threshold_value = 6
threshold_time = pd.Timedelta(days=10)
# create data frame because we'll need helper columns
df = series.reset_index()
# get all initial peaks below or above threshold
mask = df["force"].abs().gt(threshold_value)
# create variable to store only timestamps of intial peaks
df.loc[mask, "ts_gap"] = df.loc[mask, "index"]
# create forward fill to enable comparison between current and next peak
df["ts_fill"] = df["ts_gap"].ffill()
# apply time delta comparison to filter only those within given time interval
df["within"] = df["ts_fill"].diff() < threshold_time
# convert boolean values into integers and
# create cummulative sum which creates group of consecutive timestamps
df["signals"] = (~df["within"]).astype(int).cumsum()
# create dataframe containing start and end values
df_signal = df.dropna(subset=["ts_gap"])\
.groupby("signals")["ts_gap"]\
.agg(["min", "max"])
# show results
df_signal
>>> min max
signals
10 2017-11-06 2017-11-27
11 2017-12-13 2018-01-22
12 2018-02-03 2018-02-23
Finally, show the plot:
series.plot(figsize=(10, 5))
for _, (idx_min, idx_max) in df_signal.iterrows():
series[idx_min:idx_max].plot()
Result
As you can see in the plot, peaks greater an absolute value of 6 are merged into a single signal if their last and first timestamps are within a range of 10 days. The thresholds here are arbitrary just for illustration purpose. you can change them to whatever.

How to find the alignment of two data sets in pandas

Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths

how to vectorise Pandas calculation that is based on last x rows of data

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)

Categories

Resources