How to downsampling time series data in pandas? - python

I have a time series in pandas that looks like this (order by id):
id time value
1 0 2
1 1 4
1 2 5
1 3 10
1 4 15
1 5 16
1 6 18
1 7 20
2 15 3
2 16 5
2 17 8
2 18 10
4 6 5
4 7 6
I want downsampling time from 1 minute to 3 minutes for each group id.
And value is a maximum of group (id and 3 minutes).
The output should be like:
id time value
1 0 5
1 1 16
1 2 20
2 0 8
2 1 10
4 0 6
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!

You can convert your time series to an actual timedelta, then use resample for a vectorized solution:
t = pd.to_timedelta(df.time, unit='T')
s = df.set_index(t).groupby('id').resample('3T').last().reset_index(drop=True)
s.assign(time=s.groupby('id').cumcount())
id time value
0 1 0 5
1 1 1 16
2 1 2 20
3 2 0 8
4 2 1 10
5 4 0 6

Use np.r_ and .iloc with groupby:
df.groupby('id')['value'].apply(lambda x: x.iloc[np.r_[2:len(x):3,-1]])
Output:
id
1 2 5
5 16
7 20
2 10 8
11 10
4 13 6
Name: value, dtype: int64
Going a little further with column naming etc..
df_out = df.groupby('id')['value']\
.apply(lambda x: x.iloc[np.r_[2:len(x):3,-1]]).reset_index()
df_out.assign(time=df_out.groupby('id').cumcount()).drop('level_1', axis=1)
Output:
id value time
0 1 5 0
1 1 16 1
2 1 20 2
3 2 8 0
4 2 10 1
5 4 6 0

Related

Calculate difference between current row and latest row satisfying a condition

I have a pandas dataframe looking like.
time value group
0 1 12 1
1 2 14 1
2 3 15 2
3 4 15 1
4 5 18 2
5 6 20 1
6 7 19 2
7 8 24 2
I know want to calculate the spread between group 1 and group 2 for the latest values.
I.e. in each row I want to look at the latest value for group 1 and group 2 and calculate value of group 1 - value of group 2.
In the example the output should look like
time value group diff
0 1 12 1 0
1 2 14 1 0
2 3 15 2 -1
3 4 15 1 0
4 5 18 2 -3
5 6 20 1 2
6 7 19 2 1
7 8 24 2 -4
The only function I could find so far was pd.diff() but it doesn't satisfy my needs. So I would really appreciate some help here. Thanks!
You can forward fill values for group 1 and 2 respectively first and then calculate the difference:
df['diff'] = df.value.where(df.group == 1).ffill() - df.value.where(df.group == 2).ffill()
df
time value group diff
0 1 12 1 NaN
1 2 14 1 NaN
2 3 15 2 -1.0
3 4 15 1 0.0
4 5 18 2 -3.0
5 6 20 1 2.0
6 7 19 2 1.0
7 8 24 2 -4.0
Use fillna -- df['diff'] = df['diff'].fillna(0) if you need to fill NaN.

Pandas replace values (grouping by and iteration)

Good morning
I have a current problem when trying to replace some values. I have a dataframe that has a column "loc10p" that separates the records into 10 groups, and for each group I have grouped those records into smaller groups, but each group has a starting range of 1 of the subgroups instead of counting the last subgroup. For example:
c2[c2.loc10p.isin([1,2])].sort_values(['loc10p','subgrupoloc10'])[['loc10p','subgrupoloc10']]
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 1
8 2 1
9 2 1
16 2 1
17 2 1
18 2 2
23 2 2
How can I transform that into something like the following:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
I tried to do a loop that separates each group category into a different dataframe and then, replacing the values of the subgroup with a counter of the previous group, but it didn't replace anything:
w=1
temporal=[]
for e in range(1,11):
temp=c2[c2['loc10p']==e]
temporal.append(temp)
for e,i in zip(temporal,range(1,9)):
try:
e.loc[,'subgrupoloc10']=w
w+=1
except:
pass
Any help will be really appreciated!!
Try with ngroup
df['out'] = df.groupby(['loc10p','subgrupoloc10']).ngroup()+1
Out[204]:
1 1
7 1
15 1
0 2
14 2
30 2
31 2
2 3
8 3
9 3
16 3
17 3
18 4
23 4
dtype: int64
Try:
groups = (df["subgrupoloc10"] != df["subgrupoloc10"].shift()).cumsum()
df["subgrupoloc10"] = groups
print(df)
Prints:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4

How do you parse out data from a dataframe for each ID when an adjacent column contains a certain value?

I have a large dataframe in the following format. I need to parse out only the values where values ==1 and through the remaining id. This should reset on each ID so that it takes the first value in a unique id that contains the value 1 and ends when the id number terminates.
d={'ID':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5] \
,'values':[0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,0,0,0,1,1,0,1,0,1,1,1,1,1,] }
df=pd.DataFrame(data=d)
df=pd.DataFrame(data=d)
df
ND = {'ID':[1,1,2,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5],\
'values':[1,0,1,0,1,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1]}
df_final=pd.DataFrame(ND)
df_final
'''
IIUC,
df[df.groupby('ID')['values'].transform('cummax')==1]
Output:
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1
Details, use cummax to keep the value of 1 after first found. Then use equal to 1 to create a boolean series, which then is used to do boolean indexing.
if your column values is only 0 and 1, you can use groupby.cummax that will replace 0 by 1 if they are after a 1 per ID, then use this as a boolean mask:
df_ = df[df.groupby('ID')['values'].cummax().astype(bool).to_numpy()]
print(df_)
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1

Separate DataFrame into N (almost) equal segments

Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.

Pandas series Max value for column based on index

I am trying to extract the max value for a column based on the index. I have this series:
Hour Values
1 0
1 3
1 1
2 0
2 5
2 4
...
23 3
23 4
23 2
24 1
24 9
24 2
and am looking to add a new column 'Max Value' that will have the maximum of the 'Values' column for each value, based on the index (Hour):
Hour Values Max Value
1 0 3
1 3 3
1 1 3
2 0 5
2 5 5
2 4 5
...
23 3 4
23 4 4
23 2 4
24 1 9
24 9 9
24 2 9
I can do this in excel, but am new to pandas. The closest I have come is this scratchy effort, which is as far as I have got, but I get a syntax error on the first '=':
df['Max Value'] = 0
df['Max Value'][(df['Hour'] =1)] = df['Value'].max()
Use transform('max') method:
In [61]: df['Max Value'] = df.groupby('Hour')['Values'].transform('max')
In [62]: df
Out[62]:
Hour Values Max Value
0 1 0 3
1 1 3 3
2 1 1 3
3 2 0 5
4 2 5 5
5 2 4 5
6 23 3 4
7 23 4 4
8 23 2 4
9 24 1 9
10 24 9 9
11 24 2 9

Categories

Resources