Resampling in pandas

Resampling in pandas - python

I have asked a question on another thread Link. But I got an incomplete answer. And no one is willing to reply. That is why I am making another modified question. Let me explain the question briefly, I wanted to resample the following data:
**`
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2403954 622.5 461.3 312 623.3 462.6 260
2403958 623.1 461.5 311 623.4 464 261
2403962 623.6 461.7 310 623.7 465.4 261
2403966 623.8 461.5 309 623.9 466.1 261
2403970 620.9 461.4 309 623.8 465.9 259
2403974 621.7 461.1 308 623 464.8 258
2403978 622.1 461.1 308 621.9 463.9 256
2403982 622.5 461.5 308 621 463.4 255
2403986 622.4 462.1 307 620.7 463.3 254
`**
The table goes on and on like that. All the timestamps are in milliseconds. And I wanted to resample it into 100L bin time.
df = df.resample('100L')
The resulting table is:
Timestamp L_x L_y L_a R_x R_y R_a
2403900 621.3 461.3 313 623.3 461.8 260
2404000 622.5 461.3 312 623.3 462.6 260
2404100 623.1 461.5 311 623.4 464 261
2404200 623.6 461.7 310 623.7 465.4 261
2404300 623.8 461.5 309 623.9 466.1 261
But that is not the result I want. because the first timestamp index in the original table is 2403950. So the first bin time should contain from 2403950 to 2404050 but instead it is 2403900 - 2404000. like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 ... ... ... ... ... ...
2404050 ... ... ... ... ... ...
2404150 ... ... ... ... ... ...
2404250 ... ... ... ... ... ...
2404350 ... ... ... ... ... ...
The rest of the column are the mean of the values of the original table.
So to do that someone sugested that I have to calculate the offset. In my case it is 50 milliseconds. And do the following:
df.resample('100L', loffset='50L')
The offset only moves the labels 50 milliseconds forward but it doesnot change the mean values. It is still calculating the mean of, for instance for the first bin time, values from 2403900 to 2404000 instead of 2403950 to 2404050.
Thanks for your help

You're looking for the base kwarg.
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
In your case it looks like you want:
df.resample('100L', base=50)
Note: resample without a DatetimeIndex/PeriodIndex/TimedeltaIndex raises an error in recent pandas, so you should convert to DatetimeIndex before doing this.

Related

sort pivot/dataframe without All row pandas/python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!

If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833

You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

Pandas fill nan values using rolling mean

I have a dataset that contains nan values and I am attempting to fill in those values using a rolling average. My code for doing so is as follows:
df = pd.DataFrame({'vals': med_vals})
print(df[353:363])
vals
353 17682.196292
354 13796.403594
355 14880.418179
356 14139.141779
357 15397.070537
358 15108.345602
359 14286.259755
360 14962.745719
361 NaN
362 NaN
df_filled = df.fillna(df.rolling(7,min_periods = 1).mean())
print(df_filled[353:365])
vals
353 17682.196292
354 13796.403594
355 14880.418179
356 14139.141779
357 15397.070537
358 15108.345602
359 14286.259755
360 14962.745719
361 14795.663595
362 14778.712678
363 14938.605403
364 14785.783692
365 14624.502737
366 14962.745719
367 NaN
368 NaN
369 NaN
How can I make it so my code takes into account previously filled in values when calculating the rolling average?
Edit: I found a method that works but I'm not too happy with it:
while pd.isnull(df).any().any() == True:
df.fillna(df.rolling(window=8,min_periods = 7).mean(), inplace = True)

You are getting exactly what you asked for. When you do a rolling average, numpy has the current cell as the right-edge of the window. So, when setting cell 361:
355 356 357 358 359 360 361 362 363 364 365 366
^-----------------------------^
Since 361 is a NaN, you get the average of the other six. Continuing:
355 356 357 358 359 360 361 362 363 364 365 366
^-----------------------------^
^-----------------------------^
^-----------------------------^
^-----------------------------^
^-----------------------------^
So, when it's computing a value for 366, it will average from 360 through 366. The only cell in that range that has a value is 360, so that becomes the average. You told it there only needed to be one value in the range to be valid.
You're saying there is an issue, but it is not at all clear to me what you were expecting.

ValueError: Axes instance argument was not found in a figure, Question with same name has no answer

I am trying to create a seaborn Facetgrid to plot the normality distribution of all columns in my dataFrame decathlon. The data looks as such:
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0 938 1061 773 859 896 911 880 732 757 752
1 839 975 870 749 887 878 880 823 863 741
2 814 866 841 887 921 939 819 778 884 691
3 872 898 789 878 848 879 790 790 861 804
4 892 913 742 803 816 869 1004 789 854 699
... ... ... ... ... ... ... ... ... ...
7963 755 760 604 714 812 794 482 571 539 780
7964 830 845 524 767 786 783 601 573 562 535
7965 819 804 653 840 791 699 659 461 448 632
7966 804 720 539 758 830 782 731 487 425 729
7967 687 809 692 714 565 741 804 527 738 523
I am relatively new to python and I can't understand my error. My attempt to format the data and create the grid is as such:
import seaborn as sns
df_stacked = decathlon.stack().reset_index(1).rename({'level_1': 'column', 0: 'values'}, axis=1)
g = sns.FacetGrid(df_stacked, row = 'column')
g = g.map(plt.hist, "values")
However I recieve the following error:
ValueError: Axes instance argument was not found in a figure
Can anyone explain what exactly this error means and how I would go about fixing it?
EDIT
df_stacked looks as such:
column values
0 P100m 938
0 Plj 1061
0 Psp 773
0 Phj 859
0 P400m 896
... ...
7967 P110h 741
7967 Ppv 804
7967 Pdt 527
7967 Pjt 738
7967 P1500 523

I encountered this similar issue when running a Jupyter Notebook.
My solution involved:
Restart the notebook
Re-run the imports %matplotlib inline; import matplotlib.pyplot as plt

As you did not post a full working example its a bit of guessing.
What might go wrong is in the line where you have g = g.map(plt.hist, "values") because the error comes from deep within matplotlib. You can see this here in this SO question where its another function pylab.sca(axes[i]) outside matplotlib due to not being in that module available, is being triggered by matplotlib.
Likely you installed/updated something in your (conda?) environment (changes in environment paths?) and after the next reboot it was found.
I also wonder how you come up with plt.hist ... fully typed it should resemble matplotlib.pyplot.hist ... but guessing... (waiting for your updated example code).

extract only integer from txt

I have a txt file which contains lost of information,I do not want its head and tail, I need only numbers in the middle. which is a 1x11200 matrix.
[txtpda]
LT=5.6
DATE=21.06.2018
TIME=14:11
CNT=11200
RES=0.00854518
N=5
VB=350
VT=0.5
LS=0
MEASTIME=201806211412
PICKUP=BFW-2
LC=0.8
[PROFILE]
255
256
258
264
269
273
267
258
251
255
259
262
260
256
255
260
264
266
265
263
261
263
267
275
280
280
280
280
283
284
283
277
279
280
283
285
283
282
280
280
286
288
298
299
299
299
304
303
300
297
295
296
299
301
303
301
299
296
298
299
302
303
304
307
308
312
313
314
312
311
311
310
312
310
309
305
303
299
297
294
288
280
270
266
250
242
222
213
199
180
173
...
-1062
-1063
[VALUES]
Ra;2;3;2;0.769;0;0;-1;0;-1;0
Rz;2;2;2;5.137;0;0;-1;0;-1;0
Pt;0;0;0;26.25;0;0;-1;0;-1;0
Wt;0;0;0;24.3;0;0;-1;0;-1;0
now I using the following method to extract numbers:
def OpenFile():
name=askopenfilename(parent=root)
f=open(name,'r')
originalyvec1=[]
yvec1=[]
if f==0:
print("fail to open the file")
else:
print("file successfully opened")
data=f.readlines()
for i in range(0,14):
del data[0]//delete its head（string)
del data[11204]//delete its tail（string)
del data[11203]//delete its tail（string)
del data[11202]//delete its tail（string)
del data[11201]//delete its tail（string)
del data[11200]//delete its tail（string)
for line in data:
for nbr in line.split(): //delete \n
yvec1.append(int(nbr))
if f.close()==0:
print("fail to close file")
else:
print("file closed")
I want to use numpy to manage it in a easy way. Is that possible?
like np.array or something like that.

You can use a alternative form of iter(), where you pass iter() a function and it will keep calling that function until it sees the value (2nd arg). You can use this to skip until you see [PROFILE]\n and then use that same form of iter() to read until [VALUES]\n. The function is just the one called by next(iterable), which is iterable.__next__, e.g.:
with open(name) as f:
for _ in iter(f.__next__, '[PROFILE]\n'): # Skip until PROFILE
pass
yvec1 = [int(d) for d in iter(f.__next__, '[VALUES]\n')]
yvec1 will now contain all values between [PROFILE] and [VALUES].
An alternative and potentially quicker way to consume the first iter() is to use collections.deque() instead of the for loop but this is likely over-kill for this problem, e.g.:
deque(iter(f.__next__, '[PROFILE]\n'), maxlen=0)
Note: using with will automatically close(f) at the end of the block.

You can simply replace everything from the line data=f.readlines() and below with:
data = [int(line) for line in map(str.strip, f.readlines()) if line.isdigit() or line.startswith('-') and line[1:].isdigit()]
And data will be the list of integers you're looking for.

Just to give you the idea this may help
The s3[0] will be all the numbers between PROFILE ans VALUES
#s=your data
s='sjlkf slflsafj[PROFILEl9723,2974982,2987492,886[VALUES]skjlfsajlsjal'
s2=s.split('[PROFILE]')
s3=s2[1].split('[VALUES]')

How to resample starting from the first element in pandas?

I am resampling the following table/data:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2403954 622.5 461.3 312 623.3 462.6 260
2403958 623.1 461.5 311 623.4 464 261
2403962 623.6 461.7 310 623.7 465.4 261
2403966 623.8 461.5 309 623.9 466.1 261
2403970 620.9 461.4 309 623.8 465.9 259
2403974 621.7 461.1 308 623 464.8 258
2403978 622.1 461.1 308 621.9 463.9 256
2403982 622.5 461.5 308 621 463.4 255
2403986 622.4 462.1 307 620.7 463.3 254
The table goes on and on like that.
The timestamps are in milliseconds. I did the following to resample it into 100milliseconds bin time:
I changed the timestamp index into a datetime format
df.index = pd.to_datetime((df.index.values*1e6).astype(int))
I resampled it in 100milliseconds:
df = df.resample('100L')
The resulting resampled data look like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403900 621.3 461.3 313 623.3 461.8 260
2404000 622.5 461.3 312 623.3 462.6 260
2404100 623.1 461.5 311 623.4 464 261
2404200 623.6 461.7 310 623.7 465.4 261
2404300 623.8 461.5 309 623.9 466.1 261
As we can see the first bin time is 2403900, which is 50milliseconds behind the first timestamp index of the original table. But i wanted the bin time to start from the first timestamp index from the original table, which is 2403950. like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261

You can specify an offset:
df.resample('100L', loffset='50L')
UPDATE
Of course you can always calculate the offset:
offset = df.index[0] % 100
df.index = pd.to_datetime((df.index.values*1e6).astype(int))
df.resample('100L', loffset='{}L'.format(offset))

A much simpler (and general) solution is to just add base=1 to your resampling function:
df = df.resample('100L', base=1)

A dynamic solution that also works with Pandas Timestamp objects (often used to index Timeseries data), or strictly numerical index values, is to use the origin argument with the resample method as such:
df = df.resample("15min", origin=df.index[0])
Where the "15min" would represent the sampling frequency and the index[0] argument essentially says:
"start sampling the desired frequency at the first value found in this DataFrame's index"
AFAIK, this works for any combination of numerical value + a valid Timerseries offset alias (see here) such as "15min", "4H", "1W", etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Resampling in pandas - python

Related

sort pivot/dataframe without All row pandas/python

Pandas fill nan values using rolling mean

ValueError: Axes instance argument was not found in a figure, Question with same name has no answer

extract only integer from txt

How to resample starting from the first element in pandas?

Categories

Resources