I am resampling the following table/data:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2403954 622.5 461.3 312 623.3 462.6 260
2403958 623.1 461.5 311 623.4 464 261
2403962 623.6 461.7 310 623.7 465.4 261
2403966 623.8 461.5 309 623.9 466.1 261
2403970 620.9 461.4 309 623.8 465.9 259
2403974 621.7 461.1 308 623 464.8 258
2403978 622.1 461.1 308 621.9 463.9 256
2403982 622.5 461.5 308 621 463.4 255
2403986 622.4 462.1 307 620.7 463.3 254
The table goes on and on like that.
The timestamps are in milliseconds. I did the following to resample it into 100milliseconds bin time:
I changed the timestamp index into a datetime format
df.index = pd.to_datetime((df.index.values*1e6).astype(int))
I resampled it in 100milliseconds:
df = df.resample('100L')
The resulting resampled data look like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403900 621.3 461.3 313 623.3 461.8 260
2404000 622.5 461.3 312 623.3 462.6 260
2404100 623.1 461.5 311 623.4 464 261
2404200 623.6 461.7 310 623.7 465.4 261
2404300 623.8 461.5 309 623.9 466.1 261
As we can see the first bin time is 2403900, which is 50milliseconds behind the first timestamp index of the original table. But i wanted the bin time to start from the first timestamp index from the original table, which is 2403950. like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261
You can specify an offset:
df.resample('100L', loffset='50L')
UPDATE
Of course you can always calculate the offset:
offset = df.index[0] % 100
df.index = pd.to_datetime((df.index.values*1e6).astype(int))
df.resample('100L', loffset='{}L'.format(offset))
A much simpler (and general) solution is to just add base=1 to your resampling function:
df = df.resample('100L', base=1)
A dynamic solution that also works with Pandas Timestamp objects (often used to index Timeseries data), or strictly numerical index values, is to use the origin argument with the resample method as such:
df = df.resample("15min", origin=df.index[0])
Where the "15min" would represent the sampling frequency and the index[0] argument essentially says:
"start sampling the desired frequency at the first value found in this DataFrame's index"
AFAIK, this works for any combination of numerical value + a valid Timerseries offset alias (see here) such as "15min", "4H", "1W", etc.
Related
I am trying to figure out how to renumber a certain file format and struggling to get it right.
First, a little background may help: There is a certain file format used in computational chemistry to describe the structure of a molecule with the extension .xyz. The first column is the number used to identify a specific atom (carbon, hydrogen, etc.), and the subsequent columns show what other atom numbers it is connected to. Below is a small sample of this file, but the usual file is significantly larger.
259 252
260 254
261 255
262 256
264 248 265 268
265 264 266 269 270
266 265 267 282
267 266
268 264
269 265
270 265 271 276 277
271 270 272 273
272 271 274 278
273 271 275 279
274 272 275 280
275 273 274 281
276 270
277 270
278 272
279 273
280 274
282 266 283 286
283 282 284 287 288
284 283 285 289
285 284
286 282
287 283
288 283
289 284 290 293
290 289 291 294 295
291 290 292 304
As you can see, the numbers 263 and 281 are missing. Of course, there could be many more missing numbers so I need my script to be able to account for this. Below is the code I have thus far, and the lists missing_nums and missing_nums2 are given as well, however, I would normally obtain them from an earlier part of the script. The last element of the list missing_nums2 is where I want numbering to finish, so in this case: 289.
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
with open("atom_nums.xyz", "r") as f2:
lines = f2.read()
for i in range(0, len(missing_nums) - 1):
if i == 0:
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i])
for number in range(int(missing_nums[i]) + 1, int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
else:
with open("atom_nums_out.xyz", "r") as f2:
lines = f2.read()
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i]) - (i + 1)
print(replacement)
for number in range(int(missing_nums[i]), int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
The problem lies in the fact that as the file gets larger, there seems to be repeats of numbers for reasons I cannot figure out. I hope somebody can help me here.
EDIT: The desired output of the code using the above sample would be
259 252
260 254
261 255
262 256
263 248 264 267
264 263 265 268 269
265 264 266 280
266 265
267 263
268 264
269 264 270 275 276
270 269 271 272
271 270 273 277
272 270 274 278
273 271 274 279
274 272 273 279
275 269
276 269
277 271
278 272
279 273
280 265 281 284
281 280 282 285 286
282 281 283 287
283 282
284 280
285 281
286 281
287 282 288 291
288 287 289 292 293
289 288 290 302
Which is, indeed, what I get as the output for this small sample, but as the missing numbers increase it seems to not work and I get duplicate numbers. I can provide the whole file if anyone wants.
Thanks!
Assuming my interpretation of the lists missing_nums and missing_nums2 is correct, this is how I would perform the operation.
from os import rename
def fixFile(fn, mn1, mn2):
with open(fn, "r") as fin:
with open('tmp.txt', "w") as fout:
for line in fin:
for i in range(len(mn1)):
minN = int(mn1[1])
maxN = int(mn2[i])
for nxtn in range(minN, maxN):
line.replace(str(nxtn), str(nxtn +1))
fout.write(line)
rename(temp, fn)
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
fn = "atom_nums_out.xyz"
fixFile(fn, missing_nums, missing_nums2)
Note, I am only reading the file in once a line at a time, and writing the result out a line at a time. I am then renaming the temp file to the original filename after all data is processed. This means, significantly longer files, will not chew up memory.
I am trying to create a seaborn Facetgrid to plot the normality distribution of all columns in my dataFrame decathlon. The data looks as such:
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0 938 1061 773 859 896 911 880 732 757 752
1 839 975 870 749 887 878 880 823 863 741
2 814 866 841 887 921 939 819 778 884 691
3 872 898 789 878 848 879 790 790 861 804
4 892 913 742 803 816 869 1004 789 854 699
... ... ... ... ... ... ... ... ... ...
7963 755 760 604 714 812 794 482 571 539 780
7964 830 845 524 767 786 783 601 573 562 535
7965 819 804 653 840 791 699 659 461 448 632
7966 804 720 539 758 830 782 731 487 425 729
7967 687 809 692 714 565 741 804 527 738 523
I am relatively new to python and I can't understand my error. My attempt to format the data and create the grid is as such:
import seaborn as sns
df_stacked = decathlon.stack().reset_index(1).rename({'level_1': 'column', 0: 'values'}, axis=1)
g = sns.FacetGrid(df_stacked, row = 'column')
g = g.map(plt.hist, "values")
However I recieve the following error:
ValueError: Axes instance argument was not found in a figure
Can anyone explain what exactly this error means and how I would go about fixing it?
EDIT
df_stacked looks as such:
column values
0 P100m 938
0 Plj 1061
0 Psp 773
0 Phj 859
0 P400m 896
... ...
7967 P110h 741
7967 Ppv 804
7967 Pdt 527
7967 Pjt 738
7967 P1500 523
I encountered this similar issue when running a Jupyter Notebook.
My solution involved:
Restart the notebook
Re-run the imports %matplotlib inline; import matplotlib.pyplot as plt
As you did not post a full working example its a bit of guessing.
What might go wrong is in the line where you have g = g.map(plt.hist, "values") because the error comes from deep within matplotlib. You can see this here in this SO question where its another function pylab.sca(axes[i]) outside matplotlib due to not being in that module available, is being triggered by matplotlib.
Likely you installed/updated something in your (conda?) environment (changes in environment paths?) and after the next reboot it was found.
I also wonder how you come up with plt.hist ... fully typed it should resemble matplotlib.pyplot.hist ... but guessing... (waiting for your updated example code).
I have a txt file which contains lost of information,I do not want its head and tail, I need only numbers in the middle. which is a 1x11200 matrix.
[txtpda]
LT=5.6
DATE=21.06.2018
TIME=14:11
CNT=11200
RES=0.00854518
N=5
VB=350
VT=0.5
LS=0
MEASTIME=201806211412
PICKUP=BFW-2
LC=0.8
[PROFILE]
255
256
258
264
269
273
267
258
251
255
259
262
260
256
255
260
264
266
265
263
261
263
267
275
280
280
280
280
283
284
283
277
279
280
283
285
283
282
280
280
286
288
298
299
299
299
304
303
300
297
295
296
299
301
303
301
299
296
298
299
302
303
304
307
308
312
313
314
312
311
311
310
312
310
309
305
303
299
297
294
288
280
270
266
250
242
222
213
199
180
173
...
-1062
-1063
[VALUES]
Ra;2;3;2;0.769;0;0;-1;0;-1;0
Rz;2;2;2;5.137;0;0;-1;0;-1;0
Pt;0;0;0;26.25;0;0;-1;0;-1;0
Wt;0;0;0;24.3;0;0;-1;0;-1;0
now I using the following method to extract numbers:
def OpenFile():
name=askopenfilename(parent=root)
f=open(name,'r')
originalyvec1=[]
yvec1=[]
if f==0:
print("fail to open the file")
else:
print("file successfully opened")
data=f.readlines()
for i in range(0,14):
del data[0]//delete its head(string)
del data[11204]//delete its tail(string)
del data[11203]//delete its tail(string)
del data[11202]//delete its tail(string)
del data[11201]//delete its tail(string)
del data[11200]//delete its tail(string)
for line in data:
for nbr in line.split(): //delete \n
yvec1.append(int(nbr))
if f.close()==0:
print("fail to close file")
else:
print("file closed")
I want to use numpy to manage it in a easy way. Is that possible?
like np.array or something like that.
You can use a alternative form of iter(), where you pass iter() a function and it will keep calling that function until it sees the value (2nd arg). You can use this to skip until you see [PROFILE]\n and then use that same form of iter() to read until [VALUES]\n. The function is just the one called by next(iterable), which is iterable.__next__, e.g.:
with open(name) as f:
for _ in iter(f.__next__, '[PROFILE]\n'): # Skip until PROFILE
pass
yvec1 = [int(d) for d in iter(f.__next__, '[VALUES]\n')]
yvec1 will now contain all values between [PROFILE] and [VALUES].
An alternative and potentially quicker way to consume the first iter() is to use collections.deque() instead of the for loop but this is likely over-kill for this problem, e.g.:
deque(iter(f.__next__, '[PROFILE]\n'), maxlen=0)
Note: using with will automatically close(f) at the end of the block.
You can simply replace everything from the line data=f.readlines() and below with:
data = [int(line) for line in map(str.strip, f.readlines()) if line.isdigit() or line.startswith('-') and line[1:].isdigit()]
And data will be the list of integers you're looking for.
Just to give you the idea this may help
The s3[0] will be all the numbers between PROFILE ans VALUES
#s=your data
s='sjlkf slflsafj[PROFILEl9723,2974982,2987492,886[VALUES]skjlfsajlsjal'
s2=s.split('[PROFILE]')
s3=s2[1].split('[VALUES]')
I wanted the product of last 12 months data from the current row.
Date Open
21/06/11 839.9
22/06/11 853.35
23/06/11 846.55
24/06/11 874.15
27/06/11 866.7
28/06/11 878.9
29/06/11 875.7
30/06/11 888.7
01/07/11 907
04/07/11 874.4
05/07/11 869.3
06/07/11 848.85
07/07/11 858
08/07/11 873
11/07/11 854
12/07/11 847.5
13/07/11 853.05
14/07/11 863.3
15/07/11 867.7
18/07/11 871.9
19/07/11 867.5
20/07/11 886
21/07/11 875.95
22/07/11 866
25/07/11 892
26/07/11 888.25
27/07/11 875
28/07/11 855
29/07/11 840
01/08/11 838
02/08/11 827.55
03/08/11 826.75
04/08/11 828
05/08/11 799.5
08/08/11 776.7
09/08/11 753
10/08/11 785.35
11/08/11 768.35
12/08/11 783
16/08/11 760
17/08/11 760.5
18/08/11 757.7
19/08/11 731.05
22/08/11 731
23/08/11 760.35
24/08/11 764
25/08/11 761.6
26/08/11 751
29/08/11 731.1
30/08/11 765
02/09/11 796.7
05/09/11 794.5
06/09/11 783.2
07/09/11 824
08/09/11 833.5
09/09/11 852.15
12/09/11 810.35
13/09/11 813.2
14/09/11 813.9
15/09/11 833
16/09/11 850
19/09/11 825
20/09/11 823
21/09/11 850.9
22/09/11 823.95
23/09/11 773.9
26/09/11 769.2
27/09/11 774
28/09/11 799.75
29/09/11 790.5
30/09/11 803.5
03/10/11 791.2
04/10/11 784
05/10/11 772.55
07/10/11 786.7
10/10/11 804.25
11/10/11 835
12/10/11 829.4
13/10/11 850
14/10/11 842
17/10/11 867
18/10/11 825
19/10/11 825.5
20/10/11 834.85
21/10/11 840
24/10/11 848
25/10/11 855
26/10/11 879
28/10/11 899.7
31/10/11 898
01/11/11 870.5
02/11/11 855
03/11/11 867.75
04/11/11 905
08/11/11 879
09/11/11 890.05
11/11/11 859
14/11/11 891.4
15/11/11 871
16/11/11 859.1
17/11/11 845.05
18/11/11 800.3
21/11/11 800
22/11/11 788.1
23/11/11 789.9
24/11/11 775
25/11/11 769.7
28/11/11 765
29/11/11 782
30/11/11 756.7
01/12/11 799
02/12/11 797
05/12/11 808.35
07/12/11 807
08/12/11 802
09/12/11 769.9
12/12/11 760.55
13/12/11 723.9
14/12/11 738
15/12/11 731.9
16/12/11 749
19/12/11 719.2
20/12/11 741.7
21/12/11 727
22/12/11 741.35
23/12/11 760
26/12/11 747.05
27/12/11 766
28/12/11 757.7
29/12/11 733.65
30/12/11 713
02/01/12 696.8
03/01/12 712.25
04/01/12 727.4
05/01/12 715
06/01/12 697.05
07/01/12 716.7
09/01/12 714.45
10/01/12 712
11/01/12 737.9
12/01/12 747.5
13/01/12 742
16/01/12 729.95
17/01/12 716
18/01/12 762
19/01/12 789
20/01/12 790
23/01/12 755.3
24/01/12 774.6
25/01/12 788.7
27/01/12 800
30/01/12 813.9
31/01/12 804.5
01/02/12 818.9
02/02/12 835
03/02/12 830
06/02/12 845.9
07/02/12 842
08/02/12 847
09/02/12 856.75
10/02/12 850.35
13/02/12 841.1
14/02/12 846.9
15/02/12 854.2
16/02/12 831
17/02/12 822.05
21/02/12 817.5
22/02/12 848
23/02/12 832
24/02/12 833.5
27/02/12 821.8
28/02/12 789.05
29/02/12 805.05
01/03/12 811.8
02/03/12 816.25
03/03/12 811
05/03/12 812.05
06/03/12 797
07/03/12 776.55
09/03/12 775.3
12/03/12 790
13/03/12 803.45
14/03/12 828
15/03/12 818
16/03/12 780
19/03/12 781
20/03/12 756.1
21/03/12 760
22/03/12 765.9
23/03/12 743.8
26/03/12 743.9
27/03/12 738
28/03/12 730
29/03/12 718
30/03/12 729.5
02/04/12 749.35
03/04/12 744.25
04/04/12 745
09/04/12 740.05
10/04/12 746
11/04/12 739
12/04/12 733.3
13/04/12 746.05
16/04/12 747.1
17/04/12 754.8
18/04/12 750
19/04/12 753.9
20/04/12 740.05
23/04/12 725.85
24/04/12 739
25/04/12 734.1
26/04/12 737.1
27/04/12 741.3
28/04/12 739.8
30/04/12 737.5
02/05/12 747.9
03/05/12 738.5
04/05/12 733.4
07/05/12 715
08/05/12 718
09/05/12 702
10/05/12 697.25
11/05/12 693
14/05/12 698
15/05/12 679
16/05/12 675
17/05/12 680.25
18/05/12 676.9
21/05/12 686.5
22/05/12 704.6
23/05/12 685.2
24/05/12 694
25/05/12 695
28/05/12 692
29/05/12 702.2
30/05/12 699.65
31/05/12 697
01/06/12 707.35
04/06/12 677
05/06/12 696
06/06/12 704.45
07/06/12 721.05
08/06/12 718
11/06/12 732.7
12/06/12 715
13/06/12 722.25
14/06/12 716
15/06/12 718.5
18/06/12 730.35
19/06/12 717
20/06/12 738
21/06/12 734
22/06/12 713.55
25/06/12 714.2
26/06/12 717.5
27/06/12 726.4
28/06/12 724.4
29/06/12 725.1
02/07/12 735.5
03/07/12 739.95
04/07/12 740
05/07/12 734.95
06/07/12 738
09/07/12 729
10/07/12 731.45
11/07/12 733.45
12/07/12 721.9
13/07/12 720
16/07/12 720
17/07/12 724.8
18/07/12 718
19/07/12 720.2
20/07/12 722.3
23/07/12 715
24/07/12 721
25/07/12 720.4
26/07/12 720.9
27/07/12 719
30/07/12 723
31/07/12 731.6
01/08/12 740.25
02/08/12 742.1
03/08/12 735
06/08/12 748.05
07/08/12 786.05
08/08/12 785.05
09/08/12 788.9
10/08/12 777.65
13/08/12 779.5
14/08/12 787.9
16/08/12 802.05
17/08/12 817.9
21/08/12 816
22/08/12 809.2
23/08/12 810.55
24/08/12 791.75
27/08/12 786
28/08/12 786.85
29/08/12 791
30/08/12 779.75
31/08/12 780
03/09/12 768
04/09/12 763.95
05/09/12 775.25
06/09/12 766.3
07/09/12 778.7
08/09/12 793.5
10/09/12 800
11/09/12 789.5
12/09/12 793.5
13/09/12 798.1
14/09/12 813
17/09/12 848.1
18/09/12 870.2
I tried using something on these lines but did not find a solution:
df['val']= df['Open'].last('12M').transform('prod')
How can I get the result?
If you just need product of last 12 months' value for df['Open'] then you could do something like this:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Date'], inplace=True)
df.sort_index(inplace=True)
df.tail(12).prod()
which gives you
Open 2.843636e+34
dtype: float64
I think you can adapt the following example to get what you need:
# example with 7 days
import pandas as pd
dates = pd.date_range('1/1/2018', periods=7, freq='d')
values = [4,3,7,5,3,2,3]
df = pd.DataFrame({'col1':values}, index=dates)
# get product of last 2 days
df['col1'].last('2d').prod()
I have asked a question on another thread Link. But I got an incomplete answer. And no one is willing to reply. That is why I am making another modified question. Let me explain the question briefly, I wanted to resample the following data:
**`
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2403954 622.5 461.3 312 623.3 462.6 260
2403958 623.1 461.5 311 623.4 464 261
2403962 623.6 461.7 310 623.7 465.4 261
2403966 623.8 461.5 309 623.9 466.1 261
2403970 620.9 461.4 309 623.8 465.9 259
2403974 621.7 461.1 308 623 464.8 258
2403978 622.1 461.1 308 621.9 463.9 256
2403982 622.5 461.5 308 621 463.4 255
2403986 622.4 462.1 307 620.7 463.3 254
`**
The table goes on and on like that. All the timestamps are in milliseconds. And I wanted to resample it into 100L bin time.
df = df.resample('100L')
The resulting table is:
Timestamp L_x L_y L_a R_x R_y R_a
2403900 621.3 461.3 313 623.3 461.8 260
2404000 622.5 461.3 312 623.3 462.6 260
2404100 623.1 461.5 311 623.4 464 261
2404200 623.6 461.7 310 623.7 465.4 261
2404300 623.8 461.5 309 623.9 466.1 261
But that is not the result I want. because the first timestamp index in the original table is 2403950. So the first bin time should contain from 2403950 to 2404050 but instead it is 2403900 - 2404000. like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 ... ... ... ... ... ...
2404050 ... ... ... ... ... ...
2404150 ... ... ... ... ... ...
2404250 ... ... ... ... ... ...
2404350 ... ... ... ... ... ...
The rest of the column are the mean of the values of the original table.
So to do that someone sugested that I have to calculate the offset. In my case it is 50 milliseconds. And do the following:
df.resample('100L', loffset='50L')
The offset only moves the labels 50 milliseconds forward but it doesnot change the mean values. It is still calculating the mean of, for instance for the first bin time, values from 2403900 to 2404000 instead of 2403950 to 2404050.
Thanks for your help
You're looking for the base kwarg.
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
In your case it looks like you want:
df.resample('100L', base=50)
Note: resample without a DatetimeIndex/PeriodIndex/TimedeltaIndex raises an error in recent pandas, so you should convert to DatetimeIndex before doing this.