Panda group dataframes into user specified time period - python

Probably related: pandas dataframe group year index by decade
For example if I have data as follows
status bytes_sent upstream_cache_status \
timestamp
2014-05-26 23:56:30 200 356 MISS
2014-05-26 23:56:30 200 10517 -
2014-05-26 23:57:05 200 6923 MISS
2014-05-26 23:57:14 200 323 -
2014-05-26 23:57:30 200 356 MISS
2014-05-26 23:57:38 200 8107 HIT
2014-05-26 23:57:43 200 369 MISS
2014-05-26 23:57:56 304 401 HIT
2014-05-26 23:57:56 304 401 HIT
2014-05-26 23:57:56 304 387 MISS
2014-05-26 23:57:57 304 401 HIT
2014-05-26 23:57:58 304 401 HIT
2014-05-26 23:58:08 200 507 EXPIRED
2014-05-26 23:58:29 304 338 HIT
2014-05-26 23:58:31 400 409 -
2014-05-26 23:58:45 200 425 MISS
if let say I want to group them such that each group contains logs within 30 seconds (time is user-specified), how do I do that? I have seen this
df.groupby(lambda x: x.hour)
but I highly doubt it is relevant in my case

df.groupby(pd.Grouper(freq='30S', level=0)) should do; for example
>>> aggr = lambda df: df.apply(tuple)
>>> df.groupby(pd.Grouper(freq='30S', level=0)).aggregate(aggr)
status bytes_sent \
timestamp
2014-06-26 23:56:30 (200, 200) (356, 10517)
2014-06-26 23:57:00 (200, 200) (6923, 323)
2014-06-26 23:57:30 (200, 200, 200, 304, 304, 304, 304, 304) (356, 8107, 369, 401, 401, 387, 401, 401)
2014-06-26 23:58:00 (200, 304) (507, 338)
2014-06-26 23:58:30 (400, 200) (409, 425)
upstream_cache_status
timestamp
2014-06-26 23:56:30 (MISS, -)
2014-06-26 23:57:00 (MISS, -)
2014-06-26 23:57:30 (MISS, HIT, MISS, HIT, HIT, MISS, HIT, HIT)
2014-06-26 23:58:00 (EXPIRED, HIT)
2014-06-26 23:58:30 (-, MISS)

Related

is there any function in Python for Aggregating second wise millisecond data?

I am working on a problem set where the data in a microsecond. I have 4 hours of data as of now. the data set is very huge as it contains microsecond wise data. I want to aggregate each microsecond data into their respective seconds so that it would be helpful for analysis.
example:
Vibration1 Vibration2 Vibration3 Temperature Pressure Time
816 698 822 1852 710 2019-03-26 09:49:09.013650
702 690 764 2002 810 2019-03-26 09:49:09.014308
702 692 768 1888 706 2019-03-26 09:49:09.014680
696 690 704 2004 810 2019-03-26 09:49:09.015094
738 696 772 1990 710 2019-03-26 09:49:09.015682
834 692 704 2066 704 2019-03-26 09:49:09.016153
798 692 690 1892 722 2019-03-26 09:49:09.016520
696 722 708 2102 700 2019-03-26 09:49:09.016875
824 690 700 2058 718 2019-03-26 09:49:09.017213
692 702 694 2106 704 2019-03-26 09:49:09.017564
Like this, I have many rows in the 09th second.
I have a total of 4 hours of data. How should I group by each second with their respective seconds and minutes?
Please help me.
If I am doing groupby with seconds its basically grouping all the data with seconds irrespective of its hours, minutes.
I have set the index as DateTime index then I tried with this code. and it returned with some 60 seconds data aggregating irrespective with hours and minutes.
df.groupby(df.index.minute).mean()
First, make sure your Time is a datetime object:
df.Time = pd.to_datetime(df.Time)
Then you need to resample:
df.set_index('Time').resample('1S').mean()
With your example data as df, the above results in:
Vibration1 Vibration2 Vibration3 Temperature Pressure
Time
2019-03-26 09:49:09 749.8 696.4 732.6 1996.0 729.4
Can you change column 'Time'?
Example:
import pandas as pd
data = {
'dates': ['09:49:09.015682', '09:50:09.025682', '09:51:09.055682', '09:49:09.035682', '09:50:09.015682'],
'values': [ 1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
for i in df.index:
df['dates'][i] = df['dates'][i][:8]
print(df.groupby('dates').mean())
Output:
values
dates
09:49:09 2.5
09:50:09 3.5
09:51:09 3.0

Pandas: select by bigger than a value

My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example:
df['dir'].value_counts().sort_index()
It returns a Series
0 855
20 881
40 2786
70 3777
90 3964
100 4
110 2115
130 3040
140 1
160 1697
180 1734
190 3
200 618
210 3
220 1451
250 895
270 2167
280 1
290 1643
300 1
310 1894
330 1
340 965
350 1
Name: dir, dtype: int64
Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350.
How can I do that?
I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]
(df['dir'].value_counts() > 500).sum()
This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.

Resampling in pandas

I have asked a question on another thread Link. But I got an incomplete answer. And no one is willing to reply. That is why I am making another modified question. Let me explain the question briefly, I wanted to resample the following data:
**`
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2403954 622.5 461.3 312 623.3 462.6 260
2403958 623.1 461.5 311 623.4 464 261
2403962 623.6 461.7 310 623.7 465.4 261
2403966 623.8 461.5 309 623.9 466.1 261
2403970 620.9 461.4 309 623.8 465.9 259
2403974 621.7 461.1 308 623 464.8 258
2403978 622.1 461.1 308 621.9 463.9 256
2403982 622.5 461.5 308 621 463.4 255
2403986 622.4 462.1 307 620.7 463.3 254
`**
The table goes on and on like that. All the timestamps are in milliseconds. And I wanted to resample it into 100L bin time.
df = df.resample('100L')
The resulting table is:
Timestamp L_x L_y L_a R_x R_y R_a
2403900 621.3 461.3 313 623.3 461.8 260
2404000 622.5 461.3 312 623.3 462.6 260
2404100 623.1 461.5 311 623.4 464 261
2404200 623.6 461.7 310 623.7 465.4 261
2404300 623.8 461.5 309 623.9 466.1 261
But that is not the result I want. because the first timestamp index in the original table is 2403950. So the first bin time should contain from 2403950 to 2404050 but instead it is 2403900 - 2404000. like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 ... ... ... ... ... ...
2404050 ... ... ... ... ... ...
2404150 ... ... ... ... ... ...
2404250 ... ... ... ... ... ...
2404350 ... ... ... ... ... ...
The rest of the column are the mean of the values of the original table.
So to do that someone sugested that I have to calculate the offset. In my case it is 50 milliseconds. And do the following:
df.resample('100L', loffset='50L')
The offset only moves the labels 50 milliseconds forward but it doesnot change the mean values. It is still calculating the mean of, for instance for the first bin time, values from 2403900 to 2404000 instead of 2403950 to 2404050.
Thanks for your help
You're looking for the base kwarg.
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
In your case it looks like you want:
df.resample('100L', base=50)
Note: resample without a DatetimeIndex/PeriodIndex/TimedeltaIndex raises an error in recent pandas, so you should convert to DatetimeIndex before doing this.

How to resample starting from the first element in pandas?

I am resampling the following table/data:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2403954 622.5 461.3 312 623.3 462.6 260
2403958 623.1 461.5 311 623.4 464 261
2403962 623.6 461.7 310 623.7 465.4 261
2403966 623.8 461.5 309 623.9 466.1 261
2403970 620.9 461.4 309 623.8 465.9 259
2403974 621.7 461.1 308 623 464.8 258
2403978 622.1 461.1 308 621.9 463.9 256
2403982 622.5 461.5 308 621 463.4 255
2403986 622.4 462.1 307 620.7 463.3 254
The table goes on and on like that.
The timestamps are in milliseconds. I did the following to resample it into 100milliseconds bin time:
I changed the timestamp index into a datetime format
df.index = pd.to_datetime((df.index.values*1e6).astype(int))
I resampled it in 100milliseconds:
df = df.resample('100L')
The resulting resampled data look like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403900 621.3 461.3 313 623.3 461.8 260
2404000 622.5 461.3 312 623.3 462.6 260
2404100 623.1 461.5 311 623.4 464 261
2404200 623.6 461.7 310 623.7 465.4 261
2404300 623.8 461.5 309 623.9 466.1 261
As we can see the first bin time is 2403900, which is 50milliseconds behind the first timestamp index of the original table. But i wanted the bin time to start from the first timestamp index from the original table, which is 2403950. like the following:
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261
You can specify an offset:
df.resample('100L', loffset='50L')
UPDATE
Of course you can always calculate the offset:
offset = df.index[0] % 100
df.index = pd.to_datetime((df.index.values*1e6).astype(int))
df.resample('100L', loffset='{}L'.format(offset))
A much simpler (and general) solution is to just add base=1 to your resampling function:
df = df.resample('100L', base=1)
A dynamic solution that also works with Pandas Timestamp objects (often used to index Timeseries data), or strictly numerical index values, is to use the origin argument with the resample method as such:
df = df.resample("15min", origin=df.index[0])
Where the "15min" would represent the sampling frequency and the index[0] argument essentially says:
"start sampling the desired frequency at the first value found in this DataFrame's index"
AFAIK, this works for any combination of numerical value + a valid Timerseries offset alias (see here) such as "15min", "4H", "1W", etc.

program to calculate days of the week

It it maybe tricky to explain.
I have to "translate" a Old BASIC program into python.
the program is called weekdays:
10 PRINT TAB(32);"WEEKDAY"
20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY"
30 PRINT:PRINT:PRINT
100 PRINT "WEEKDAY IS A COMPUTER DEMONSTRATION THAT"
110 PRINT"GIVES FACTS ABOUT A DATE OF INTEREST TO YOU."
120 PRINT
130 PRINT "ENTER TODAY'S DATE IN THE FORM: 3,24,1979 ";
140 INPUT M1,D1,Y1
150 REM THIS PROGRAM DETERMINES THE DAY OF THE WEEK
160 REM FOR A DATE AFTER 1582
170 DEF FNA(A)=INT(A/4)
180 DIM T(12)
190 DEF FNB(A)=INT(A/7)
200 REM SPACE OUTPUT AND READ IN INITIAL VALUES FOR MONTHS.
210 FOR I= 1 TO 12
220 READ T(I)
230 NEXT I
240 PRINT"ENTER DAY OF BIRTH (OR OTHER DAY OF INTEREST)";
250 INPUT M,D,Y
260 PRINT
270 LET I1 = INT((Y-1500)/100)
280 REM TEST FOR DATE BEFORE CURRENT CALENDAR.
290 IF Y-1582 <0 THEN 1300
300 LET A = I1*5+(I1+3)/4
310 LET I2=INT(A-FNB(A)*7)
320 LET Y2=INT(Y/100)
330 LET Y3 =INT(Y-Y2*100)
340 LET A =Y3/4+Y3+D+T(M)+I2
350 LET B=INT(A-FNB(A)*7)+1
360 IF M > 2 THEN 470
370 IF Y3 = 0 THEN 440
380 LET T1=INT(Y-FNA(Y)*4)
390 IF T1 <> 0 THEN 470
400 IF B<>0 THEN 420
410 LET B=6
420 LET B = B-1
430 GOTO 470
440 LET A = I1-1
450 LET T1=INT(A-FNA(A)*4)
460 IF T1 = 0 THEN 400
470 IF B <>0 THEN 490
480 LET B = 7
490 IF (Y1*12+M1)*31+D1<(Y*12+M)*31+D THEN 550
500 IF (Y1*12+M1)*31+D1=(Y*12+M)*31+D THEN 530
510 PRINT M;"/";D;"/";Y;" WAS A ";
520 GOTO 570
530 PRINT M;"/";D;"/";Y;" IS A ";
540 GOTO 570
550 PRINT M;"/";D;"/";Y;" WILL BE A ";
560 REM PRINT THE DAY OF THE WEEK THE DATE FALLS ON.
570 IF B <>1 THEN 590
580 PRINT "SUNDAY."
590 IF B<>2 THEN 610
600 PRINT "MONDAY."
610 IF B<>3 THEN 630
620 PRINT "TUESDAY."
630 IF B<>4 THEN 650
640 PRINT "WEDNESDAY."
650 IF B<>5 THEN 670
660 PRINT "THURSDAY."
670 IF B<>6 THEN 690
680 GOTO 1250
690 IF B<>7 THEN 710
700 PRINT "SATURDAY."
710 IF (Y1*12+M1)*31+D1=(Y*12+M)*31+D THEN 1120
720 LET I5=Y1-Y
730 PRINT
740 LET I6=M1-M
750 LET I7=D1-D
760 IF I7>=0 THEN 790
770 LET I6= I6-1
780 LET I7=I7+30
790 IF I6>=0 THEN 820
800 LET I5=I5-1
810 LET I6=I6+12
820 IF I5<0 THEN 1310
830 IF I7 <> 0 THEN 850
835 IF I6 <> 0 THEN 850
840 PRINT"***HAPPY BIRTHDAY***"
850 PRINT " "," ","YEARS","MONTHS","DAYS"
855 PRINT " "," ","-----","------","----"
860 PRINT "YOUR AGE (IF BIRTHDATE) ",I5,I6,I7
870 LET A8 = (I5*365)+(I6*30)+I7+INT(I6/2)
880 LET K5 = I5
890 LET K6 = I6
900 LET K7 = I7
910 REM CALCULATE RETIREMENT DATE.
920 LET E = Y+65
930 REM CALCULATE TIME SPENT IN THE FOLLOWING FUNCTIONS.
940 LET F = .35
950 PRINT "YOU HAVE SLEPT ",
960 GOSUB 1370
970 LET F = .17
980 PRINT "YOU HAVE EATEN ",
990 GOSUB 1370
1000 LET F = .23
1010 IF K5 > 3 THEN 1040
1020 PRINT "YOU HAVE PLAYED",
1030 GOTO 1080
1040 IF K5 > 9 THEN 1070
1050 PRINT "YOU HAVE PLAYED/STUDIED",
1060 GOTO 1080
1070 PRINT "YOU HAVE WORKED/PLAYED",
1080 GOSUB 1370
1085 GOTO 1530
1090 PRINT "YOU HAVE RELAXED ",K5,K6,K7
1100 PRINT
1110 PRINT TAB(16);"*** YOU MAY RETIRE IN";E;" ***"
1120 PRINT
1140 PRINT
1200 PRINT
1210 PRINT
1220 PRINT
1230 PRINT
1240 END
1250 IF D=13 THEN 1280
1260 PRINT "FRIDAY."
1270 GOTO 710
1280 PRINT "FRIDAY THE THIRTEENTH---BEWARE!"
1290 GOTO 710
1300 PRINT "NOT PREPARED TO GIVE DAY OF WEEK PRIOR TO MDLXXXII. "
1310 GOTO 1140
1320 REM TABLE OF VALUES FOR THE MONTHS TO BE USED IN CALCULATIONS.
1330 DATA 0, 3, 3, 6, 1, 4, 6, 2, 5, 0, 3, 5
1340 REM THIS IS THE CURRENT DATE USED IN THE CALCULATIONS.
1350 REM THIS IS THE DATE TO BE CALCULATED ON.
1360 REM CALCULATE TIME IN YEARS, MONTHS, AND DAYS
1370 LET K1=INT(F*A8)
1380 LET I5 = INT(K1/365)
1390 LET K1 = K1- (I5*365)
1400 LET I6 = INT(K1/30)
1410 LET I7 = K1 -(I6*30)
1420 LET K5 = K5-I5
1430 LET K6 =K6-I6
1440 LET K7 = K7-I7
1450 IF K7>=0 THEN 1480
1460 LET K7=K7+30
1470 LET K6=K6-1
1480 IF K6>0 THEN 1510
1490 LET K6=K6+12
1500 LET K5=K5-1
1510 PRINT I5,I6,I7
1520 RETURN
1530 IF K6=12 THEN 1550
1540 GOTO 1090
1550 LET K5=K5+1
1560 LET K6=0
1570 GOTO 1090
1580 REM
1590 END
this program will take current date, and date of birth and return some statistics eg how long you have lives, how many days you have slept.
For part of the assignment, I have to explain what each variable means in the OLD BASIC program. In the old days, the variable name can only be things like A1, B3 etc...
In this program, There is an array
Call
DATA = [0, 3, 3, 6, 1, 4, 6, 2, 5, 0, 3, 5]
There are 12 numbers in this array. I realized that the program will read each number and match from Jan to Dec and I also find out this is to deal with calculating what is it is eg Monday. Tuesday.
I have found that much so far but can anybody explain to me what those numbers in DATA array mean exactly.
thanks.
Without pulling all the code apart, it looks like it's the offset for the start of the week for a given month...
Assume Jan 1st is a Tuesday (like 2013)...
Jan 0 Tuesday
Feb 3 Friday (Tuesday + 3)
Mar 3 Friday (Tuesday + 3)
Apr 6 Monday (Tuesday + 6)
etc...
This seems to assume it's not a leap year otherwise the number from March onwards would need to be decreased by 1 to allow for the extra day.

Categories

Resources