I have data in the below format.
index timestamps(s) Bytes
0 0.0 0
1 0.1 9
2 0.2 10
3 0.3 8
4 0.4 8
5 0.5 9
6 0.6 7
7 0.7 8
8 0.8 7
9 0.9 6
It is in pandas data frame (however the format does not matter). I want to divide the data into smaller portions (called windows). Each portion should be of fixed duration (0.3 seconds) and then computing average of the bytes in each window. I want the start and end index of rows for each windows like below:
win_start_ind = [1 4 7]
win_end_ind = [3 6 9]
I intend to use these indices then to compute average number of bytes in each window.
Appreciate for python code.
John Galt suggests a simple alternative that works well for your problem.
g = df.groupby(df['timestamps(s)']//0.3*0.3).Bytes.mean().reset_index()
A generic solution that would work for any date data involves pd.to_datetime and pd.Grouper.
df['timestamps(s)'] = pd.to_datetime(df['timestamps(s)'], format='%S.%f') # 1
g = df.groupby(pd.Grouper(key='timestamps(s)', freq='0.3S')).Bytes\
.mean().reset_index() # 2
g['timestamps(s)'] = g['timestamps(s)']\
.dt.strftime('%S.%f').astype(float) # 3
g
timestamps(s) Bytes
0 0.0 6.333333
1 0.3 8.333333
2 0.6 7.333333
3 0.9 6.000000
g.Bytes.values
array([ 6.33333333, 8.33333333, 7.33333333, 6. ])
Well, not panda aware possible solution to obtain two list of indices as requested, assuming your data is accessible as a two dimensional array where the 1st dimension are rows :
win_start_ind = []
win_end_ind = []
last = last_nonzerobyte_idx = first_ts = None
for i, ts, byt in data : # (1)
if not byt: continue
if first_ts == None :
first_ts = ts
win_num = int((ts-first_ts) * 10 // 3) # (2)
if win_num >= 1 or not win_start_ind:
if win_start_ind :
win_end_ind.append(last_nonzerobyte_idx)
win_start_ind.append(i)
last = win_num
first_ts = ts
last_nonzerobyte_idx = i
wind_end_ind.append(last_nonzerobyte_idx)
This line just loops through your array and assign its rows content to variables, you have to adapt it to your situation. You can also loop through your array and assign the complete row to a single variable, and on the next just line extract the data you want, to the needed variables. See (dataframe
docs - N-Dimensional arrays - Indexing in NumPy) to tailor this code to your needs.
this line is the line that tells us when a new time window starts, if it is 0 then we are still in the same time window, if it is 1, it is time to :
add to win_end_ind the last non-zero-byte-row index
add to win_start_ind the current index
set first_ts to the current time stamp so that ts-first_ts gives us the relative time elapsed since the beginning of this time window.
I got the answer to my questions using pandas builtin function as follow:
As i mentioned I wanted to partition my data into fixed duration windows (or bins). Be noted, that the I tested the function only with uni timestamps. (timestamps values in my question above were hypothetical for simplicity).
The solution is copied from the Link as follow:
import pandas as pd
import datetime
import numpy as np
# Create an empty dataframe
df = pd.DataFrame()
# Create a column from the timestamps series
df['timestamps'] = timestamps
# Convert that column into a datetime datatype
df['timestamps'] = pd.to_datetime(df['timestamps'])
# Set the datetime column as the index
df.index = df['timestamps']
# Create a column from the numeric Bytes series
df['Bytes'] = Bytes
# Now for my original data
# Downsample the series into 30S bins and sum the values of the Bytes
# falling into a bin.
window = df.Bytes.resample('30S').sum()
My output:
1970-01-01 00:00:00 10815752
1970-01-01 00:00:30 6159960
1970-01-01 00:01:00 40270
1970-01-01 00:01:30 44196
1970-01-01 00:02:00 48084
1970-01-01 00:02:30 47147
1970-01-01 00:03:00 45279
1970-01-01 00:03:30 40574
In the output:
First column ==> Time Windows for 30 seconds duration
Second column ==> Sum of all Bytes in the 30 seconds bin
You may also try more options of the function such as mean, last etc.. For more details, read the Documentation.
Related
I'm looking to have two-level index, of which one is of type datetime and the other one is int. The time column I'd like to resample for 1min, and the int column I'd like to have it as intervals of 5.
Currently I've only done the first part, but I've left the second level untouched:
x = w.groupby([pd.Grouper(level='time', freq='1min'), pd.Grouper(level=1)]).sum()
The problem is that it's not good to use bins generated from the entire range of data for pd.cut(), because most of them will be zero. I want to limit the bins only to the context of each 5-second interval.
In other words, I want to replace the second argument (pd.Grouper(level=1)) with pd.cut(rows_from_level0, my_bins) where mybins is an array from the respective 5 second group that's in intervals of 5. (e.g. for [34,54,29,31] -> [30, 35, 40, 45, 50, 55]).
How my_bins computed can be seen below:
def roundTo(num, base=5):
return base * round(num/base)
arr_min = roundTo(min(arr))
arr_max = roundTo(max(arr))
dif = arr_max - arr_min
my_bins = np.linspace(arr_min, arr_max, dif//5 +1)
Basically I'm not sure how to make the second level pd.cut aware of the rows from the first level index in order to produce the bins.
One way to go is to extract the level values, do some math, then groupby on that:
N = 5
df.groupby([pd.Grouper(level='datetime', freq='1min'),
df.index.get_level_values(level=1)//N * N]
).sum()
You would get something similar to this:
data
datetime lvl1
2021-01-01 00:00:00 5 9
15 1
25 4
60 9
2021-01-01 00:01:00 5 8
25 7
85 2
90 6
2021-01-01 00:02:00 0 9
70 8
Consider the following dataframe
df = pd.DataFrame()
df['Amount'] = [13,17,31,48]
I want to calculate for each row the std of the previous 2 values of the column "Amount". For example:
For the third row, the value should be the std of 17 and 13 (which is 2).
For the fourth row, the value should be the std of 31 and 17 (which is 7).
This is what I did:
df['std previous 2 weeks'] = df['Amount'].shift(1).rolling(2).std()
But this is not working. I thought that my problem was an index problem. But this works perfectly with the sum method.
df['total amount of previous 2 weeks'] = df['Amount'].shift(1).rolling(2).sum()
PD : I know that this can be done in some other ways but I want to know the reason for why this does not work (and how to fix it).
You could shift after rolling.std. Also the degrees of freedom is 1 by default, it seems you want it to be 0.
df['Stdev'] = df['Amount'].rolling(2).std(ddof=0).shift()
Output:
Amount Stdev
0 13 NaN
1 17 NaN
2 31 2.0
3 48 7.0
I have a df with 2 columns. One is the timestamp in microseconds and the other is a a value. It looks like this:
time score
83620 4
83621 4
83622 4
83623 4
83624 4
83625 4
83626 4
83627 4
83628 4
83629 4
83630 4
83631 4
83632 4
83633 5
83634 5
83635 5
83636 5
83637 5
83638 5
83639 6
83640 1
83641 1
83642 4
I want to concert df.time to milliseconds and aggregate df.score by the mode. It should look like this:
time score
8362 4
8363 5
8364 1
Try:
df.groupby(df['time'] // 10)['score'].apply(lambda x: x.mode()[0])
Output:
time
8362 4
8363 5
8364 1
Name: score, dtype: int64
Two approaches:
Using resample, but I have only learned about it today and have not tried it so far, but it looks powerful.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
My favorite way to do this:
df["milliseconds"] = np.round(df["time"] / 1000, 0) # For cutoff, consider // 1000
df = df.groupby("milliseconds").agg(score=("score", "mode")).reset_index()
If time-critical, consider doing the millisecond calculation with .apply() or with list-comprehension. If you do it with apply, remember that lambda functions have an overhead.
For very large samples, numpy will probably be slightly faster.
Resample will probably be faster than grouping, but this is quite easy.
First, convert time column so it contains milliseconds. 1 microsecond contains 0.001 milliseconds. Therefore, this code will convert your time column to milliseconds:
df['time'] = df['time'] / 1000
Then, group by the desired column, in this case score, and then specify that you was time column aggregated by mode. This can be done using the following code:
df.groupby(['score']).apply(pd.DataFrame.mode).reset_index(drop=True)
I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35
I want to convert a column of a Pandas DataFrame from an object to a number (e.g., float64). The DataFrame is the following:
import pandas as pd
import numpy as np
import datetime as dt
df = pd.read_csv('data.csv')
df
ID MIN
0 201167 32:59:00
1 203124 14:23
2 101179 8:37
3 200780 5:22
4 202699 NaN
5 203117 NaN
6 202331 36:05:00
7 2561 30:43:00
I would like to convert the MIN column from type object to a number (e.g., float64). For example, 32:59:00 should become 32.983333.
I'm not sure if it's necessary as an initial step, but I can convert each NaN to 0 via:
df['MIN'] = np.where(pd.isnull(df['MIN']), '0', df['MIN'])
How can I efficiently convert the entire column? I've tried variations of dt.datetime.strptime(), df['MIN'].astype('datetime64'), and pd.to_datetime(df['MIN']) with no success.
Defining a converter function:
def str_to_number(time_str):
if not isinstance(time_str, str):
return 0
minutes, sec, *_ = [int(x) for x in time_str.split(':')]
return minutes + sec / 60
and applying it to the MINcolumn:
df.MIN = df.MIN.map(str_to_number)
works.
Before:
ID MIN
0 1 32:59:00
1 2 NaN
2 3 14:23
After:
ID MIN
0 1 32.983333
1 2 0.000000
2 3 14.383333
The above is for Python 3. This works for Python 2:
def str_to_number(time_str):
if not isinstance(time_str, str):
return 0
entries = [int(x) for x in time_str.split(':')]
minutes = entries[0]
sec = entries[1]
return minutes + sec / 60.0
Note the 60.0. Alternatively, use from __future__ import print_function to avoid the integer division problem.