Groupby on second level index that respects the previous level - python

I'm looking to have two-level index, of which one is of type datetime and the other one is int. The time column I'd like to resample for 1min, and the int column I'd like to have it as intervals of 5.
Currently I've only done the first part, but I've left the second level untouched:
x = w.groupby([pd.Grouper(level='time', freq='1min'), pd.Grouper(level=1)]).sum()
The problem is that it's not good to use bins generated from the entire range of data for pd.cut(), because most of them will be zero. I want to limit the bins only to the context of each 5-second interval.
In other words, I want to replace the second argument (pd.Grouper(level=1)) with pd.cut(rows_from_level0, my_bins) where mybins is an array from the respective 5 second group that's in intervals of 5. (e.g. for [34,54,29,31] -> [30, 35, 40, 45, 50, 55]).
How my_bins computed can be seen below:
def roundTo(num, base=5):
return base * round(num/base)
arr_min = roundTo(min(arr))
arr_max = roundTo(max(arr))
dif = arr_max - arr_min
my_bins = np.linspace(arr_min, arr_max, dif//5 +1)
Basically I'm not sure how to make the second level pd.cut aware of the rows from the first level index in order to produce the bins.

One way to go is to extract the level values, do some math, then groupby on that:
N = 5
df.groupby([pd.Grouper(level='datetime', freq='1min'),
df.index.get_level_values(level=1)//N * N]
).sum()
You would get something similar to this:
data
datetime lvl1
2021-01-01 00:00:00 5 9
15 1
25 4
60 9
2021-01-01 00:01:00 5 8
25 7
85 2
90 6
2021-01-01 00:02:00 0 9
70 8

Related

Create a custom percentile rank for a pandas series

I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

Find mean based on a fixed time duration

I have data in the below format.
index timestamps(s) Bytes
0 0.0 0
1 0.1 9
2 0.2 10
3 0.3 8
4 0.4 8
5 0.5 9
6 0.6 7
7 0.7 8
8 0.8 7
9 0.9 6
It is in pandas data frame (however the format does not matter). I want to divide the data into smaller portions (called windows). Each portion should be of fixed duration (0.3 seconds) and then computing average of the bytes in each window. I want the start and end index of rows for each windows like below:
win_start_ind = [1 4 7]
win_end_ind = [3 6 9]
I intend to use these indices then to compute average number of bytes in each window.
Appreciate for python code.
John Galt suggests a simple alternative that works well for your problem.
g = df.groupby(df['timestamps(s)']//0.3*0.3).Bytes.mean().reset_index()
A generic solution that would work for any date data involves pd.to_datetime and pd.Grouper.
df['timestamps(s)'] = pd.to_datetime(df['timestamps(s)'], format='%S.%f') # 1
g = df.groupby(pd.Grouper(key='timestamps(s)', freq='0.3S')).Bytes\
.mean().reset_index() # 2
g['timestamps(s)'] = g['timestamps(s)']\
.dt.strftime('%S.%f').astype(float) # 3
g
timestamps(s) Bytes
0 0.0 6.333333
1 0.3 8.333333
2 0.6 7.333333
3 0.9 6.000000
g.Bytes.values
array([ 6.33333333, 8.33333333, 7.33333333, 6. ])
Well, not panda aware possible solution to obtain two list of indices as requested, assuming your data is accessible as a two dimensional array where the 1st dimension are rows :
win_start_ind = []
win_end_ind = []
last = last_nonzerobyte_idx = first_ts = None
for i, ts, byt in data : # (1)
if not byt: continue
if first_ts == None :
first_ts = ts
win_num = int((ts-first_ts) * 10 // 3) # (2)
if win_num >= 1 or not win_start_ind:
if win_start_ind :
win_end_ind.append(last_nonzerobyte_idx)
win_start_ind.append(i)
last = win_num
first_ts = ts
last_nonzerobyte_idx = i
wind_end_ind.append(last_nonzerobyte_idx)
This line just loops through your array and assign its rows content to variables, you have to adapt it to your situation. You can also loop through your array and assign the complete row to a single variable, and on the next just line extract the data you want, to the needed variables. See (dataframe
docs - N-Dimensional arrays - Indexing in NumPy) to tailor this code to your needs.
this line is the line that tells us when a new time window starts, if it is 0 then we are still in the same time window, if it is 1, it is time to :
add to win_end_ind the last non-zero-byte-row index
add to win_start_ind the current index
set first_ts to the current time stamp so that ts-first_ts gives us the relative time elapsed since the beginning of this time window.
I got the answer to my questions using pandas builtin function as follow:
As i mentioned I wanted to partition my data into fixed duration windows (or bins). Be noted, that the I tested the function only with uni timestamps. (timestamps values in my question above were hypothetical for simplicity).
The solution is copied from the Link as follow:
import pandas as pd
import datetime
import numpy as np
# Create an empty dataframe
df = pd.DataFrame()
# Create a column from the timestamps series
df['timestamps'] = timestamps
# Convert that column into a datetime datatype
df['timestamps'] = pd.to_datetime(df['timestamps'])
# Set the datetime column as the index
df.index = df['timestamps']
# Create a column from the numeric Bytes series
df['Bytes'] = Bytes
# Now for my original data
# Downsample the series into 30S bins and sum the values of the Bytes
# falling into a bin.
window = df.Bytes.resample('30S').sum()
My output:
1970-01-01 00:00:00 10815752
1970-01-01 00:00:30 6159960
1970-01-01 00:01:00 40270
1970-01-01 00:01:30 44196
1970-01-01 00:02:00 48084
1970-01-01 00:02:30 47147
1970-01-01 00:03:00 45279
1970-01-01 00:03:30 40574
In the output:
First column ==> Time Windows for 30 seconds duration
Second column ==> Sum of all Bytes in the 30 seconds bin
You may also try more options of the function such as mean, last etc.. For more details, read the Documentation.

is there any quick function to do looking-back calculating in pandas dataframe?

I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)

Categories

Resources