Related
Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.
I have a metereological DataFrame, indexed by TimeStamp, and I want to find all the possible periods of 24 hours present in the DataFrame with these conditions:
at least 6 hours of Rainfalls with Temperature > 10°C
a minimum of 6 consecutive hours of Relative Humidity > 90%.
The hours taken in consideration may also be 'overlapped' (a period with 6 hours of both RH > 90 and Rainfalls > 0 is sufficient).
A sample DataFrame with 48 hours can be created by:
df = pd.DataFrame({'TimeStamp': pd.date_range('1/5/2015 00:00:00', periods=48, freq='H'),
'Temperature': np.random.choice( [11,12,13], 48),
'Rainfalls': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2,0.3,0.3,0.3,0.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'RelativeHumidity': [95,95,95,95,95,95,80,80,80,80,80,80,80,80,85,85,85,85,85,85,85,85,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80]})
df = df.set_index('TimeStamp')
In output I just want the indexes of the various TimeStamps from which every period with the mentioned characteristics starts. In the case of the sample df, only the first TimeStamp is given in output.
I have tried to use the df.rolling() function but I managed to find only the 6 hours of consecutive RH > 90.
Thanks in advance for the help.
I hope I've understood your question right. This example will find all groups where Temperature > 10 and RH > 90 of minimum length of 6 and then prints the first index of these groups:
x = (df.Temperature > 10).astype(int) + (df.RelativeHumidity > 90).astype(int)
out = (
x.groupby((x != x.shift(1)).cumsum().values)
.apply(lambda x: x.index[0] if (x.iat[0] == 2) and len(x) > 5 else np.nan)
.dropna()
)
print(out)
Prints:
1 2015-01-05
dtype: datetime64[ns]
I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.
I'm learning pandas and have a query about aggregate functions. Apologies for what might be a very basic question for experts on this forum :).
Here's a sample of my dataset:
EmpID Age_Range Salary
0 321 20, 35 34000
1 561 20, 35 24000
2 789 50, 65 34000
the above dataset is df, and i'm saving down avg. salary info per employee age range into a separate dataframe (df_age), where I'm persisting the above data. I was able to successfully apply mean() on the salary table to get the avg. salary per age range.
So basically what I want is the count of employees for each age_range.
df_age['EmpCount'] = df.groupby('Age_Range')['EmpID'].count() doesn't work, and returns a 'NaN' in my dataset.
additionally, when I used the transform function
df_age['EmpCount'] = df.groupby('Age_Range')['EmpID'].transform(count)
it returns values, but the same value across the three age ranges - 37, which is not correct. There are a total of 100 entries in my dataset.
desired output for df_age:
0 (20, 35] 50000 27
1 (35, 50] 37000 11
2 (50, 65] 65000 30
Thanks!
If I understood your question correctly you want a new column which has count of employees for age_range. Well, you can use aggregate function to get your answer as follows:
df_age = df.set_index(['Age_Range','EmpID']).groupby(level =0).size().reset_index(name='count_of_employees')
df_age['Ave_Salary'] = df.set_index(['Age_Range','Salary']).groupby(level =0).mean()
You can use size or len in a transform, just like you did with count:
# Dummy data
df = pd.DataFrame({"sample": ["sample1", "sample2", "sample2", "sample3", "sample3", "sample3"]})
df["number_of_samples"] = df.groupby("sample").sample.transform("size")
df["number_of_samples_again"] = df.groupby("sample").sample.transform(len)
Output:
sample number_of_samples number_of_samples_again
0 sample1 1 1
1 sample2 2 2
2 sample2 2 2
3 sample3 3 3
4 sample3 3 3
5 sample3 3 3
I have found a solution to this, but it's not neat / efficient:
df_age1 = df.groupby('Age_Range')['Salary'].mean()
df_age1 = df_age1.reset_index()
df_age1.rename(columns={'Salary':'SalAvg'}, inplace=True)
df_age2 = df.groupby('Age_Range')['EmpID'].count()
df_age2 = df_age2.reset_index()
df_age2.rename(columns={'EmpID':'EmpCount'}, inplace=True)
Then finally,
df_age = pd.merge(df_age1, df_age2, on='Age_Range')
The above iteration gives me what I need, but across three dataframes - I'll obviously be ignoring df_age1 and 2, but I'm still on the lookout for an efficient answer!
I have a data frame df like this
x
0 8.86
1 1.12
2 0.56
3 5.99
4 3.08
5 4.15
I need to perform some sort of groupby operation on x to aggregate x every time its sum reaches 10. If the index of df were a datetime object, I could use pd.Grouper as below
grouped = df.groupby(pd.Grouper(freq="min")
grouped["x"].sum()
which would group by the datetime index and then sum x every minute. In my case I don't have a datetime target to use, so df.groupby(pd.Grouper(freq=10)) yields ValueError: Invalid frequency: 10.
The desired output dataframe, after applying groupby() and sum() operations would look like this
y
0 10.54
1 13.22
because elements 0-2 of df sum to 10.54 and elements 3-5 sum to 13.22
How can I group x by its sum, every time the sum reaches 10?
Here's one approach:
# cumulative sum and modulo 10
s = df.x.cumsum().mod(10)
# if value lower than 10, we've reached the value
m = s.diff().lt(0)
# groupby de cumsum
df.x.groupby(m.cumsum().shift(fill_value=0)).sum()
x
0 10.54
1 13.22
Name: x, dtype: float64
You can do this with a for-loop and rolling sums.
data_slices = [] # Store each sample
rollingSum = 0
last_t = 0
for t in range(len(df)):
rollingSum += df['x'][t] # Add the t index value to sum
if rollingSum >= 10:
data_slice = df['x'][last_t:t] # Slice of x column that sums over 10
data_slices.append(data_slice)
rollingSum = 0 # Reset the sum
last_t = t # Set this as the start index of next slice
grouped_data = pd.concat(data_slices, axis=0)