How many savings can be achieved? - python

I have a problem. I have a dataframe, this gives me the savings. If I sum this up I get all the savings of X data points.
Last year I had Y data points. I would now predict / extrapolate how much savings I can expect.
Thus I have to calculate the following
(all savings / data points) * number of old data points = extrapolated savings
In my estimation, this is wrong. So how could I calculate an extrapolation for the savings of the 10 data points?
import pandas as pd
d = {'id': [1, 2, 3, 4], 'saveing': [10, 20, 30, 5]}
df = pd.DataFrame(data=d)
count_datapoints = 10
(df['saveing'].sum() / df.shape[0]) * count_datapoints
[OUT] 162.5
Dataframe
id saveing
0 1 10
1 2 20
2 3 30
3 4 5

Related

Pandas dataframe from numpy array with multiindex

I'm working with a numpy array called array_test with shape (5, 359, 2). This is checked with array_test.shape. The array reflects mean and uncertainty for observations in 5 repetitions of an experiment.
The goal of this is to be able to estimate the mean value of each observation across the 5 repetitions of the experiment, and to estimate the total uncertainty per observation also a mean across the 5 repetitions.
I would need to create a pandas dataframe from it, I believe with a multiindex in which the first level would have 5 values from the first dimension (named simply '1', '2', etc.), and a second one which would be 'mean' and 'uncertainty'.
Suggestions are more than welcome!
IIUC, you might want to aggregate in numpy, then construct a DataFrame and stack:
a = np.random.random((5, 359, 2))
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack()
Output (a Series):
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
dtype: float64
For a DataFrame:
out = pd.DataFrame(a.mean(1), index=range(1, a.shape[0]+1),
columns=['mean', 'uncertainty']).stack().to_frame('value')
Output:
value
1 mean 0.499102
uncertainty 0.511757
2 mean 0.480295
uncertainty 0.473132
3 mean 0.500507
uncertainty 0.519352
4 mean 0.505443
uncertainty 0.493672
5 mean 0.514302
uncertainty 0.519299
I would approach it by using a normal Dataframe, but adding columns for the observation and experiment number.
import numpy as np
import pandas as pd
a = np.random.rand(5, 10, 2)
# Get the shape
n_experiments, n_observations, n_values = a.shape
# Reshape array into a 2-dimensional array
# (stacking experiments on top of each other)
a = a.reshape(-1, n_values)
# Create Dataframe and add experiment and observation number
df = pd.DataFrame(a, columns=["mean", "uncertainty"])
# This returns an array, like [0, 0, 0, 0, 0, 1, 1, 1, ..., 4, 4]
experiment = np.repeat(range(n_experiments), n_observations)
df["experiment"] = experiment
# This returns an array like [0, 1, 2, 3, 4, 0, 1, 2, ..., 3, 4]
observation = np.tile(range(n_observations), n_experiments)
df["observation"] = observation
The Dataframe now looks like this:
print(df.head(15))
mean uncertainty experiment observation
0 0.741436 0.775086 0 0
1 0.401934 0.277716 0 1
2 0.148269 0.406040 0 2
3 0.852485 0.702986 0 3
4 0.240930 0.644746 0 4
5 0.309648 0.914761 0 5
6 0.479186 0.495845 0 6
7 0.154647 0.422658 0 7
8 0.381012 0.756473 0 8
9 0.939797 0.764821 0 9
10 0.994342 0.019140 1 0
11 0.300225 0.992146 1 1
12 0.265698 0.823469 1 2
13 0.791907 0.555051 1 3
14 0.503281 0.249237 1 4
Now you can analyze the Dataframe (with groupby and mean):
# Only the mean
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).mean())
mean uncertainty
observation
0 0.699324 0.506369
1 0.382288 0.456324
2 0.333396 0.324469
3 0.690545 0.564583
4 0.365198 0.555231
5 0.453545 0.596149
6 0.526988 0.395162
7 0.565689 0.569904
8 0.425595 0.415944
9 0.731776 0.375612
Or with more advanced aggregate functions, which are probably useful for your usecase:
# Use aggregate function to calculate not only mean, but min and max as well
print(df[['observation', 'mean', 'uncertainty']].groupby(['observation']).aggregate(['mean', 'min', 'max']))
mean uncertainty
mean min max mean min max
observation
0 0.699324 0.297030 0.994342 0.506369 0.019140 0.974842
1 0.382288 0.063046 0.810411 0.456324 0.108774 0.992146
2 0.333396 0.148269 0.698921 0.324469 0.009539 0.823469
3 0.690545 0.175471 0.895190 0.564583 0.260557 0.721265
4 0.365198 0.015501 0.726352 0.555231 0.249237 0.929258
5 0.453545 0.111355 0.807582 0.596149 0.101421 0.914761
6 0.526988 0.323945 0.786167 0.395162 0.007105 0.691998
7 0.565689 0.154647 0.813336 0.569904 0.302157 0.964782
8 0.425595 0.116968 0.567544 0.415944 0.014439 0.756473
9 0.731776 0.411324 0.939797 0.375612 0.085988 0.764821

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Pandas: efficient way to get a random subset from each row within a restricted column range

I have some numerical time-series of varying lengths stored in a wide pandas dataframe. Each row corresponds to one series and each column to a measurement time point. Because of their varying length, those series can have missing values (NA) tails either left (first time points) or right (last time points) or both. There is always a continuous stripe without NA of a minimum length on each row.
I need to get a random subset of fixed length from each of these rows, without including any NA. Ideally, I wish to keep the original dataframe intact and to report the subsets in a new one.
I managed to obtain this output with a very inefficient for loop that goes through each row one by one, determines a start for the crop position such that NAs will not be included in the output and copies the cropped result. This works but it is extremely slow on large datasets. Here is the code:
import pandas as pd
import numpy as np
from copy import copy
def crop_random(df_in, output_length, ignore_na_tails=True):
# Initialize new dataframe
colnames = ['X_' + str(i) for i in range(output_length)]
df_crop = pd.DataFrame(index=df_in.index, columns=colnames)
# Go through all rows
for irow in range(df_in.shape[0]):
series = copy(df_in.iloc[irow, :])
series = np.array(series).astype('float')
length = len(series)
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
# Range where the subset might start
lo = pos_non_na[0][0]
hi = pos_non_na[0][-1]
left = np.random.randint(lo, hi - output_length + 2)
else:
left = np.random.randint(0, length - output_length)
series = series[left : left + output_length]
df_crop.iloc[irow, :] = series
return df_crop
And a toy example:
df = pd.DataFrame.from_dict({'t0': [np.NaN, 1, np.NaN],
't1': [np.NaN, 2, np.NaN],
't2': [np.NaN, 3, np.NaN],
't3': [1, 4, 1],
't4': [2, 5, 2],
't5': [3, 6, 3],
't6': [4, 7, np.NaN],
't7': [5, 8, np.NaN],
't8': [6, 9, np.NaN]})
# t0 t1 t2 t3 t4 t5 t6 t7 t8
# 0 NaN NaN NaN 1 2 3 4 5 6
# 1 1 2 3 4 5 6 7 8 9
# 2 NaN NaN NaN 1 2 3 NaN NaN NaN
crop_random(df, 3)
# One possible output:
# X_0 X_1 X_2
# 0 2 3 4
# 1 7 8 9
# 2 1 2 3
How could I achieve same results in a way adapted to large dataframes?
Edit: Moved my improved solution to the answer section.
I managed to speed up things quite drastically with:
def crop_random(dataset, output_length, ignore_na_tails=True):
# Get a random range to crop for each row
def get_range_crop(series, output_length, ignore_na_tails):
series = np.array(series).astype('float')
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
start = pos_non_na[0][0]
end = pos_non_na[0][-1]
left = np.random.randint(start,
end - output_length + 2) # +1 to include last in randint; +1 for slction span
else:
length = len(series)
left = np.random.randint(0, length - output_length)
right = left + output_length
return left, right
# Crop the rows to random range, reset_index to do concat without recreating new columns
range_subset = dataset.apply(get_range_crop, args=(output_length,ignore_na_tails, ), axis = 1)
new_rows = [dataset.iloc[irow, range_subset[irow][0]: range_subset[irow][1]]
for irow in range(dataset.shape[0])]
for row in new_rows:
row.reset_index(drop=True, inplace=True)
# Concatenate all rows
dataset_cropped = pd.concat(new_rows, axis=1).T
return dataset_cropped

Work out full and part-time colleagues based off a declarable variable hour

this is more of a guidance / point me in the right direction sort of question.
The Problemo!
I have a problem at work that I currently work out using a very very long excel formula.
I basically allocate a variable of hours (let's call this h) to 500 stores
I then declare the hour's allocation for a full-time colleague and part-time (ft and pt)
The formula I have at the moment works out based on the no. of hours how many FT can work there and after the FT allocation is exhausted (basically it cannot be divided/mod into the whole number of hours) it then works onto the number of pt colleagues.
in math terms I allocate 20 hours to store A
store A FT colleagues work 12 hours and the PT work 6
based on this store A can accomodate 1 FT col 1 PT and have 2 hours as a remainder.
I would like to do this in python and thought it would be a good first real-ish project to work on.
Solution thus far,
What I've tried is to start fleshing out a function that takes in the ft,pt and h as arguments and spits out the number of FT and PT the number of hours can accomodate. I would then love to append this into a pandas data frame. However, I've not been able to work this out for a while now.. and I have no idea what to search for on SO
def (full_time, part_time,hours):
for hours in full_time:
if hours < full_time or part_time:
return full_time
elif hours >= full_time
return full_time
elif hours >= full_time ....
What I've tried is to start fleshing out a function that takes in the ft,pt and h as arguments and spits out the number of FT and PT the number of hours can accommodate.
My understanding is that you have three input variables and three outputs. A given store with total_hours allocated has FT employees who can work ft_hours and PT employees who can each work pt_hours. You want to find the number of FT workers & PT workers to allocate, and the remainder assuming that no employees will work half-shifts.
def alloc_hours(
ft_hours: int,
pt_hours: int,
total_hours: int
) -> tuple:
"""Calculate hour-allocation for given store.
ft_hours: The number of hours a full-time emp. works.
pt_hours: The number of hours a part-time emp. works.
total_hours: The total hours allocated to the store.
Returns: tuple
1st element: num. of full-time workers.
2nd element: num. of part-time workers.
3rd element: remainder hours.
"""
ft_workers, remainder = divmod(total_hours, ft_hours)
pt_workers, remainder = divmod(remainder, pt_hours)
return ft_workers, pt_workers, remainder
Examples:
>>> alloc_hours(12, 6, 20)
(1, 1, 2)
>>> alloc_hours(8, 6, 20)
(2, 0, 4)
>>> alloc_hours(8, 6, 24)
(3, 0, 0)
In Pandas:
import pandas as pd
data = {
'ft_hours': [12, 8, 10, 8, 12, 10, 8, 8],
'pt_hours': [6, 4, 6, 6, 6, 4, 4, 6],
'total_hours': [20, 20, 24, 40, 30, 20, 10, 40]
}
data = pd.DataFrame(data)
# Pandas supports vectorization, so each of these results is a Series.
ft_workers, remainder = divmod(data['total_hours'], data['ft_hours'])
pt_workers, remainder = divmod(remainder, data['pt_hours'])
data = data.assign(
ft_workers=ft_workers,
pt_workers=pt_workers,
remainder=remainder
)
Result:
>>> data
ft_hours pt_hours total_hours ft_workers pt_workers remainder
0 12 6 20 1 1 2
1 8 4 20 2 1 0
2 10 6 24 2 0 4
3 8 6 40 5 0 0
4 12 6 30 2 1 0
5 10 4 20 2 0 0
6 8 4 10 1 0 2
7 8 6 40 5 0 0
This answer is based on the assumption that you have an existing Dataframe that provides the three inputs. You could create a new column/field using the the pandas apply function. Apply takes your inputs, applies your function, then returns the results in the new field.

Pandas conditions across multiple series

Lets say I have some data like this:
category = pd.Series(np.ones(4))
job1_days = pd.Series([1, 2, 1, 2])
job1_time = pd.Series([30, 35, 50, 10])
job2_days = pd.Series([1, 3, 1, 3])
job2_time = pd.Series([10, 40, 60, 10])
job3_days = pd.Series([1, 2, 1, 3])
job3_time = pd.Series([30, 15, 50, 15])
Each entry represents an individual (so 4 people total). xxx_days represents the number of days an individual did something and xxx_time represents the number of minutes spent doing that job on a single day
I want to assign a 2 to category for an individual, if across all jobs they spent at least 3 days of 20 minutes each. So for example, person 1 does not meet the criteria because they only spent 2 total days with at least 20 minutes (their job 2 day count does not count toward the total because time is < 20). Person 2 does meet the criteria as they spent 5 total days (jobs 1 and 2).
After replacement, category should look like this:
[1, 2, 2, 1]
My current attempt to do this requires a for loop and manually indexing into each series and calculating the total days where time is greater than 20. However, this approach doesn't scale well to my actual dataset. I haven't included the code here as i'd like to approach it from a Pandas perspective instead
Whats the most efficient way to do this in Pandas? The thing that stumps me is checking conditions across multiple series and act accordingly after summation of days
Put days and time in two data frames with column positions correspondence maintained, then do the calculation in a vectorized approach:
import pandas as pd
time = pd.concat([job1_time, job2_time, job3_time], axis = 1) ​
days = pd.concat([job1_days, job2_days, job3_days], axis = 1)
((days * (time >= 20)).sum(1) >= 3) + 1
#0 1
#1 2
#2 2
#3 1
#dtype: int64

Categories

Resources