how to cap pandas column with mean values with some conditions - python

I have following dataframe in pandas
ID Quantity Rate Product
1 10 70 MS
2 10 70 MS
3 100 70 MS
4 10 100 MS
5 700 65 HS
6 1100 65 HS
7 700 100 HS
I want to cap values with mean values in Quantity and Rate For MS if Quantity is greater than 100 and Rate is greater than 99 then it should be replaced by mean and For HS if Quantity is greater than 1000 and Rate is greater than 99 then it should be replaced by mean.
I am using following way
mean_MS = df['Quantity'][(df['Product'] == 'MS') and (df['Quantity'] < 100)].mean()
But it does not work.
My desired dataframe would be
ID Quantity Rate Product
1 10 70 MS
2 10 70 MS
3 10 70 MS
4 10 70 MS
5 700 65 HS
6 700 65 HS
7 700 65 HS

one way to solve this,
m1=df['Product']=='MS'
m2=(df['Quantity']>=100)|(df['Rate']>99)
df.loc[m1&m2,'Quantity']=df[m1&(df['Quantity']<100)]['Quantity'].mean()
df.loc[m1&m2,'Rate']=df[m1&(df['Rate']<99)]['Rate'].mean()
m3=df['Product']=='HS'
m4=(df['Quantity']>=1000)|(df['Rate']>99)
df.loc[m3&m4,'Quantity']=df[m3&(df['Quantity']<1000)]['Quantity'].mean()
df.loc[m3&m4,'Rate']=df[m3&(df['Rate']<99)]['Rate'].mean()
O/P:
ID Quantity Rate Product
0 1 10.0 70.0 MS
1 2 10.0 70.0 MS
2 3 10.0 70.0 MS
3 4 10.0 70.0 MS
4 5 700.0 65.0 HS
5 6 700.0 65.0 HS
6 7 700.0 65.0 HS
Explanation:
divide your problem into two sub models one is MS and another one is HS for both contains same logic but differs in quantity value.
first you have to change value only for MS so flag that in m1 then if Quantity is greater than or equal to 100 or Rate is greater than 99 replace the mean value from the df where df contains required MS row and clearing out the values where our condition exceeds.
repeat the same logic for Rate.
repeat step 2 and 3 for HS too where Quantity condition modified from 100 to 1000.

IIUC , you can also try the below:
val1= df.loc[df.Product.eq('MS'),['Quantity','Rate']].mode().values
#array([[10, 70]], dtype=int64)
val2= df.loc[df.Product.eq('HS'),['Quantity','Rate']].mode().values
#array([[700, 65]], dtype=int64)
df.loc[df.Product.eq('MS')&df.Quantity.ge(100)|df.Product.eq('MS')&df.Rate.gt(99),['Quantity','Rate']] = val1
df.loc[df.Product.eq('HS')&df.Quantity.ge(1000)|df.Product.eq('HS')&df.Rate.gt(99),['Quantity','Rate']] = val2
print(df)
ID Quantity Rate Product
0 1 10 70 MS
1 2 10 70 MS
2 3 10 70 MS
3 4 10 70 MS
4 5 700 65 HS
5 6 700 65 HS
6 7 700 65 HS

Related

comparing column values in groupby in pandas

my dataframe look like this
Time Name price Profit
5:25 A 150 15
5:25 B 250 10
5:25 C 200 20
5:30 A 200 25
5:30 B 150 20
5:30 C 210 25
5:35 A 180 15
5:35 B 200 30
5:35 C 200 10
5:40 A 150 20
5:40 B 260 15
5:40 C 220 10
I want output should be like:
Time Name price profit diff_price diff_profit
5:25 A 150 15 0 0
5:25 B 250 10 0 0
5:25 C 200 20 0 0
5:30 A 200 25 50 10
5:30 B 150 20 -100 10
5:30 C 210 25 10 5
5:35 A 180 15 20 -10
5:35 B 200 30 50 10
5:35 C 200 10 -10 -15
5:40 A 150 20 -30 5
5:40 B 260 35 60 5
5:40 C 220 15 20 5
I need to compare between previous values of groupby is greater than of previous values like
difference of A,B and C are greater than previous values or not .
if condition matches it has to display Name :
from above at Time 5:40, diff_price and diff_profit of B is greater than all previous Time column values
so output should print like : B
my code look like
df.groupby(['Time','Price'])
df['diff_price']=df.groupby(['Time','Price']).price.diff().fillna(0)
df['diff_profit']=df.groupby(['Time','Price']).profit.diff().fillna(0)
Then how to do comparision between values to get desired output to display is : B
You could tackle this problem one group ("Name") at the time:
# Let's iterate the dataframe by grouping by "Name"
for name, group_df in df.groupby(["Name"]):
# Make sure that the rows are sorted by time
group_df = group_df.sort_values("Time")
# Calculate difference between each row (diff = bottom - top)
group_df[["diff_price", "diff_profit"]] = group_df[["price", "Profit"]].shift(1) - group_df[["price", "Profit"]]
# Fill the first value with 0 instead of NaN (as in your sample input)
group_df = group_df.fillna(0)
# Let's see if the maximum diff_price is reached at the end
*previous_values, last_value = group_df["diff_price"]
if last_value >= max(previous_values):
print(f"Max price diff reached at '{name}'")
print(group_df.tail(1))
# Again, but let's checkout the diff_profit
*previous_values, last_value = group_df["diff_profit"]
if last_value >= max(previous_values):
print(f"Max profit diff reached at '{name}'")
print(group_df.tail(1))
This is the output I get for your sample input:
Max price diff reached: A
Time Name price Profit diff_price diff_profit
9 5:40 A 150 20 30.0 -5.0
Max profit diff reached: B
Time Name price Profit diff_price diff_profit
10 5:40 B 260 15 -60.0 15.0
IIUC, compute diff_price and diff_profit based on Name column then patch the last group of time according your condition:
df[['diff_price', 'diff_profit']] = df.groupby('Name')[['price', 'profit']] \
.diff().fillna(0)
mask = df['Time'].eq(df['Time'].max())
df.loc[mask, 'diff_profit'] = df.loc[mask, 'diff_profit'].max()
Output:
>>> df
Time Name price profit diff_price diff_profit
0 5:25 A 150 15 0.0 0.0
1 5:25 B 250 10 0.0 0.0
2 5:25 C 200 20 0.0 0.0
3 5:30 A 200 25 50.0 10.0
4 5:30 B 150 20 -100.0 10.0
5 5:30 C 210 25 10.0 5.0
6 5:35 A 180 15 -20.0 -10.0
7 5:35 B 200 30 50.0 10.0
8 5:35 C 200 10 -10.0 -15.0
9 5:40 A 150 20 -30.0 5.0
10 5:40 B 260 15 60.0 5.0
11 5:40 C 220 10 20.0 5.0

Plotting in pivot table using label

my dataset
df
Month 1 2 3 4 5 Label
Name
A 120 80.5 120 105.5 140 0
B 80 110 98.5 105 100 1
C 150 90.5 105 120 190 2
D 100 105 98.5 110 120 1
...
To draw a plot for Month, applying the inverse matrix,
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0.00 1.0 2.0 1.000
Ultimately what I want to do is Drawing a plot, the x-axis is this month, y-axis is value.
but,
I have two questions.
Q1.
To inverse matrix, the data type of 'label' is changed(int -> float),
Can only the index of the 'label' be set to int type?
output what I want
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0 1 2 1
Q2.
q1 is actually for q2.
When drawing a plot, I want to group it using a label.(Like seaborn hue)
When drawing a plot using the pivot table above, is there a way for grouping to be possible?
(matplotlib, sns method does not matter)
The label above doesn't have to be int, and if possible, you don't need to answer the q1 task.
thank you for reading
Q2: You need reshape values, e.g. here with DataFrame.melt for possible use hue:
df1 = df.reset_index().melt(['Name','Label'])
print (df1)
sns.stripplot(data=df1,hue='Label',x='Name',y='value')
Q1: Pandas not support it, e.g. if convert last row label it not change values to floats:
df = df.T
df.loc['Label', :] = df.loc['Label', :].astype(int)
print (df)
Name A B C D
1 120.0 80.0 150.0 100.0
2 80.5 110.0 90.5 105.0
3 120.0 98.5 105.0 98.5
4 105.5 105.0 120.0 110.0
5 140.0 100.0 190.0 120.0
Label 0.0 1.0 2.0 1.0
EDIT:
df1 = df.reset_index().melt(['Name','Label'], var_name='Month')
print (df1)
Name Label Month value
0 A 0 1 120.0
1 B 1 1 80.0
2 C 2 1 150.0
3 D 1 1 100.0
4 A 0 2 80.5
5 B 1 2 110.0
6 C 2 2 90.5
7 D 1 2 105.0
8 A 0 3 120.0
9 B 1 3 98.5
10 C 2 3 105.0
11 D 1 3 98.5
12 A 0 4 105.5
13 B 1 4 105.0
14 C 2 4 120.0
15 D 1 4 110.0
16 A 0 5 140.0
17 B 1 5 100.0
18 C 2 5 190.0
19 D 1 5 120.0
sns.lineplot(data=df1,hue='Label',x='Month',y='value')

Pandas totalling balances with date timeline from multiple sheets

I have three sheets inside one excel spreadsheet. I am trying to obtain the output listed below, or something close to it. The desired outcome is to know when there will be a shortage so that I can attempt to re-actively order and prevent the shortage. All of these, except for the output, is on one excel file. Each are different sheets. How hard will this be to achieve, is this possible? Note that in all sheets listed, there are tons of other data columns so positional references to columns may be needed, or using iloc to call upon columns by name.
instock sheet
product someother datapoint qty
5.25 1 2 100
5.25 1 3 200
6 2 1 50
6 4 1 500
ordered
product something ordernum qty date
5 1/4 abc 52521 50 07/01/2019
5 1/4 ddd 22911 100 07/28/2019
6 eeee 72944 10 07/5/2019
promised
product order qty date
5 1/4 456 300 06/12/2019
5 1/4 789 50 06/20/2019
5 1/4 112 50 07/20/2019
6 113 800 07/22/2019
5 1/4 144 50 07/28/2019
9 155 100 08/22/2019
Output
product date onhand qtyordered commited balance shortage
5.25 06/10 300 300 n
5.25 06/12 300 300 0 n
5.25 06/20 0 50 -50 y
5.25 07/01 -50 50 0 n
6 07/05 550 10 0 560 n
5.25 07/20 0 50 -50 y
6 07/22 560 0 800 -240 y
5.25 07/28 -50 100 50 0 n
9 08/22 0 0 100 -100 y

How to create new values in a pandas dataframe column based on values from another column

I have a pandas dataframe of values I read in from a csv file. I have a column labeled 'SleepQuality' and the values are float from 0.0 - 100.0. I want to create a new column labeled 'SleepQualityGroup' where values from the original column btw 0 - 49 have a value of 0 in the new column, 50 - 59 = 1 , 60 - 69 = 2, 70 - 79 = 3, 80 - 89 = 4, and 90 - 100 = 5
What would be the best formula to use in order to do this? I am stuck on the logic needed to identify all values in each range and assign to the new value.
An example of what the output would like like below in the new 'SleepQualityGroup' column.
SleepQuality SleepQualityGroup
80.4 4
90.1 5
66.4 2
50.3 1
86.2 4
75.4 3
45.7 0
91.5 5
61.3 2
54 1
58.2 1
Use pd.cut i.e
df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
Output:
SleepQuality SleepQualityGroup new
0 80.4 4 4
1 90.1 5 5
2 66.4 2 2
3 50.3 1 1
4 86.2 4 4
5 75.4 3 3
6 45.7 0 0
7 91.5 5 5
8 61.3 2 2
9 54.0 1 1
10 58.2 1 1
That's basically a binning operation. As such two tools could be used here.
Using np.searchsorted -
bins = np.arange(50,100,10)
df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
Using np.digitize -
df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
Sample output -
In [866]: df
Out[866]:
SleepQuality SleepQualityGroup
0 80.4 4
1 90.1 5
2 66.4 2
3 50.3 1
4 86.2 4
5 75.4 3
6 45.7 0
7 91.5 5
8 61.3 2
9 54.0 1
10 58.2 1
Runtime test -
In [921]: df
Out[921]:
SleepQuality SleepQualityGroup
0 80.4 4
1 90.1 5
2 66.4 2
3 50.3 1
4 86.2 4
5 75.4 3
6 45.7 0
7 91.5 5
8 61.3 2
9 54.0 1
10 58.2 1
In [922]: df = pd.concat([df]*10000,axis=0)
# #Dark's soln using pd.cut
In [923]: %timeit df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
1000 loops, best of 3: 1.04 ms per loop
In [926]: %timeit df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
1000 loops, best of 3: 591 µs per loop
In [927]: %timeit df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
1000 loops, best of 3: 538 µs per loop

Get maximum values relative to the current index in pandas python

Let me say I have a DataFrame where the data is ordered with respect to time. I have a column as weights and I want to find the maximum weight relative to the current index. For example the max value found for the 10th Row would be from elements 11 to the end.
I ended up writing this function. But performance is a big threat.
import pandas as pd
df=pd.DataFrame({"time":[100,200,300,400,500,600,700,800],"weights":
[120,160,190,110,34,55,66,33]})
totalRows=df['time'].count()
def findMaximumValRelativeToCurrentRow(row):
index= row.name
if index!= totalRows:
tempDf = df[index:totalRows]
val=tempDf['weights'].max()
df.set_value(index,'max',val)
else:
df.set_value(index,'max',row['weights'])
df.apply(findMaximumValRelativeToCurrentRow,axis=1)
print df
Is there any better way to do the operation than this?
You can use cummax with iloc for reverse order:
print (df['weights'].iloc[::-1])
7 33
6 66
5 55
4 34
3 110
2 190
1 160
0 120
Name: weights, dtype: int64
df['max1'] = df['weights'].iloc[::-1].cummax()
print (df)
time weights max max1
0 100 120 190.0 190
1 200 160 190.0 190
2 300 190 190.0 190
3 400 110 110.0 110
4 500 34 66.0 66
5 600 55 66.0 66
6 700 66 66.0 66
7 800 33 33.0 33

Categories

Resources