Python sum it twice for groupby - python

I'm struggling with a series sum after already grouped the dataframe, and I was hoping that someone could please help me with an idea.
Basically I have in the example below I need to have the sum per each "Material".
Basically Material "ABC" should give me 2, and all the others as they have only one sign operation would have the same value.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Material" : ["M-12", "H4-LAMPE", "M-12", "H4-LAMPE",
"ABC" , "H4-LAMPE", "ABC", "ABC"] ,
"Quantity" : [6, 1, 3, 5, 1, 1, 10, 9],
"TYPE": ["+", "-", "+", "-", "+", "-", "+", "-"]})
df.groupby(['Material', "Quantity"], as_index=False).count()
listX = []
for item in df["TYPE"]:
if item == "+":
listX.append(1)
elif item == "-":
listX.append(-1)
else:
pass
df["Sign"] = lista
df["MovementsQty"] = df["Quantity"]*df["Sign"]
#df = df.groupby(["Material", "TYPE", "Quantity1"]).sum()
df1 = df.groupby(["Material", "TYPE"]).sum()
df1.drop(columns=["Quantity", "Sign"], inplace=True)
print(df1)
The result is:
The desired result is:
I tried to sum it again, to consider it differently but I was not successful so far and I think I need some help.
Thank you very much for your help

You're on the right track. I've tried to improve your code. Just use "Type" to determine and assign the sign using np.where, perform groupby and sum, and then re-compute the "Type" column based on the result.
v = (df.assign(Quantity=np.where(df.TYPE == '+', df.Quantity, -df.Quantity))
.groupby('Material', as_index=False)[['Quantity']]
.sum())
v.insert(1, 'Type', np.where(np.sign(v.Quantity) == 1, '+', '-'))
print (v)
Material Type Quantity
0 ABC + 2
1 H4-LAMPE - -7
2 M-12 + 9
Alternatively, you can do this with two groupby calls:
i = df.query('TYPE == "+"').groupby('Material').Quantity.sum()
j = df.query('TYPE == "-"').groupby('Material').Quantity.sum()
# Find the union of the indexes.
idx = i.index.union(j.index)
# Reindex and subtract.
v = i.reindex(idx).fillna(0).sub(j.reindex(idx).fillna(0)).reset_index()
# Insert the Type column back into the result.
v.insert(1, 'Type', np.where(np.sign(v.Quantity) == 1, '+', '-'))
print(v)
Material Type Quantity
0 ABC + 2.0
1 H4-LAMPE - -7.0
2 M-12 + 9.0

Here is another take (fairly similar to coldspeed though).
#Correct quantity with negative sign (-) according to TYPE
df.loc[df['TYPE'] == '-', 'Quantity'] *= -1
#Reconstruct df as sum of quantity to remove dups
df = df.groupby('Material')['Quantity'].sum().reset_index()
df['TYPE'] = np.where(df['Quantity'] < 0, '-', '+')
print(df)
Returns:
Material Quantity TYPE
0 ABC 2 +
1 H4-LAMPE -7 -
2 M-12 9 +

map and numpy.sign
Just sum up Quantity * TYPE and figure out the sign afterwards.
d = {'+': 1, '-': -1}
r = dict(map(reversed, d.items())).get
q = df.Quantity
m = df.Material
t = df.TYPE
s = pd.Series((q * t.map(d)).values, m, name='MovementsQty').sum(level=0)
s.reset_index().assign(TYPE=lambda x: [*map(r, np.sign(x.MovementsQty))])
Material MovementsQty TYPE
0 M-12 9 +
1 H4-LAMPE -7 -
2 ABC 2 +

Related

In Python I need to do an iterative groupby that access the previous "grouped value" to establish the value of the row of the aggregated column

I have the following dataset that you can replicate with this code:
number_order = [2,2,3,3,5,5,5,6]
number_fakecouriers = [1,2,1,2,1,2,3,3]
dictio = {"number_order":number_order, "number_fakecouriers":number_fakecouriers}
actual_table = pd.DataFrame(dictio)
What I need is to write a code that through a for loop or a groupby generates the following result:
The code should perform a groupby on the column "number_orders" and then take the minimum of the column "number_fakeorders", but each time it should iteratively exclude the minimum values of the column "number_fakeorders" that have been already selected. Then in case there are no more values available it should input a "None".
This is the explanation row by row:
1) "number_orders" = 2 : here the value of "number_fakeorders" is "1", and it is simply the minimum value of "number_fakeorders", where ["number_orders" = 2], because it is the first value that appears
2) "number_orders" = 3 : here the value of "number_fakeorders" is "2" because "1" has been already selected for ["number_orders" = 2], so excluding "1", where ["number_orders" = 3] the minimum value is "2"
3) "number_orders" = 5 : here the value of "number_fakeorders" is "3" because "1" and "2" have been already selected
4) "number_orders" = 6 : here the value of "number_fakeorders" is "None" because the only value of "number_fakeorders" where ["number_orders" = 6] is "3", and "3" has already been selected
Try:
def fn(x, seen):
for v in x:
if v in seen:
continue
seen.add(v)
return v
out = (
actual_table.groupby("number_order")["number_fakecouriers"]
.apply(fn, seen=set())
.reset_index()
)
print(out)
Prints:
number_order number_fakecouriers
0 2 1.0
1 3 2.0
2 5 3.0
3 6 NaN
Note: You can sort dataframe before processing (if not sorted already):
actual_table = actual_table.sort_values(
by=["number_order", "number_fakecouriers"]
)
...
Loop the groupby object and record previous min value in each group
res, prev_min = [], set()
for name, group in actual_table.groupby('number_order'):
diff = set(group['number_fakecouriers']).difference(prev_min)
if len(diff):
m = min(diff)
prev_min.add(m)
else:
m = np.nan
res.append([name, m])
out = pd.DataFrame(res, columns=actual_table.columns)
print(out)
number_order number_fakecouriers
0 2 1.0
1 3 2.0
2 5 3.0
3 6 NaN

Count by groups (pandas)

I have 5 years of stock date. I need to do this: take years 1, 2 and 3 What is the probability that after seeing k consecutive ”down days”, the next day is an ”up day”? For example, if k = 3, what is the probability of seeing ”−, −, −, +” as opposed to seeing ”−, −, −, −”. Compute this for k = 1, 2, 3. I have played with groupby and cumsum, but can't seem to get it right.
For example:
group1 = df[df['True Label'] == "-"].groupby((df['True Label'] != "-").cumsum())
Date
True Label
2019-01-02
+
2019-01-03
-
2019-01-04
+
2019-01-07
+
2019-01-08
+
Try this bit of logic:
import pandas as pd
import numpy as np
np.random.seed(123)
s = pd.Series(np.random.choice(['+','-'], 1000))
sm = s.groupby((s == '+').cumsum()).cumcount()
prob = (sm.diff() == -3).sum() / (sm == 3).sum()
prob
Output:
0.43661971830985913
Details:
Use (s == '+').cumsum() to create groups of '-' records, groupby and cumcount the elements in this group, the first element is the '+' and cumcount starts with zero. There fore '+--' will become 0, 1, 2. Now, take the difference to find out where '-' turns to '+'.
If this is equal to -3 then we know this group has three minus and is followed by a '+'.
Check sm==3 to get to all number of times you hand '---', sum then divide.

Loop and Accumulate Sum from Pandas Column Made of Lists

Currently, my Pandas data frame looks like the following
Row_X
["Medium, "High", "Low"]
["Medium"]
My intention is to iterate through the list in each row such that:
summation = 0
for value in df["Row_X"]:
if "High" in value:
summation = summation + 10
elif "Medium" in value:
summation = summation + 5
else:
summation= summation + 0
Finally, I wish to apply this to each and create a new column that looks like the following:
Row_Y
15
10
My assumption is that either np.select() or apply() can play into this but thus far have encountered errors with implementing either.
We can do:
mapper = {'Medium' : 5, 'High' : 10}
df['Row_Y'] = [sum([mapper[word] for word in l
if word in mapper])
for l in df['Row_X']]
If pandas version > 0.25.0 We can use
df['Row_Y'] = df['Row_X'].explode().map(mapper).sum(level=0)
print(df)
Row_X Row_Y
0 [Medium, High, Low] 15
1 [Medium] 5
Maybe on a cleaner side, convert to a Series and directly use map
mapp = {'Medium' : 5, 'High' : 10}
df['Row_Y'] = df['Row_X'].apply(lambda x: pd.Series(x).map(mapp).sum())
df
Row_X Row_Y
0 [Medium, High, Low] 15.0
1 [Medium] 5.0
Map your function to the series
import pandas as pd
def function(x):
summation = 0
for i in x:
if "High" in i:
summation += 10
elif "Medium" in i:
summation += 5
else:
summation += 0
return summation
df = pd.DataFrame({'raw_x': [['Medium', 'High', 'Low'], ['Medium']]})
df['row_y'] = df['raw_x'].map(function)
You can do it in a shorter format with
mapping = {'High': 10, 'Medium': 5, 'Low': 0}
df['raw_y'] = df['raw_x'].map(lambda x: sum([mapping[i] if i in mapping else 0 for i in x]))
print(df)
raw_x row_y
0 [Medium, High, Low] 15
1 [Medium] 5
This solution should work -
vals = {"High":10, "Medium":5, "Low":0}
df['Row_Y'] = df.apply(lambda row : sum(vals[i] for i in row['Row_X']) ,axis=1)

Add a Rand to each row using Pandas' Assign and a Lambda Function

How do I pass a random number to a lambda function, so that I can be used by the pandas assign to add a different random number to each row.
My original attempt,
df = pd.DataFrame({'cat': [1]*10 + [0]*10,
'value': [3]*5 + [2]*5 + [2]*2 + [3]*8})
df.assign(cat=lambda df: df.cat + np.random.rand(1)).head(3)
Out[1]:
cat value
0 1.962857 3
1 1.962857 3
2 1.962857 3
We see here that the the random number 0.962857 has been added to all rows. But I would like a different rand for each row. How can I do this?
Change 1, because return scalar to array by length of DataFrame:
print (np.random.rand(1))
[ 0.88642869]
print (np.random.rand(len(df)))
[ 0.42677701 0.89968857 0.87976326 0.07758206 0.43617027 0.03221375
0.46398119 0.14226246 0.14237448 0.22679517 0.60271752 0.85003435
0.5676184 0.87565266 0.89830548 0.27066452 0.23907483 0.73784657
0.09083235 0.98984701]
df = df.assign(cat=lambda df: df.cat + np.random.rand(len(df))).head(3)
print (df)
cat value
0 1.886429 3
1 1.426777 3
2 1.899689 3

Assign values in Pandas series based on condition?

I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?
You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64
When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))

Categories

Resources