panda python how to make a groupby more columns - python

This is my question:
I have a file csv like this:
SELL,NUMBER,TYPE,MONTH
-1484829.72,25782,E,3
-1337196.63,26688,E,3
-1110271.83,15750,E,3
-1079426.55,16117,E,3
-964656.26,11344,D,1
-883818.81,10285,D,2
-836068.57,14668,E,3
-818612.27,13806,E,3
-765820.92,14973,E,3
-737911.62,8685,D,2
-728828.93,8975,D,1
-632200.31,12384,E
41831481.50,18425,E,2
1835587.70,33516,E,1
1910671.45,20342,E,6
1916569.50,24088,E,6
1922369.40,25101,E,1
2011347.65,23814,E,3
2087659.35,18108,D,3
2126371.86,34803,E,2
2165531.50,35389,E,3
2231818.85,37515,E,3
2282611.90,32422,E,6
2284141.50,21199,A,1
2288121.05,32497,E,6
I want to make a groupby TYPE and sum the columns SELLS and NUMBERS making a separation between negative and positive number
I make this command:
end_result= info.groupby(['TEXTOCANAL']).agg({
'SELLS': (('negative', lambda x : x[x < 0].sum()), ('positiv', lambda x : x[x > 0].sum())),
'NUMBERS': (('negative', lambda x : x[info['SELLS'] <0].sum()), ('positive', lambda x : x[info['SELLS'] > 0].sum())),
})
And the result is the following:
SELLS NUMBERS
negative positive negative positive
TYPE
A -1710.60 5145.25 17 9
B -95.40 3391.10 1 29
C -3802.25 36428.40 191 1063
D 0.00 30.80 0 7
E -19143.30 102175.05 687 1532
But i want to make this groupby adding the column MONTH
Something like that:
1 2
SELLS NUMBERS
negative positive negative positive negative positive negative positive
TYPE
A -1710.60 5145.25 17 9 -xxx.xx xx.xx xx xx
B -95.40 3391.10 1 29
C -3802.25 36428.40 191 1063
D 0.00 30.80 0 7
E -19143.30 102175.05 687 1532
Any idea?
Thanks in advance for your help

This should work:
end_result = (
info.groupby(['TYPE', 'MONTH', np.sign(info.SELL)]) # groupby negative and positive SELL
['SELL', 'NUMBER'].sum() # select columns to be aggregated
# in this case is redundant to select columns
# since those are the only two columns left
# groupby moves TYPE and MONTH as index
.unstack([1, 2]) # reshape as you need it
.reorder_levels([0, 1, 3, 2]) # to have pos/neg as last level in MultiIndex
.rename({-1: 'negative', 1: 'positive'}, axis=1, level=-1)
)

Similar answer to RichieV's answer. I was unaware of np.sign, which is a neat trick.
Another way to do this is that you can .assign a flag column with np.where to identify a positive or negative. Then, groupby all non-numerical columns and move the second and third fields to the the columns with .unstack([1,2]).
info = (info.assign(flag=np.where((info['SELL'] > 0), 'postive', 'negative'))
.groupby(['TYPE','MONTH','flag'])['SELL', 'NUMBER'].sum()
.unstack([1,2]))
output (image since multi-indexes are messy).

Related

Hot to make pandas cut have first range equal to minimum value

I have this dataframe:
lst = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,3,3,3,3,3,3,3]
ser = pd.Series(lst)
df1 = pd.DataFrame(ser, columns=['Quantity'])
When i check unique values from variable quantity i have the following distribution:
df1.groupby(['Quantity'])['Quantity'].count() / sum ( df1['Quantity'])
Quantity
0 0.741935
1 0.338710
2 0.016129
3 0.209677
Name: Quantity, dtype: float64
Because value 2 represents only 0.016 i want to create a new categorical variable that creates "bins" like:
Quantity
0
1-2
3+
How the bins are created is not relevant, the rule of thumb is :
If a number has low representation, it should be aggregated with the other values in a class (bin) .
Other example:
Quantity
0 2662035
1 1200
2 2
Could be converted in :
Quantity
0
1+
You can define the bins the way you want in pandas.cut, by default the right part of the bins is uncluded:
import numpy as np
(pd.cut(df['Quantity'], bins=[-1, 0, 2, np.inf], labels=['0', '1-2', '3+'])
.value_counts()
)
Output:
0 57
1-2 29
3+ 5
Name: Quantity, dtype: int64
combining counts based on a threshold
threshold = 0.05
c = df1['Quantity'].value_counts(sort=False).sort_index()
group = c.div(c.sum()).gt(threshold).cumsum()
(c.reset_index()
.groupby(group)
.agg({'index': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}' if len(x)>1 else str(x.iloc[0]),
'Quantity': 'sum',
})
.set_index('index')
)
Output:
Quantity
index
0 46
1-2 22
3 13

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Sum groups of flagged items and then find max values

I'd like to sum the values grouped by positive and negatives flows and then compare them to figure out the largest negative and largest positive flows.
I think itertools is probably the way to do this but can't figure it out.
#create a data frame that shows week and value
n_rows = 30
dftest = pd.DataFrame({'week': pd.date_range('1/4/2019', periods=n_rows, freq='W'),
'value': np.random.randint(-100,100,size=(n_rows))})
#flag positives and negatives
def flowFinder(row):
if row['value'] > 0:
return "Positive"
else:
return "Negative"
dftest['flag'] = dftest.apply(flowFinder,axis=1)
dftest
In this example df, you'd determine that 15-19 adds up toe 249 which is the max value of all the positive flows. The max negative flow is line 5 with -98.
Edit by Scott Boston
It is best if you added code that generates your dataframe instead of links to a picture.
df = pd.DataFrame({'week':pd.date_range('2019-01-06',periods=21, freq='W'),
'value':[64,43,94,-19,3,-98,1,80,-7,-43,45,58,27,29,
-4,20,97,30,22,80,-95],
'flag':['Positive']*3+['Negative']+['Positive']+['Negative']+
['Positive']*2+['Negative']*2+['Positive']*4+
['Negative']+['Positive']*5+['Negative']})
You can try this:
df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value'].sum().agg(['min','max'])
Output:
min -98
max 249
Name: value, dtype: int64
Using rename:
df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value'].sum().agg(['min','max'])\
.rename(index={'min':'Negative','max':'Positive'})
Output:
Negative -98
Positive 249
Name: value, dtype: int64
Update answer comment:
df_out = df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value','week']\
.agg({'value':'sum','week':'last'})
df_out.loc[df_out.agg({'value':['idxmin','idxmax']}).squeeze().tolist()]
Output:
value week
flag
4 -98 2019-02-10
9 249 2019-05-19

Applying dataframe-returning function to every row of base dataframe

Toy example
Suppose that base_df is the tiny dataframe shown below:
In [221]: base_df
Out[221]:
seed
I S
0 a 0
b 1
1 a 2
b 3
Note that base_df has a 2-level multi-index for the rows. (Part of the problem here involves "propagating" this multi-index's values in a derived dataframe.)
Now, the function fn (definition given at the end of this post) takes an integer seed as argument and returns a 1-column dataframe indexed by string keys1. For example:
In [222]: fn(0)
Out[222]:
F
key
01011 0.592845
10100 0.844266
In [223]: fn(1)
Out[223]:
F
key
11110 0.997185
01000 0.932557
11100 0.128124
I want to generate a new dataframe, in essence, by applying fn to every row of base_df, and concatenating the resulting dataframes vertically. More specifically, the desired result would look like this:
F
I S key
0 a 01011 0.592845
10100 0.844266
b 11110 0.997185
01000 0.932557
11100 0.128124
1 a 01101 0.185082
01110 0.931541
b 00100 0.070725
11011 0.839949
11111 0.121329
11000 0.569311
IOW, conceptually, the desired dataframe is obtained by generating one "sub-dataframe" for each row of base_df, and concatenating these sub-dataframes vertically. The sub-dataframe corresponding to each row has a 3-level multi-index. The first two levels (I and S) of this multi-index come from base_df's multi-index value for that row, while its last level (key), as well as the values for the (lone) F column come from the dataframe returned by fn for that row's seed value.
The part I'm not clear on is how to propagate the row's original multi-index value to the rows of the dataframe created by fn for that row's seed value.
IMPORTANT: I'm looking for a way to do this that is agnostic to the names of the base_df's multi-index's levels, and their number.
I tried the following
base_df.apply(lambda row: fn(row.seed), axis=1)
...but the evaluation fails with the error
ValueError: Shape of passed values is (4, 2), indices imply (4, 1)
Is there some convenient way to do what I'm trying to do?
Here's the definition of fn. Its internals are unimportant as far as this question is concerned. What matters is that it takes an integer seed as argument, and returns a dataframe, as described earlier.
import numpy
def fn(seed, _spec='{{0:0{0:d}b}}'.format(5)):
numpy.random.seed(int(seed))
n = numpy.random.randint(2, 5)
r = numpy.random.rand(n)
k = map(_spec.format, numpy.random.randint(0, 31, size=n))
result = pandas.DataFrame(r, columns=['F'], index=k)
result.index.name = 'key'
return result
1 In this example, these keys happen to correspond to the binary representation of some integer between 0 and 31, inclusive, but this fact plays no role in the question.
Option 1
groupby
base_df.groupby(level=[0, 1]).apply(fn)
F
I S key
0 a 11010 0.385245
00010 0.890244
00101 0.040484
b 01001 0.569204
11011 0.802265
00100 0.063107
1 a 00100 0.947827
00100 0.056551
11000 0.084872
b 11110 0.592641
00110 0.130423
11101 0.915945
Option 2
pd.concat
pd.concat({t.Index: fn(t.seed) for t in base_df.itertuples()})
F
key
0 a 11011 0.592845
00011 0.844266
b 00101 0.997185
01111 0.932557
00000 0.128124
1 a 01011 0.185082
10010 0.931541
b 10011 0.070725
01010 0.839949
01011 0.121329
11001 0.569311

How to apply different functions to a groupby object?

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 1, 1, 2, 1, 2, 2],
'min_max': ['max_val', 'max_val', 'min_val', 'min_val', 'max_val', 'max_val', 'min_val', 'min_val'],
'value': [1, 20, 20, 10, 12, 3, -10, -5 ]})
id min_max value
0 1 max_val 1
1 2 max_val 20
2 1 min_val 20
3 1 min_val 10
4 2 max_val 12
5 1 max_val 3
6 2 min_val -10
7 2 min_val -5
Each id has several maximal and minimal values associated with it. My desired output looks like this:
max min
id
1 3 10
2 20 -10
It contains the maximal max_val and the minimal min_val for each id.
Currently I implement that as follows:
gdf = df.groupby(by=['id', 'min_max'])['value']
max_max = gdf.max().loc[:, 'max_val']
min_min = gdf.min().loc[:, 'min_val']
final_df = pd.concat([max_max, min_min], axis=1)
final_df.columns = ['max', 'min']
What I don't like is that I have to call .max() and .min() on the grouped dataframe gdf, separately where I throw away 50% of the information (since I am not interested in the maximal min_val and the minimal min_val).
Is there a way to do this in a more straightforward manner by e.g. passing the function that should be applied to a group directly to the groupby call?
EDIT:
df.groupby('id')['value'].agg(['max','min'])
is not sufficient as there can be the case that a group has a min_val that is higher than all max_val for that group or a max_val that is lower than all min_val. Thus, one also has to group based on the column min_max.
Result for
df.groupby('id')['value'].agg(['max','min'])
max min
id
1 20 1
2 20 -10
Result for the code from above:
max min
id
1 3 10
2 20 -10
Here's a slightly tongue-in-cheek solution:
>>> df.groupby(['id', 'min_max'])['value'].apply(lambda g: getattr(g, g.name[1][:3])()).unstack()
min_max max_val min_val
id
1 3 10
2 20 -10
This applies a function that grabs the name of the real function to apply from the group key.
Obviously this wouldn't work so simply if there weren't such a simple relationship between the string "max_val" and the function name "max". It could be generalized by having a dict mapping column values to functions to apply, something like this:
func_map = {'min_val': min, 'max_val': max}
df.groupby(['id', 'min_max'])['value'].apply(lambda g: func_map[g.name[1]](g)).unstack()
Note that this is slightly less efficient than the version above, since it calls the plain Python max/min rather than the optimized pandas versions. But if you want a more generalizable solution, that's what you have to do, because there aren't optimized pandas versions of everything. (This is also more or less why there's no built-in way to do this: for most data, you can't assume a priori that your values can be mapped to meaningful functions, so it doesn't make sense to try to determine the function to apply based on the values themselves.)
One option is to do the customized aggregation with groupby.apply, since it doesn't fit with built in aggregation scenario well:
(df.groupby('id')
.apply(lambda g: pd.Series({'max': g.value[g.min_max == "max_val"].max(),
'min': g.value[g.min_max == "min_val"].min()})))
# max min
#id
# 1 3 10
# 2 20 -10
Solution with pivot_table:
df1 = df.pivot_table(index='id', columns='min_max', values='value', aggfunc=[np.min,np.max])
df1 = df1.loc[:, [('amin','min_val'), ('amax','max_val')]]
df1.columns = df1.columns.droplevel(1)
print (df1)
amin amax
id
1 10 3
2 -10 20

Categories

Resources