FutureWarning: Level keyword deprecated in 1.3, use groupby instead - python

I currently have a file where I create a hierarchy from the product and calculate the percentage split based on the previous level.
My code looks like this:
data = [['product1', 'product1a', 'product1aa', 10],
['product1', 'product1a', 'product1aa', 5],
['product1', 'product1a', 'product1aa', 15],
['product1', 'product1a', 'product1ab', 10],
['product1', 'product1a', 'product1ac', 20],
['product1', 'product1b', 'product1ba', 15],
['product1', 'product1b', 'product1bb',15],
['product2', 'product2_a', 'product2_aa', 30]]
df = pd.DataFrame(data, columns = ["Product_level1", "Product_Level2", "Product_Level3", "Qty"])
prod_levels = ["Product_level1", "Product_Level2", "Product_Level3"]
df = df.groupby(prod_levels).sum("Qty")
df["Qty ratio"] = df["Qty"] / df["Qty"].sum(level=prod_levels[-2])
print(df)
This gives me this as a result:
Qty Qty ratio
Product_level1 Product_Level2 Product_Level3
product1 product1a product1aa 30 0.500000
product1ab 10 0.166667
product1ac 20 0.333333
product1b product1ba 15 0.500000
product1bb 15 0.500000
product2 product2_a product2_aa 30 1.000000
According to my version of pandas (1.3.2), I'm getting a FutureWarning that level is deprecated and that I should use a groupby instead.
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum()
Unfortunately, I cannot seem to figure out what would be the correct syntax to get to the same results using Group by to make sure this will work with futrue versions of Pandas. I've tried variations of what's below but none worked.
df["Qty ratio"] = df.groupby(["Product_level1", "Product_Level2", "Product_Level3"]).sum("Qty") / df.groupby(level=prod_levels[-1]).sum("Qty")
Can anyway suggest how I could approach this?
Thank you

The level keyword on many functions was deprecated in 1.3. Deprecate: level parameter for aggregations in DataFrame and Series #39983.
The following functions are affected:
any
all
count
sum
prod
max
min
mean
median
skew
kurt
sem
var
std
mad
The level argument was always rewritten internally to be a groupby operation. For this reason, to increase clarity and reduce redundancy in the library it was deprecated.
The general pattern is whatever the level arguments passed to the aggregation were, they should be moved to groupby instead.
Sample Data:
import pandas as pd
df = pd.DataFrame(
{'A': [1, 1, 2, 2],
'B': [1, 2, 1, 2],
'C': [5, 6, 7, 8]}
).set_index(['A', 'B'])
C
A B
1 1 5
2 6
2 1 7
2 8
With aggregate over level:
df['C'].sum(level='B')
B
1 12
2 14
Name: C, dtype: int64
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead.
This now becomes groupby over level:
df['C'].groupby(level='B').sum()
B
1 12
2 14
Name: C, dtype: int64
In this specific example:
df["Qty ratio"] = df["Qty"] / df["Qty"].sum(level=prod_levels[-2])
Becomes
df["Qty ratio"] = df["Qty"] / df["Qty"].groupby(level=prod_levels[-2]).sum()
*just move the level argument to groupby

Related

Finding the summation of values from two pandas dataframe column

I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN

making dataframe of stats output [duplicate]

Is there an easy and straightforward way to load the output from sp.stats.describe() into a DataFrame, including the value names? It doesn't seem to be a dictionary format or something related. Ofcourse I can manually attach the relevant column names (see below), but was wondering whether it might be possible to directly load into a DataFrame with named columns.
import pandas as pd
import scipy as sp
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})
sp.stats.describe(data['a'])
pd.DataFrame(a)
pd.DataFrame(a).transpose().rename(columns={0: 'N', 1: 'Min,Max',
2: 'Mean', 3: 'Var',
4: 'Skewness',
5: 'Kurtosis'})
You can use _fields for columns names from named tuple:
a = sp.stats.describe(data['a'])
df = pd.DataFrame([a], columns=a._fields)
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Also is possible create dictionary from named tuples by _asdict:
d = sp.stats.describe(data['a'])._asdict()
df = pd.DataFrame([d], columns=d.keys())
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3

Pandas DataFrame groupby apply and re-expand along grouped axis

Say I have a dataframe
A B C D
2019-01-01 1 10 100 12
2019-01-02 2 20 200 23
2019-01-03 3 30 300 34
And an array to group the columns by
array([0, 1, 0, 2])
I wish to group the dataframe by the array (on the column axis), apply a function, then return a Series with length of the number of columns, containing the result of the applied function on each column.
So, for the above (with the applied function taking the group's sum), would want to output:
A 606
B 60
C 606
D 69
dtype: int64
My best attempt:
func = lambda a: np.full(a.shape[1], np.sum(a.values))
df.groupby(groups, axis=1).apply(func)
0 [606, 606]
1 [60]
2 [69]
dtype: object
(in this example the applied function returns equal values inside a group, but this can't be guaranteed for the real case)
I can not see how to do this with pandas grouping syntax, unless I am missing something. Could anyone lend a hand, thanks!
Try this:
import numpy as np
import pandas as pd
groups = [0, 1, 0, 2]
df = pd.DataFrame({'A': [1, 2, 3],
'B': [10, 20, 30],
'C': [100, 200, 300],
'D': [12, 23, 34]})
temp = df.apply(sum).to_frame()
temp.index = pd.MultiIndex.from_arrays(
np.stack([temp.index, groups]),
names=("df columns", "groups")
)
temp_filter = temp.groupby(level=1).agg(sum)
result = temp.join(temp_filter, rsuffix='0'). \
set_index(temp.index.get_level_values(0))["00"]
# df columns
# A 606
# B 60
# C 606
# D 69
# Name: 00, dtype: int64

Element-wise division with accumulated numbers in Python?

The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")

How to perform chisquare tests on rows of pandas dataframes?

I have a dataframe df of the form
class_1_frequency class_2_frequency
group_1 20 10
group_2 60 25
..
group_n 50 15
Suppose class_1 has a total of 70 members and class_2 has 30.
For each row (group_1, group_2,..group_n) I want to create contingency tables (preferably dynamically) and then carry out a chisquare test to evaluate p-values.
For example, for group_1, the contingency table under the hood would look like:
class_1 class_2
group_1_present 20 10
group_1_absent 70-20 30-10
Also, I know scipy.stats.chi2_contingency() is the appropriate function for chisquare, but I am not able to apply it to my context. I have looked at previously discussed questions such as: here and here.
What is the most efficient way to achieve this?
You can take advantage of the apply function on pd.DataFrame. It allows to apply arbitrary functions to columns or rows of a DataFrame. Using your example:
df = pd.DataFrame([[20, 10], [60, 25], [50, 15]])
To produce the contingency tables one can use lambda and some vector operations
>>> members = np.array([70, 30])
>>> df.apply(lambda x: np.array([x, members-x]), axis=1)
0 [[20, 10], [50, 20]]
1 [[60, 25], [10, 5]]
2 [[50, 15], [20, 15]]
And this can of course be wrapped with the scipy function.
df.apply(lambda x: chi2_contingency(np.array([x, members-x])), axis=1)
This produces all possible return values, but by slicing the output, one is able to specify the wanted return values, leaving e.g. the expected arrays. The resulting series can also be converted to a DataFrame.
>>> s = df.apply(lambda x: chi2_contingency(np.array([x, members-x]))[:-1], axis=1)
>>> s
0 (0.056689342403628114, 0.8118072280034329, 1)
1 (0.0, 1.0, 1)
2 (3.349031920460492, 0.06724454934343391, 1)
dtype: object
>>> s.apply(pd.Series)
0 1 2
0 0.056689 0.811807 1.0
1 0.000000 1.000000 1.0
2 3.349032 0.067245 1.0
Now I don't know about the execution efficiency of this approach, but I'd trust the ones who have implemented these functions. And most likely the speed is not that critical. But it is at least efficient in the sense that it is (hypothetically) easy to understand and fast to write.

Categories

Resources