Normalize within groups in Pandas - python

I have read several similar questions and cannot for the life of me find an answer that works for what I'm trying to specifically even though the question is very simple. I have a set of data that has a grouping variable, a position, and a value at that position:
Sample Position Depth
A 1 2
A 2 3
A 3 4
B 1 1
B 2 3
B 3 2
I want to generate a new column that is an internally normalized depth as follows:
Sample Position Depth NormalizedDepth
A 1 2 0
A 2 3 0.5
A 3 4 1
B 1 1 0
B 2 3 1
B 3 2 0.5
This is essentially represented by the formula NormalizedDepth = (x - min(x))/(max(x)-min(x)) such that the minimum and maximum are of the group.
I know how to do this with dplyr in R with the following:
depths %>%
group_by(Sample) %>%
mutate(NormalizedDepth = 100 * (Depth - min(Depth))/(max(Depth) - min(Depth)))
I cannot figure out how to do this with pandas I've tried doing grouping and applying, but none of it seems to replicate what I am looking for.

We have transform (do the same as mutate in R dplyr ) with ptp (thes is get the diff between the max and min )
import numpy as np
g=df.groupby('Sample').Depth
df['new']=(df.Depth-g.transform('min'))/g.transform(np.ptp)
0 0.0
1 0.5
2 1.0
3 0.0
4 1.0
5 0.5
Name: Depth, dtype: float64

Group the Data Frame by Sample Series' values, apply an anonymous function to each value of the (split) Depth Series which performs min max normalisation, assign result to NormalizedDepth Series of df DataFrame (note unlikely to be as efficient as YOBEN_S' answer above):
import pandas as pd
df['NormalizedDepth'] = df.groupby('Sample').Depth.apply(lambda x: (x - min(x))/(max(x)-min(x)))

Related

Pandas solution to computing the maximum of two sums of two columns?

So I have a DataFrame with (amongst others) four colours with numerical values. I want to add a column to the DataFrame that has the maximum of the two sums obtained from summing two columns.
My solutions so far is
from pandas import DataFrame
df = DataFrame(data={'text': ['a','b','c'], 'a':[1,2,3],'b':[2,3,4],'c':[5,4,2],'d':[-2,4,1]})
df['sum1'] = df['a'].add(df['b'])
df['sum2'] = df['c'].add(df['d'])
df['maxsum'] = df[['sum1','sum2']].max(axis=1)
which gives the desired result.
I am pretty sure, there is a more concise way to do this...
There is nothing wrong with your approach. In fact, it is the approach I would take if nothing more than the fact it is easy to read and figure out what you are doing. But if you are looking for another solution, here is one using numpy.ufunc.reduceat
import pandas as pd
import numpy as np
# sample frame
df = pd.DataFrame(data={'text': ['a','b','c'], 'a':[1,2,3],'b':[2,3,4],'c':[5,4,2],'d':[-2,4,1]})
# we skip the first column and convert to an array - df[df.columns[1:]].values
# we specify the indicies to slice - np.arange(len(df.columns[1:]))[::2]
# then find the max
df['max'] = np.max(np.add.reduceat(df[df.columns[1:]].values,
np.arange(len(df.columns[1:]))[::2],
axis=1),
axis=1)
text a b c d max
0 a 1 2 5 -2 3
1 b 2 3 4 4 8
2 c 3 4 2 1 7
Not that it much more concised, but instead of your current approach you can apply one-shot assignment:
df = df.assign(sum1=df[['a', 'b']].sum(1), sum2=df[['c', 'd']].sum(1),
maxsum=lambda df: df[['sum1','sum2']].max(1))
text a b c d sum1 sum2 maxsum
0 a 1 2 5 -2 3 3 3
1 b 2 3 4 4 5 8 8
2 c 3 4 2 1 7 3 7

How to average n adjacent columns together in python pandas dataframe?

I have a dataframe that is a histogram with 2000 bins, with a column for each bin. I need to reduce it down to a quarter of the size - 500 bins.
Let's say we have the original dataframe:
A B C D E F G H
1 1 1 1 2 2 2 2
I want to reduce it to a new quarter width dataframe:
A B
1 2
where in the new dataframe, A is the average of A+B+C+D/4 in the original dataframe.
Feels like it should be easy, but can't work out how to do it! Cheers :)
Assuming you want to group the first 4 and last 4 columns (or any number of columns 4 by 4):
out = df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
ouput:
0 1
0 1.0 2.0
If you further want to relabel the columns A/B:
out = (df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
.set_axis(['A', 'B'], axis=1)
)
output:
A B
0 1.0 2.0

Count rows with positive values and reset if negative

I am looking to add a column that counts consecutive positive numbers and resets the counter on finding a negative on a pandas dataframe. I might be able to loop through it with 'for' statement but I just know there is a better solution. I have looked at various similar posts that almost ask the same but I just cannot get those solutions to work on my problem.
I have:
Slope
-25
-15
17
6
0.1
5
-3
5
1
3
-0.1
-0.2
1
-9
What I want:
Slope Count
-25 0
-15 0
17 1
6 2
0.1 3
5 4
-3 0
5 1
1 2
3 3
-0.1 0
-0.2 0
1 1
-9 0
Please keep in mind that this a low-skill level question. If there are multiple steps on your proposed solution, please explain each. I would like an answer, but would prefer for me to understand the 'how'.
You first want to mark the positions where new segments (i.e., groups) start:
>>> df['Count'] = df.Slope.lt(0)
>>> df.head(7)
Slope Count
0 -25.0 True
1 -15.0 True
2 17.0 False
3 6.0 False
4 0.1 False
5 5.0 False
6 -3.0 True
Now you need to label each group using the cumulative sum: as True is evaluated as 1 in mathematical equations, the cumulative sum will label each segment with an incrementing integer. (This is a very powerful concept in pandas!)
>>> df['Count'] = df.Count.cumsum()
>>> df.head(7)
Slope Count
0 -25.0 1
1 -15.0 2
2 17.0 2
3 6.0 2
4 0.1 2
5 5.0 2
6 -3.0 3
Now you can use groupby to access each segment, then all you need to do is generate an incrementing sequence starting at zero for each group. There are many ways to do that, I'd just use the (reset'ed) index of each group, i.e., reset the index, get the fresh RangeIndex starting at 0, and turn it into a series:
>>> df.groupby('Count').apply(lambda x: x.reset_index().index.to_series())
Count
1 0 0
2 0 0
1 1
2 2
3 3
4 4
3 0 0
1 1
2 2
3 3
4 0 0
5 0 0
1 1
6 0 0
This results in the expected counts, but note that the final index doesn't match the original dataframe, so we need another reset_index() with drop=True to discard the grouped index to put this into our original dataframe:
>>> df['Count'] = df.groupby('Count').apply(lambda x:x.reset_index().index.to_series()).reset_index(drop=True)
Et voilá:
>>> df
Slope Count
0 -25.0 0
1 -15.0 0
2 17.0 1
3 6.0 2
4 0.1 3
5 5.0 4
6 -3.0 0
7 5.0 1
8 1.0 2
9 3.0 3
10 -0.1 0
11 -0.2 0
12 1.0 1
13 -9.0 0
we can solve the problem by looping through all the rows and using the loc feature in pandas. Assuming that you already have a dataframe named df with a column called slope. The idea is that we are going to sequentially add one to the previous row, but if we ever hit a count where slope_i < 0 the row is multiplied by 0.
df['new_col'] = 0 # just preset everything to be zero
for i in range(1, len(df)):
df.loc[i, 'new_col'] = (df.loc[i-1, 'new_col'] + 1) * (df.loc[i, 'slope'] >= 0)
you can do this by using the groupby-command. It requires some steps, which probably could be shortened, but it works this way.
First, you create a reset column by finding negative numbers
# create reset condition
df['reset'] = df.slope.lt(0)
Then you create groups with a cumsum() to this resets --> at this point every group of positives gets an unique group value. the last line here gives all negative numbers the group 0
# create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
Now you take the groups of positives and cumsum some ones (there MUST be a better solution than that) to get your result. The last line again cleans up results for negative values
# sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
It is not that fine one-line, but especially for larger datasets it should be faster than iterating through the whole dataframe
for easier copy&paste the whole thing (including some commented lines which replace the lines before. makes it shorter but harder to understand)
import pandas as pd
## create data
slope = [-25, -15, 17, 6, 0.1, 5, -3, 5, 1, 3, -0.1, -0.2, 1, -9]
df = pd.DataFrame(data=slope, columns=['slope'])
## create reset condition
df['reset'] = df.slope.lt(0)
## create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
# df['group'] = df.reset.cumsum().mask(df.reset, 0)
## sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
# df['count'] = df.groupby('group')['count'].cumsum().mask(df.reset, 0)
IMO, solving this problem iteratively is the only way because there is a condition that has to meet. you can use any iterative way like for or while. solving this problem with map will be troublesome since this problem still need the previous element to be modified and assign to current element

Hierarchical Operation Pandas

Clearly I'm missing something simple but I don't know what. I would like to propagate an operation by groups. Let's say something simple, I have a simple series with multiindex (let's say 2 levels), I want to take the mean and subtract that mean to the correct index level.
Minimalist example code:
a = pd.Series({(2,1): 3., (1,2):4.,(2,3):4.})
b = a.groupby(level=0).mean()
r = a-b # this is the wrong line, b doesn't propagate to the multiindex of a
The result that I expect:
2 1 -0.5
1 2 0
2 3 .5
dtype: float64
Use Series.sub with possible defined level for align:
r = a.sub(b, level=0)
print (r)
2 1 -0.5
1 2 0.0
2 3 0.5
dtype: float64
Or use GroupBy.transform for Series with same index like original a Series:
b = a.groupby(level=0).transform('mean')
r = a-b
print (r)
2 1 -0.5
1 2 0.0
2 3 0.5
dtype: float64

How to DataFrame.groupby along axis=1

I have:
df = pd.DataFrame({'A':[1, 2, -3],'B':[1,2,6]})
df
A B
0 1 1
1 2 2
2 -3 6
Q: How do I get:
A
0 1
1 2
2 1.5
using groupby() and aggregate()?
Something like,
df.groupby([0,1], axis=1).aggregate('mean')
So basically groupby along axis=1 and use row indexes 0 and 1 for grouping. (without using Transpose)
Are you looking for ?
df.mean(1)
Out[71]:
0 1.0
1 2.0
2 1.5
dtype: float64
If you do want groupby
df.groupby(['key']*df.shape[1],axis=1).mean()
Out[72]:
key
0 1.0
1 2.0
2 1.5
Grouping keys can come in 4 forms, I will only mention the first and third which are relevant to your question. The following is from "Data Analysis Using Pandas":
Each grouping key can take many forms, and the keys do not have to be all of the same type:
• A list or array of values that is the same length as the axis being grouped
•A dict or Series giving a correspondence between the values on the axis being grouped and the group names
So you can pass on an array the same length as your columns axis, the grouping axis, or a dict like the following:
df1.groupby({x:'mean' for x in df1.columns}, axis=1).mean()
mean
0 1.0
1 2.0
2 1.5
Given the original dataframe df as follows -
A B C
0 1 1 2
1 2 2 3
2 -3 6 1
Please use command
df.groupby(by=lambda x : df[x].loc[0],axis=1).mean()
to get the desired output as -
1 2
0 1.0 2.0
1 2.0 3.0
2 1.5 1.0
Here, the function lambda x : df[x].loc[0] is used to map columns A and B to 1 and column C to 2. This mapping is then used to decide the grouping.
You can also use any complex function defined outside the groupby statement instead of the lambda function.
try this:
df["A"] = np.mean(dff.loc[:,["A","B"]],axis=1)
df.drop(columns=["B"],inplace=True)
A
0 1.0
1 2.0
2 1.5

Categories

Resources