Python Dataframe: normalize a numerical column using lambda - python

I tried to use the following code to normalize a column in python data frame:
df['x_norm'] = df.apply(lambda x: (x['X'] - x['X'].mean()) / (x['X'].max() - x['X'].min()),axis=1)
but got the following error:
df['x_norm'] = df.apply(lambda x: (x['X'] - x['X'].mean()) / (x['X'].max() - x['X'].min()),axis=1)
AttributeError: ("'float' object has no attribute 'mean'", u'occurred at index 0')
Does anyone know what I missed here? Thanks!

I'm assuming you are using Pandas.
Instead of applying to the entire DataFrame apply (Documentation) only to the Series 'X', also you should pre-calculate the mean, max and min values. Something like this:
avg = df['X'].mean()
diff = df['X'].max() - df['X'].min()
new_df = df['X'].apply(lambda x: (x-avg)/diff)
If you are looking to normalize the entire DataFrame check this answer:
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

If you want to normalize values in column X:
df['x_norm'] = df.X.div(df.X.sum())
Step by step:
In [65]: df
Out[65]:
a b X
0 2 1 5
1 1 4 5
2 7 4 7
3 1 6 6
4 5 5 8
5 5 8 2
6 6 7 5
7 8 2 5
8 7 9 9
9 9 6 5
In [68]: df['x_norm'] = df.X.div(df.X.sum())
In [69]: df
Out[69]:
a b X x_norm
0 2 1 5 0.087719
1 1 4 5 0.087719
2 7 4 7 0.122807
3 1 6 6 0.105263
4 5 5 8 0.140351
5 5 8 2 0.035088
6 6 7 5 0.087719
7 8 2 5 0.087719
8 7 9 9 0.157895
9 9 6 5 0.087719
check:
In [70]: df.x_norm.sum()
Out[70]: 1.0

Related

Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...

Pandas - Duplicate rows on function application

I have a dataframe, and I'm trying to apply a single function to that dataframe, with multiple arguments. I want the results of the function application to be stored in a new column, with each row duplicated to match each column, but I can't figure out how to do this.
Simple example:
df= pd.DataFrame({"a" : [4 ,5], "b" : [7, 8]}, index = [1, 2])
a b
1 4 7
2 5 8
Now, I want to add both the numbers 10 and 11 to column 'a', and store the results in a new column, 'c'. Sorry if this is unclear, but this is the result I'm looking for:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Is there an easy way to do this?
Use Index.repeat with numpy.tile:
df= pd.DataFrame({"a" : [4 ,5], "b" : [7, 8]}, index = [1, 2])
a = [10,11]
df1 = (df.loc[df.index.repeat(len(a))]
.assign(c = lambda x: x.a + np.tile(a, len(df)))
.reset_index(drop=True)
.rename(lambda x: x+1)
)
Or:
df1 = df.loc[df.index.repeat(len(a))].reset_index(drop=True).rename(lambda x: x+1)
df1['c'] = df1.a + np.tile(a, len(df))
print (df1)
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Another idea is use cross join:
a = [10,11]
df1 = df.assign(tmp=1).merge(pd.DataFrame({'c':a, 'tmp':1}), on='tmp').drop('tmp', 1)
df1['c'] += df1.a
print (df1)
a b c
0 4 7 14
1 4 7 15
2 5 8 15
3 5 8 16
Using the explode method (pandas >= 0.25.0):
df1 = df.assign(c=df.apply(lambda row: [row.a+10, row.a+11], axis=1))
df1 = df1.explode('c')
print(df1)
a b c
1 4 7 14
1 4 7 15
2 5 8 15
2 5 8 16
Note that your code example doesn't do what you say (5+10 = 15, not 16).
The output from adding 10 and 11 is:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
That said, here's some understandable code:
def add_x_y_to_df_col(df, incol, outcol, x, y):
df1 = df.copy()
df[outcol] = df[incol] + x
df1[outcol] = df[incol] + y
return df.append(df1, ignore_index=True)
df = add_x_y_to_df_col(df, 'a', 'c', 10, 11)
Note this returns:
a b c
0 4 7 14
1 5 8 15
2 4 7 15
3 5 8 16
If you want to sort by column a and restart the index at 1:
df = df.sort_values(by='a').reset_index(drop=True)
df.index += 1
(You could of course add that code to the function.) This gives the desired result:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16

Rolling min of a Pandas Series without window / cumulative minimum / expanding min

I'm looking for a way to calculate with Python Pandas rolling(*) min of a Series without window.
Let's consider the following Series
In [26]: s = pd.Series([10, 12, 14, 9, 10, 8, 16, 20])
Out[26]:
0 10
1 12
2 14
3 9
4 10
5 8
6 16
7 20
dtype: int64
I would like to get a Series like
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
I tried
s.rolling().min()
but I'm getting the following error
TypeError: rolling() missing 1 required positional argument: 'window'
I did this
r = s.copy()
val_min = r.iloc[0]
for i, (idx, val) in enumerate(r.iteritems()):
if i > 0:
if val < val_min:
val_min = val
else:
r[idx] = val_min
and have a correct answer
In [30]: r
Out[30]:
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
but I think a Pandas method should probably exist (and be much more efficient) or if it doesn't exist, it should probably be implemented.
(*) "rolling" may not be the appropriate term, maybe it should be named instead a "local" min.
Edit: it's in fact named a cumulative minimum or expanding min
Use Series.cummin:
print(s.cummin())
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
You can use np.minimum.accumulate:
import numpy as np
pd.Series(np.minimum.accumulate(s.values))
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
Another way is to use s.expanding.min (see Series.expanding):
s.expanding().min()
Output:
0 10.0
1 10.0
2 10.0
3 9.0
4 9.0
5 8.0
6 8.0
7 8.0

Define trend pandas/python

I have dataset:
print (df['price'])
0 0.435
1 -2.325
2 -3.866
...
58 -35.876
59 -37.746
Name: price, dtype: float64
moving average:
m_a = df['price'].rolling(window=5).mean()
m_a.plot()
print(m_a)
0 NaN
1 NaN
2 NaN
3 NaN
4 -2.8976
5 -4.9628
...
58 -36.2204
59 -36.4632
M/A
How can I determine the trend for the last n rows - FLAT/UP/DOWN?
In text, or int def result, like:
trend = gettrend(df,5)
print(trend)
>>UP
You can use something like this with np.where and expand on the logic as required:
df['Trend'] = np.where(df['m_a'] < df['m_a'].shift(),'DOWN',
np.where(df['m_a'] > df['m_a'].shift(),'UP','FLAT'))
price m_a Trend
0 1 2 FLAT
1 2 2 FLAT
2 3 4 UP
3 4 5 UP
4 5 6 UP
5 6 7 UP
6 7 -1 DOWN
7 8 2 UP
8 6 7 UP
9 7 -6 DOWN
10 8 -7 DOWN
I'd do it this way:
Setup sample DF:
In [31]: df = pd.DataFrame(np.random.rand(20)*100, columns=['price'])
In [32]: df
Out[32]:
price
0 20.555945
1 58.312756
2 3.723192
3 22.298697
4 71.533725
5 71.257019
6 87.355602
7 55.076239
8 67.941031
9 77.437012
10 94.496416
11 16.937017
12 68.494663
13 79.112648
14 88.298477
15 59.028143
16 16.991677
17 14.835137
18 75.095696
19 95.177781
Solution:
In [33]: df['trend'] = np.sign(df['price']
...: .rolling(window=5)
...: .mean()
...: .diff()
...: .fillna(0)) \
...: .map({0:'FLAT',1:'UP',-1:'DOWN'})
...:
In [34]: df
Out[34]:
price trend
0 20.555945 FLAT
1 58.312756 FLAT
2 3.723192 FLAT
3 22.298697 FLAT
4 71.533725 FLAT
5 71.257019 UP
6 87.355602 UP
7 55.076239 UP
8 67.941031 UP
9 77.437012 UP
10 94.496416 UP
11 16.937017 DOWN
12 68.494663 UP
13 79.112648 UP
14 88.298477 UP
15 59.028143 DOWN
16 16.991677 UP
17 14.835137 DOWN
18 75.095696 DOWN
19 95.177781 UP
Plot:
In [39]: df.price.plot(figsize=(16,6))
Out[39]: <matplotlib.axes._subplots.AxesSubplot at 0xc16e4a8>
In [40]: plt.locator_params(nbins=len(df))

Multiply all columns in a Pandas dataframe together

Is it possible to multiply all the columns in a Pandas.DataFrame together to get a single value for every row in the DataFrame?
As an example, using
df = pd.DataFrame(np.random.randn(5,3)*10)
I want a new DataFrame df2 where df2.ix[x,0] will have the value of df.ix[x,0] * df.ix[x,1] * df.ix[x,2].
However I do not want to hardcode this, how can I use a loop to achieve this?
I found a function df.mul(series, axis=1) but cant figure out a way to use this for my purpose.
You could use DataFrame.prod():
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 7 7 5
1 1 8 6
2 4 8 4
3 2 9 5
4 3 8 7
>>> df.prod(axis=1)
0 245
1 48
2 128
3 90
4 168
dtype: int64
You could also apply np.prod, which is what I'd originally done, but usually when available the direct methods are faster.
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 9 3 3
1 8 5 4
2 3 6 7
3 9 8 5
4 7 1 2
>>> df.apply(np.prod, axis=1)
0 81
1 160
2 126
3 360
4 14
dtype: int64

Categories

Resources