Let us say, I have the following data frame.
Frequency
20
14
10
8
6
2
1
I want to scale Frequency value from 0 to 1.
Is there a way to do this in Python? I have found something similar here But it doesn't serve my purpose.
I am sure there's a more standard way to do this in Python, but I use a self-defined function that you can select the range to be scaled on:
def my_scaler(min_scale_num,max_scale_num,var):
return (max_scale_num - min_scale_num) * ( (var - min(var)) / (max(var) - min(var)) ) + min_scale_num
# You can input your range
df['scaled'] = my_scaler(0,1,df['Frequency'].astype(float)) # scaled between 0,1
df['scaled2'] = my_scaler(-5,5,df['Frequency'].astype(float)) # scaled between -5,5
df
Frequency scaled scaled2
0 20 1.000000 5.000000
1 14 0.684211 1.842105
2 10 0.473684 -0.263158
3 8 0.368421 -1.315789
4 6 0.263158 -2.368421
5 2 0.052632 -4.473684
6 1 0.000000 -5.000000
Just change a, b = 10, 50 to a, b = 0, 1 in linked answer for upper and lower values for scale:
a, b = 0, 1
x, y = df.Frequency.min(), df.Frequency.max()
df['normal'] = (df.Frequency - x) / (y - x) * (b - a) + a
print (df)
Frequency normal
0 20 1.000000
1 14 0.684211
2 10 0.473684
3 8 0.368421
4 6 0.263158
5 2 0.052632
6 1 0.000000
You can use applymap to apply any function on each cell of the df.
For example:
df = pd.DataFrame([20, 14, 10, 8, 6, 2, 1], columns=["Frequency"])
min = df.min()
max = df.max()
df2 = df.applymap(lambda x: (x - min)/(max-min))
df
Frequency
0 20
1 14
2 10
3 8
4 6
5 2
6 1
df2
0 Frequency 1.0
dtype: float64
1 Frequency 0.684211
dtype: float64
2 Frequency 0.473684
dtype: float64
3 Frequency 0.368421
dtype: float64
4 Frequency 0.263158
dtype: float64
5 Frequency 0.052632
dtype: float64
6 Frequency 0.0
dtype: float64
Related
I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cf = pd.DataFrame({'grade': rng.choice(list('ACD'),size=(8)),
'dash': rng.choice(list('PQRS'),size=(8)),
'dumeel': rng.choice(list('QWER'),size=(8)),
'dumma': rng.choice((1234),size=(8)),
'target': rng.choice([0,1],size=(8))
})
I would like to do the below
a) Find the total and %total for each of my value in the categorical columns against the target column
I tried the below but it only gets me to half way of the results.
cols = cf.select_dtypes('object')
cf.melt('target',cols).groupby(['variable','value']).size().reset_index(name='cnt of records')
How can I use the above result to compute target met and target not met details using the target column?
I expect my output to be like as shown below (note that I have shown only two columns grade and dash for sample). Code should follow the same logic for all string columns
Select your columns to flatten with melt then join the target column. Finally, group by variable and value columns and apply a dict of functions to each group.
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target not met': lambda x: len(x) - sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%",
'target not met %': lambda x: f"{round(100 * (len(x) - sum(x)) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
Output:
>>> out
variable value cnt of records target met target not met target met % target not met %
0 dash Q 2 0 2 0.00% 100.00%
1 dash R 2 2 0 100.00% 0.00%
2 dash S 4 2 2 50.00% 50.00%
3 dumeel E 3 2 1 66.67% 33.33%
4 dumeel Q 3 2 1 66.67% 33.33%
5 dumeel R 1 0 1 0.00% 100.00%
6 dumeel W 1 0 1 0.00% 100.00%
7 grade A 2 0 2 0.00% 100.00%
8 grade C 3 2 1 66.67% 33.33%
9 grade D 3 2 1 66.67% 33.33%
You can use agg after you groupby for this:
cols = cf.select_dtypes('object')
df = (
cf.melt('target', cols)
.groupby(['variable','value'])
['target']
.agg([('l', 'size'), ('s', 'sum')]) # l = length (total count of rows in this group), s = sum (total count of rows in the group where target = 1)
.pipe(lambda x: (
x.assign(
met_pct=x.s / x.l * 100,
not_met_pct=100 - (x.s / x.l * 100),
met=x.s,
not_met=x.l - x.s
)
)).reset_index()
.drop(['l', 's'], axis=1)
)
Output:
>>> df
variable value met_pct not_met_pct met not_met
0 dash P 100.000000 0.000000 1 0
1 dash Q 0.000000 100.000000 0 3
2 dash R 50.000000 50.000000 1 1
3 dash S 50.000000 50.000000 1 1
4 dumeel E 0.000000 100.000000 0 1
5 dumeel Q 100.000000 0.000000 1 0
6 dumeel R 50.000000 50.000000 2 2
7 dumeel W 0.000000 100.000000 0 2
8 grade A 0.000000 100.000000 0 1
9 grade C 50.000000 50.000000 2 2
10 grade D 33.333333 66.666667 1 2
For the DF below - in the Value Column, Product 3(i.e, 100) and Product 4 (i.e. 98) have amounts that are outliers. I want to
group by ['Class']
obtain the mean of the [Value] excluding the outlier amount
replace the outlier amount with the mean calculated in step 2.
Any suggestions of how to structure the code greatly appreciated. I have my code that works given the sample table, but I have a feeling that when I implement in the real solution it might not work.
Product,Class,Value
0 1 A 5
1 2 A 4
2 3 A 100
3 4 B 98
4 5 B 20
5 6 B 25
My code implementation:
# Establish the condition to remove the outlier rows from the DF
stds = 1.0
filtered_df = df[~df.groupby('Class')['Value'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Output:
Product Class Value
0 1 A 5
1 2 A 4
4 5 B 20
5 6 B 25
# compute mean of each class without the outliers
class_means = filtered_df[['Class', 'Value']].groupby(['Class'])['Value'].mean()
Output:
Class
A 4.5
B 22.5
#extract rows in DF that are outliers and fail the test
outlier_df = df[df.groupby('Class')['Value'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
outlier_df
Output:
Product Class Value
2 3 A 100
3 4 B 98
#replace outlier values with computed means grouped by class
outlier_df['Value'] = np.where((outlier_df.Class == class_means.index), class_means,outlier_df.Value)
outlier_df
Output:
Product Class Value
2 3 A 4.5
3 4 B 22.5
#recombine cleaned dataframes
df_cleaned = pd.concat([filtered_df,outlier_df], axis=0 )
df_cleaned
Output:
Product Class Value
0 1 A 5.0
1 2 A 4.0
4 5 B 20.0
5 6 B 25.0
2 3 A 4.5
3 4 B 22.5
Proceed as follows:
Start from your code:
stds = 1.0
Save your lambda function under a variable:
isOutlier = lambda x: abs((x - x.mean()) / x.std()) > stds
Define the following function, to be applied to each group:
def newValue(grp):
val = grp.Value
outl = isOutlier(val)
return val.mask(outl, val[~outl].mean())
Generate new Value column:
df.Value = df.groupby('Class', group_keys=False).apply(newValue)
The result is:
Product Class Value
0 1 A 5.0
1 2 A 4.0
2 3 A 4.5
3 4 B 22.5
4 5 B 20.0
5 6 B 25.0
You even don't lose the original row order.
Edit
Or you can "incorporate" the content of your lambda function in newValue
(as you don't call it in any other place):
def newValue(grp):
val = grp.Value
outl = abs((val - val.mean()) / val.std()) > stds
return val.mask(outl, val[~outl].mean())
Suppose I have a Pandas Dataframe named df, which has the following structure:-
Column 1 Column 2 ......... Column 104
Row 1 0.01 0.55 3
Row 2 0.03 0.14 1
...
Row 100 0.75 0.56 0
What I am trying to accomplish is that for all rows which match the condition given below, I need to generate 100 more rows with a random value between 0 and 0.05 added to each row:-
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append([df_try]*100,ignore_index=True)
The problem is that I can simply duplicate the rows in df_try to generate 100 more rows for each case, but I want to add a random value to each row as well, such that each row is different from the others but very similar.
import random
df = df.append([df_try + random.uniform(0,0.05)]*100, ignore_index=True)
What this does is to simply add the fixed random value to df_try's 100 new rows, but not a unique random value to each row. I know that this is because the above syntax does not iterate over df_try, resulting in the fixed random value being added, but is there a suitable way to add the random values iteratively over the data frame in this case?
One idea is create 2d array with same size like new appended DataFrame and add to joined lists with concat:
N = 10
arr = np.random.uniform(0,0.05, size=(N, len(df.columns)))
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append(pd.concat([df_try]*N) + arr,ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.024738 0.561647 3.045146
4 0.035315 0.584161 3.008656
5 0.022386 0.563025 3.033091
6 0.039175 0.588785 3.004649
7 0.049465 0.594903 3.003303
8 0.027366 0.580478 3.041745
9 0.044721 0.599853 3.001736
10 0.052849 0.589775 3.042434
11 0.033957 0.582610 3.045215
12 0.044349 0.582218 3.027665
Your solution should be changed by list comprehension if need add scalar to each df_try:
N = 10
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append( [df_try + random.uniform(0, 0.05) for _ in range(N)], ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.036756 0.576756 3.026756
4 0.039357 0.579357 3.029357
5 0.048746 0.588746 3.038746
6 0.040197 0.580197 3.030197
7 0.011045 0.551045 3.001045
8 0.013942 0.553942 3.003942
9 0.054658 0.594658 3.044658
10 0.025909 0.565909 3.015909
11 0.012093 0.552093 3.002093
12 0.058463 0.598463 3.048463
You can combine the copies first and create a single array containing all the random values, add them together, and then append the result to the original:
import numpy as np
n_copies = 2
df = pd.DataFrame(np.c_[np.arange(6), np.random.randint(1, 3, size=6)])
subset = df[df.iloc[:, -1] > 1]
extra = pd.concat([subset] * n_copies).add(np.random.uniform(0, 0.05, len(subset) * n_copies), axis='rows')
result = df.append(extra, ignore_index=True)
print(result)
Output:
0 1
0 0.000000 2.000000
1 1.000000 2.000000
2 2.000000 1.000000
3 3.000000 2.000000
4 4.000000 1.000000
5 5.000000 2.000000
6 0.007723 2.007723
7 1.005718 2.005718
8 3.003063 2.003063
9 5.005238 2.005238
10 0.006509 2.006509
11 1.034742 2.034742
12 3.022345 2.022345
13 5.040911 2.040911
I know rolling_mean() exists, but this is for a school project so I'm trying to avoid using rolling_mean()
I'm trying to use the following function on a dataframe series
def run_mean(array, period):
ret = np.cumsum(array, dtype=float)
ret[period:] = ret[period:] - ret[:-period]
return ret[period - 1:] / period
data['run_mean'] = run_mean(data['ratio'], 150)
But I'm getting the error 'ValueError: cannot set using a slice indexer with a different length than the value'.
Using data['run_mean'] = pd.rolling_mean(raw_data['ratio'],150) works exactly fine, what am I missing?
Fill the initial values up to period with NaN.
def run_mean(array, period): # Vector
ret = np.cumsum(array / period, dtype=float) # First divide by period to avoid overflow.
ret[period:] = ret[period:] - ret[:-period]
ret[:period - 1] = np.nan
return ret
run_mean(np.array(range(5)), 3)
Out[35]: array([ nan, nan, 1., 2., 3.])
To quote the pandas documentation,
A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
This example should illustrate what's going on:
In [1]: import numpy as np
...: import pandas as pd
In [2]: a = pd.Series(np.random.random(5))
In [3]: a
Out[3]:
0 0.740975
1 0.983654
2 0.274207
3 0.427542
4 0.874127
dtype: float64
In [4]: a[2:]
Out[4]:
2 0.274207
3 0.427542
4 0.874127
dtype: float64
In [5]: a[:-2]
Out[5]:
0 0.740975
1 0.983654
2 0.274207
dtype: float64
In [6]: a[2:] - a[:-2]
Out[6]:
0 NaN
1 NaN
2 0.0
3 NaN
4 NaN
dtype: float64
In [7]: a[2:] = _
The last statement will produce the ValueError you get.
Converting ret from a pandas Series to a numpy ndarray should give you the behaviour you're looking for.
You're mixing up the use of : in DataFrame slicing.
Solution
What you want to use is shift()
def run_mean(array, period):
ret = np.cumsum(array, dtype=float)
roll = ret - ret.shift(period).fillna(0)
return roll[(period - 1):] / period
Example Setup
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame((np.random.rand(6, 5) * 10).astype(int), columns=list('ABCDE'))
print df
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
2 7 2 1 3 8
3 2 0 6 5 5
4 6 6 4 3 5
5 4 8 8 1 0
Observe
print df[:4]
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
2 7 2 1 3 8
3 2 0 6 5 5
print df[:-4]
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
These are not the same length.
Demonstration
A B C D E
2 8.000000 4.666667 1.666667 6.333333 6.333333
3 5.666667 3.000000 3.000000 5.666667 5.000000
4 5.000000 2.666667 3.666667 3.666667 6.000000
5 4.000000 4.666667 6.000000 3.000000 3.333333
I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.
Source.0 Source.1 Source.2 Source.3
cluster
0 0.001182 0.184535 0.814230 0.000054
1 0.000001 0.160490 0.839508 0.000001
2 0.000001 0.173829 0.826114 0.000055
3 0.000432 0.180065 0.819502 0.000001
4 0.000152 0.157041 0.842694 0.000113
5 0.000183 0.174142 0.825674 0.000001
6 0.000001 0.151556 0.848405 0.000038
7 0.000771 0.177583 0.821645 0.000001
8 0.000001 0.202059 0.797939 0.000001
9 0.000025 0.189537 0.810410 0.000028
10 0.006142 0.003041 0.493912 0.496905
11 0.003739 0.002367 0.514216 0.479678
12 0.002334 0.001517 0.529041 0.467108
13 0.003458 0.000001 0.532265 0.464276
14 0.000405 0.005655 0.527576 0.466364
15 0.002557 0.003233 0.507954 0.486256
16 0.004161 0.000001 0.491271 0.504568
17 0.001364 0.001330 0.528311 0.468996
18 0.002886 0.000001 0.506392 0.490721
19 0.001823 0.002498 0.509620 0.486059
Source.0 Source.1 Source.2 Source.3
cluster
0 0.000001 0.197108 0.802495 0.000396
1 0.000001 0.157860 0.842076 0.000063
2 0.094956 0.203057 0.701662 0.000325
3 0.000001 0.181948 0.817841 0.000210
4 0.000003 0.169680 0.830316 0.000001
5 0.000362 0.177194 0.822443 0.000001
6 0.000001 0.146807 0.852924 0.000268
7 0.001087 0.178994 0.819564 0.000354
8 0.000001 0.202182 0.797333 0.000485
9 0.000348 0.181399 0.818252 0.000001
10 0.003050 0.000247 0.506777 0.489926
11 0.004420 0.000001 0.513927 0.481652
12 0.006488 0.001396 0.527197 0.464919
13 0.001510 0.000001 0.525987 0.472502
14 0.000001 0.000001 0.520737 0.479261
15 0.000001 0.001765 0.515658 0.482575
16 0.000001 0.000001 0.492550 0.507448
17 0.002855 0.000199 0.526535 0.470411
18 0.000001 0.001952 0.498303 0.499744
19 0.001232 0.000001 0.506612 0.492155
Then I want to get the mean of these two dataframes.
What is the easiest way to do this?
Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.
So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.
Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:
import numpy as np
import pandas as pd
# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
# concatenate them
df_concat = pd.concat((df1, df2))
print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64
print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64
Update
If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:
by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()
print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5
This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.
I go similar as #ali_m, but since you want one mean per row-column combination, I conclude differently:
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()
x y
0 0.841282 2.5
1 0.716749 1.0
2 -0.551903 2.5
3 1.240736 1.5
4 1.227109 2.0
As per Niklas' comment, the solution to the question is panel.mean(axis=0).
As a more complete example:
import pandas as pd
import numpy as np
dfs = {}
nrows = 4
ncols = 3
for i in range(4):
dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
columns=list('abc'))
print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
Will give the following output:
DF0:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
DF1:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
DF2:
a b c
0 2 3 4
1 5 6 7
2 8 9 10
3 11 12 13
DF3:
a b c
0 3 4 5
1 6 7 8
2 9 10 11
3 12 13 14
Mean of stacked DFs:
a b c
0 1.5 2.5 3.5
1 4.5 5.5 6.5
2 7.5 8.5 9.5
3 10.5 11.5 12.5
Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...
averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()
And your done...
Or for more general purposes...
dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
You can simply assign a label to each frame, call it group and then concat and groupby to do what you want:
In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [58]: df2 = df.copy()
In [59]: dfs = [df, df2]
In [60]: df
Out[60]:
a b c d
0 0.1959 0.1260 0.1464 0.1631
1 0.9344 -1.8154 1.4529 -0.6334
2 0.0390 0.4810 1.1779 -1.1799
3 0.3542 0.3819 -2.0895 0.8877
4 -2.2898 -1.0585 0.8083 -0.2126
5 0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785 0.0885 1.0945 -1.6271
7 -1.7169 0.3760 -1.4078 0.8994
8 0.0508 0.4891 0.0274 -0.6369
9 -0.7019 1.0425 -0.5476 -0.5143
In [61]: for i, d in enumerate(dfs):
....: d['group'] = i
....:
In [62]: dfs[0]
Out[62]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
7 -1.7169 0.3760 -1.4078 0.8994 0
8 0.0508 0.4891 0.0274 -0.6369 0
9 -0.7019 1.0425 -0.5476 -0.5143 0
In [63]: final = pd.concat(dfs, ignore_index=True)
In [64]: final
Out[64]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
.. ... ... ... ... ...
13 0.3542 0.3819 -2.0895 0.8877 1
14 -2.2898 -1.0585 0.8083 -0.2126 1
15 0.3727 -0.6867 -1.3440 -1.4849 1
16 -1.1785 0.0885 1.0945 -1.6271 1
17 -1.7169 0.3760 -1.4078 0.8994 1
18 0.0508 0.4891 0.0274 -0.6369 1
19 -0.7019 1.0425 -0.5476 -0.5143 1
[20 rows x 5 columns]
In [65]: final.groupby('group').mean()
Out[65]:
a b c d
group
0 -0.394 -0.0576 -0.0682 -0.4339
1 -0.394 -0.0576 -0.0682 -0.4339
Here, each group is the same, but that's only because df == df2.
Alternatively, you can throw the frames into a Panel:
In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [71]: panel = pd.Panel({0: df, 1: df2})
In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d
In [73]: panel.mean()
Out[73]:
0 1
a 0.3839 0.2956
b 0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450
With Pandas version 1.3.4 this works for me:
import numpy as np
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100), z=np.random.randint(-3, 2, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 2, 100), z=np.random.randint(-1, 2, 100)))
pd.concat([df1, df2]).groupby(level=0).mean()