I have to 2 dataframes:
input_df
Apples Pears Peaches Grapes
12 23 0 4
10 0 0 4
12 16 12 5
6 0 0 11
coefficients_df
Fruit n w1 w2
Apples 2 0.4 40
Pears 1 0.1 43
Peaches 1 0.6 51
Grapes 2 0.5 11
I'm trying to apply an equation y = w2*(1-exp(-w1*input_df^n))to input_df. The equation takes coefficients from coefficients_df
This is what I tried:
# First map coefficients_df to input_df
merged_df = input_df.merge(coefficients_df.pivot('Fruit'), on=['Apples','Pears','Peaches','Grapes'])
# Apply function to each row
output_df = merged_df.apply(lambda x: w2*(1-exp(-w1*x^n))
Use simple index alignment:
coeff = coefficients_df.set_index('Fruit')
y = coeff['w2']*(1-np.exp(-coeff['w1']*input_df**coeff['n']))
Output:
Apples Pears Peaches Grapes
0 40.000000 38.68887 0.000000 10.996310
1 40.000000 0.00000 0.000000 10.996310
2 40.000000 34.31845 50.961924 10.999959
3 39.999978 0.00000 0.000000 11.000000
Use broadcasting to get value from df_input:
# df1 is input_df
# df2 is coefficients_df
col_idx = (df1.columns.to_numpy() == df2['Fruit'].to_numpy()[:, None]).argmax(axis=1)
x = df1.values[df1.index, col_idx]
y = df2['w2']*(1-np.exp(-df2['w1']*x**df2['n']))
Output:
>>> y
0 40.000000
1 0.000000
2 50.961924
3 11.000000
dtype: float64
Im a bit late to the party but here you go.
import numpy as np
results = {}
for col in input_df.columns:
coefs = coefficients_df.loc[coefficients_df['Fruit'] == col].iloc[0,1:]
y = np.array(coefs.w2*(1-np.exp(-coefs.w1*(input_df[col]**coefs.n))))
results[col] = y
and for me the output is
{'Apples': array([40. , 40. , 40. , 39.9999777]),
'Pears': array([38.68886972, 0. , 34.31844973, 0. ]),
'Peaches': array([ 0. , 0. , 50.96192412, 0. ]),
'Grapes': array([10.99630991, 10.99630991, 10.99995901, 11. ])}
Related
I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cf = pd.DataFrame({'grade': rng.choice(list('ACD'),size=(8)),
'dash': rng.choice(list('PQRS'),size=(8)),
'dumeel': rng.choice(list('QWER'),size=(8)),
'dumma': rng.choice((1234),size=(8)),
'target': rng.choice([0,1],size=(8))
})
I would like to do the below
a) Find the total and %total for each of my value in the categorical columns against the target column
I tried the below but it only gets me to half way of the results.
cols = cf.select_dtypes('object')
cf.melt('target',cols).groupby(['variable','value']).size().reset_index(name='cnt of records')
How can I use the above result to compute target met and target not met details using the target column?
I expect my output to be like as shown below (note that I have shown only two columns grade and dash for sample). Code should follow the same logic for all string columns
Select your columns to flatten with melt then join the target column. Finally, group by variable and value columns and apply a dict of functions to each group.
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target not met': lambda x: len(x) - sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%",
'target not met %': lambda x: f"{round(100 * (len(x) - sum(x)) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
Output:
>>> out
variable value cnt of records target met target not met target met % target not met %
0 dash Q 2 0 2 0.00% 100.00%
1 dash R 2 2 0 100.00% 0.00%
2 dash S 4 2 2 50.00% 50.00%
3 dumeel E 3 2 1 66.67% 33.33%
4 dumeel Q 3 2 1 66.67% 33.33%
5 dumeel R 1 0 1 0.00% 100.00%
6 dumeel W 1 0 1 0.00% 100.00%
7 grade A 2 0 2 0.00% 100.00%
8 grade C 3 2 1 66.67% 33.33%
9 grade D 3 2 1 66.67% 33.33%
You can use agg after you groupby for this:
cols = cf.select_dtypes('object')
df = (
cf.melt('target', cols)
.groupby(['variable','value'])
['target']
.agg([('l', 'size'), ('s', 'sum')]) # l = length (total count of rows in this group), s = sum (total count of rows in the group where target = 1)
.pipe(lambda x: (
x.assign(
met_pct=x.s / x.l * 100,
not_met_pct=100 - (x.s / x.l * 100),
met=x.s,
not_met=x.l - x.s
)
)).reset_index()
.drop(['l', 's'], axis=1)
)
Output:
>>> df
variable value met_pct not_met_pct met not_met
0 dash P 100.000000 0.000000 1 0
1 dash Q 0.000000 100.000000 0 3
2 dash R 50.000000 50.000000 1 1
3 dash S 50.000000 50.000000 1 1
4 dumeel E 0.000000 100.000000 0 1
5 dumeel Q 100.000000 0.000000 1 0
6 dumeel R 50.000000 50.000000 2 2
7 dumeel W 0.000000 100.000000 0 2
8 grade A 0.000000 100.000000 0 1
9 grade C 50.000000 50.000000 2 2
10 grade D 33.333333 66.666667 1 2
Let us say, I have the following data frame.
Frequency
20
14
10
8
6
2
1
I want to scale Frequency value from 0 to 1.
Is there a way to do this in Python? I have found something similar here But it doesn't serve my purpose.
I am sure there's a more standard way to do this in Python, but I use a self-defined function that you can select the range to be scaled on:
def my_scaler(min_scale_num,max_scale_num,var):
return (max_scale_num - min_scale_num) * ( (var - min(var)) / (max(var) - min(var)) ) + min_scale_num
# You can input your range
df['scaled'] = my_scaler(0,1,df['Frequency'].astype(float)) # scaled between 0,1
df['scaled2'] = my_scaler(-5,5,df['Frequency'].astype(float)) # scaled between -5,5
df
Frequency scaled scaled2
0 20 1.000000 5.000000
1 14 0.684211 1.842105
2 10 0.473684 -0.263158
3 8 0.368421 -1.315789
4 6 0.263158 -2.368421
5 2 0.052632 -4.473684
6 1 0.000000 -5.000000
Just change a, b = 10, 50 to a, b = 0, 1 in linked answer for upper and lower values for scale:
a, b = 0, 1
x, y = df.Frequency.min(), df.Frequency.max()
df['normal'] = (df.Frequency - x) / (y - x) * (b - a) + a
print (df)
Frequency normal
0 20 1.000000
1 14 0.684211
2 10 0.473684
3 8 0.368421
4 6 0.263158
5 2 0.052632
6 1 0.000000
You can use applymap to apply any function on each cell of the df.
For example:
df = pd.DataFrame([20, 14, 10, 8, 6, 2, 1], columns=["Frequency"])
min = df.min()
max = df.max()
df2 = df.applymap(lambda x: (x - min)/(max-min))
df
Frequency
0 20
1 14
2 10
3 8
4 6
5 2
6 1
df2
0 Frequency 1.0
dtype: float64
1 Frequency 0.684211
dtype: float64
2 Frequency 0.473684
dtype: float64
3 Frequency 0.368421
dtype: float64
4 Frequency 0.263158
dtype: float64
5 Frequency 0.052632
dtype: float64
6 Frequency 0.0
dtype: float64
Suppose I have a Pandas Dataframe named df, which has the following structure:-
Column 1 Column 2 ......... Column 104
Row 1 0.01 0.55 3
Row 2 0.03 0.14 1
...
Row 100 0.75 0.56 0
What I am trying to accomplish is that for all rows which match the condition given below, I need to generate 100 more rows with a random value between 0 and 0.05 added to each row:-
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append([df_try]*100,ignore_index=True)
The problem is that I can simply duplicate the rows in df_try to generate 100 more rows for each case, but I want to add a random value to each row as well, such that each row is different from the others but very similar.
import random
df = df.append([df_try + random.uniform(0,0.05)]*100, ignore_index=True)
What this does is to simply add the fixed random value to df_try's 100 new rows, but not a unique random value to each row. I know that this is because the above syntax does not iterate over df_try, resulting in the fixed random value being added, but is there a suitable way to add the random values iteratively over the data frame in this case?
One idea is create 2d array with same size like new appended DataFrame and add to joined lists with concat:
N = 10
arr = np.random.uniform(0,0.05, size=(N, len(df.columns)))
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append(pd.concat([df_try]*N) + arr,ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.024738 0.561647 3.045146
4 0.035315 0.584161 3.008656
5 0.022386 0.563025 3.033091
6 0.039175 0.588785 3.004649
7 0.049465 0.594903 3.003303
8 0.027366 0.580478 3.041745
9 0.044721 0.599853 3.001736
10 0.052849 0.589775 3.042434
11 0.033957 0.582610 3.045215
12 0.044349 0.582218 3.027665
Your solution should be changed by list comprehension if need add scalar to each df_try:
N = 10
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append( [df_try + random.uniform(0, 0.05) for _ in range(N)], ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.036756 0.576756 3.026756
4 0.039357 0.579357 3.029357
5 0.048746 0.588746 3.038746
6 0.040197 0.580197 3.030197
7 0.011045 0.551045 3.001045
8 0.013942 0.553942 3.003942
9 0.054658 0.594658 3.044658
10 0.025909 0.565909 3.015909
11 0.012093 0.552093 3.002093
12 0.058463 0.598463 3.048463
You can combine the copies first and create a single array containing all the random values, add them together, and then append the result to the original:
import numpy as np
n_copies = 2
df = pd.DataFrame(np.c_[np.arange(6), np.random.randint(1, 3, size=6)])
subset = df[df.iloc[:, -1] > 1]
extra = pd.concat([subset] * n_copies).add(np.random.uniform(0, 0.05, len(subset) * n_copies), axis='rows')
result = df.append(extra, ignore_index=True)
print(result)
Output:
0 1
0 0.000000 2.000000
1 1.000000 2.000000
2 2.000000 1.000000
3 3.000000 2.000000
4 4.000000 1.000000
5 5.000000 2.000000
6 0.007723 2.007723
7 1.005718 2.005718
8 3.003063 2.003063
9 5.005238 2.005238
10 0.006509 2.006509
11 1.034742 2.034742
12 3.022345 2.022345
13 5.040911 2.040911
I have a dataframe, and I want to iterate change the values of some rows depending on some calculations I'm doing in a loop.
So for example: if a condition is met, Then I want to change the centers, that are the values in a row of my dataframe.
This are my centers:
centers=[np.array([ 4.73478261, 3.10869565, 1.44782609, 0.20434783]),
np.array([ 5. , 2.4 , 3.2 , 1.03333333]),
np.array([ 5.135, 3.555, 1.48 , 0.275]),
np.array([ 5.52857143, 4.04285714, 1.47142857, 0.28571429]),
np.array([ 5.596, 2.664, 4.052, 1.252]),
np.array([ 6.01176471, 2.71176471, 4.94705882, 1.79411765]),
np.array([ 6.4 , 2.97058824, 4.55294118, 1.41176471]),
np.array([ 6.49090909, 2.9 , 5.37272727, 1.8 ]),
np.array([ 6.61333333, 3.16 , 5.56666667, 2.28666667]),
np.array([ 7.475, 3.125, 6.3 , 2.05 ])]
I then convert them to a dataframe
centersDf = pd.DataFrame(centers)
centersDf
and I would like to do something like,
centersDf[i]=np.array[5, 1, 0 , 2 ]
This doesn't work, but what could be the equivalent?
So, I'm recalculating the centers in my loop, and I want to update my dataframe.
centersDf = pd.DataFrame(centers)
centersDf.head()
0 1 2 3
0 4.734783 3.108696 1.447826 0.204348
1 5.000000 2.400000 3.200000 1.033333
2 5.135000 3.555000 1.480000 0.275000
3 5.528571 4.042857 1.471429 0.285714
4 5.596000 2.664000 4.052000 1.252000
centersDf.iloc[0] = np.array([5, 1, 0 , 2 ])
centersDf.head()
0 1 2 3
0 5.000000 1.000000 0.000000 2.000000
1 5.000000 2.400000 3.200000 1.033333
2 5.135000 3.555000 1.480000 0.275000
3 5.528571 4.042857 1.471429 0.285714
4 5.596000 2.664000 4.052000 1.252000
When you pass a scalar value to the __getitem__ method (that is using the []) pandas mathches up against column names. So centersDf[0] is the 0th column. You get an error because your trying to assign an array of length 4 to a column of length 10 and that makes no sense.
If you want to be able to assign by column name, create your dataframe as it's transpose
centersDf = pd.DataFrame(centers).T
Then
centersDf[0] = [5, 1, 0 , 2]
Works fine
centersDf
0 1 2 3 4 5 6 7 8 9
0 5 5.000000 5.135 5.528571 5.596 6.011765 6.400000 6.490909 6.613333 7.475
1 1 2.400000 3.555 4.042857 2.664 2.711765 2.970588 2.900000 3.160000 3.125
2 0 3.200000 1.480 1.471429 4.052 4.947059 4.552941 5.372727 5.566667 6.300
3 2 1.033333 0.275 0.285714 1.252 1.794118 1.411765 1.800000 2.286667 2.050
Otherwise, just use loc as has been suggested already.
I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.
Source.0 Source.1 Source.2 Source.3
cluster
0 0.001182 0.184535 0.814230 0.000054
1 0.000001 0.160490 0.839508 0.000001
2 0.000001 0.173829 0.826114 0.000055
3 0.000432 0.180065 0.819502 0.000001
4 0.000152 0.157041 0.842694 0.000113
5 0.000183 0.174142 0.825674 0.000001
6 0.000001 0.151556 0.848405 0.000038
7 0.000771 0.177583 0.821645 0.000001
8 0.000001 0.202059 0.797939 0.000001
9 0.000025 0.189537 0.810410 0.000028
10 0.006142 0.003041 0.493912 0.496905
11 0.003739 0.002367 0.514216 0.479678
12 0.002334 0.001517 0.529041 0.467108
13 0.003458 0.000001 0.532265 0.464276
14 0.000405 0.005655 0.527576 0.466364
15 0.002557 0.003233 0.507954 0.486256
16 0.004161 0.000001 0.491271 0.504568
17 0.001364 0.001330 0.528311 0.468996
18 0.002886 0.000001 0.506392 0.490721
19 0.001823 0.002498 0.509620 0.486059
Source.0 Source.1 Source.2 Source.3
cluster
0 0.000001 0.197108 0.802495 0.000396
1 0.000001 0.157860 0.842076 0.000063
2 0.094956 0.203057 0.701662 0.000325
3 0.000001 0.181948 0.817841 0.000210
4 0.000003 0.169680 0.830316 0.000001
5 0.000362 0.177194 0.822443 0.000001
6 0.000001 0.146807 0.852924 0.000268
7 0.001087 0.178994 0.819564 0.000354
8 0.000001 0.202182 0.797333 0.000485
9 0.000348 0.181399 0.818252 0.000001
10 0.003050 0.000247 0.506777 0.489926
11 0.004420 0.000001 0.513927 0.481652
12 0.006488 0.001396 0.527197 0.464919
13 0.001510 0.000001 0.525987 0.472502
14 0.000001 0.000001 0.520737 0.479261
15 0.000001 0.001765 0.515658 0.482575
16 0.000001 0.000001 0.492550 0.507448
17 0.002855 0.000199 0.526535 0.470411
18 0.000001 0.001952 0.498303 0.499744
19 0.001232 0.000001 0.506612 0.492155
Then I want to get the mean of these two dataframes.
What is the easiest way to do this?
Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.
So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.
Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:
import numpy as np
import pandas as pd
# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
# concatenate them
df_concat = pd.concat((df1, df2))
print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64
print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64
Update
If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:
by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()
print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5
This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.
I go similar as #ali_m, but since you want one mean per row-column combination, I conclude differently:
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()
x y
0 0.841282 2.5
1 0.716749 1.0
2 -0.551903 2.5
3 1.240736 1.5
4 1.227109 2.0
As per Niklas' comment, the solution to the question is panel.mean(axis=0).
As a more complete example:
import pandas as pd
import numpy as np
dfs = {}
nrows = 4
ncols = 3
for i in range(4):
dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
columns=list('abc'))
print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
Will give the following output:
DF0:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
DF1:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
DF2:
a b c
0 2 3 4
1 5 6 7
2 8 9 10
3 11 12 13
DF3:
a b c
0 3 4 5
1 6 7 8
2 9 10 11
3 12 13 14
Mean of stacked DFs:
a b c
0 1.5 2.5 3.5
1 4.5 5.5 6.5
2 7.5 8.5 9.5
3 10.5 11.5 12.5
Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...
averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()
And your done...
Or for more general purposes...
dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
You can simply assign a label to each frame, call it group and then concat and groupby to do what you want:
In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [58]: df2 = df.copy()
In [59]: dfs = [df, df2]
In [60]: df
Out[60]:
a b c d
0 0.1959 0.1260 0.1464 0.1631
1 0.9344 -1.8154 1.4529 -0.6334
2 0.0390 0.4810 1.1779 -1.1799
3 0.3542 0.3819 -2.0895 0.8877
4 -2.2898 -1.0585 0.8083 -0.2126
5 0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785 0.0885 1.0945 -1.6271
7 -1.7169 0.3760 -1.4078 0.8994
8 0.0508 0.4891 0.0274 -0.6369
9 -0.7019 1.0425 -0.5476 -0.5143
In [61]: for i, d in enumerate(dfs):
....: d['group'] = i
....:
In [62]: dfs[0]
Out[62]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
7 -1.7169 0.3760 -1.4078 0.8994 0
8 0.0508 0.4891 0.0274 -0.6369 0
9 -0.7019 1.0425 -0.5476 -0.5143 0
In [63]: final = pd.concat(dfs, ignore_index=True)
In [64]: final
Out[64]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
.. ... ... ... ... ...
13 0.3542 0.3819 -2.0895 0.8877 1
14 -2.2898 -1.0585 0.8083 -0.2126 1
15 0.3727 -0.6867 -1.3440 -1.4849 1
16 -1.1785 0.0885 1.0945 -1.6271 1
17 -1.7169 0.3760 -1.4078 0.8994 1
18 0.0508 0.4891 0.0274 -0.6369 1
19 -0.7019 1.0425 -0.5476 -0.5143 1
[20 rows x 5 columns]
In [65]: final.groupby('group').mean()
Out[65]:
a b c d
group
0 -0.394 -0.0576 -0.0682 -0.4339
1 -0.394 -0.0576 -0.0682 -0.4339
Here, each group is the same, but that's only because df == df2.
Alternatively, you can throw the frames into a Panel:
In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [71]: panel = pd.Panel({0: df, 1: df2})
In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d
In [73]: panel.mean()
Out[73]:
0 1
a 0.3839 0.2956
b 0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450
With Pandas version 1.3.4 this works for me:
import numpy as np
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100), z=np.random.randint(-3, 2, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 2, 100), z=np.random.randint(-1, 2, 100)))
pd.concat([df1, df2]).groupby(level=0).mean()