Elagent way multiplying diffrent constant value to diffrent columns in Pandas - python

The objective is to multiply some constant value to a column in Pandas. Each column has its own constant value.
For example, the columns 'a_b_c','dd_ee','ff_ff','abc','devb' are multiply with constant 15,20,15,15,20, respectively.
The constants values and its associated column is store in a dict const_val
const_val=dict(a_b_c=15,
dd_ee=20,
ff_ff=15,
abc=15,
devb=20,)
Currently, I am using a for-loop to multiply each column to its associate constant value which is shown in code below
for dpair in const_val:
df[('per_a',dpair)]=df[dpair]*const_val[dpair]/reval
However, I wonder whether there is more elagent ways of doing this.
The full code is provided below
import pandas as pd
import numpy as np
np.random.seed(0)
const_val=dict(a_b_c=15,
dd_ee=20,
ff_ff=15,
abc=15,
devb=20,)
df = pd.DataFrame(data=np.random.randint(5, size=(3, 6)),
columns=['id','a_b_c','dd_ee','ff_ff','abc','devb'])
reval=6
for dpair in const_val:
df[('per_a',dpair)]=df[dpair]*const_val[dpair]/reval
The expected output is as below
id a_b_c dd_ee ... (per_a, ff_ff) (per_a, abc) (per_a, devb)
0 4 0 3 ... 7.5 7.5 3.333333
1 3 2 4 ... 0.0 0.0 13.333333
2 2 1 0 ... 2.5 2.5 0.000000
Please note that the
(per_a, ff_ff) (per_a, abc) (per_a, devb)
are multiindex column. The representative might be different in your compiler
p.s., I am using IntelliJ IDEA

If you only have numbers in your DataFrame:
out = df.mul(pd.Series(const_val).reindex(df.columns, fill_value=1), axis=1)
If you have a mix non numeric and non-numeric:
out = df.select_dtypes('number').mul(pd.Series(const_val), axis=1).combine_first(df)
update:
out = df.join(df[list(const_val)].mul(pd.Series(const_val), axis=1)
.div(reval).add_prefix('per_a_'))
Output
id a_b_c dd_ee ff_ff abc devb per_a_a_b_c per_a_dd_ee per_a_ff_ff per_a_abc per_a_devb
0 1 4 3 0 3 0 10.0 10.000000 0.0 7.5 0.0
1 2 3 0 1 3 3 7.5 0.000000 2.5 7.5 10.0
2 3 0 1 1 1 0 0.0 3.333333 2.5 2.5 0.0

Update for multiindex/tuple column headers:
cols = pd.Index(const_val.keys())
mi = pd.MultiIndex.from_product([['per_a'], cols])
df[mi] = df[cols] * pd.Series(const_val) / reval
print(df)
Output:
id a_b_c dd_ee ff_ff abc devb (per_a, a_b_c) (per_a, dd_ee) (per_a, ff_ff) (per_a, abc) (per_a, devb)
0 4 0 3 3 3 1 0.0 10.000000 7.5 7.5 3.333333
1 3 2 4 0 0 4 5.0 13.333333 0.0 0.0 13.333333
2 2 1 0 1 1 0 2.5 0.000000 2.5 2.5 0.000000
Try this using pandas intrinsic data alignment tenants to align data using indexing:
cols = pd.Index(const_val.keys())
df[cols + '_per_a'] = df[cols] * pd.Series(const_val) / reval
Output:
id a_b_c dd_ee ff_ff abc devb a_b_c_per_a dd_ee_per_a ff_ff_per_a abc_per_a devb_per_a
0 4 0 3 3 3 1 0.0 10.000000 7.5 7.5 3.333333
1 3 2 4 0 0 4 5.0 13.333333 0.0 0.0 13.333333
2 2 1 0 1 1 0 2.5 0.000000 2.5 2.5 0.000000

df
id a_b_c dd_ee ff_ff abc devb
0 4 0 3 3 3 1
1 3 2 4 0 0 4
2 2 1 0 1 1 0
make const_val to series
s = pd.Series(const_val)
s
a_b_c 15
dd_ee 20
ff_ff 15
abc 15
devb 20
dtype: int64
use broadcasting
out = df[['id']].join(df[df.columns[1:]].mul(s))
out
id a_b_c dd_ee ff_ff abc devb
0 4 0 60 45 45 20
1 3 30 80 0 0 80
2 2 15 0 15 15 0

Related

is there a better way to handle NaN values?

I have an input dataframe
KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
A (C602+C603) C601 75
B (C605+C606) C602 NaN
C 75 L239+C602 NaN
D (32*(C603+44)) 75 NaN
E L239 NaN C601
I have an Indicator df
99 75 C604 C602 C601 C603 C605 C606 44 L239 32
PatientID
1 1 0 1 0 1 0 0 0 1 0 1
2 0 0 0 0 0 0 1 1 0 0 0
3 1 1 1 1 0 1 1 1 1 1 1
4 0 0 0 0 0 1 0 1 0 1 0
5 1 0 1 1 1 1 0 1 1 1 1
source:
input_df = pd.DataFrame({'KPI_ID': ['A','B','C','D','E'], 'KPI_Key1': ['(C602+C603)','(C605+C606)','75','(32*(C603+44))','L239'] , 'KPI_Key2' : ['C601','C602','L239+C602','75',np.NaN] , 'KPI_Key3' : ['75',np.NaN,np.NaN,np.NaN,'C601']})
indicator_df = pd.DataFrame({'PatientID': [1,2,3,4,5],'99' : ['1','0','1','0','1'],'75' : ['0','0','1','0','0'],'C604' : ['1','0','1','0','1'],'C602' : ['0','0','1','0','1'],'C601' : ['1','0','0','0','1'],'C603' : ['0','0','1','1','1'],'C605' : ['0','1','1','0','0'],'C606' : ['0','1','1','1','1'],'44' : ['1','0','1','0','1'],'L239' : ['0','0','1','1','1'], '32' : ['1','0','1','0','1'],}).set_index('PatientID')
My Goal is to create an output df like this (by evaluating the input_df against indicator_df )
final_out_df:
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
1 A 0 1 0
2 A 0 0 0
3 A 2 0 1
4 A 1 0 0
5 A 2 1 0
1 B 0 0 0
2 B 2 0 0
3 B 2 1 0
... ... ... ... ...
I am VERY Close and my logic works fine except I am unable to handle the NaN values in the input_df.I am able to generate the output for KPI_ID 'A' since none of the three formulas (KPI_Key1,KPI_Key2,KPI_Key3 for 'A') are null. But I fail to generate it for 'B'. Is there anything I can do instead of using a dummy variuable in place of NaN and creating that row in indicator_df?
Here is what I did so far:
indicator_df = indicator_df.astype('int32')
final_out_df = pd.DataFrame()
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
final_out_df = pd.DataFrame()
#running loop only for 'A' so it won't fail
for i in range(0,len(input_df)-4):
for j in ['KPI_Key1','KPI_Key2','KPI_Key3']:
exp = input_df[j].iloc[i]
temp_out_df=indicator_df.eval(re.sub(r'(\w+)', r'`\1`', exp)).reset_index(name=j)
out_df['KPI_ID'] = input_df['KPI_ID'].iloc[i]
out_df = out_df.merge(temp_out_df, on='PatientID', how='left')
final_out_df= final_out_df.append(out_df)
out_df = pd.DataFrame(index=indicator_df.index)
out_df.reset_index(level=0, inplace=True)
Replace NaN by None and create a dict of local variables to allow a correct evaluation with pd.eval:
def eval_kpi(row):
kpi = row.filter(like='KPI_Key').fillna('None')
return pd.Series(pd.eval(kpi, local_dict=row['local_vars']), index=kpi.index)
final_out_df = indicator_df.astype(int).apply(dict, axis=1) \
.rename('local_vars').reset_index() \
.merge(input_df, how='cross')
final_out_df.update(final_out_df.apply(eval_kpi, axis=1))
final_out_df = final_out_df.drop(columns='local_vars') \
.sort_values(['KPI_ID', 'PatientID']) \
.reset_index(drop=True)
Output:
>>> final_out_df
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
0 1 A 0.0 1.0 75.0
1 2 A 0.0 0.0 75.0
2 3 A 2.0 0.0 75.0
3 4 A 1.0 0.0 75.0
4 5 A 2.0 1.0 75.0
5 1 B 0.0 0.0 NaN
6 2 B 2.0 0.0 NaN
7 3 B 2.0 1.0 NaN
8 4 B 1.0 0.0 NaN
9 5 B 1.0 1.0 NaN
10 1 C 75.0 0.0 NaN
11 2 C 75.0 0.0 NaN
12 3 C 75.0 2.0 NaN
13 4 C 75.0 1.0 NaN
14 5 C 75.0 2.0 NaN
15 1 D 1408.0 75.0 NaN
16 2 D 1408.0 75.0 NaN
17 3 D 1440.0 75.0 NaN
18 4 D 1440.0 75.0 NaN
19 5 D 1440.0 75.0 NaN
20 1 E 0.0 NaN 1.0
21 2 E 0.0 NaN 0.0
22 3 E 1.0 NaN 0.0
23 4 E 1.0 NaN 0.0
24 5 E 1.0 NaN 1.0
I was able to solve it by adding:
if exp == exp:
before parsing the exp through the regex.

how can I replace NaN value with data from another dataframe in python?

from io import StringIO
import pandas as pd
x1 = """No.,col1,col2,col3,A
123,2,5,2,NaN
453,4,3,1,3
146,7,9,4,2
175,2,4,3,NaN
643,0,0,0,2
"""
x2 = """No.,col1,col2,col3,A
123,24,57,22,1
453,41,39,15,2
175,21,43,37,3
"""
df1 = pd.read_csv(StringIO(x1), sep=",")
df2 = pd.read_csv(StringIO(x2), sep=",")
how can I fill the NaN value in df1 with the corresponding No. column in df2, to have
No. col1 col2 col3 A
123 2 5 2 1
453 4 3 1 3
146 7 9 4 2
175 2 4 3 3
643 0 0 0 2
I tried the following line but nothing changed
df1['A'].fillna(df2['A'])
Use combine_first that is explicitly designed for this purpose:
(df1.set_index('No.')
.combine_first(df2.set_index('No.'))
.reset_index()
)
output:
No. col1 col2 col3 A
0 123 2.0 5.0 2.0 1.0
1 146 7.0 9.0 4.0 2.0
2 175 2.0 4.0 3.0 3.0
3 453 4.0 3.0 1.0 3.0
4 643 0.0 0.0 0.0 2.0
or fillna after setting 'No.' as index:
(df1.set_index('No.')
.fillna(df2.set_index('No.'))
.reset_index()
)
output:
No. col1 col2 col3 A
0 123 2 5 2 1.0
1 453 4 3 1 3.0
2 146 7 9 4 2.0
3 175 2 4 3 3.0
4 643 0 0 0 2.0
)
Try this:
df1['A'] = df1['A'].fillna(df2.set_index('No.').reindex(df1['No.'])['A'].reset_index(drop=True))
Another way with fillna and map:
df1["A"] = df1["A"].fillna(df1["No."].map(df2.set_index("No.")["A"]))
>>> df1
No. col1 col2 col3 A
0 123 2 5 2 1.0
1 453 4 3 1 3.0
2 146 7 9 4 2.0
3 175 2 4 3 3.0
4 643 0 0 0 2.0

Setting subset of a pandas DataFrame by a DataFrame

I feel like this question has been asked a millions times before, but I just can't seem to get it to work or find a SO-post answering my question.
So I am selecting a subset of a pandas DataFrame and want to change these values individually.
I am subselecting my DataFrame like this:
df.loc[df[key].isnull(), [keys]]
which works perfectly. If I try and set all values to the same value such as
df.loc[df[key].isnull(), [keys]] = 5
it works as well. But if I try and set it to a DataFrame it does not, however no error is produced either.
So for example I have a DataFrame:
data = [['Alex',10,0,0,2],['Bob',12,0,0,1],['Clarke',13,0,0,4],['Dennis',64,2],['Jennifer',56,1],['Tom',95,5],['Ellen',42,2],['Heather',31,3]]
df1 = pd.DataFrame(data,columns=['Name','Age','Amount_of_cars','cars_per_year','some_other_value'])
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.0 2.0
1 Bob 12 0 0.0 1.0
2 Clarke 13 0 0.0 4.0
3 Dennis 64 2 NaN NaN
4 Jennifer 56 1 NaN NaN
5 Tom 95 5 NaN NaN
6 Ellen 42 2 NaN NaN
7 Heather 31 3 NaN NaN
and a second DataFrame:
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5],[3/31,7]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
cars_per_year some_other_value
0 0.031250 5
1 0.017857 1
2 0.052632 7
3 0.047619 5
4 0.096774 7
and I would like to replace those nans with the second DataFrame
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
Unfortunately this does not work as the index does not match. So how do I ignore the index, when setting values?
Any help would be appreciated. Sorry if this has been posted before.
It is possible only if number of mising values is same like number of rows in df2, then assign array for prevent index alignment:
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
If not, get errors like:
#4 rows assigned to 5 rows
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (5,)
Another idea is set index of df2 by index of filtered rows in df1:
df2 = df2.set_index(df1.index[df1['cars_per_year'].isnull()])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
Just add .values or .to_numpy() if using pandas v 0.24 +
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

Pandas: Iterate by two column for each iteration

Does anyone know how to iterate a pandas Dataframe with two columns for each iteration?
Say I have
a b c d
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
So something like
for x, y in ...:
correlation of x and y
So output will be
corr_ab corr_bc corr_cd
0.1 0.3 -0.4
You can use zip with indexing for tuples, create dictionary of one element lists with Series.corr and f-strings for columns names and pass to DataFrame constructor:
L = {f'corr_{col1}{col2}': [df[col1].corr(df[col2])]
for col1, col2 in zip(df.columns, df.columns[1:])}
df = pd.DataFrame(L)
print (df)
corr_ab corr_bc corr_cd
0 0.860108 0.61333 0.888523
You can use df.corr to get the correlation of the dataframe. You then use mask to avoid repeated correlations. After that you can stack your new dataframe to make it more readable. Assuming you have data like this
0 1 2 3 4
0 11 6 17 2 3
1 3 12 16 17 5
2 13 2 11 10 0
3 8 12 13 18 3
4 4 3 1 0 18
Finding the correlation,
corrData = data.corr(method='pearson')
We get,
0 1 2 3 4
0 1.000000 -0.446023 0.304108 -0.136610 -0.674082
1 -0.446023 1.000000 0.563112 0.773013 -0.258801
2 0.304108 0.563112 1.000000 0.494512 -0.823883
3 -0.136610 0.773013 0.494512 1.000000 -0.545530
4 -0.674082 -0.258801 -0.823883 -0.545530 1.000000
Masking out repeated correlations,
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
We get
0 1 2 3 4
0 NaN -0.446023 0.304108 -0.136610 -0.674082
1 NaN NaN 0.563112 0.773013 -0.258801
2 NaN NaN NaN 0.494512 -0.823883
3 NaN NaN NaN NaN -0.545530
4 NaN NaN NaN NaN NaN
Stacking the correlated data
dataCorr = dataCorr.stack().reset_index()
The stacked data will look as shown
level_0 level_1 0
0 0 1 -0.446023
1 0 2 0.304108
2 0 3 -0.136610
3 0 4 -0.674082
4 1 2 0.563112
5 1 3 0.773013
6 1 4 -0.258801
7 2 3 0.494512
8 2 4 -0.823883
9 3 4 -0.545530

Normalize adjency matrix (in pandas) with MinMaxScaler

I have an adjency matrix (dm) of items vs items; the value between two items (e.g., item0,item1) refers to the number of times these items appear together. How can I scale all the values in pandas between 0 to 1?
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
However, I am not sure how to apply scaler to the pandas data frame.
You can assign the resulting array back to the dataframe with loc:
df = pd.DataFrame(np.random.randint(1, 5, (5, 5)))
df
Out[277]:
0 1 2 3 4
0 2 3 2 3 1
1 2 3 4 4 2
2 2 3 4 3 2
3 1 1 2 1 4
4 4 2 2 3 1
df.loc[:,:] = scaler.fit_transform(df)
df
Out[279]:
0 1 2 3 4
0 0.333333 1.0 0.0 0.666667 0.000000
1 0.333333 1.0 1.0 1.000000 0.333333
2 0.333333 1.0 1.0 0.666667 0.333333
3 0.000000 0.0 0.0 0.000000 1.000000
4 1.000000 0.5 0.0 0.666667 0.000000
You can do the same with (df - df.min()) / (df.max() - df.min()).

Categories

Resources