Convert upper triangular matrix to lower triangular matrix in Pandas Dataframe - python

I tried using transpose and adding some twists to it but it didn't workout
Convert Upper:
Data :
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 0.31 0.41 0.51
3 0.32 0.42 0.52 NaN
4 0.43 0.53 NaN NaN
5 0.54 NaN NaN Nan
to:
Data :
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54
without effecting the first two rows

I believe you need justify with sort with exclude first 2 rows:
arr = justify(df.values[2:,:], invalid_val=np.nan, side='down', axis=0)
df.values[2:,:] = np.sort(arr, axis=1)
print (df)
0 1 2 3
0 5.00 NaN NaN NaN
1 1.00 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54

IIUC you can first index the dataframe from row 2 onwards and swap with the transpose, and then you can use justify so that all NaNs are at the top:
df.iloc[2:,:] = df.iloc[2:,:].T.values
pd.Dataframe(justify(df.values.astype(float), invalid_val=np.nan, side='down', axis=0))
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54

Related

Calculating multiple columns using Panda's mask() and diff() with multiple conditions

I have a dataframe df
df:
Date
Type
AVG1
AVG2
AVG3
AVG4
AVG5
2022-05
ROL1
0.33
0.45
0.12
0.96
1.33
2022-05
ROL2
1.43
0.11
0.75
1.99
3.01
2022-05
ROL3
0.11
0.32
0.55
1.26
4.22
2022-04
ROL1
1.66
0.71
0.87
5.88
1.11
2022-04
ROL2
2.31
0.89
2.20
4.36
4.87
2022-04
ROL3
5.40
1.22
4.45
0.01
0.31
And I need to create the columns AVG1_ROL1_MoM, AVG1_ROL2_MoM, AVG3_ROL1_MoM, AVG1_ROL2_MoM and so on. Where AVG1_ROL1_MoM is the difference in AVG1 where TYPE = ROL1 from one month to the other:
Date
Type
AVG1
AVG2
AVG3
AVG4
AVG5
AVG1_ROL1_MoM
AVG1_ROL2_MoM
2022-05
ROL1
0.33
0.45
0.12
0.96
1.33
-1.33
NaN
2022-05
ROL2
1.43
0.11
0.75
1.99
3.01
NaN
-0.88
2022-05
ROL3
0.11
0.32
0.55
1.26
4.22
NaN
NaN
2022-04
ROL1
1.66
0.71
0.87
5.88
1.11
NaN
NaN
2022-04
ROL2
2.31
0.89
2.20
4.36
4.87
NaN
NaN
2022-04
ROL3
5.40
1.22
4.45
0.01
0.31
NaN
NaN
I tried to do that with mask() and shift(), but it didn't work:
df['AVG1_ROL1_MoM'] = df.mask(df['Type']=="ROL1", df['AVG1'] - df['AVG1'].shift(), inplace=True)
This returns that an axis must be defined, but when I define and axis it returns that:
"Cannot do inplace boolean setting on mixed-types with a non np.nan value"
What would be the best approach for this?
melt the dataframe to get all the values in a single column
Create the new column names
groupby to find the monthly differences
pivot to get back the original structure
merge with the original dataframe
melted = df.melt(["Date","Type"])
melted["column"] = melted["variable"]+"_"+melted["Type"]+"_MoM"
melted["diff"] = melted.groupby(["Type","variable"])["value"].diff(-1)
pivoted = melted.pivot(["Date","Type"],"column","diff").sort_index(ascending=[False,True]).reset_index()
output = df.merge(pivoted, on=["Date","Type"])
>>> output
Date Type AVG1 ... AVG5_ROL1_MoM AVG5_ROL2_MoM AVG5_ROL3_MoM
0 2022-05 ROL1 0.33 ... 0.22 NaN NaN
1 2022-05 ROL2 1.43 ... NaN -1.86 NaN
2 2022-05 ROL3 0.11 ... NaN NaN 3.91
3 2022-04 ROL1 1.66 ... NaN NaN NaN
4 2022-04 ROL2 2.31 ... NaN NaN NaN
5 2022-04 ROL3 5.40 ... NaN NaN NaN
[6 rows x 22 columns]
IUUC, you can try group by Type column and then compare the subgroup AVG shifted value and rename the outcome columns:
out = (df.filter(like='AVG')
.groupby(df['Type'])
.apply(lambda g: (g-g.shift(-1)).rename(columns=lambda col: f'{col}_{g.name}_MOM'))
)
print(out)
AVG1_ROL1_MOM AVG2_ROL1_MOM AVG3_ROL1_MOM AVG4_ROL1_MOM AVG5_ROL1_MOM \
0 -1.33 -0.26 -0.75 -4.92 0.22
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
AVG1_ROL2_MOM AVG2_ROL2_MOM AVG3_ROL2_MOM AVG4_ROL2_MOM AVG5_ROL2_MOM \
0 NaN NaN NaN NaN NaN
1 -0.88 -0.78 -1.45 -2.37 -1.86
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
AVG1_ROL3_MOM AVG2_ROL3_MOM AVG3_ROL3_MOM AVG4_ROL3_MOM AVG5_ROL3_MOM
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 -5.29 -0.9 -3.9 1.25 3.91
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
out = pd.concat([df, out], axis=1)
print(out)
Date Type AVG1 AVG2 AVG3 AVG4 AVG5 AVG1_ROL1_MOM AVG2_ROL1_MOM \
0 2022-05 ROL1 0.33 0.45 0.12 0.96 1.33 -1.33 -0.26
1 2022-05 ROL2 1.43 0.11 0.75 1.99 3.01 NaN NaN
2 2022-05 ROL3 0.11 0.32 0.55 1.26 4.22 NaN NaN
3 2022-04 ROL1 1.66 0.71 0.87 5.88 1.11 NaN NaN
4 2022-04 ROL2 2.31 0.89 2.20 4.36 4.87 NaN NaN
5 2022-04 ROL3 5.40 1.22 4.45 0.01 0.31 NaN NaN
AVG3_ROL1_MOM AVG4_ROL1_MOM AVG5_ROL1_MOM AVG1_ROL2_MOM AVG2_ROL2_MOM \
0 -0.75 -4.92 0.22 NaN NaN
1 NaN NaN NaN -0.88 -0.78
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
AVG3_ROL2_MOM AVG4_ROL2_MOM AVG5_ROL2_MOM AVG1_ROL3_MOM AVG2_ROL3_MOM \
0 NaN NaN NaN NaN NaN
1 -1.45 -2.37 -1.86 NaN NaN
2 NaN NaN NaN -5.29 -0.9
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
AVG3_ROL3_MOM AVG4_ROL3_MOM AVG5_ROL3_MOM
0 NaN NaN NaN
1 NaN NaN NaN
2 -3.9 1.25 3.91
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN

filter dataframe using isna() to filter ourt rows that have null value in following columns

I have dataframe similar to this one:
id name val1_rain val2_tik val3_bon val4_tig ...
0 2349 Rivi 0.11 0.34 0.78 0.21
1 3397 Mani NaN NaN NaN NaN
2 0835 Pigi 0.34 NaN 0.32 NaN
3 5093 Tari 0.65 0.12 0.34 2.45
4 2340 Yoti NaN NaN NaN NaN
I want to drop any row that has all null values for all the columns that come after the name column ( [:,2:]).
So the result output would look like this:
id name val1_rain val2_tik val3_bon val4_tig ...
0 2349 Rivi 0.11 0.34 0.78 0.21
2 0835 Pigi 0.34 NaN 0.32 NaN
3 5093 Tari 0.65 0.12 0.34 2.45
I have tried to do something like this:
df[~df.iloc[:,2:].isnull()]
but that raised an error:
ValueError: cannot reindex from a duplicate axis
First of all, I'm not sure why the error speaks about duplicate axis.
Then, I would like to find a way that I can have only rows that have any value at any column after the 2nd column.
I haven't found any question similar to this.
You can filter if exist at least one non missing values after second columns with DataFrame.notna and DataFrame.any:
df = df[df.iloc[:,2:].notna().any(axis=1)]
print (df)
id name val1_rain val2_tik val3_bon val4_tig
0 2349 Rivi 0.11 0.34 0.78 0.21
2 835 Pigi 0.34 NaN 0.32 NaN
3 5093 Tari 0.65 0.12 0.34 2.45

FIltering dataframe groups with "or" condition

I am dealing with a dataframe such this one:
id Xp_1 Xp_2 Xp_4 Xt_1 Xt_2 Xt_3 Mp_1 Mp_2 Mp_3 Mt_1 Mt_2 Mt_6
0 i24 Nan 0.27 Nan 0.45 0.20 0.25 0.27 Nan Nan Nan Nan Nan
1 i25 0.45 0.47 0.46 0.22 0.42 Nan 0.42 0.05 0.43 0.12 0.01 0.04
2 i11 Nan Nan 0.32 0.14 0.32 0.35 0.29 0.33 Nan Nan 0.02 0.44
3 i47 Nan 0.56 0.59 0.92 Nan 0.56 0.51 0.12 Nan 0.1 0.1 Nan
As you can see, I have something like two macro-groups (X and M), and for each macro-group two subsets (p and t). What I would like to implement is a "or" condition between the two macro-groups and a "and" condition between each subset of the macro-group.
Basically, I'd like to keep those lines that have at least two values for each subset in at least one group.
For example:
i24 should be discarded, in fact, we only have one value for the Xps, moreover, we don't have any value for the M group.
Entries like i11 should be kept, in fact, the condition is not satisfied for the X group, but it is satisfied for the M. The same goes for i25, which satisfies the condition in both groups.
I tried this:
keep_r = (df.groupby(lambda col: col.split("_", maxsplit=1)[0], axis=1)
.count()
.ge(2)
.all(axis=1))
df = df.loc[keep_r]
but it checks whether in all subsets (Xp, Xt, Mp, Mt) there are at least two values. Instead, I want to treat X and M independently.
Thank you!
We can groupby over 2 things: X, M and p, t which are column names' first & second characters. Then we can invoke your .count().ge(2).all(axis=1) logic but over the p and t's level. Then we put the or condition via any:
# to keep the `id` column aside
df = df.set_index("id")
# groups
c = df.columns
g = df.groupby([c.str[0], c.str[1]], axis=1)
# boolean mask
mask = (g.count()
.ge(2)
.all(axis=1, level=0) # micros: and
.any(axis=1)) # macros: or
# new df
ndf = df[mask]
to get
>>> ndf
Xp_1 Xp_2 Xp_4 Xt_1 Xt_2 Xt_3 Mp_1 Mp_2 Mp_3 Mt_1 Mt_2 Mt_6
id
i25 0.45 0.47 0.46 0.22 0.42 NaN 0.42 0.05 0.43 0.12 0.01 0.04
i11 NaN NaN 0.32 0.14 0.32 0.35 0.29 0.33 NaN NaN 0.02 0.44
i47 NaN 0.56 0.59 0.92 NaN 0.56 0.51 0.12 NaN 0.1 0.1 NaN
For illustration, before invoking all and any, we had:
>>> g.count().ge(2)
M X
p t p t
id
i24 False False False True
i25 True True True True
i11 True True False True
i47 True True True True
Then all over level 0 i.e., over p, t reduced this one step with and logic:
>>> g.count().ge(2).all(axis=1, level=0)
M X
id
i24 False False
i25 True True
i11 True False
i47 True True
and finally any over the remaining M, X reduced it to a boolean series with or logic and this says which rows to keep:
>>> g.count().ge(2).all(axis=1, level=0).any(axis=1)
id
i24 False
i25 True
i11 True
i47 True
dtype: bool
IIUC Try creating a MultiIndex from pattern str.extract:
df = df.set_index('id')
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)_(.+)'))
0 X M
1 p t p t
2 1 2 4 1 2 3 1 2 3 1 2 6
id
i24 NaN 0.27 NaN 0.45 0.20 0.25 0.27 NaN NaN NaN NaN NaN
i25 0.45 0.47 0.46 0.22 0.42 NaN 0.42 0.05 0.43 0.12 0.01 0.04
i11 NaN NaN 0.32 0.14 0.32 0.35 0.29 0.33 NaN NaN 0.02 0.44
i47 NaN 0.56 0.59 0.92 NaN 0.56 0.51 0.12 NaN 0.10 0.10 NaN
Then groupby levels 0 and 1 to count then apply separate logic to each level.:
keep = (
df.groupby(axis=1, level=[0, 1]).count()
.ge(2).all(axis=1, level=0).any(axis=1)
)
id
i24 False
i25 True
i11 True
i47 True
dtype: bool
Then filter down and collapse MultiIndex:
df = df.loc[keep]
df.columns = df.columns.map(lambda c: f'{"".join(c[:-1])}_{c[-1]}')
df = df.reset_index()
id Xp_1 Xp_2 Xp_4 Xt_1 Xt_2 Xt_3 Mp_1 Mp_2 Mp_3 Mt_1 Mt_2 Mt_6
0 i25 0.45 0.47 0.46 0.22 0.42 NaN 0.42 0.05 0.43 0.12 0.01 0.04
1 i11 NaN NaN 0.32 0.14 0.32 0.35 0.29 0.33 NaN NaN 0.02 0.44
2 i47 NaN 0.56 0.59 0.92 NaN 0.56 0.51 0.12 NaN 0.10 0.10 NaN

Pandas : How to concatenate or merge the groups using groupby function and populate single table or dataframe?

df = name description curve tenor rates
IND 3M ZAR_3M 0.25 6.808000088
IND 2Y ZAR_3M 2 6.483012199
IND 3Y ZAR_3M 3 6.565002918
IND 4Y ZAR_3M 4 6.694129944
IND 5Y ZAR_3M 5 6.83951807
IND 3M CAD_OIS 0.25 1.738620043
BHU 6M CAD_OIS 0.5 1.718042016
IND 9M CAD_OIS 0.75 1.697247028
IND 1Y CAD_OIS 1 1.67719996
IND 18M CAD_OIS 1.5 1.631257057
IND 2Y CAD_3M 2 1.906309009
IND 3y CAD_3M 3 1.855569959
IND 4Y CAD_3M 4 1.830132961
BHU 5Y CAD_3M 5 1.817605019
BHU 6y CAD_3M 6 1.814880013
IND 7Y CAD_3M 7 1.821526051
BHU TND CZK_Curve 0.01 0.02
BHU 1WK CZK_Curve 0.03 0.0203
BHU 1M CZK_Curve 0.09 0.021
BHU 2M CZK_Curve 0.18 0.0212
BHU 3M CZK_Curve 0.26 0.0214
BHU 6M CZK_Curve 0.51 0.0212
BHU 9M CZK_Curve 0.76 0.02045
BHU 12M CZK_Curve 1.01 0.01985
BHU 2Y CZK_Curve 2.01 0.020033333
BHU 3Y CZK_Curve 3.02 0.018816667
BHU 4Y CZK_Curve 4.02 0.017666667
BHU 5Y CZK_Curve 5.02 0.016616667
BHU 6Y CZK_Curve 6.02 0.015766667
BHU 7Y CZK_Curve 7.02 0.015216667
BHU 8Y CZK_Curve 8.02 0.014616667
BHU 9Y CZK_Curve 9.02 0.014358333
Above is my dataframe(df) having 5 variables. I would like to populate the table based on 'curve' and rename the rates as curve name. Following is my expected output. I tried using groupby function to generate groups and concatenate side by side based on 'tenor'. But my code seems incomplete. Please suggest to how to produce the below output.
df_tenor = df_tenor[['Tenor']].drop_duplicates()
df_tenor = df_tenor.sort_values(by=['tenor'])
gb = df.groupby('curve')
df.rename(columns={'rates': str([df.curve.unique() for g in gb])}, inplace=True)
df_final= pd.concat([g[1].merge(df_tenor, how='outer', on='Tenor') for g in gb], axis=1)
df_final.to_csv('testconcat.csv', index = False)
Use ``pandas.pivot_table()```
pd.pivot_table(df, index='tenor', values='rates', columns='curve')
Output
curve CAD_3M CAD_OIS CZK_Curve ZAR_3M
tenor
0.01 NaN NaN 0.020000 NaN
0.03 NaN NaN 0.020300 NaN
0.09 NaN NaN 0.021000 NaN
0.18 NaN NaN 0.021200 NaN
0.25 NaN 1.738620 NaN 6.808000
0.26 NaN NaN 0.021400 NaN
0.50 NaN 1.718042 NaN NaN
0.51 NaN NaN 0.021200 NaN
0.75 NaN 1.697247 NaN NaN
0.76 NaN NaN 0.020450 NaN
1.00 NaN 1.677200 NaN NaN
1.01 NaN NaN 0.019850 NaN
1.50 NaN 1.631257 NaN NaN
2.00 1.906309 NaN NaN 6.483012
2.01 NaN NaN 0.020033 NaN
3.00 1.855570 NaN NaN 6.565003
3.02 NaN NaN 0.018817 NaN
4.00 1.830133 NaN NaN 6.694130
4.02 NaN NaN 0.017667 NaN
5.00 1.817605 NaN NaN 6.839518
5.02 NaN NaN 0.016617 NaN
6.00 1.814880 NaN NaN NaN
6.02 NaN NaN 0.015767 NaN
7.00 1.821526 NaN NaN NaN
7.02 NaN NaN 0.015217 NaN
8.02 NaN NaN 0.014617 NaN
9.02 NaN NaN 0.014358 NaN

Is there a way to plot corresponding points of two data frames?

I have two dataframes with the same columns and date indices:
df1:
Date T.TO AS.TO NTR.TO ... R.TO
2016-03-03 0.1 0.02 0.04 0.02
2016-03-04 0.09 0.01 0.02 0.02
2016-03-05 0.1 0.02 0.04 0.02
...
2019-03-03 0.09 0.01 0.02 0.02
df2:
Date T.TO AS.TO NTR.TO ... R.TO
2016-03-03 0.01 0.32 0.04 0.02
2016-03-04 0.81 0.21 0.02 0.02
2016-03-05 0.01 0.12 0.04 0.02
...
2019-03-03 0.89 0.11 0.12 0.72
I want to plot all the matching points of the two dataframes on a chart like the first point would correspond to 2016-03-03, T.TO (0.1, 0.01). Another point would correspond to 2016-03-03, AS.TO (0.02, 0.32) and so on giving me a large number of points. I will then use these to find a line of best fit.
I know how to find the best fit line but I am having difficulty plotting these points directly. I tried using nested for loops and dictionaries but I was wondering if there is a more straightforward approach to this?
To plot these points, you can stack:
plt.scatter(df1.set_index('Date').stack(), df2.set_index('Date').stack())
Output:
If you want to drop out all the data that is not common between the two dataframes then this should work.
In [71]: df = pd.read_clipboard()
In [72]: df
Out[72]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 0.10 0.02 0.04 0.02 NaN
1 2016-03-04 0.09 0.01 0.02 0.02 NaN
2 2016-03-05 0.10 0.02 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 0.09 0.01 0.02 0.02 NaN
In [73]: df2 = pd.read_clipboard()
In [74]: df2
Out[74]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 0.01 0.32 0.04 0.02 NaN
1 2016-03-04 0.81 0.21 0.02 0.02 NaN
2 2016-03-05 0.01 0.12 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 0.89 0.11 0.12 0.72 NaN
Then df3 can only have values that match the two datasets
In [75]: df3 = df[df==df2]
In [76]: df3
Out[76]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 NaN NaN 0.04 0.02 NaN
1 2016-03-04 NaN NaN 0.02 0.02 NaN
2 2016-03-05 NaN NaN 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 NaN NaN NaN NaN NaN
From there plotting is a simple matter.

Categories

Resources