manipulating more than 2 dataframes

manipulating more than 2 dataframes - python

I have 6 different dataframes and I would like to append one after the other .
The only way I find to do so is append 2 each time, although I believe there must be a more efficient way to do this .
I am also looking forward after that to change the index and header names, that I also know how to do one by one, but I also believe there must also be an efficient way to do so.
The last problem I am facing is how to set an index with with the column that is NaN , how shall I refer to it in order to set_index?  
 
df1
 NaN     1      2      3
1   A   17.03   13.41  19.61
7   B   3.42    1.51    5.44
8   C   5.65    2.81    1.89
df2
NaN     1      2      3
1  J   1.60   2.65   1.44
5  H   26.78  27.04  21.06
df3
NaN    1      2      3
1   L   1.20   1.41   2.04
2   M   1.23   1.72   2.47
4   R  66.13  51.49  16.62
5   F     --  46.89  22.35
df4
 NaN    1      2      3
1   A   17.03   13.41  19.61
7   B   3.42    1.51    5.44
8   C   5.65    2.81    1.89
df5
NaN    1      2      3
1  J   1.60   2.65   1.44
5  H   26.78  27.04  21.06
df6
NaN    1      2      3
1   L   1.20   1.41   2.04
2   M   1.23   1.72   2.47
4   R  66.13  51.49  16.62
5   F     --  46.89  22.35

You can use concat, for select NaN column is possible use df.columns[0] with set_index and list comprehension:
dfs = [df1,df2, df3, ...]
df = pd.concat([df.set_index(df.columns[0], append=True) for df in dfs])
print (df)
1 2 3
NaN
1 A 17.03 13.41 19.61
7 B 3.42 1.51 5.44
8 C 5.65 2.81 1.89
1 J 1.6 2.65 1.44
5 H 26.78 27.04 21.06
1 L 1.20 1.41 2.04
2 M 1.23 1.72 2.47
4 R 66.13 51.49 16.62
5 F -- 46.89 22.35
EDIT:
It seems NaN values can be strings:
print (df3.columns)
Index(['NaN', '1', '2', '3'], dtype='object')
dfs = [df1,df2, df3]
df = pd.concat([df.set_index('NaN', append=True) for df in dfs])
print (df)
1 2 3
NaN
1 A 17.03 13.41 19.61
7 B 3.42 1.51 5.44
8 C 5.65 2.81 1.89
1 J 1.6 2.65 1.44
5 H 26.78 27.04 21.06
1 L 1.20 1.41 2.04
2 M 1.23 1.72 2.47
4 R 66.13 51.49 16.62
5 F -- 46.89 22.35
Or if there are np.nan for me works also:
#converting to `NaN` if necessary
#df1.columns = df1.columns.astype(float)
#df2.columns = df2.columns.astype(float)
#df3.columns = df3.columns.astype(float)
dfs = [df1,df2, df3]
df = pd.concat([df.set_index(np.nan, append=True) for df in dfs])
print (df)
1.0 2.0 3.0
nan
1 A 17.03 13.41 19.61
7 B 3.42 1.51 5.44
8 C 5.65 2.81 1.89
1 J 1.6 2.65 1.44
5 H 26.78 27.04 21.06
1 L 1.20 1.41 2.04
2 M 1.23 1.72 2.47
4 R 66.13 51.49 16.62
5 F -- 46.89 22.35

Related

Categorize the highest value output among different columns

In my dataframe I want to choose the highest value among columns A,B,C. Then I will let to categorize the highest value in my output dataframe. I would also like to include a special condition, where if all the values are negative, then the output will return as N.A.
input df:
A B C
Date
2020-01-05 3.57 5.29 6.23
2020-01-04 4.98 9.64 7.58
2020-01-03 3.79 5.25 6.26
2020-01-02 3.95 5.65 6.61
2020-01-01 -3.10 -7.20 -8.16
output df:
A B C HIGHEST_CAT
Date
2020-01-05 3.57 5.29 6.23 C
2020-01-04 4.98 9.64 7.58 B
2020-01-03 3.79 5.25 6.26 C
2020-01-02 3.95 5.65 6.61 C
2020-01-01 -3.10 -7.20 -8.16 N.A.
How could I achieve this output?

Use DataFrame.idxmax with condition for test all values bellow 0 by DataFrame.lt and DataFrame.all in numpy.where:
df['HIGHEST_CAT'] = np.where(df.lt(0).all(axis=1), np.nan, df.idxmax(axis=1))
Or in Series.mask with default np.nan, so not necessary specify:
df['HIGHEST_CAT'] = df.idxmax(axis=1).mask(df.lt(0).all(axis=1))
Or:
df.loc[df.gt(0).all(axis=1), 'HIGHEST_CAT'] = df.idxmax(axis=1)
print (df)
A B C HIGHEST_CAT
Date
2020-01-05 3.57 5.29 6.23 C
2020-01-04 4.98 9.64 7.58 B
2020-01-03 3.79 5.25 6.26 C
2020-01-02 3.95 5.65 6.61 C
2020-01-01 -3.10 -7.20 -8.16 NaN

Use df.where:
In [375]: df['HIGHEST_CAT'] = df.idxmax(axis=1).where(df.gt(0).all(axis=1))
In [376]: df
Out[376]:
A B C HIGHEST_CAT
Date
2020-01-05 3.57 5.29 6.23 C
2020-01-04 4.98 9.64 7.58 B
2020-01-03 3.79 5.25 6.26 C
2020-01-02 3.95 5.65 6.61 C
2020-01-01 -3.10 -7.20 -8.16 NaN

Merge two column header and give a new name in MultiIndex Dataframe python/Add column above column names

I have the initial dataframe :
r_id1 r_score1 rid2 r_score2
Rank
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
I want my data frame to be (Result_df):
Score_R1 Score_R2
r_id1 r_score1 rid2 r_score2
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
My dataframe is multiindex and with MultiIndex columns.
I tried this piece of code
final_df.columns = [' '.join(col).strip() for col in final_df.columns.values]
which gives me this output
ID1 ID2 r_id1 r_score1 rid2 r_score2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
After :
cols = final_df.columns.map(''.join)
lvl = 'Score_R' + cols.str.extract('(\d+)', expand=False)
final_df.columns = [lvl, cols]
final_df.to_csv("f.csv")
Output is:
Score_R1 Score_R1 Score_R2 Score_R2
r_id1 r_score1 rid2 r_score2
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
I need to combine the column headers with same name
Score_R1 Score_R2
r_id1 r_score1 rid2 r_score2
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45

You can use str.extract for get numbers from columns names, add prefix and last assign back with original columns for MultiIndex in columns:
print (df.columns.tolist())
[('r_id1', ''), ('r_score1', ''), ('rid2', ''), ('r_score2', '')]
cols = df.columns.map(''.join)
print (cols.tolist())
['r_id1', 'r_score1', 'rid2', 'r_score2']
lvl = 'Score_R' + cols.str.extract('(\d+)', expand=False)
print (lvl)
Index(['Score_R1', 'Score_R1', 'Score_R2', 'Score_R2'], dtype='object')
df.columns = [lvl, cols]
print (df)
Score_R1 Score_R2
r_id1 r_score1 rid2 r_score2
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
df.columns = df.columns.map('_'.join)
print (df)
Score_R1_r_id1 Score_R1_r_score1 Score_R2_rid2 Score_R2_r_score2
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
EDIT: You can replace missing values of first level to empty strings:
cols = df.columns.droplevel(-1)
lvl = 'Score_R' + cols.str.extract('(\d+)', expand=False)
print (lvl)
Index(['Score_R1', 'Score_R1', 'Score_R2', 'Score_R2'], dtype='object')
lvl = lvl.where(~lvl.duplicated(), '')
print (lvl)
Index(['Score_R1', '', 'Score_R2', ''], dtype='object')
df.columns = [lvl, cols]
print (df)
Score_R1 Score_R2
r_id1 r_score1 rid2 r_score2
ID1 ID2
1 A-1 id-1 1.23 id-34 6.78
2 A-1 id-9 2.34 id-45 3.45
3 A-2 id-8 3.56 id-32 4.56
4 A-3 id-6 4.35 id-10 3.98
5 A-4 id-4 7.89 id-67 2.98
print (df.columns)
MultiIndex([('Score_R1', 'r_id1'),
( '', 'r_score1'),
('Score_R2', 'rid2'),
( '', 'r_score2')],
)

Insert into a pandas dataframe slice of another slice from the same dataframe

I have a pandas.DataFrame like this one:
df = pd.DataFrame({'val_1': [np.nan, np.nan, np.nan, 2.34, 2.21, 2.45],
'val_2': [3.1, 3.02, 3.67, np.nan , np.nan, np.nan],
'group': [1, 1, 1, 2, 2, 2]})
df
val_1 val_2 group
0 NaN 3.10 1
1 NaN 3.02 1
2 NaN 3.67 1
3 2.34 NaN 2
4 2.21 NaN 2
5 2.45 NaN 2
I want to fill the NaN values from column val_1 that belong to group 1 with the values from the column val_1 from the group 2. I tried using:
df.loc[df['group']==1, 'val_1'] = df.loc[df['group']==2, 'val_1']
I'm expecting as a result the following:
val_1 val_2 group
0 2.34 3.10 1
1 2.21 3.02 1
2 2.45 3.67 1
3 2.34 NaN 2
4 2.21 NaN 2
5 2.45 NaN 2
But instead I'm getting this:
val_1 val_2 group
0 NaN 3.10 1
1 NaN 3.02 1
2 NaN 3.67 1
3 2.34 NaN 2
4 2.21 NaN 2
5 2.45 NaN 2
How can I perform that action properly?
The solution needs to be extensible to a larger dataframe.
Thank you in advance!

Fix your code adding the .values
df.loc[df['group']==1, 'val_1'] = df.loc[df['group']==2, 'val_1'].values
df
Out[300]:
val_1 val_2 group
0 2.34 3.10 1
1 2.21 3.02 1
2 2.45 3.67 1
3 2.34 NaN 2
4 2.21 NaN 2
5 2.45 NaN 2

Python/Pandas - Sum dataframe items if indexes have the same month

I have this two DataFrames:
Seasonal_Component:
# DataFrame that has the seasonal component of a time series
Date
2014-12 -1.08
2015-01 -0.28
2015-02 0.15
2015-03 0.46
2015-04 0.48
2015-05 0.37
2015-06 0.20
2015-07 0.15
2015-08 0.12
2015-09 -0.02
2015-10 -0.17
2015-11 -0.39
Prediction_df:
# DataFrame with the prediction of the trend of that same time series
Prediction MAPE Score
2015-11-01 7.93 1.83 1
2015-12-01 7.93 1.67 1
2016-01-01 7.92 1.71 1
2016-02-01 7.95 1.84 1
2016-03-01 7.94 1.53 1
2016-04-01 7.87 1.45 1
2016-05-01 7.91 1.53 1
2016-06-01 7.87 1.40 1
2016-07-01 7.84 1.40 1
2016-08-01 7.89 1.77 1
2016-09-01 7.87 1.99 1
What I need to do:
Check which Prediction_df index have the same months as the Seasonal_Component index and sum the correspondent seasonal component with the prediction, so the Prediction_df looks like this:
Prediction MAPE Score
2015-11-01 7,54 1.83 1
2015-12-01 6.85 1.67 1
2016-01-01 7.64 1.71 1
2016-02-01 8.10 1.84 1
2016-03-01 8.40 1.53 1
2016-04-01 8.35 1.45 1
2016-05-01 8.28 1.53 1
2016-06-01 8.07 1.40 1
2016-07-01 7.99 1.40 1
2016-08-01 8.01 1.77 1
2016-09-01 7.85 1.99 1
Anyone available to enlight my journey?
I'm already on the "almost mad" stage trying to solve this.
EDIT
Important note to make it clearer: I need to disconsider the year and consider only the month to make the sum. Something like "everytime that an April appears (doesn't matter if it is 2006 or 2025) I need to sum with the April value of the Seasonal_Component frame.

Consider a data frame merge on the date fields (month values), then a simple addition of the two fields. The date fields may require conversion from string values:
import datetime as dt
...
# IF DATES ARE REGULAR COLUMNS
seasonal_component['Date'] = pd.to_datetime(seasonal_component['Date'])
seasonal_component['Month'] = seasonal_component['Date'].dt.month
predict_df['Date'] = pd.to_datetime(predict_df['Date'])
predict_df['Month'] = predict_df['Date'].dt.month
# IF DATES ARE INDICES
seasonal_component.index = pd.to_datetime(seasonal_component.index)
seasonal_component['Month'] = seasonal_component.index.month
predict_df.index = pd.to_datetime(predict_df.index)
predict_df['Month'] = predict_df.index.month
However, think about how you need to join the two data sets (akin to SQL's join clauses):
inner (default) - keeps only records matching both
left - keeps records of predict_df and only those matching seasonal_component where predict_df is first argument
right - keeps records of seasonal_component and only those matching predict_df where predict_df is first argument
outer - keeps all records, those that match and those that don't match
Below assumes an outer join where data on both sides remain with NaNs to fill for missing values.
# MERGING DATA FRAMES
merge_df = pd.merge(predict_df, seasonal_component[['Month', 'SeasonalComponent']],
on=['Month'], how='outer')
# ADDING COLUMNS
merge_df['Prediction'] = merge_df['Prediction'] + merge_df['SeasonalComponent']
Outcome (using posted data)
Date Prediction MAPE Score Month SeasonalComponent
0 2015-11-01 7.54 1.83 1 11 -0.39
1 2015-12-01 6.85 1.67 1 12 -1.08
2 2016-01-01 7.64 1.71 1 1 -0.28
3 2016-02-01 8.10 1.84 1 2 0.15
4 2016-03-01 8.40 1.53 1 3 0.46
5 2016-04-01 8.35 1.45 1 4 0.48
6 2016-05-01 8.28 1.53 1 5 0.37
7 2016-06-01 8.07 1.40 1 6 0.20
8 2016-07-01 7.99 1.40 1 7 0.15
9 2016-08-01 8.01 1.77 1 8 0.12
10 2016-09-01 7.85 1.99 1 9 -0.02
11 NaT NaN NaN NaN 10 -0.17

Firstly separate the month from both dataframes and then merge on basis of month. Further add the required columns and create new column with desired output. Here is the code below:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from numpy.random import randn
Seasonal_Component = DataFrame({
'Date': ['2014-12','2015-01','2015-02','2015-03','2015-04','2015-05','2015-06','2015-07','2015-08','2015-09','2015-10','2015-11'],
'Value': [-1.08,-0.28,0.15,0.46,0.48,0.37,0.20,0.15,0.12,-0.02,-0.17,-0.39]
})
Prediction_df = DataFrame({
'Date': ['2015-11-01','2015-12-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01'],
'Prediction': [7.93,7.93,7.92,7.95,7.94,7.87,7.91,7.87,7.84,7.89,7.87],
'MAPE':[1.83,1.67,1.71,1.84,1.53,1.45,1.53,1.40,1.40,1.77,1.99],
'Score':[1,1,1,1,1,1,1,1,1,1,1]
})
def mon_extract(date):
return date.split('-')[1]
Seasonal_Component['Month']=Seasonal_Component['Date'].apply(mon_extract)
def mon_extract(date):
return date.split('-')[1].split('-')[0]
Prediction_df['Month']=Prediction_df['Date'].apply(mon_extract)
FinalDF=pd.merge(Seasonal_Component,Prediction_df,on='Month',how='right')
FinalDF
FinalDF['PredictionF']=FinalDF['Value']+FinalDF['Prediction']
FinalDF.loc[:,['Date_y','PredictionF','MAPE','Score']]

Re-shaping Dataframe so that Column Headers are made into Rows

I am trying to reshape the dataframe below.
Tenor 2013M06D12 2013M06D13 2013M06D14 \
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
So, that it looks as follows. I was looking at using pivot_table, but this is sort of the opposite of what that would do as I need to convert Column Headers to rows and not the other way around. Hence, I am not sure how to proceed in order to obtain this dataframe.
Date Tenor Rate
1 2013-06-12 1 1.24
2 2013-06-13 1 1.26
4 2013-06-14 1 1.23
The code just involves reading from a CSV:
result = pd.DataFrame.read_csv("BankofEngland.csv")

I think you can do with with a melt, a sort, a date parse, and some column shuffling:
dfm = pd.melt(df, id_vars="Tenor", var_name="Date", value_name="Rate")
dfm = dfm.sort("Tenor").reset_index(drop=True)
dfm["Date"] = pd.to_datetime(dfm["Date"], format="%YM%mD%d")
dfm = dfm[["Date", "Tenor", "Rate"]]
produces
In [104]: dfm
Out[104]:
Date Tenor Rate
0 2013-06-12 1 1.24
1 2013-06-13 1 1.26
2 2013-06-14 1 1.23
3 2013-06-12 2 2.01
4 2013-06-13 2 0.43
5 2013-06-14 2 0.45
6 2013-06-12 3 1.21
7 2013-06-13 3 2.24
8 2013-06-14 3 1.03
9 2013-06-12 4 0.39
10 2013-06-13 4 2.32
11 2013-06-14 4 1.23

import pandas as pd
import numpy as np
# try to read your sample data, replace with your read_csv func
df = pd.read_clipboard()
Out[139]:
Tenor 2013M06D12 2013M06D13 2013M06D14
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
# reshaping
df.set_index('Tenor', inplace=True)
df = df.stack().reset_index()
df.columns=['Tenor', 'Date', 'Rate']
# suggested by DSM, use the date parser
df.Date = pd.to_datetime(df.Date, format='%YM%mD%d')
Out[147]:
Tenor Date Rate
0 1 2013-06-12 1.24
1 1 2013-06-13 1.26
2 1 2013-06-14 1.23
3 2 2013-06-12 2.01
4 2 2013-06-13 0.43
.. ... ... ...
7 3 2013-06-13 2.24
8 3 2013-06-14 1.03
9 4 2013-06-12 0.39
10 4 2013-06-13 2.32
11 4 2013-06-14 1.23
[12 rows x 3 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

manipulating more than 2 dataframes - python

Related

Categorize the highest value output among different columns

Merge two column header and give a new name in MultiIndex Dataframe python/Add column above column names

Insert into a pandas dataframe slice of another slice from the same dataframe

Python/Pandas - Sum dataframe items if indexes have the same month

Re-shaping Dataframe so that Column Headers are made into Rows

Categories

Resources