Nuances of merging multiple pandas dataframes (3+) on a key column - python

First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.
Here are the standard formats of the dataframes I am merging with pandas.
df1 =
RT %Area RRT
0 4.83 5.257 0.509
1 6.76 0.424 0.712
2 7.27 0.495 0.766
3 7.70 0.257 0.811
4 7.79 0.122 0.821
5 9.49 92.763 1.000
6 11.40 0.681 1.201
df2=
RT %Area RRT
0 4.83 0.731 0.508
1 6.74 1.243 0.709
2 7.28 0.109 0.766
3 7.71 0.287 0.812
4 7.79 0.177 0.820
5 9.50 95.824 1.000
6 11.31 0.348 1.191
7 11.40 1.166 1.200
8 12.09 0.113 1.273
df3 = ...
Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.
#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
if (entry.path.endswith(".csv")) and entry.is_file():
entry = pd.read_csv(entry.path, header=None)
#Block of formatting code removed
dfs.append(entry.round(2))
dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)
RRT RT_x %Area_x RT_y %Area_y RT %Area
0 0.508 NaN NaN 4.83 0.731 NaN NaN
1 0.509 4.83 5.257 NaN NaN 4.83 5.257
2 0.709 NaN NaN 6.74 1.243 NaN NaN
3 0.712 6.76 0.424 NaN NaN 6.76 0.424
4 0.766 7.27 0.495 7.28 0.109 7.27 0.495
5 0.811 7.70 0.257 NaN NaN 7.70 0.257
6 0.812 NaN NaN 7.71 0.287 NaN NaN
7 0.820 NaN NaN 7.79 0.177 NaN NaN
8 0.821 7.79 0.122 NaN NaN 7.79 0.122
9 1.000 9.49 92.763 9.50 95.824 9.49 92.763
10 1.191 NaN NaN 11.31 0.348 NaN NaN
11 1.200 NaN NaN 11.40 1.166 NaN NaN
12 1.201 11.40 0.681 NaN NaN 11.40 0.681
13 1.273 NaN NaN 12.09 0.113 NaN NaN
This works, but:
Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.
Is this reduced merge_ordered the simplest way of doing this?
Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?
I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.
concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)
A B C
RT %Area RRT RT %Area RRT RT %Area RRT
0 4.83 5.257 0.509 4.83 0.731 0.508 4.83 5.257 0.509
1 6.76 0.424 0.712 6.74 1.243 0.709 6.76 0.424 0.712
2 7.27 0.495 0.766 7.28 0.109 0.766 7.27 0.495 0.766
3 7.70 0.257 0.811 7.71 0.287 0.812 7.70 0.257 0.811
4 7.79 0.122 0.821 7.79 0.177 0.820 7.79 0.122 0.821
5 9.49 92.763 1.000 9.50 95.824 1.000 9.49 92.763 1.000
6 11.40 0.681 1.201 11.31 0.348 1.191 11.40 0.681 1.201
7 NaN NaN NaN 11.40 1.166 1.200 NaN NaN NaN
8 NaN NaN NaN 12.09 0.113 1.273 NaN NaN NaN
I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.
Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?
Thanks!

Related

How to create dataframe by randomly selecting from another dataframe?

DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
(0.519) (1.117) (1.152) 0.772 1.490 (0.850) (1.189) (0.759)
0.030 0.047 0.632 (0.608) (0.322) 0.939 0.346 0.651
1.290 (0.179) 0.006 0.850 (1.141) 0.758 0.682
1.500 (1.228) 1.840 (1.594) (0.282) (0.907)
(1.540) 0.689 (0.683) 0.005 0.543
(0.197) (0.664) (0.636) 0.878
(0.942) 0.764 (0.137)
0.693 1.647
0.197
I have above dataframe:
i need below dataframe using random value from above dataframe:
DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
(0.664) 1.290 0.682 0.030 (0.683) (0.636) (0.683) 1.840 (1.540)
1.490 (0.907) (0.850) (0.197) (1.228) 0.682 1.290 0.939
0.047 0.682 0.346 0.689 (0.137) 1.490 0.197
0.047 0.878 0.651 0.047 0.047 (0.197)
(1.141) 0.758 0.878 1.490 0.651
1.647 1.490 0.772 1.490
(0.519) 0.693 0.346
(0.137) 0.850
0.197
I've tried this code :
df2= df1.sample(len(df1))
print(df2)
But Output is
DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 DP9
OP8 0.735590 1.762630 NaN NaN NaN NaN NaN NaN NaN
OP7 -0.999665 0.817949 -0.147698 NaN NaN NaN NaN NaN NaN
OP2 0.031430 0.049994 0.682040 -0.667445 -0.360034 1.089516 0.426642 0.916619 NaN
OP3 1.368955 -0.191781 0.006623 0.932736 -1.277548 0.880056 0.841018 NaN NaN
OP1 -0.551065 -1.195305 -1.243199 0.847178 1.668630 -0.986300 -1.465904 -1.069986 NaN
OP4 1.592201 -1.314628 1.985683 -1.749389 -0.315828 -1.052629 NaN NaN NaN
OP6 -0.208647 -0.710424 -0.686654 0.963221 NaN NaN NaN NaN NaN
OP10 NaN NaN NaN NaN NaN NaN NaN NaN NaN
OP9 0.209244 NaN NaN NaN NaN NaN NaN NaN NaN
OP5 -1.635306 0.737937 -0.736907 0.005545 0.607974 NaN NaN NaN NaN
You can use np.random.choice() for the sampling.
Assuming df is something like this:
df = pd.DataFrame({'DP 1': ['(0.519)','0.030','1.290','1.500','(1.540)','(0.197)','(0.942)','0.693','0.197'],'DP 2': ['(1.117)','0.047','(0.179)','(1.228)','0.689','(0.664)','0.764','1.647',np.nan],'DP 3': ['(1.152)','0.632','0.006','1.840','(0.683)','(0.636)','(0.137)',np.nan,np.nan],'DP 4': ['0.772','(0.608)','0.850','(1.594)','0.005','0.878',np.nan,np.nan,np.nan],'DP 5': ['1.490','(0.322)','(1.141)','(0.282)','0.543',np.nan,np.nan,np.nan,np.nan],'DP 6': ['(0.850)','0.939','0.758','(0.907)',np.nan,np.nan,np.nan,np.nan,np.nan],'DP 7': ['(1.189)','0.346','0.682',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'DP 8': ['(0.759)','0.651',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'DP 9': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'DP 10': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
# DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
# 0 (0.519) (1.117) (1.152) 0.772 1.490 (0.850) (1.189) (0.759) NaN NaN
# 1 0.030 0.047 0.632 (0.608) (0.322) 0.939 0.346 0.651 NaN NaN
# 2 1.290 (0.179) 0.006 0.850 (1.141) 0.758 0.682 NaN NaN NaN
# 3 1.500 (1.228) 1.840 (1.594) (0.282) (0.907) NaN NaN NaN NaN
# 4 (1.540) 0.689 (0.683) 0.005 0.543 NaN NaN NaN NaN NaN
# 5 (0.197) (0.664) (0.636) 0.878 NaN NaN NaN NaN NaN NaN
# 6 (0.942) 0.764 (0.137) NaN NaN NaN NaN NaN NaN NaN
# 7 0.693 1.647 NaN NaN NaN NaN NaN NaN NaN NaN
# 8 0.197 NaN NaN NaN NaN NaN NaN NaN NaN NaN
First extract the choices from all non-null values of df:
choices = df.values[~pd.isnull(df.values)]
# array(['(0.519)', '(1.117)', '(1.152)', '0.772', '1.490', '(0.850)',
# '(1.189)', '(0.759)', '0.030', '0.047', '0.632', '(0.608)',
# '(0.322)', '0.939', '0.346', '0.651', '1.290', '(0.179)', '0.006',
# '0.850', '(1.141)', '0.758', '0.682', '1.500', '(1.228)', '1.840',
# '(1.594)', '(0.282)', '(0.907)', '(1.540)', '0.689', '(0.683)',
# '0.005', '0.543', '(0.197)', '(0.664)', '(0.636)', '0.878',
# '(0.942)', '0.764', '(0.137)', '0.693', '1.647', '0.197'],
# dtype=object)
Then take a np.random.choice() from choices for all non-null cells:
df = df.applymap(lambda x: np.random.choice(choices) if not pd.isnull(x) else x)
# DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
# 0 (0.179) 0.682 0.758 (1.152) (0.137) (1.152) 0.939 (0.759) NaN NaN
# 1 1.500 (1.152) (0.197) 0.772 1.840 1.840 0.772 (0.850) NaN NaN
# 2 0.878 0.005 (1.540) 0.764 (0.519) 0.682 (1.152) NaN NaN NaN
# 3 0.758 (0.137) 1.840 1.647 1.647 (0.942) NaN NaN NaN NaN
# 4 0.693 (0.683) (0.759) 1.500 (0.197) NaN NaN NaN NaN NaN
# 5 0.006 (0.137) 0.764 (1.117) NaN NaN NaN NaN NaN NaN
# 6 (0.664) 0.632 (1.141) NaN NaN NaN NaN NaN NaN NaN
# 7 0.543 (0.664) NaN NaN NaN NaN NaN NaN NaN NaN
# 8 (0.137) NaN NaN NaN NaN NaN NaN NaN NaN NaN

Pivot_table from lists in a column value

I have a dataframe like:
ID Sim Items
1 0.345 [7,7]
2 0.604 [2,7,3,8,5]
3 0.082 [9,1,9,1]
I want to form a pivot_table by:
df.pivot_table(index ="ID" , columns = "Items", values="Sim")
To do that,
I have to extract list elements in items column and repeat the ID,Sim values for each unique elements in row_list.
To be as:
ID Sim Items
1 0.345 7
2 0.604 2
2 0.604 7
2 0.604 3
2 0.604 8
2 0.604 5
3 0.082 9
3 0.082 1
pivot table :
7 2 3 8 5 1 9
1 0.345 - - - - - -
2 0.604 0.604 0.604 0.604 0.604
3 - - - - - 0.082 0.082
Is there any pythonic approach for that? Or any suggestions?
Use explode(new in pandas 0.25+) before pivot;
df.explode('Items').pivot_table(index ="ID" , columns = "Items", values="Sim")
Items 1 2 3 5 7 8 9
ID
1 NaN NaN NaN NaN 0.345 NaN NaN
2 NaN 0.604 0.604 0.604 0.604 0.604 NaN
3 0.082 NaN NaN NaN NaN NaN 0.082
for lower versions of pandas, you can try with:
(df.drop('Items',1).join(pd.DataFrame(df['Items'].tolist())
.stack(dropna=False).droplevel(1).rename('Items'))
.pivot_table(index ="ID" , columns = "Items", values="Sim"))
Items 1 2 3 5 7 8 9
ID
1 NaN NaN NaN NaN 0.345 NaN NaN
2 NaN 0.604 0.604 0.604 0.604 0.604 NaN
3 0.082 NaN NaN NaN NaN NaN 0.082
If exact ordering matters , use reindex with unique of Items after explode:
(df.explode('Items').pivot_table(index ="ID" , columns = "Items", values="Sim")
.reindex(df.explode('Items')['Items'].unique(),axis=1))
Items 7 2 3 8 5 9 1
ID
1 0.345 NaN NaN NaN NaN NaN NaN
2 0.604 0.604 0.604 0.604 0.604 NaN NaN
3 NaN NaN NaN NaN NaN 0.082 0.082

Calculation does not work on pandas dataframe

I am working with a dataframe such as below.
df.head()
Out[20]:
Date Price Open High ... Vol. Change % A Day % OC %
0 2016-04-25 9577.5 9650.0 9685.0 ... 306230.0 -0.83 1.79 -0.75
1 2016-04-26 9660.0 9567.5 9695.0 ... 389490.0 0.86 1.52 0.97
2 2016-04-27 9627.5 9660.0 9682.5 ... 277940.0 -0.34 1.02 -0.34
3 2016-04-28 9595.0 9625.0 9667.5 ... 75120.0 -0.34 1.36 -0.31
4 2016-04-29 9532.5 9567.5 9597.5 ... 138340.0 -0.65 0.73 -0.37
I sliced it with some conditions. As a result I got a list of sliced indices con_down_success whose length is 96.
Also, I made a list such as,
con_down_success_D1 = [x+1 for x in con_down_success]
What I want to do is below.
df.iloc[con_down_success_D1,:].Low/df.iloc[con_down_success,:].Price
This code is supposed to show calculated series but too many are NaNs like below.
(df.iloc[con_down_success_D1,:].Low/df.iloc[con_down_success,:].Price).tail(12)
Out[26]:
778 0.995716
779 NaN
787 NaN
788 NaN
794 NaN
795 NaN
821 NaN
822 NaN
827 NaN
828 NaN
830 NaN
831 NaN
All of the two series have actual numbers, not NaN or NA. For example, below is no problem.
df.iloc[831,:].Low/df.iloc[830,:].Price
Out[18]: 0.9968354430379747
Could you tell me how to handle the dataframe to show what I want?
Thanks in advance.

using cumsum method on multi level index in pandas

I have the following multi-level dataframe (partial)
Px_last FINAL RETURN Stock_RES WANTED
Stock Date
ALKM 10/27/2016 0.0013 1 -53.85 NaN -53.85
1/17/2017 0.0009 1 111.11 NaN 57.26
1/18/2017 0.0012 1 233.33 NaN 290.60
1/23/2018 0.0012 1 16.67 NaN 307.26
1/30/2018 0.0019 1 -42.11 NaN 265.16
ANDI 12/28/2017 0.0017 1 370.59 NaN 370.59
2/14/2018 0.0324 1 20.00 NaN 390.59
APPZ 9/22/2017 0.0002 1 -50.00 NaN -50.00
12/5/2017 0.0001 1 -100.00 NaN -150.00
12/6/2017 0.0001 1 0.00 NaN -150.00
I can do a cumulative sum for the entire dataframe with the following code
df3['TTL_SUM'] = df3['RETURN'].cumsum()
But want I want to do is a cumulative sum by each stock but when I do the following I get a column of NaN. Does anyone know what I am doing wrong here? SEE dataframe above
df3['Stock_RES'] = df3.groupby(level=0)['RETURN'].sum()
It does seem to work when I assign that to a variable but ultimately I want to get it in the dataframe
RESULTS = df3.groupby(level=0)['RETURN'].sum()
Can someone help me out. Seems like the same code to me so not sure why it won't add directly into a dataframe.
You were using sum and not cumsum in a groupby context.
df.assign(WANTED1=df.groupby('Stock').RETURN.cumsum())
Px_last FINAL RETURN Stock_RES WANTED WANTED1
Stock Date
ALKM 10/27/2016 0.0013 1 -53.85 NaN -53.85 -53.85
1/17/2017 0.0009 1 111.11 NaN 57.26 57.26
1/18/2017 0.0012 1 233.33 NaN 290.60 290.59
1/23/2018 0.0012 1 16.67 NaN 307.26 307.26
1/30/2018 0.0019 1 -42.11 NaN 265.16 265.15
ANDI 12/28/2017 0.0017 1 370.59 NaN 370.59 370.59
2/14/2018 0.0324 1 20.00 NaN 390.59 390.59
APPZ 9/22/2017 0.0002 1 -50.00 NaN -50.00 -50.00
12/5/2017 0.0001 1 -100.00 NaN -150.00 -150.00
12/6/2017 0.0001 1 0.00 NaN -150.00 -150.00

Reindexing data frame Pandas

I am trying to split a data set for training and testing using Pandas.
data = pd.read_csv("housingdata.csv", header=None)
train = testing.sample(frac=0.6)
train.reindex()
test = testing.loc[~testing.index.isin(train.index)]
print train
print test
when I print the data, I get
0 1 2 3 4
9 0.17004 12.5 7.87 0 0.524
1 0.02731 0.0 7.07 0 0.469
5 0.02985 0.0 2.18 0 0.458
3 0.03237 0.0 2.18 0 0.458
7 0.14455 12.5 7.87 0 0.524
6 0.08829 12.5 7.87 0 0.524
0 1 2 3 4
0 0.00632 18.0 2.31 0 0.538
2 0.02729 0.0 7.07 0 0.469
4 0.06905 0.0 2.18 0 0.458
8 0.21124 12.5 7.87 0 0.524
As noticed, the row indices are re-shuffled. How to re-index the rows in both the data sets?
This however does not change global settings. Eg.,
train.iloc[0,4]
gives 0.524
As #EdChum's comments point out, it's not exactly clear what behavior you're looking for. But if all you want to do is to give both new dataframes indices going from 0, 1, 2 ... n then you can use reset_index():
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

Categories

Resources