Merge two dataframe when one has multiIndex in pandas

Merge two dataframe when one has multiIndex in pandas - python

I have MultiIndex dataframe (table1) and I want to merge specific columns from another dataframe that is not multiIndex (table 2).
Example of table 1:
>>> name 2020-10-21 2020-10-22 ...
Column 9 10 11 12 9 10 11 12
0 A5 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
...
Example of table 2:
>>>name indc control code
0 A5 0.32 yes 1
1 C3 0.11 no 2
2 B18 0.23 yes 2
3 B9 0.45 no 3
I want to merge the column "code" based on key "name" from table 2 (and "index" from table 1) to get the code beside te name:
>>> index 2020-10-21 2020-10-22 ...
Column code 9 10 11 12 9 10 11 12
0 A5 1 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 3 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 2 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
...
I know how to merge when the index is not multindex level, then I do so something like this:
df = table1.merge(table2[['code','name']], how = 'left',
left_on = 'index', right_on = 'name')
but now I get error:
UserWarning: merging between different levels can give an unintended
result (2 levels on the left,1 on the right) warnings.warn(msg,
UserWarning)
and then:
ValueError: 'index' is not in list
when I print the columns I can see that thy are like tuples but I don't know why it says the index is not in list as when I print the oclumns of table 1 I get:
Index([ ('index', ''), (2020-10-22, 9)...
so i'm a bit confused.
My end goal: to merge the code column based on the columns "name" and "index"

For correct working need MultiIndex in both DataFrames:
df2 = table2[['code','name']].rename(columns={'name':'index'})
df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
df = table1.merge(df2, how = 'left', on = [('index', '')])
#if necessary reorder columns names
cols = df.columns[:1].tolist() + df.columns[-1:].tolist() + df.columns[1:-1].tolist()
df = df[cols]
print (df)
index code 2020-10-21 2020-10-22
9 10 11 12 9 10 11 12
0 A5 1 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 3 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 2 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7

Related

extract subset of data from multiindex dataframe and calculate difference of columns

I have a pandas dataframe where in the first row I have multiple entries but the 2nd row has repeating columns.
A B C
Date open r close open r close open r close
2000-07-03 19.7 5 17.1 66.26 4 6.22 23.26. 1 9.9
2000-07-05 49.8 2 8.3 78.81 6 4.34 39.81 5 5.1
2000-07-15 89.5 3 4.1 43.45 7 2.45 29.3 8 1.2
2000-08-13 74.7 6 7.4 34.26 8 6.4 72.26 9 5.4
2000-08-25 39.84 1 8.4 95.43 3 4.3 69.81. 0 5.2
2000-08-28 61.8 4 4.2 43.81 1 2.2 129.81 6 1.3
2000-09-11 82.79 7 7.4 66.26 1 6.5 72.25 6 5.6
2000-09-16 64.8 8 8.7 73.45 5 4.7 69.45 4 5.4
2000-09-22 58.5 9 3.3 13.81 8 2.9 777.8 8 1.4
I want to extract data for 7th month of 2000 and find out which is the lowest (Open - Close) from A or B or C?
MY PLAN:
s=data.stack(level=0)
values = s[s.index.get_level_values(1)]['open', 'close'].reset_index()
values['Date'] = pd.to_datetime(values['Date'])
start_date = 2000-07-01
end_date = 2000-08-01
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
df['Val_Diff'] = df['open'] - df['close']
print(df['Val_Diff'].max())
I get the error
KeyError: "None of [Index are in the [columns]"
why is multiindex a problem for this code?

I think it's caused by the unnamed columns in the index when the stack deforms vertically.
Process flow:
Flatten the column names of multi-indexes.
Transform from horizontal to vertical using the wide_to_long function
Convert the date sequence to 'Datetime' format for conditional extraction.
import pandas as pd
import numpy as np
import io
import datetime
data = '''
Date open r close open r close open r close
2000-07-03 19.7 5 17.1 66.26 4 6.22 23.26 1 9.9
2000-07-05 49.8 2 8.3 78.81 6 4.34 39.81 5 5.1
2000-07-15 89.5 3 4.1 43.45 7 2.45 29.3 8 1.2
2000-08-13 74.7 6 7.4 34.26 8 6.4 72.26 9 5.4
2000-08-25 39.84 1 8.4 95.43 3 4.3 69.81 0 5.2
2000-08-28 61.8 4 4.2 43.81 1 2.2 129.81 6 1.3
2000-09-11 82.79 7 7.4 66.26 1 6.5 72.25 6 5.6
2000-09-16 64.8 8 8.7 73.45 5 4.7 69.45 4 5.4
2000-09-22 58.5 9 3.3 13.81 8 2.9 777.8 8 1.4
'''
data = pd.read_csv(io.StringIO(data), sep='\s+')
idx = pd.MultiIndex.from_arrays([['','A','A','A','B','B','B','C','C','C'], ['Date','open','r','close','open','r','close','open','r','close']])
data.columns = idx
new_cols = [k[1]+'_'+k[0] for k in data.columns[1:]]
new_cols.insert(0, 'Date')
data.columns = new_cols
data = pd.wide_to_long(data,['open','r','close'], i='Date', j='item', sep='_', suffix='\\w+')
data.reset_index(inplace=True)
data['Date'] = pd.to_datetime(data['Date'])
start_date = datetime.datetime(2000,7,1)
end_date = datetime.datetime(2000,8,1)
mask = (data.Date > start_date) & (data.Date <= end_date)
data = data.loc[mask]
data
Date item open r close
0 2000-07-03 A 19.70 5 17.10
1 2000-07-05 A 49.80 2 8.30
2 2000-07-15 A 89.50 3 4.10
9 2000-07-03 B 66.26 4 6.22
10 2000-07-05 B 78.81 6 4.34
11 2000-07-15 B 43.45 7 2.45
18 2000-07-03 C 23.26 1 9.90
19 2000-07-05 C 39.81 5 5.10
20 2000-07-15 C 29.30 8 1.20
data['Val_Diff'] = data['open'] - data['close']
print(data['Val_Diff'].max())
85.4

In pandas, how to assign the result of a groupby aggregate to the next group in the original df?

Using pandas I like to use groupby and an aggregate function, e.g. mean
and then put the results back in the original dataframe, but in the next group and not in the group itself. How to do this in a vectorized way?
I have a pandas dataframe like this:
data = {'Group': ['A','A','B','B','B','B', 'C','C', 'D','D'],
'Value': [1.1,1.3,9.1,9.2,9.5,9.4,6.2,6.4,2.2,2.3]
}
df = pd.DataFrame(data, columns = ['Group','Value'])
print (df)
Group Value
0 A 1.1
1 A 1.3
2 B 9.1
3 B 9.2
4 B 9.5
5 B 9.4
6 C 6.2
7 C 6.4
8 D 2.2
9 D 2.3
I like to get this, where each group has the mean value of the previous group.
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
I tried this, but this is without the shift to the next group
df.groupby('Group')['Value'].transform('mean')

Easy, use map on a groupby result:
df['Value'] = df['Group'].map(df.groupby('Group')['Value'].mean().shift())
df
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
How It Works
Get the mean
df.groupby('Group')['Value'].mean()
Group
A 1.20
B 9.30
C 6.30
D 2.25
Name: Value, dtype: float64
Shift it down by 1
df.groupby('Group')['Value'].mean().shift()
Group
A NaN
B 1.2
C 9.3
D 6.3
Name: Value, dtype: float64
Map it back.
df['Group'].map(df.groupby('Group')['Value'].mean().shift())
0 NaN
1 NaN
2 1.2
3 1.2
4 1.2
5 1.2
6 9.3
7 9.3
8 6.3
9 6.3
Name: Group, dtype: float64

You can calculate aggregated GroupBy.mean of each group value and use pd.Series.shift and take advantage of pandas index alignment.
df.set_index('Group').assign(value = df.groupby('Group').mean().shift()).reset_index()
Group Value value
0 A 1.1 NaN
1 A 1.3 NaN
2 B 9.1 1.2
3 B 9.2 1.2
4 B 9.5 1.2
5 B 9.4 1.2
6 C 6.2 9.3
7 C 6.4 9.3
8 D 2.2 6.3
9 D 2.3 6.3

How to Interpolate between all Values in Two Separate Pandas DataFrames?

Let's assume you have two Pandas DataFrames, one containing data for the year 2020 and the other containing data for the year 2030. Both DataFrames have the same shape, column names, and only contain numeric values. For simplicity, we'll create them as follows:
twenty = pd.DataFrame({'A':[1,1,1], 'B':[3,3,3]})
thirty = pd.DataFrame({'A':[3,3,3], 'B':[7,7,7]})
Now, the goal is to perform a linear interpolation on all values in these DataFrames to obtain a new DataFrame for the year 2025 (or whatever year we select). So, we would want to interpolate between each paired set of values, such as twenty['A'][0] and thirty['A'][0]. If we did this for the target year 2025, the result should be:
twentyfive = pd.DataFrame({'A':[2,2,2],'B':[5,5,5]})
I've attempted to use np.interp; however, that is really intended for interpolation on a given (singular) array as far as I can tell. And I've solved the problem using a more brute-force method of melting the DataFrames, adding year columns, merging them together, and then creating a new column with the interpolated values. It's a bit messy and long-winded.
I feel like there must be a more straight-forward (and optimized) way of performing this task. Any help is appreciated.

You can try of taking average directly, if both have same shape
(thirty + twenty)/2
Out:
A B
0 2 5
1 2 5
2 2 5
Edit : if the dataframes does not have equal shapes, you can try of merging with inner join and groupby columns to take interpolated mean.
df = pd.merge(twenty,thirty, left_index=True, right_index=True, how='inner').rename(columns=lambda x: x.split('_')[0])
df.T.groupby(df.T.index).mean().T
Out:
A B
0 2 5
1 2 5
2 2 5

You can concat being smart about the keys (naming them integers), and then groupby allowing you to interpolate everything:
import pandas as pd
df = pd.concat([twenty, thirty], keys=[20,30], axis=1)
s = (df.groupby(df.columns.get_level_values(1), axis=1)
.apply(lambda x: x.T.reset_index(1, drop=True).reindex(np.arange(20,31)).interpolate())).T
20 21 22 23 24 25 26 27 28 29 30
A 0 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
1 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
B 0 3.0 3.4 3.8 4.2 4.6 5.0 5.4 5.8 6.2 6.6 7.0
1 3.0 3.4 3.8 4.2 4.6 5.0 5.4 5.8 6.2 6.6 7.0
2 3.0 3.4 3.8 4.2 4.6 5.0 5.4 5.8 6.2 6.6 7.0
Now if you just care about 25:
s[25].unstack(0)
A B
0 2.0 5.0
1 2.0 5.0
2 2.0 5.0

How to re order the order of a data frame to match the order of 2nd data frame?

If I have two data frames for an example:
df1:
x y
0 1.1. 2.1
1 3.1 5.1
df2:
x y
0 0.0 2.2
1 1.1 2.1
2 3.0. 6.6
3 3.1 5.1
4 0.2 8.8
and I want df2 to match the order that matching values that are in common but keeping the values that don't match after the order, how would I do that using pandas? or maybe something else.
desired output:
new_df:
x y
0 1.1 2.1
1 3.1. 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
rows 2-4 I don't care about the order as long as the matching rows follow the same order as df1. I want the values of indexes of df1 and df2 to be equal
any way to do this?
sorry if the way I submitted this is wrong.
thanks guys

Just using merge with indicator sort as default
df1.merge(df2,indicator=True,how='right')
Out[354]:
x y _merge
0 1.1 2.1 both
1 3.1 5.1 both
2 0.0 2.2 right_only
3 3.0 6.6 right_only
4 0.2 8.8 right_only

Use pd.concat with drop_duplicates:
pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
Output:
x y
0 1.1 2.1
1 3.1 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8

Look at the .combine_first & .update methods.
df1.combine_first(df2)
They are explained in the documentation here.

Right way to update the data in a table?

I need add three columns in a pandas dataframe, from existing data.
df
>>
n a b
0 3 1.2 1.4
1 2 2.8 3.8
2 3 2.3 2.0
3 3 1.7 5.7
4 2 6.9 4.9
5 1 3.9 19.0
6 9 2.3 8.3
7 5 8.5 3.1
8 18 6.7 7.0
9 10 5.6 6.4
I have done the following
import pandas
import numpy
def add_tests(add_df):
new_tests = """
(a+b)/n
(a*b)/n
((a+b)/n)**-1
""".split()
rows = add_df.shape[0]
cols = len(new_tests)
U = pandas.DataFrame(numpy.empty([rows, cols]), columns=new_tests)
add_df = pandas.concat([df, U], axis=1)
for i, row in add_df.iterrows():
# 1) good calculation:
add_df['(a+b)/n'].loc[i] = (add_df['a'].loc[i] + add_df['b'].loc[i])/ add_df['n'].loc[i]
# 2) good calculation (Both ways):
add_df['(a*b)/n'].loc[i] = (row['a'] * row['b'])/ row['n']
# 3) bad calculation
add_df['((a+b)/n)**-1'].loc[i] = row['(a+b)/n'] ** -1
pass
return add_df
I get the next warning message:
df = add_tests(df)
df
>>
C:...\pandas\core\indexing.py:141: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 0.833333
1 2 2.8 3.8 3.300000 5.320000 0.588235
2 3 2.3 2.0 1.433333 1.533333 0.434783
3 3 1.7 5.7 2.466667 3.230000 0.178571
4 2 6.9 4.9 5.900000 16.905000 0.500000
5 1 3.9 19.0 22.900000 74.100000 0.052632
6 9 2.3 8.3 1.177778 2.121111 0.142857
7 5 8.5 3.1 2.320000 5.270000 0.263158
8 18 6.7 7.0 0.761111 2.605556 0.111111
9 10 5.6 6.4 1.200000 3.584000 0.666667
Obviously step 3 does not work properly ...
How to do it the right way?

Fun with eval
define tuples of temporary column names with formulas
create a \n separated string of formulas to pass to eval
use dictionary to make formulas into column names
ftups = [('aa', '(a+b)/n'), ('bb', '(a*b)/n'), ('cc', '((a+b)/n)**-1')]
forms = '\n'.join([' = '.join(tup) for tup in ftups])
fdict = dict(ftups)
df.eval(forms, inplace=False).rename(columns=fdict)
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 1.153846
1 2 2.8 3.8 3.300000 5.320000 0.303030
2 3 2.3 2.0 1.433333 1.533333 0.697674
3 3 1.7 5.7 2.466667 3.230000 0.405405
4 2 6.9 4.9 5.900000 16.905000 0.169492
5 1 3.9 19.0 22.900000 74.100000 0.043668
6 9 2.3 8.3 1.177778 2.121111 0.849057
7 5 8.5 3.1 2.320000 5.270000 0.431034
8 18 6.7 7.0 0.761111 2.605556 1.313869
9 10 5.6 6.4 1.200000 3.584000 0.833333

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two dataframe when one has multiIndex in pandas - python

Related

extract subset of data from multiindex dataframe and calculate difference of columns

In pandas, how to assign the result of a groupby aggregate to the next group in the original df?

How to Interpolate between all Values in Two Separate Pandas DataFrames?

How to re order the order of a data frame to match the order of 2nd data frame?

Right way to update the data in a table?

Categories

Resources