How to Interpolate between all Values in Two Separate Pandas DataFrames? - python

Let's assume you have two Pandas DataFrames, one containing data for the year 2020 and the other containing data for the year 2030. Both DataFrames have the same shape, column names, and only contain numeric values. For simplicity, we'll create them as follows:
twenty = pd.DataFrame({'A':[1,1,1], 'B':[3,3,3]})
thirty = pd.DataFrame({'A':[3,3,3], 'B':[7,7,7]})
Now, the goal is to perform a linear interpolation on all values in these DataFrames to obtain a new DataFrame for the year 2025 (or whatever year we select). So, we would want to interpolate between each paired set of values, such as twenty['A'][0] and thirty['A'][0]. If we did this for the target year 2025, the result should be:
twentyfive = pd.DataFrame({'A':[2,2,2],'B':[5,5,5]})
I've attempted to use np.interp; however, that is really intended for interpolation on a given (singular) array as far as I can tell. And I've solved the problem using a more brute-force method of melting the DataFrames, adding year columns, merging them together, and then creating a new column with the interpolated values. It's a bit messy and long-winded.
I feel like there must be a more straight-forward (and optimized) way of performing this task. Any help is appreciated.

You can try of taking average directly, if both have same shape
(thirty + twenty)/2
Out:
A B
0 2 5
1 2 5
2 2 5
Edit : if the dataframes does not have equal shapes, you can try of merging with inner join and groupby columns to take interpolated mean.
df = pd.merge(twenty,thirty, left_index=True, right_index=True, how='inner').rename(columns=lambda x: x.split('_')[0])
df.T.groupby(df.T.index).mean().T
Out:
A B
0 2 5
1 2 5
2 2 5

You can concat being smart about the keys (naming them integers), and then groupby allowing you to interpolate everything:
import pandas as pd
df = pd.concat([twenty, thirty], keys=[20,30], axis=1)
s = (df.groupby(df.columns.get_level_values(1), axis=1)
.apply(lambda x: x.T.reset_index(1, drop=True).reindex(np.arange(20,31)).interpolate())).T
20 21 22 23 24 25 26 27 28 29 30
A 0 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
1 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
B 0 3.0 3.4 3.8 4.2 4.6 5.0 5.4 5.8 6.2 6.6 7.0
1 3.0 3.4 3.8 4.2 4.6 5.0 5.4 5.8 6.2 6.6 7.0
2 3.0 3.4 3.8 4.2 4.6 5.0 5.4 5.8 6.2 6.6 7.0
Now if you just care about 25:
s[25].unstack(0)
A B
0 2.0 5.0
1 2.0 5.0
2 2.0 5.0

Related

Performing operations on column with nan's without removing them

I currently have a data frame like so:
treated
control
9.5
9.6
10
5
6
0
6
6
I want to apply get a log 2 ratio between treated and control i.e log2(treated/control). However, the math.log2() ratio breaks, due to 0 values in the control column (a zero division). Ideally, I would like to get the log 2 ratio using method chaining, e.g a df.assign() and simply put nan's where it is not possible, like so:
treated
control
log_2_ratio
9.5
9.6
-0.00454
10
5
0.301
6
0
nan
6
6
0
I have managed to do this in an extremely round-about way, where I have:
made a column ratio which is treated/control
done new_df = df.dropna() on this dataframe
applied the log 2 ratio to this.
Left joined it back to it's the original df.
As always, any help is very much appreciated :)
You need to replace the inf with nan:
df.assign(log_2_ratio=np.log2(df['treated'].div(df['control'])).replace(np.inf, np.nan))
Output:
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 NaN
3 6.0 6.0 0.000000
To avoid subsequent replacement you may go through an explicit condition (bearing in mind that multiplication/division operation with zero always result in 0).
df.assign(log_2_ratio=lambda x: np.where(x.treated * x.control, np.log2(x.treated/x.control), np.nan))
Out[22]:
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 NaN
3 6.0 6.0 0.000000
Stick with the numpy log functions and you'll get an inf in the cells where the divide doesn't work. That seems like a better choice than nan anyway.
>>> df["log_2_ratio"] = np.log2(df.treated/df.control)
>>> df
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 inf
3 6.0 6.0 0.000000

Merge two dataframe when one has multiIndex in pandas

I have MultiIndex dataframe (table1) and I want to merge specific columns from another dataframe that is not multiIndex (table 2).
Example of table 1:
>>> name 2020-10-21 2020-10-22 ...
Column 9 10 11 12 9 10 11 12
0 A5 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
...
Example of table 2:
>>>name indc control code
0 A5 0.32 yes 1
1 C3 0.11 no 2
2 B18 0.23 yes 2
3 B9 0.45 no 3
I want to merge the column "code" based on key "name" from table 2 (and "index" from table 1) to get the code beside te name:
>>> index 2020-10-21 2020-10-22 ...
Column code 9 10 11 12 9 10 11 12
0 A5 1 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 3 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 2 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
...
I know how to merge when the index is not multindex level, then I do so something like this:
df = table1.merge(table2[['code','name']], how = 'left',
left_on = 'index', right_on = 'name')
but now I get error:
UserWarning: merging between different levels can give an unintended
result (2 levels on the left,1 on the right) warnings.warn(msg,
UserWarning)
and then:
ValueError: 'index' is not in list
when I print the columns I can see that thy are like tuples but I don't know why it says the index is not in list as when I print the oclumns of table 1 I get:
Index([ ('index', ''), (2020-10-22, 9)...
so i'm a bit confused.
My end goal: to merge the code column based on the columns "name" and "index"
For correct working need MultiIndex in both DataFrames:
df2 = table2[['code','name']].rename(columns={'name':'index'})
df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
df = table1.merge(df2, how = 'left', on = [('index', '')])
#if necessary reorder columns names
cols = df.columns[:1].tolist() + df.columns[-1:].tolist() + df.columns[1:-1].tolist()
df = df[cols]
print (df)
index code 2020-10-21 2020-10-22
9 10 11 12 9 10 11 12
0 A5 1 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 3 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 2 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7

Groupby Row element and Tranpose a Panda Dataframe

In Python, I have the following Pandas dataframe:
Factor Value
0 a 1.2
1 b 3.4
2 b 4.5
3 b 5.6
4 c 1.3
5 d 4.6
I would like to organize this where:
unique row identifiers (the factor col) become columns
Their respective values remain under the created columns
The factor values are not in an organized.
Target:
A B C D
0 1.2 3.4 1.3 4.6
1 4.5
2 5.6
3
4
5
Use, set_index and unstack with groupby:
df.set_index(['Factor', df.groupby('Factor').cumcount()])['Value'].unstack(0)
Output:
Factor a b c d
0 1.2 3.4 1.3 4.6
1 NaN 4.5 NaN NaN
2 NaN 5.6 NaN NaN

How to re order the order of a data frame to match the order of 2nd data frame?

If I have two data frames for an example:
df1:
x y
0 1.1. 2.1
1 3.1 5.1
df2:
x y
0 0.0 2.2
1 1.1 2.1
2 3.0. 6.6
3 3.1 5.1
4 0.2 8.8
and I want df2 to match the order that matching values that are in common but keeping the values that don't match after the order, how would I do that using pandas? or maybe something else.
desired output:
new_df:
x y
0 1.1 2.1
1 3.1. 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
rows 2-4 I don't care about the order as long as the matching rows follow the same order as df1. I want the values of indexes of df1 and df2 to be equal
any way to do this?
sorry if the way I submitted this is wrong.
thanks guys
Just using merge with indicator sort as default
df1.merge(df2,indicator=True,how='right')
Out[354]:
x y _merge
0 1.1 2.1 both
1 3.1 5.1 both
2 0.0 2.2 right_only
3 3.0 6.6 right_only
4 0.2 8.8 right_only
Use pd.concat with drop_duplicates:
pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
Output:
x y
0 1.1 2.1
1 3.1 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
Look at the .combine_first & .update methods.
df1.combine_first(df2)
They are explained in the documentation here.

Right way to update the data in a table?

I need add three columns in a pandas dataframe, from existing data.
df
>>
n a b
0 3 1.2 1.4
1 2 2.8 3.8
2 3 2.3 2.0
3 3 1.7 5.7
4 2 6.9 4.9
5 1 3.9 19.0
6 9 2.3 8.3
7 5 8.5 3.1
8 18 6.7 7.0
9 10 5.6 6.4
I have done the following
import pandas
import numpy
def add_tests(add_df):
new_tests = """
(a+b)/n
(a*b)/n
((a+b)/n)**-1
""".split()
rows = add_df.shape[0]
cols = len(new_tests)
U = pandas.DataFrame(numpy.empty([rows, cols]), columns=new_tests)
add_df = pandas.concat([df, U], axis=1)
for i, row in add_df.iterrows():
# 1) good calculation:
add_df['(a+b)/n'].loc[i] = (add_df['a'].loc[i] + add_df['b'].loc[i])/ add_df['n'].loc[i]
# 2) good calculation (Both ways):
add_df['(a*b)/n'].loc[i] = (row['a'] * row['b'])/ row['n']
# 3) bad calculation
add_df['((a+b)/n)**-1'].loc[i] = row['(a+b)/n'] ** -1
pass
return add_df
I get the next warning message:
df = add_tests(df)
df
>>
C:...\pandas\core\indexing.py:141: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 0.833333
1 2 2.8 3.8 3.300000 5.320000 0.588235
2 3 2.3 2.0 1.433333 1.533333 0.434783
3 3 1.7 5.7 2.466667 3.230000 0.178571
4 2 6.9 4.9 5.900000 16.905000 0.500000
5 1 3.9 19.0 22.900000 74.100000 0.052632
6 9 2.3 8.3 1.177778 2.121111 0.142857
7 5 8.5 3.1 2.320000 5.270000 0.263158
8 18 6.7 7.0 0.761111 2.605556 0.111111
9 10 5.6 6.4 1.200000 3.584000 0.666667
Obviously step 3 does not work properly ...
How to do it the right way?
Fun with eval
define tuples of temporary column names with formulas
create a \n separated string of formulas to pass to eval
use dictionary to make formulas into column names
ftups = [('aa', '(a+b)/n'), ('bb', '(a*b)/n'), ('cc', '((a+b)/n)**-1')]
forms = '\n'.join([' = '.join(tup) for tup in ftups])
fdict = dict(ftups)
df.eval(forms, inplace=False).rename(columns=fdict)
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 1.153846
1 2 2.8 3.8 3.300000 5.320000 0.303030
2 3 2.3 2.0 1.433333 1.533333 0.697674
3 3 1.7 5.7 2.466667 3.230000 0.405405
4 2 6.9 4.9 5.900000 16.905000 0.169492
5 1 3.9 19.0 22.900000 74.100000 0.043668
6 9 2.3 8.3 1.177778 2.121111 0.849057
7 5 8.5 3.1 2.320000 5.270000 0.431034
8 18 6.7 7.0 0.761111 2.605556 1.313869
9 10 5.6 6.4 1.200000 3.584000 0.833333

Categories

Resources