Is there a Pandas method to join based on index and column?

Is there a Pandas method to join based on index and column? - python

I'm trying to create a new column in a Pandas DataFrame by extracting a value from another DataFrame's. For each index, it should use the column value that matches the existing DataFrames value. Here is a solution that works, but I'm looking for the best way to do it using Pandas.
import pandas as pd
dates = pd.date_range('2020-01-01', '2020-01-03', freq='d')
A = pd.DataFrame({
'i': [1,2,3],
}, index=dates)
B = pd.DataFrame({
1: [11, 12, 13],
2: [21, 22, 23],
3: [31, 32, 33],
}, index=dates)
# replace this with a more efficient method, avoiding for-loop and creating C
r = [B.loc[k, v] for k, v in A.i.items()]
C = pd.DataFrame({'B': r}, index=dates)
pd.merge(A, C, left_index=True, right_index=True)
expected result:
i B
2020-01-01 1 11
2020-01-02 2 22
2020-01-03 3 33

If I understood you correctly:
# Reshape main df with index as (date, i), remove axis name
_A = A.set_index('i', append=True).rename_axis(index=lambda x: None)
# Reshape sub df with index as (date, i), name series (column) as 'B'
_B = B.stack().rename('B')
# perform left join on indices
_A.merge(_B, how='left', left_index=True, right_index=True)
Result:
B
2020-01-01 1 11
2020-01-02 2 22
2020-01-03 3 33
You can chain the whole set into one line, but I wouldn't recommend this monstrosity:
A.set_index('i', append=True).rename_axis(index=lambda x: None) \
.merge(B.stack().rename('B'), how='left', left_index=True, right_index=True)

You can use lookup in this situation:
A['B'] = B.lookup(A.index, A['i'])
print(A)
Output:
i B
2020-01-01 1 11
2020-01-02 2 22
2020-01-03 3 33

Related

Change shape of dataframe with multiindex

How can change the shape of my multiindexed dataframe from:
to something like this, but with all cells values, not only of the first index:
I have tried to do it but somehow receive only the dataframe as above with this code:
numbers = [100,50,20,10,5,2,1]
for number in numbers:
dfj[number] = df['First_column_value_name'].xs(key=number, level='Second_multiindex_column_name')
list_of_columns_position = []
for number in numbers:
R_string = '{}_R'.format(number)
list_of_columns_position.append(R_string)
df_positions_as_columns = pd.concat(dfj.values(), ignore_index=True, axis=1)
df_positions_as_columns.columns = list_of_columns_position

Split your first columns into 2 parts then join the result with the second column and finally pivot your dataframe:
Setup:
data = {'A': ['TLM_1/100', 'TLM_1/50', 'TLM_1/20',
'TLM_2/100', 'TLM_2/50', 'TLM_2/20'],
'B': [11, 12, 13, 21, 22, 23]}
df = pd.DataFrame(data)
print(df)
# Output:
A B
0 TLM_1/100 11
1 TLM_1/50 12
2 TLM_1/20 13
3 TLM_2/100 21
4 TLM_2/50 22
5 TLM_2/20 23
>>> df[['B']].join(df['A'].str.split('/', expand=True)) \
.pivot(index=0, columns=1, values='B') \
.rename_axis(index=None, columns=None) \
.add_suffix('_R')
100_R 20_R 50_R
TLM_1 11 13 12
TLM_2 21 23 22

use a regular expression to split the label column into two columns a and b then group by column a and unstack the grouping.

How to filter multiple dataframes at once by the same date range?

I am trying to filter multiple dataframes at once by a specific date range (for this example January 2 - January 4). I know you can filter a dataframe by date using the following code: df = (df['Date'] > 'start-date') & (df['Date'] < 'end-date'); however, when I created a list of dataframes and tried to loop over them, I am returned the original dataframe with the original date range.
Any suggestions? I have provide some example code below:
d1 = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]
}
df1 = pd.DataFrame(data=d1)
d2 = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'C': [11, 12, 13, 14, 15],
'D': [16, 17, 18, 19, 20]
}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
for df in df_list:
df = (df['Date'] > '2021-01-01') & (df['Date'] < '2021-01-05')
df1
**Output:**
Date A B
0 2021-01-01 1 6
1 2021-01-02 2 7
2 2021-01-03 3 8
3 2021-01-04 4 9
4 2021-01-05 5 10
I have tried various ways to filter, such as .loc, writing functions, and creating a mask, but still can't get it to work. Another thing to note is that I am doing more formatting as part of this loop, and all the other formats are applied except this one. Any help is greatly appreciated! Thanks!

The issue here is that you're simply reassigning the variable df in your for loop without actually writing the result back into df_list.
This solves your issue:
df_list = [df1, df2]
output_list = []
for df in df_list:
df_filter = (df['Date'] > '2021-01-01') & (df['Date'] < '2021-01-05')
output_list.append(df.loc[df_filter])
output_list now contains the filtered dataframes.

As #anky mentioned, you need to specify your 'Date' columns as type of datetime. In addition, in your for loop you get boolean. You need to use it for selection.
[...]
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df_list_filtered = []
for df in df_list:
df = df[(df['Date'] > '2021-01-01') & (df['Date'] < '2021-01-05')]
df_list_filtered.append(df)
[...]
print(df_list_filtered[0])
Date A B
1 2021-01-02 2 7
2 2021-01-03 3 8
3 2021-01-04 4 9

Update a dataframe by dataframes with NaN values

I try to update a DataFrame
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
by another DataFrame
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]}).
Now, my aim is to update df1 by df2 and overwrite all values (NaN values too) using
df1.update(df2)
In contrast with the common usage it's important to me to get the NaN values finally in df1.
But as far as I see the update returns
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
Is there a way to get
>>> df1
A B
0 1 9
1 2 NaN
2 3 11
3 4 NaN
without building df1 manually?

I am late to the party but I was recently confronted to the same issue, i.e. trying to update a dataframe without ignoring NaN values like the Pandas built-in update method does.
For two dataframes sharing the same column names, a workaround would be to concatenate both dataframes and then remove duplicates, only keeping the last entry:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [9, np.nan, 11, np.nan]})
frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]
Depending on indexing, it might be necessary to sort the indices of the output dataframe:
df1=df1.sort_index()
To address you very specific example for which df2 does not have a column A, you could run:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1['B']=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]['B']

It also works fine for me. You could perhaps use np.nan instead of 'nan'?

I guess you meant [9, np.nan, 11, np.nan], not string "nan".
If there is no mandatory to use update() then do df1.B = df2.B instead, so that the new df1.B will contain NaN.
DataFrame.update() only updates non-NA values. See docs

Approach 1: Drop all affected columns
I achieved this by dropping the new columns and joining the data from the replacement DataFrame:
df1 = df1.drop(columns=df2.columns).join(df2)
This tells Pandas to remove the columns from df1 that you're about to recreate using the values from df2. Note that the column order changes since the new columns are appended to the end.
Approach 2: Preserve column order
Loop over all columns in the replacement DataFrame, inserting affected columns in the target DataFrame in their original place after dropping the original. If the replacement DataFrame includes a column not in the target DataFrame, it will be appended to the end.
for col in df2.columns:
try:
col_pos = list(df1.columns).index(col)
df1.drop(columns=[col], inplace=True)
df1.insert(col_pos, col, df2[col])
except ValueError:
df1[col] = df2[col]
Caveat
With both of these approaches, if your indices do not match between df1 and df2, the missing indices from df2 will end up NaN in your output DataFrame:
df1 = pd.DataFrame(data = {'B' : [1,2,3,4,5], 'A' : [5,6,7,8,9]}) # Note the additional row
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df1.update(df2)
Output:
>>> df1
B A
0 9.0 5
1 2.0 6
2 11.0 7
3 4.0 8
4 5.0 9
My version 1:
df1 = pd.DataFrame(data = {'A' : [1,2,3,4,5], 'B' : [5,6,7,8,9]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df1 = df1.drop(columns=df2.columns).join(df2)
Output:
>>> df1
A B
0 5 9.0
1 6 NaN
2 7 11.0
3 8 NaN
4 9 NaN
My version 2:
df1 = pd.DataFrame(data = {'A' : [1,2,3,4,5], 'B' : [5,6,7,8,9]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
for col in df2.columns:
try:
col_pos = list(df1.columns).index(col)
df1.drop(columns=[col], inplace=True)
df1.insert(col_pos, col, df2[col])
except ValueError:
df1[col] = df2[col]
Output:
>>> df1
B A
0 9.0 5
1 NaN 6
2 11.0 7
3 NaN 8
4 NaN 9

A usable trick is to fill with a string like 'n/a', then replace 'n/a' with np.nan, and convert column type back to float
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, 'n/a', 11, 'n/a']})
df1.update(df2)
df1['B'] = df1['B'].replace({'n/a':np.nan})
df1['B'] = df1['B'].apply(pd.to_numeric, errors='coerce')
Some explanation about the type conversion: after the call to replace, the result is:
A B
0 1 9.0
1 2 NaN
2 3 11.0
3 4 NaN
This looks acceptable, but actually the type of column B has changed from float to object.
df1.dtypes
will give
A int64
B object
dtype: object
To set it back to float, you can use:
df1['B'] = df1['B'].apply(pd.to_numeric, errors='coerce')
And then, you shall have the expected result:
df1.dtypes
will give the expected type:
A int64
B float64
dtype: object

The pandas.DataFrame.update doesn't replace values by nan by default so to circumvent this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})
df2.replace(np.nan, 'NAN', inplace = True)
df1.update(df2)
df1.replace('NAN', np.nan, inplace = True)

Two dataframes into one

I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.

You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.

Merging multiple dataframes on column

I am trying to merge/join multiple Dataframes and so far I have no luck. I've found merge method, but it works only with two Dataframes. I also found this SO answer suggesting to do something like that:
df1.merge(df2,on='name').merge(df3,on='name')
Unfortunatelly it will not work in my case, because I have 20+ number of dataframes.
My next idea was to use join. According to the reference when joining multiple dataframes I need to use list and only I can join on index column. So I changed indexes for all of the columns (ok, it can be done grammatically easily) and end up with something like this:
df.join([df1,df2,df3])
Unfortunately, also this approach failed, because other columns names are this same in all dataframes. I've decided to do the last thing, that is renaming all columns. But when I finally joined everything:
df = pd.Dataframe()
df.join([df1,df2,df3])
I've received empty dataframe. I have no more idea, how I can join them. Can someone suggest anything more?
EDIT1:
Sample input:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr1', 'attr2'])
df2 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr1', 'attr2'])
df1
name attr1 attr2
0 a 5 19
1 b 14 16
2 c 4 9
df2
name attr1 attr2
0 a 15 49
1 b 4 36
2 c 14 9
Expected output:
df
name attr1_1 attr2_1 attr1_2 attr2_2
0 a 5 19 15 49
1 b 14 16 4 36
2 c 4 9 14 9
Indexes might be unordered between dataframes, but it is guaranteed, that they will exists.

use pd.concat
dflist = [df1, df2]
keys = ["%d" % i for i in range(1, len(dflist) + 1)]
merged = pd.concat([df.set_index('name') for df in dflist], axis=1, keys=keys)
merged.columns = merged.swaplevel(0, 1, 1).columns.to_series().str.join('_')
merged
Or
merged.reset_index()

use reduce:
def my_merge(df1, df2):
return df1.merge(df2,on='name')
final_df = reduce(my_merge, df_list)
considering df_list to be a list of your dataframes

The solution of #piRSquared works for 20+ dataframes, see the following script for creating 20+ example dataframes:
N = 25
dflist = []
for d in range(N):
df = pd.DataFrame(np.random.rand(3,2))
df.columns = ['attr1', 'attr2']
df['name'] = ['a', 'b', 'c']
dflist.append(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a Pandas method to join based on index and column? - python

You can use lookup in this situation: A['B'] = B.lookup(A.index, A['i']) print(A) Output: i B 2020-01-01 1 11 2020-01-02 2 22 2020-01-03 3 33

Related

Change shape of dataframe with multiindex

How to filter multiple dataframes at once by the same date range?

Update a dataframe by dataframes with NaN values

Two dataframes into one

Merging multiple dataframes on column

Categories

Resources