Pandas Merge - Bring in identical columnvalues based on keys - python

I have 3 dataframes like this,
df = pd.DataFrame([[1, 3], [2, 4], [3,6], [4,12], [5,18]], columns=['A', 'B'])
df2 = pd.DataFrame([[1, 5], [2, 6], [3,9]], columns=['A', 'C'])
df3 = pd.DataFrame([[4, 15, "hello"], [5, 19, "yes"]], columns=['A', 'C', 'D'])
They look like this,
df
A B
0 1 3
1 2 4
2 3 6
3 4 12
4 5 18
df2
A C
0 1 5
1 2 6
2 3 9
df3
A C D
0 4 15 hello
1 5 19 yes
my merges, first merge,
f_merge = pd.merge(df, df2, on='A',how='left')
second merge,(first_merge with df3)
s_merge = pd.merge(f_merge, df3, on='A', how='left')
I get the output like this,
A B C_x C_y D
0 1 3 5.0 NaN NaN
1 2 4 6.0 NaN NaN
2 3 6 9.0 NaN NaN
3 4 12 NaN 15.0 hello
4 5 18 NaN 19.0 yes
I need like this,
A B C D
0 1 3 5.0 NaN
1 2 4 6.0 NaN
2 3 6 9.0 NaN
3 4 12 15.0 hello
4 5 18 19.0 yes
How can I achieve this output? Any suggestion would be great.

Concat df2 and df3 before merging.
new_df = pd.merge(df, pd.concat([df2, df3], ignore_index=True), on='A')
new_df
Out:
A B C D
0 1 3 5 NaN
1 2 4 6 NaN
2 3 6 9 NaN
3 4 12 15 hello
4 5 18 19 yes

We can do combine_first
df.set_index('A',inplace=True)
df2.set_index('A').combine_first(df).combine_first(df3.set_index('A'))
B C D
A
1 3.0 5.0 NaN
2 4.0 6.0 NaN
3 6.0 9.0 NaN
4 12.0 15.0 hello
5 18.0 19.0 yes

Related

Pandas interpolation adding rows by group with different ranges for each group

I am trying to add rows to a DataFrame interpolating values in a column by group, and fill with missing all other columns. My data looks something like this:
import pandas as pd
import random
random.seed(42)
data = {'group':['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c' ],
'value' : [1, 2, 5, 3, 4, 5, 7, 4, 7, 9],
'other': random.sample(range(1, 100), 10)}
df = pd.DataFrame(data)
print(df)
group value other
0 a 1 82
1 a 2 15
2 a 5 4
3 b 3 95
4 b 4 36
5 b 5 32
6 b 7 29
7 c 4 18
8 c 7 14
9 c 9 87
What I am trying to achieve is something like this:
group value other
a 1 82
a 2 15
a 3 NaN
a 4 NaN
a 5 NaN
b 3 95
b 4 36
b 5 32
b 6 NaN
b 7 29
c 4 18
c 5 NaN
c 6 NaN
c 7 14
c 8 NaN
c 9 87
For example, group a has a range from 1 to 5, b from 3 to 7, and c from 4 to 9.
The issue I'm having is that each group has a different range. I found something that works assuming a single range for all groups. This could work using the global min and max and dropping extra rows in each group, but since my data is fairly large adding many rows per group quickly becomes unfeasible.
>>> df.groupby('group').apply(lambda x: x.set_index('value').reindex(np.arange(x['value'].min(), x['value'].max() + 1))).drop(columns='group').reset_index()
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0
We group on the group column and then re-index each group with the range from the min to the max of the value column
One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):
# pip install pyjanitor
import pandas as pd
import janitor
new_value = {'value' : lambda df: range(df.min(), df.max()+1)}
# expose the missing values per group via the `by` parameter
df.complete(new_value, by='group', sort = True)
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0

How to merge two tables while preserving all values?

I am relatively new to python and I am wondering how I can merge these two tables and preserve both their values?
Consider these two tables:
df = pd.DataFrame([[1, 3], [2, 4],[2.5,1],[5,6],[7,8]], columns=['A', 'B'])
A B
1 3
2 4
2.5 1
5 6
7 8
df2 = pd.DataFrame([[1],[2],[3],[4],[5],[6],[7],[8]], columns=['A'])
A
1
2
...
8
I want to obtain the following result:
A B
1 3
2 4
2.5 1
3 NaN
4 NaN
5 6
6 NaN
7 8
8 NaN
You can see that column A includes all values from both the first and second dataframe in an ordered manner.
I have attempted:
pd.merge(df,df2,how='outer')
pd.merge(df,df2,how='right')
But the former does not result in an ordered dataframe and the latter does not include rows that are unique to df.
Let us do concat then drop_duplicates
out = pd.concat([df2,df]).drop_duplicates('A',keep='last').sort_values('A')
Out[96]:
A B
0 1.0 3.0
1 2.0 4.0
2 2.5 1.0
2 3.0 NaN
3 4.0 NaN
3 5.0 6.0
5 6.0 NaN
4 7.0 8.0
7 8.0 NaN

Appending Pandas DataFrame in a loop

Let's say I have a df such as this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'A_z': [2,3,4,5,6], 'B': [3,4,5,6,7], 'B_z': [4,5,6,7,8],
'C': [5,6,7,8,9], 'C_z': [6,7,8,9,10]})
Which looks like this:
A A_z B B_z C C_z
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
What I'm looking to do is create a new df and for each letter (A,B,C) append this new df vertically with the data from the two columns per letter so that it looks like this:
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
As far as I'm concerned something like this should work fine:
for col in df.columns:
if col[-1] != 'z':
new_df = new_df.append(df[[col, col + '_z']])
However this results in the following mess:
A A_z B B_z C C_z
0 1.0 2.0 NaN NaN NaN NaN
1 2.0 3.0 NaN NaN NaN NaN
2 3.0 4.0 NaN NaN NaN NaN
3 4.0 5.0 NaN NaN NaN NaN
4 5.0 6.0 NaN NaN NaN NaN
0 NaN NaN 3.0 4.0 NaN NaN
1 NaN NaN 4.0 5.0 NaN NaN
2 NaN NaN 5.0 6.0 NaN NaN
3 NaN NaN 6.0 7.0 NaN NaN
4 NaN NaN 7.0 8.0 NaN NaN
0 NaN NaN NaN NaN 5.0 6.0
1 NaN NaN NaN NaN 6.0 7.0
2 NaN NaN NaN NaN 7.0 8.0
3 NaN NaN NaN NaN 8.0 9.0
4 NaN NaN NaN NaN 9.0 10.0
What am I doing wrong? Any help would be really appreciated, cheers.
EDIT:
After the kind help from jezrael the renaming of the columns in his answer got me thinking about a possible way to do it using my original train of thought.
I can now also achieve the new df I want using the following:
for col in df:
if col[-1] != 'z':
d = df[[col, col + '_z']]
d.columns = ['Letter', 'Letter_z']
new_df = new_df.append(d)
The different columns names were clearly what was causing the problem which is something I wasn't aware of at the time. Hope this helps anyone.
One ide is use Series.str.split with expand=True for MultiIndex, then use rename for avoid NaNs and finally new columns names, reshape by DataFrame.stack, sort for correct order by DataFrame.sort_index and last remove MultiIndex:
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns=lambda x:'Letter_z' if x == 'z' else 'Letter', level=1)
df = df.stack(0).sort_index(level=[1,0]).reset_index(drop=True)
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
Or if possible simplify problem with reshape all non z values to one column and all z values to another use numpy.ravel:
m = df.columns.str.endswith('_z')
a = df.loc[:, ~m].to_numpy().T.ravel()
b = df.loc[:, m].to_numpy().T.ravel()
df = pd.DataFrame({'Letter': a,'Letter_z': b})
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
You can use the function concat and a list comprehension:
cols = df.columns[~df.columns.str.endswith('_z')]
func = lambda x: 'letter_z' if x.endswith('_z') else 'letter'
pd.concat([df.filter(like=i).rename(func, axis=1) for i in cols])
or
cols = df.columns[~df.columns.str.endswith('_z')]
pd.concat([df.filter(like=i).set_axis(['letter', 'letter_z'], axis=1, inplace=False) for i in cols])

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

Pandas: Reshape two columns into one row

I want to reshape a pandas DataFrame from two columns into one row:
import numpy as np
import pandas as pd
df_a = pd.DataFrame({ 'Type': ['A', 'B', 'C', 'D', 'E'], 'Values':[2,4,7,9,3]})
df_a
Type Values
0 A 2
1 B 4
2 C 7
3 D 9
4 E 3
df_b = df_a.pivot(columns='Type', values='Values')
df_b
Which gives me this:
Type A B C D E
0 2.0 NaN NaN NaN NaN
1 NaN 4.0 NaN NaN NaN
2 NaN NaN 7.0 NaN NaN
3 NaN NaN NaN 9.0 NaN
4 NaN NaN NaN NaN 3.0
When I want it condensed into a single row like this:
Type A B C D E
0 2.0 4.0 7.0 9.0 3.0
I believe you dont need pivot, better is DataFrame constructor only:
df_b = pd.DataFrame([df_a['Values'].values], columns=df_a['Type'].values)
print (df_b)
A B C D E
0 2 4 7 9 3
Or set_index with transpose by T:
df_b = df_a.set_index('Type').T.rename({'Values':0})
print (df_b)
Type A B C D E
0 2 4 7 9 3
Another way:
df_a['col'] = 0
df_a.set_index(['col','Type'])['Values'].unstack().reset_index().drop('col', axis=1)
Type A B C D E
0 2 4 7 9 3
We can fix your df_b
df_b.ffill().iloc[[-1],:]
Out[360]:
Type A B C D E
4 2.0 4.0 7.0 9.0 3.0
Or we do
df_a.assign(key=[0]*len(df_a)).pivot(columns='Type', values='Values',index='key')
Out[366]:
Type A B C D E
key
0 2 4 7 9 3

Categories

Resources