Comparing Python Data-frames - python

I have two data frames, e.g.,
df_1:
index A B C D
1 2 5 9 12
2 9 8 13 22
3 0 44 3 1
and
df_2:
index A C
1 12 40
2 9 13
3 16 1
4 0 21
I am looking for a way to compare these two dfs and the final product should be rows in df_1 such that the values in column A and C are present in df_2, e.g.
Final_df:
index A B C D
2 9 8 13 22
I have tried,
Final_df = pd.merge(df_1, df_2, on=['A','C'], how='left', indicator='Exist')
Final_df['Exist'] = np.where(Final_df.Exist == 'both', True, False)
Final_df = Final_df[Final_df['Exist']==True]
But it doesn't give the expected results. Your suggestion will be appreciated!

I think you just want to have an inner merge.
df_1.merge(df_2, on=['A', 'C'], how='inner')
A B C D
0 9 8 13 22

Related

How to get a scalar product of rows in dataframe with matching indexes

Let's say I have two dataframes with same columns, first one has unique index, second has not unique index,
column1 column2
a 1 2
b 4 5
c 3 3
column1 column2
a 1 2
a 4 5
c 3 3
b 1 2
b 4 5
a 3 3
Now how can I make a scalar product of rows where index match, the result would be a dataframe with one column (with values of scalar product, for example first row: 1*1+2*2=5) and index as in second dataframe:
result
a 5
a 14
c 18
b 14
b 41
a 9
Multiple and then sum DataFrames:
df = df2.mul(df1).sum(axis=1).to_frame('result')
print (df)
result
a 5
a 14
a 9
b 14
b 41
c 18
If ordering is important in ouput:
df = (df2.assign(a=range(len(df2)))
.set_index('a', append=True)
.mul(df1, level=0)
.sum(axis=1)
.droplevel(1)
.to_frame('result'))
print (df)
result
a 5
a 14
c 18
b 14
b 41
a 9

How to melt multiple columns into one column?

I have this table:
a b c d e f 19-08-06 19-08-07 19-08-08 g h i
1 2 3 4 5 6 7 8 9 10 11 12
I have 34 columns of the date, so I want to melt the date columns to be into one column only.
How can I do this in pyhton?
Thanks in advance
You can use pd.Series.fullmatch to create a boolean mask for extracting date columns, then use df.melt
m = df.columns.str.fullmatch("\d{2}-\d{2}-\d{2}")
cols = df.columns[m]
df.melt(value_vars=cols, var_name='date', value_name='vals')
date vals
0 19-08-06 7
1 19-08-07 8
2 19-08-08 9
If you want to melt while keeping other columns then try this.
df.melt(
id_vars=df.columns.difference(cols), var_name="date", value_name="vals"
)
a b c d e f g h i date vals
0 1 2 3 4 5 6 10 11 12 19-08-06 7
1 1 2 3 4 5 6 10 11 12 19-08-07 8
2 1 2 3 4 5 6 10 11 12 19-08-08 9
Here I did not use value_vars=cols as it's done implicitly
value_vars: tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are
not set as id_vars.

If possible batch drop dataframe's columns with something like slice selection method?

For next dataframe, I want to drop the columns c, d, e, f, g
a b c d e f g h i j
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
So I use next code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('abcdefghij'))
df.drop(['c', 'd', 'e', 'f', 'g'], axis=1)
The problem is maybe my dataframe not just have so little columns, I may need to drop a lots of consecutive columns, so my question any way like 'c': 'g' could be possible for me to quick select the columns to drop?
Use DataFrame.loc for select consecutive names of columns:
df = df.drop(df.loc[:, 'c':'g'].columns, axis=1)
print (df)
a b h i j
0 0 1 7 8 9
1 10 11 17 18 19
Or use Index.isin:
c = df.loc[:, 'c':'g'].columns
df = df.loc[:, ~df.columns.isin(c)]
If possible multiple consecutive groups use Index.union for join values together, Index.isin, Index.difference or Index.drop:
c1 = df.loc[:, 'c':'g'].columns
c2 = df.loc[:, 'i':'j'].columns
df = df.loc[:, ~df.columns.isin(c1.union(c2))]
print (df)
a b h
0 0 1 7
1 10 11 17
df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('wbcdefghij'))
print (df)
w b c d e f g h i j
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
c1 = df.loc[:, 'c':'g'].columns
c2 = df.loc[:, 'i':'j'].columns
#possible change order of columns, because function difference sorting
df1 = df[df.columns.difference(c1.union(c2))]
print (df1)
b h w
0 1 7 0
1 11 17 10
#ordering is not changed
df2 = df[df.columns.drop(c1.union(c2))]
print (df2)
w b h
0 0 1 7
1 10 11 17

Dynamically reshape the dataframe in pandas

I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)

Setting with enlargement - updating transaction DF

Looking for ways to achieve following updates on a dataframe:
dfb is the base dataframe that I want to update with dft transactions.
Any common index rows should be updated with values from dft.
Indexes only in dft should be appended to dfb.
Looking at the documentation, setting with enlargement looked perfect but then I realized it only worked with a single row. Is it possible to use setting with enlargement to do this update or is there another method that could be recommended?
dfb = pd.DataFrame(data={'A': [11,22,33], 'B': [44,55,66]}, index=[1,2,3])
dfb
Out[70]:
A B
1 11 44
2 22 55
3 33 66
dft = pd.DataFrame(data={'A': [0,2,3], 'B': [4,5,6]}, index=[3,4,5])
dft
Out[71]:
A B
3 0 4
4 2 5
5 3 6
# Updated dfb should look like this:
dfb
Out[75]:
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
You can use combine_first with renaming columns, last convert float columns to int by astype:
dft = dft.rename(columns={'c':'B', 'B':'A'}).combine_first(dfb).astype(int)
print (dft)
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
Another solution with finding same indexes in both DataFrames by Index.intersection, drop it from first DataFrame dfb and then use concat:
dft = dft.rename(columns={'c':'B', 'B':'A'})
idx = dfb.index.intersection(dft.index)
print (idx)
Int64Index([3], dtype='int64')
dfb = dfb.drop(idx)
print (dfb)
A B
1 11 44
2 22 55
print (pd.concat([dfb, dft]))
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6

Categories

Resources