Pandas compare 1 columns values to another dataframe column, find matching rows - python

I have a database that I am bringing in a SQL table of events and alarms (df1), and I have a txt file of alarm codes and properties (df2) to watch for. Want to use 1 columns values from df2 that each value needs cross checked against an entire column values in df1, and output the entire rows of any that match into another dataframe df3.
df1 A B C D
0 100 20 1 1
1 101 30 1 1
2 102 21 2 3
3 103 15 2 3
4 104 40 2 3
df2 0 1 2 3 4
0 21 2 2 3 3
1 40 0 NaN NaN NaN
Output entire rows from df1 that column B match with any of df2 column 0 values into df3.
df3 A B C D
0 102 21 2 3
1 104 40 2 3
I was able to get single results using:
df1[df1['B'] == df2.iloc[0,0]]
But I need something that will do this on a larger scale.

Method 1: merge
Use merge, on B and 0. Then select only the df1 columns
df1.merge(df2, left_on='B', right_on='0')[df1.columns]
A B C D
0 102 21 2 3
1 104 40 2 3
Method 2: loc
Alternatively use loc to find rows in df1 where B has a match in df2 column 0 using .isin:
df1.loc[df1.B.isin(df2['0'])]
A B C D
2 102 21 2 3
4 104 40 2 3

Related

How to get a scalar product of rows in dataframe with matching indexes

Let's say I have two dataframes with same columns, first one has unique index, second has not unique index,
column1 column2
a 1 2
b 4 5
c 3 3
column1 column2
a 1 2
a 4 5
c 3 3
b 1 2
b 4 5
a 3 3
Now how can I make a scalar product of rows where index match, the result would be a dataframe with one column (with values of scalar product, for example first row: 1*1+2*2=5) and index as in second dataframe:
result
a 5
a 14
c 18
b 14
b 41
a 9
Multiple and then sum DataFrames:
df = df2.mul(df1).sum(axis=1).to_frame('result')
print (df)
result
a 5
a 14
a 9
b 14
b 41
c 18
If ordering is important in ouput:
df = (df2.assign(a=range(len(df2)))
.set_index('a', append=True)
.mul(df1, level=0)
.sum(axis=1)
.droplevel(1)
.to_frame('result'))
print (df)
result
a 5
a 14
c 18
b 14
b 41
a 9

How to apply a command to multiple column elements?

I have the table below and would like to apply onde command to compare and eliminate duplicate values ​​in row n and n + 1 in multiple dataframes (df1, df2).
Comand sugestion: .diff().ne(0)
How to apply this command only to the elements of columns A ,C and D, using the commands def ,lambda or apply?
df1:
A
B
22
33
22
4
3
55
1
55
df2:
C
D
5
2.3
45
33
7
33
7
11
The expected output is:
df1:
A
B
22
33
NaN
4
3
55
1
55
df2:
C
D
5
2.3
45
33
7
NaN
NaN
11
The other desired option would be to delete the duplicated lines, keeping the first number.
df1:
A
B
22
33
row deleted
row deleted
3
55
row deleted
row deleted
df2:
C
D
5
2.3
45
33
row deleted
row deleted
row deleted
row deleted
Based on this answer, you can create a mask for a single column in your dataframe (here for example for column A) with
mask1 = df['A'].shift() == df['A']
Since this shows True if there was a duplicate, you need to slice the DataFrame with the negation of the mask
df = df[~mask1]
To do this for multiple columns, make a mask for each column and use NumPy's logical_or to combine the masks. Then slice df with the final mask.
With your suggested command: .diff().ne(0) (or .diff.eq(0))
Option 1: set NaN to duplicate values
# For 1 column
df1.loc[df1['A'].diff().eq(0), 'A'] = np.NaN
print(df1)
A B
0 22.0 33
1 NaN 4
2 3.0 55
3 1.0 55
# For multiple columns
df2 = df2.apply(lambda x: x[x.diff().ne(0)])
print(df2)
C D
0 5.0 2.3
1 45.0 33.0
2 7.0 NaN
3 NaN 11.0
Option 2: delete rows
>>> df1[df1.diff().ne(0).all(axis=1)]
A B
0 22 33
2 3 55
>>> df2[df2.diff().ne(0).all(axis=1)]
C D
0 5 2.3
1 45 33.0

Map two pandas dataframe and add a column to the first dataframe

I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance

Pandas: Merging 2 different size dataframes with different columns along 1 shared column [duplicate]

This question already has answers here:
Merge items on dataframes with duplicate values
(2 answers)
Closed 2 years ago.
I've seen other answered questions similar to this one, but to my knowledge I have yet to find a response that does exactly what I am looking for. I have 2 pandas dataframes: df1 which has 3 columns-ID, A, and B; and df2 which has 4 columns-ID, C, D, and E.
df1 has the following rows:
ID A B
0 1 200 0.5
1 1 201 0.5
2 2 99 1.1
And df2 has the following rows:
ID C D E
0 1 50 1.1250 0
1 1 52 1.1300 0
2 1 50 1.1200 0
3 2 25 0.6667 20
4 2 24 0.6667 20
I want to merge df1 and df2 on the ID column such that if a pair of rows from each dataframe has a matching ID, we combine them into a single row. Notice that the dataframes are not the same size. If one dataframe has a row with no more available matches from the other dataframe, then we fill in the missing data with NaN. How can I accomplish this merge in pandas?
So far, I have tried variations of the function pd.merge(df1, df2, on='ID', how='...'), but no matter if I put how= 'left', 'right', 'outer', or 'inner', I get a wrong result which is a dataframe with 8 rows. Below is the desired result.
Desired result:
ID A B C D E
0 1 200 0.5 50 1.1250 0
1 1 201 0.5 52 1.1300 0
2 1 NaN NaN 50 1.1200 0
3 2 99 1.1 25 0.6667 20
4 2 NaN NaN 24 0.6667 20
You need to order your ID using groupby ID and cumcount so that first ID 1 in df1 joins with the first ID 1 in df2 and the 2nd to 2nd, and so on. And the same with ID 2 and so on for all IDs in both dataframes. Then, merge on both ID and key with how='outer'.
df1k = df1.assign(key=df1.groupby('ID').cumcount())
df2k = df2.assign(key=df2.groupby('ID').cumcount())
df_out = df1k.merge(df2k, on=['ID','key'], how='outer').sort_values('ID')
Output:
ID A B key C D E
0 1 200.0 0.5 0 50 1.1250 0
1 1 201.0 0.5 1 52 1.1300 0
3 1 NaN NaN 2 50 1.1200 0
2 2 99.0 1.1 0 25 0.6667 20
4 2 NaN NaN 1 24 0.6667 20
And, you can drop the 'key' also,
df_out.drop('key', axis=1)
Output:
ID A B C D E
0 1 200.0 0.5 50 1.1250 0
1 1 201.0 0.5 52 1.1300 0
3 1 NaN NaN 50 1.1200 0
2 2 99.0 1.1 25 0.6667 20
4 2 NaN NaN 24 0.6667 20

Setting with enlargement - updating transaction DF

Looking for ways to achieve following updates on a dataframe:
dfb is the base dataframe that I want to update with dft transactions.
Any common index rows should be updated with values from dft.
Indexes only in dft should be appended to dfb.
Looking at the documentation, setting with enlargement looked perfect but then I realized it only worked with a single row. Is it possible to use setting with enlargement to do this update or is there another method that could be recommended?
dfb = pd.DataFrame(data={'A': [11,22,33], 'B': [44,55,66]}, index=[1,2,3])
dfb
Out[70]:
A B
1 11 44
2 22 55
3 33 66
dft = pd.DataFrame(data={'A': [0,2,3], 'B': [4,5,6]}, index=[3,4,5])
dft
Out[71]:
A B
3 0 4
4 2 5
5 3 6
# Updated dfb should look like this:
dfb
Out[75]:
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
You can use combine_first with renaming columns, last convert float columns to int by astype:
dft = dft.rename(columns={'c':'B', 'B':'A'}).combine_first(dfb).astype(int)
print (dft)
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
Another solution with finding same indexes in both DataFrames by Index.intersection, drop it from first DataFrame dfb and then use concat:
dft = dft.rename(columns={'c':'B', 'B':'A'})
idx = dfb.index.intersection(dft.index)
print (idx)
Int64Index([3], dtype='int64')
dfb = dfb.drop(idx)
print (dfb)
A B
1 11 44
2 22 55
print (pd.concat([dfb, dft]))
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6

Categories

Resources