Compare 2 columns and merge rows on match? - python

New to coding here and trying to make a project. I want to compare two DF, and if any of the rows in the product column matches, I want to copy it over to a new DF. The rows in DF1 and DF2 will not be in the same position. Like I want to compare row 1 DF1 against the entire column in DF2. Is there an easy solution to this?

Take a look at this: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
You can try:
df3 = df1[df1['Product'].isin(set(df2['Product']))]
Which gives:
>>> df1 = pd.DataFrame({'prod':[1,2], 'ean':[5,6]})
>>> df1
prod ean
0 1 5
1 2 6
>>> df2 = pd.DataFrame({'prod':[3,2]})
>>> df2
prod
0 3
1 2
>>> df1[df1['prod'].isin(set(df2['prod']))]
prod ean
1 2 6
To explain:
df1[...] is to filter the rows of df1 based on criterion ...
I'm using a set() here so it is fast to check whether a row in df1 is in df2's "Product" column

Related

Using Numpy to filter two dataframes

I have two data frames. They are structured like this:
df a
Letter
ID
A
3
B
4
df b
Letter
ID
Value
A
3
100
B
4
300
B
4
100
B
4
150
A
3
200
A
3
400
I need to take for each combo of Letter and ID in df A values from df B and run an outlier function on then.
Currently I am using over 40,000 rows of A and a list of 4,500,000 of list b
a['Results'] = a.apply(lambda x: outliers(b[(b['Letter']==x['Letter']) & (b['ID']==x['ID'])]['value'].to_list()),axis=1)
As you can imagine this is taking forever. Is there some mistake im making or something that can improve this code?
I'd first aggregate every combination of [Letter, ID] in df_b into a list using .groupby, then merge with df_a and apply your outliers function afterwards. Should be faster:
df_a["results"] = df_a.merge(
df_b.groupby(["Letter", "ID"])["Value"].agg(list),
left_on=["Letter", "ID"],
right_index=True,
how="left",
)["Value"].apply(outliers)
print(df_a)
You can first try to merge the datasets a and b and then run a group by over letter and ID, aggregate the Value by Outlier function.
pd.merge(a,b,how="inner",on = ['letter','ID']).groupby(['letter','ID']).agg(outlier).reset_index()

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

Get only matching rows for groups in Pandas groupby

I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated
You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2
Make a table with all allowed combinations and then inner join it with this dataframe.

Sum of columns from two data frames that contain float values

I have two data frames.
The columns name are the same of those data frames.
I want to sum the float values of the same columns from dataframes
Then I can use
df3 = df1.add(df2)
However, my dataframes contain two colums of string. These strings are added too.
How can I wrtie the code not to add the string but to add the float in two data frames
The two sample dataframes are as follow:
df1 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[1,2,3,4]),index=[0,1,2,3])
df2 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[3,1,2,4]),index=[0,1,2,3])
When I used df3 = df1.add(df2)
it also added the string in column "Team" as follow:
Team Value
0 AA 4
1 BB 3
2 CC 5
3 DD 8
How can I write code without adding the Team but the Value.
Thanks,
Zep
Use the team names as indices instead of integer indices:
In [2]: df1 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[1,2,3,4])).set_index('Team')
...: df2 = pd.DataFrame(dict(Team=['A','B','C','D'],Value=[3,1,2,4])).set_index('Team')
In [3]: df1 + df2
Out[3]:
Value
Team
A 4
B 3
C 5
D 8
In case you have multiple other columns, just sum the columns:
total = df1['Value'] + df2['Value']
If, in addition, you need a dataframe of the same shape as df1 and df2 with Value replaced by the sum, you can do
df3 = df1.copy()
df3['Value'] = total

Matching the column names of two pandas data-frames in python

I have two pandas dataframes with names df1 and df2 such that
`
df1: a b c d
1 2 3 4
5 6 7 8
and
df2: b c
12 13
I want the result be like
result: b c
2 3
6 7
Here it should be noted that a b c d are the column names in pandas dataframe. The shape and values of both pandas dataframe are different. I want to match the column names of df2 with that of column names of df1 and select all the rows of df1 the headers of which are matched with the column names of df2.. df2 is only used to select the specific columns of df1 maintaining all the rows. I tried some code given below but that gives me an empty index.
df1.columns.intersection(df2.columns)
The above code is not giving me my resut as it gives index headers with no values. I want to write a code in which I can give my two dataframes as input and it compares the columns headers for selection. I don't have to hard code column names.
I believe you need:
df = df1[df1.columns.intersection(df2.columns)]
Or like #Zero pointed in comments:
df = df1[df1.columns & df2.columns]
Or, use reindex
In [594]: df1.reindex(columns=df2.columns)
Out[594]:
b c
0 2 3
1 6 7
Also as
In [595]: df1.reindex(df2.columns, axis=1)
Out[595]:
b c
0 2 3
1 6 7
Alternatively to intersection:
df = df1[df1.columns.isin(df2.columns)]

Categories

Resources