Python Pandas: How to merge based on an "OR" condition? - python

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber.
However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense...
Any help is really really appreciated!
As suggested, I looked into this post:
Python pandas merge with OR logic
But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks #Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN

I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

Related

Find the single non-nan value in a multi-indexed dataframe

EDIT: I noticed that I simplified my problem too much. This is probably because I assumed that the proposed solutions would work in a similar way as my original brute-force solution. I changed the multiindex to better show my problems. My apologies for those who have already put effort in it, thank you so much!
I have a pandas dataframe that is multi-indexed. Let's say the index has three levels, the second level contains the name of a color. I know that in each row all column which have the blue color in the index contain NaN except a single one, so it looks like this:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo', 'qux'], ["red", "blue", "green"], ['one', 'two']]
mi = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(np.random.randn(5, 24), columns=mi)
df[("bar", "blue","one")] = [2 , np.nan, np.nan, 3 , np.nan]
df[("baz", "blue","two")] = [np.nan, 4.4 , np.nan, np.nan, 5 ]
df[("qux", "blue","one")] = [np.nan, np.nan, 1 , np.nan, np.nan]
Output:
bar ... qux
red blue green ... red blue green
one two one two one two ... one two one two one two
0 0.046326 -0.999092 2.0 0.073113 0.958438 0.276653 ... -0.258202 -0.772636 NaN -0.639735 1.438262 -0.033578
1 0.257776 -2.499286 NaN 0.854263 -0.037380 -0.571258 ... 1.656198 -1.110911 NaN 0.757692 0.498118 1.070371
2 -0.314146 0.941367 NaN 0.265850 -0.153231 -1.092106 ... -0.208089 -0.363624 1.0 0.046457 -2.158373 0.572496
3 -1.198977 0.605490 3.0 -0.790985 0.000563 -0.958261 ... 1.339086 -1.057270 NaN -0.355639 1.050980 -1.727684
4 -0.562230 -1.721894 NaN 0.856543 -1.137364 1.185481 ... 0.986215 1.028128 NaN -0.264889 0.571484 -0.505340
Now I want to create a new dataframe that contains the non-nan value that the row has in the respective column and also names the other indices of that multi index.
word number blue
0 bar one 2.0
1 baz two 4.4
2 qux one 1.0
3 bar one 3.0
4 baz two 5.0
i.e. the word and number entries of the new dataframe should be the indeces in which the original dataframe had the non-nan value and the new blue column should contain the values.
I have a brute-force solution where I iterate over basically every entry, but my final dataframe will contain around 2000 columns, which will then take very long to run.
If select by DataFrame.xs then only reshape by DataFrame.stack, remove first Multiindex level by reset_index with drop=True and last convert Series to 2 columns DataFrame by Series.rename_axis and Series.reset_index:
df = (df.xs('blue', axis=1, level=1)
.stack()
.reset_index(level=0, drop=True)
.rename_axis('number')
.reset_index(name='blue'))
print (df)
number blue
0 1 2.0
1 2 4.4
2 3 1.0
3 1 3.0
4 2 5.0
EDIT: Solution is similar, only filtered at least one NaNs columns by DataFrame.isna and DataFrame.any with DataFrame.loc and then is used DataFrame.stack by both MultiIndex levels:
df1 = (df.loc[:, df.isna().any()]
.xs('blue', axis=1, level=1)
.stack([0,1])
.reset_index(level=0, drop=True)
.rename_axis(('word','number'))
.reset_index(name='blue'))
print (df1)
word number blue
0 bar one 2.0
1 baz two 4.4
2 qux one 1.0
3 bar one 3.0
4 baz two 5.0
You could stack one single level, only keep the blue column, and drop NaN values:
resul = df.stack(level=0)['blue'].reset_index(level=1).rename(columns={'level_1': 'number'}).dropna()
It gives:
number blue
0 1 2.0
1 2 4.4
2 3 1.0
3 1 3.0
4 2 5.0
For the edited question, it looks that you want to only process columns containing NaN values and only keep non NaN one. This could do the trick:
df.loc[:,df.isna().any()].stack(level=[0,2])[['blue']].dropna()
It gives:
blue
0 bar one 2.0
1 baz two 4.4
2 qux one 1.0
3 bar one 3.0
4 baz two 5.0
NB: if you keep other columns, you will get much more results for blue values...
You may check with two stack chain
df.stack().stack().reset_index()
level_0 level_1 level_2 0
0 0 blue 1 2.2
1 1 blue 2 5.0
2 2 blue 1 44.0
3 3 blue 3 3.3
4 4 blue 1 1.0
5 5 blue 3 1.0

Derive multiple df from single df such that each df has no NaN values

I want to convert this table
0 thg John 3.0
1 thg James 4.0
2 mol NaN 5.0
3 mol NaN NaN
4 lob NaN NaN
In this following tables
df1
movie name rating
0 thg John 3.0
1 thg James 4.0
df2
movie rating
2 mol 5.0
df3
movie
3 mol
4 lob
Where each dataframe has no Nan value, Also tell method if I need to separate with respect to blank value instead of Nan.
I think that start of a new target DataFrame should occur not
only when the number of NaN values changes (compared to
previous row), but also when this number is the same, but
NaN values are in different columns.
So I propose the following formula:
dfs = [g.dropna(how='all',axis=1) for _,g in
df.groupby(df.isna().ne(df.isna().shift()).any(axis=1).cumsum())]
You can print partial DataFrames (any number of them) running:
n = 0
for grp in dfs:
print(f'\ndf No {n}:\n{grp}')
n += 1
The advantage of my solution over the other becomes obvious when you add
to the source DataFrame another row containing:
5 NaN NaN 3.0
It contains also 1 non-null value (like two previous rows).
The other solution will treat all these rows as one partial DataFrame
containing:
movie rating
3 mol NaN
4 lob NaN
5 NaN 3.0
as you can see, with NaN values, whereas my solution divides these
rows into 2 separate DataFrames, without any NaN.
create a list of dfs , with a groupby and dropna:
dfs = [g.dropna(how='all',axis=1) for _,g in df.groupby(df.isna().sum(1))]
print(dfs[0],'\n\n',dfs[1],'\n\n',dfs[2])
Or dict:
d = {f"df{e+1}": g[1].dropna(how='all',axis=1)
for e,g in enumerate(df.groupby(df.isna().sum(1)))}
print(d['df1'],'\n\n',d['df2'],'\n\n',d['df3']) #read the keys of d
movie name rating
0 thg John 3.0
1 thg James 4.0
movie rating
2 mol 5.0
movie
3 mol
4 lob

Trying to fill NaNs with fillna() and groupby()

So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!
The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0

thresh in dropna for DataFrame in pandas in python

df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)

How to drop row in Dataframe if column is NaN and there is another row where the column is not NaN

I have a pandas dataframe in python where the rows are identified by p1 & p2, but p2 is sometimes NaN:
p1 p2
0 a 1
1 a 2
2 a 3
3 b NaN
4 c 4
5 d NaN
6 d 5
The above dataframe was returned from a larger one with many duplicates by using
df.drop_duplicates(subset=["p1","p2"], keep='last')
which works for the most part, the only issue being that NaN and 5 are technically not duplicates and therefore not dropped.
How can I drop the rows (such as: "d", NaN) where there is another row with the same p1 and a p2 value of not.null eg. "d", 5. The important thing here being that "b", NaN is kept because there are no rows with "b", not.null.
We can groupby and ffill and bfill, then drop_duplicates
df.assign(p2=df.groupby('p1')['p2'].apply(lambda x : x.ffill().bfill())).\
drop_duplicates(subset=["p1","p2"], keep='last')
Out[645]:
p1 p2
0 a 1.0
1 a 2.0
2 a 3.0
3 b NaN
4 c 4.0
6 d 5.0
This set of duplicates should essentially be the intersection of all rows which contain NaN values and rows which contain duplicate p1 elements, unioned with the those which are duplicates across both columns:
dupe_1 = df['p1'].duplicated(keep=False) & df['p2'].isnull()
dupe_2 = df.duplicated(subset=['p1','p2'])
total_dupes = dupe_1 | dupe_2
new_df = df[~total_dupes]
Note that this will fail for a dataframe such as:
p1 p2
0 a NaN
1 a NaN
As both of those elements would be removed. Thus, we must first run df.drop_duplicates(subset=['p1','p2'], inplace=True, keep='last'), removing all but one of those rows, making the solution work fine once again.

Categories

Resources