Pandas difference between dataframes on column values - python

I couldn't find a way to have a dataframe that has the difference of 2 dataframes based on a column. So basically:
dfA = ID, val
1, test
2, other test
dfB = ID, val
2, other test
I want to have a dfC that holds the difference dfA - dfB based on column ID
dfC = ID, val
1, test

merge the dataframe on ID
dfMerged = dfA.merge(dfB, left_on='ID', right_on='ID', how='outer') # defaults to inner join.
In the merged dataframe, name collisions are avoided using the suffix _x & _y to denote left and right source dataframes.
So, you'll end up with (most likely) val_x and val_y. compare these columns however you want to. For example:
dfMerged['x_y_test'] = dfMerged.val_y == dfMerged.val_x
# gives you a column with a comparison of val_x, val_y.
Use this as a mask to get to the desired dfC in your question.

Does this work for you?
dfC = dfB[dfB["ID"] == dfA["ID"]]
How about this:
dfC = dfB[dfB["ID"].isin(dfA["ID"])]

Related

Pandas "A value is trying to be set on a copy of a slice from a DataFrame"

Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.

rsuffix for merging data in pandas

I have multiple dataframes with the same columns but different values that look like that
Product 1 Dataframe
Here's the code that generated them
import pandas as pd
d1 = {"Year":[2018,2019,2020],"Quantity": [10,20,30], "Price": [100,200,300]}
df_product1 = pd.DataFrame(data=d1)
d2 = {"Year":[2018,2019,2020],"Quantity": [20,20,50], "Price": [120,110,380]}
df_product2 = pd.DataFrame(data=d2)
d3 = {"Year":[2018,2019,2020],"Quantity": [40,20,70], "Price": [1000,140,380]}
df_product3 = pd.DataFrame(data=d3)
I merge two dataframes and identify suffixes like so
df_total = df_product1.merge(df_product2,on="Year", suffixes = ("_Product1","_Product2"))
And I get
First Merged Dataframe
However, when I merge another dataframe to the result above using:
df_total = df_total.merge(df_product3,on="Year", suffixes = ("_Product","_Product3"))
I get
Final Merged Dataframe
Where there is no suffix for the third product.
I would like the last two columns of the dataframe to be Quantity_Product3, Price_Product3 instead of just Quantity and Price.
Let me know if it is possible or if I need to approach the problem from a completely different angle.
Why you don't get the result you want
It's explained in the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
suffixeslist-like, default is (“_x”, “_y”) A length-2 sequence where
each element is optionally a string indicating the suffix to add to
overlapping column names in left and right respectively. Pass a value
of None instead of a string to indicate that the column name from left
or right should be left as-is, with no suffix. At least one of the
values must not be None.
Suffixes are added to overlapping column names.
See this example - suffixes are added to column b, because both dataframes have a column b, but not to columns a and c, as they are unique and not in common between the two dataframes.
df1 = pd.DataFrame(columns =['a','b'], data = np.random.rand(10,2))
df2 = pd.DataFrame(columns =['b','c'], data = np.random.rand(10,2), index = np.arange(5,15))
# equivalent to an inner join on the indices
out = pd.merge(df1, df2, how ='inner', left_index = True, right_index = True)
A crude solution
Why don't you just rename the columns manually? Not elegant but effective
A possible alternative
The table you are trying to build looks like a pivot. I would look into normalising all your dataframes, concatenating them, then running a pivot on the result.
Depending on your case, this may well be more convoluted and could well be overkill. I mention it because I want to bring your attention to the concepts of pivoting/unpivoting (stacking/unstacking/normalising) data.
The code below takes a df which looks similar to yours and normalises it. For simpler cases you can use pandas.melt(). I don't have the exact data of your example but this should be a good starting point.
def random_dates(start, end, n, unit='D', seed=None):
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.randint(0, ndays, n), unit=unit
)
df = pd.DataFrame()
mysize = 20
df['key'] = np.arange(0,mysize)
df['A_value'] = np.random.randint(0,10000,mysize)
df['A_date'] = random_dates(pd.to_datetime('2010-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df['B_value'] = np.random.randint(-5000,5000,mysize)
df['B_date'] = random_dates(pd.to_datetime('2005-01-01' ), pd.to_datetime('2015-01-01'), mysize)
df['C_value'] = np.random.randint(-10000,10000,mysize)
df['C_date'] = random_dates(pd.to_datetime('2000-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df2 = df.set_index('key', drop=True, verify_integrity = True)
df2 = df2.stack().reset_index()
df2.columns=['key','rawsource','rawvalue']
df2['source'] = df2['rawsource'].apply(lambda x: x[0:1])
df2['metric'] = df2['rawsource'].apply(lambda x: x[2:])
df2 = df2.drop(['rawsource'], axis = 1)
df_piv = df2.pivot_table( index=['key','source'], columns = 'metric' , values ='rawvalue', aggfunc='first' ).reset_index().rename_axis(None, axis=1)

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

What is the difference between 'pd.concat([df1, df2], join='outer')', 'df1.combine_first(df2)', 'pd.merge(df1, df2)' and 'df1.join(df2, how='outer')'? [duplicate]

This question already has answers here:
Difference(s) between merge() and concat() in pandas
(7 answers)
Closed 2 years ago.
Say I have the following 2 pandas dataframes:
import pandas as pd
A = [174,-155,-931,301]
B = [943,847,510,16]
C = [325,914,501,884]
D = [-956,318,319,-83]
E = [767,814,43,-116]
F = [110,-784,-726,37]
G = [-41,964,-67,-207]
H = [-555,787,764,-788]
df1 = pd.DataFrame({"A": A, "B": B, "C": C, "D": D})
df2 = pd.DataFrame({"E": E, "B": F, "C": G, "D": H})
If I do concat with join=outer, I get the following resulting dataframe:
pd.concat([data1,data2], join='outer')
If I do df1.combine_first(df2), I get the following:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()
If I do pd.merge(df1, df2), I get the following which is identical to the result produced by concat:
pd.merge(data1, data2, on=['B','C','D'], how='outer')
And finally, if I do df1.join(df2, how='outer'), I get the following:
df1.join(df2, how='outer', on='B', lsuffix='_left', rsuffix='_right')
I don't fully understand how and why each produces different results.
concat: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.
combine_first: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).
So if there are no missing values in your dataframe you wouldn't normally use combine_first.
merge is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.
join merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right')). As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.

How to merge columns interspersing the data?

I'm new to python and pandas and working to create a Pandas MultiIndex with two independent variables: flow and head which create a dataframe and I have 27 different design points. It's currently organized in a single dataframe with columns for each variable and rows for each design point.
Here's how I created the MultiIndex:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
I then created three columns of data:
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
And want to merge these into a single column of data such that I can use this command:
pc = pd.DataFrame(data, index=index)
Using the three columns with the same indexes for the rows (0-27), I want to merge the columns into a single column such that the data is interspersed. If I call the columns col1, col2 and col3 and I denote the index in parentheses such that col1(0) indicates column1 index 0, I want the data to look like:
col1(0)
col2(0)
col3(0)
col1(1)
col2(1)
col3(1)
col1(2)...
it is a bit confusing. But what I understood is that you are trying to do this:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
data = pd.concat[df0, df1, df2]
pc = pd.DataFrame(data=data, index=index)

Categories

Resources