How to find updated rows between two pandas dataframe - python

I have one two source of data. One data is old and one is current version of same data.
I need to find new and updated and deleted rows in this two data.
Here is an example. Updated in a sense values of any column is changed from old data.
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'id':[1,2,3,4],'b':[4,np.nan,6,12]})
>>> df2 = pd.DataFrame({'id':[2,1,3,5],'b':[np.nan,40,6,6]})
>>> df1
id b
0 1 4.0
1 2 NaN
2 3 6.0
3 4 12.0
>>> df2
id b
0 2 NaN
1 1 40.0
2 3 6.0
3 5 6.0
here id is primary key for table.
I can easily find new rows from comparing primary key.
>>> df2[~df2.id.isin(df1.id)]
id b
3 5 6.0
But i am having trouble finding updated rows in new data source.
I tried following
>>>tmp = df1.merge(df2)
>>> df2[(~df2.id.isin(tmp.id)) & (df2.id.isin(df1.id))]
id b
1 1 40.0
This works for given case. But when i apply same thing to my original data frame -(shape (97000,58) and two columns combined makes a PK)- is not giving desired result. Its giving rows that are not updated.
My question is 'Is this the right way to achieve this?'.
How can i improve this?

Get the intersection of the ids and simply compare using ==. This is only possible because you have identically-labeled data frames (i.e. same indexes - due to the intersection - and same columns).
ids = set(df1.id.unique()).intersection(df2.id)
d1 = df1[df1.id.isin(ids)].set_index('id').sort_index()
d2 = df2[df2.id.isin(ids)].set_index('id').sort_index()
comp = (d1 == d2) | (pd.isnull(d1) & pd.isnull(d2))
which gives a boolean data frame with True values wherever values are equal, and False values wherever they differ
id b
0 1 False
1 2 True
2 3 True

Related

Is there a way to associate the value of a row with another row in Excel using Python

I created a df from the data of my excel sheet and in a specific column I have a lot of values that are the same, but some of then are different. What I want to do is find in what row these different values are and associate each one with another value from the same row. I will give an example:
ColA ColB
'Ship' 5
'Ship' 5
'Car' 3
'Ship' 5
'Plane' 2
Following the example, is there a way to find where the values different from 5 are with the code giving me the respective value from ColA? In this case would be finding 3 and 2, returning for me 'Car' and 'Plane', respectively.
Any help is welcome! :)
It depends on exacty what you want to do, but you could use:
a filter - to filter for the value you seek.
.where - to show values which are False.
Given the above dataframe the following would work:
df['different'] = df['ColB']==5
df['type'] = df['ColA'].where(df['different']==False)
print(df)
Which returns this:
ColA ColB different type
0 Ship 5 True NaN
1 Ship 5 True NaN
2 Car 3 False Car
3 Ship 5 True NaN
4 Plane 2 False Plane
The 4th column has what you seek...

How to do a full outer join excluding the intersection between two pandas dataframes?

I have two datasets with identical column headers and I would like to remove ALL data that is 100% identical, and just have what they do not have exactly in common remaining. How could I go about doing that?
Thank you for your time!
To get everything BUT the intersection of two pandas datasets, try this:
# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
There is one caveat though, when filtering with boolean masks, your int values might turn into floats. By default, pandas replaces unwanted (False) values with the float version of NAN and converts the entire column to float. You can see this happening in the example below.
To circumvent this, explicitly declare the datatype when creating the dataframe.
Example
import pandas as pd
df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")
df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")
# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
print(f"result\n{result}\n")
Input
csv1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
csv2
A B C
0 1 2 3
1 4 5 6
2 10 11 12
Output
result
A B C
0 7.0 8.0 9.0
1 10.0 11.0 12.0

Populating a column based on values in another column - pandas

After merging two data frames I have some gaps in my data frame that can be filled in based on neighboring columns (I have many more columns, and rows in the DF but I'm focusing on these three columns):
Example DF:
Unique ID | Type | Location
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land
Ultimately I want the three columns to be filled in:
Unique ID | Type | Location
A 1 Land
A 1 Land
B 2 sub
B 2 sub
C 3 Land
C 3 Land
I've tried:
df.loc[df.Type.isnull(), 'Type'] = df.loc[df.Type.isnull(), 'Unique ID'].map(df.loc[df.Type.notnull()].set_index('Unique ID')['Type'])
but it throws:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
What am I missing here? - Thanks
Your example indicates that you want to forward-fill. YOu can do it like this (complete code):
import pandas as pd
from io import StringIO
clientdata = '''ID N T
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land'''
df = pd.read_csv(StringIO(clientdata), sep='\s+')
df["N"] = df["N"].fillna(method="ffill")
df["T"] = df["T"].fillna(method="ffill")
print(df)
The best solution is to probably just get rid of the NaN rows instead of overwriting them. Pandas has a simple command for that:
df.dropna()
Here's the documentation for it: pandas.DataFrame.dropna

comparing each value in two columns

How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!
pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN
Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN

Pandas- set values to an empty dataframe

I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks
Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6
How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>
Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.
Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6

Categories

Resources