How to compare 2 dataframes and generate new dataframe - python

I have 2 similar dataframes that I would like to compare each row of the 1st dataframe with the 2nd based on condition. The dataframe looks like this:
Based on this comparison I would like to generate a similar dataframe with a new column 'change' containing the changes based on the following conditions:
if the rows have similar values then 'change'='identical' otherwise if the date changed then 'change'='new date'.

Here is an easy workaround.
# Import pandas library
import pandas as pd
# One dataframe
data = [['foo', 10], ['bar', 15], ['foobar', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# Another similar dataframe but foo age is 13 this time
data = [['foo', 13], ['bar', 15], ['foobar', 14]]
df2 = pd.DataFrame(data, columns = ['Name', 'Age'])
df3 = df2.copy()
for index, row in df.iterrows():
if df.at[index,'Age'] != df2.at[index,'Age']:
df3.at[index,'Change']="Changed"
df3["Change"].fillna("Not Changed",inplace = True)
print(df3)
Here is the output
Name Age Change
0 foo 13 Changed
1 bar 15 Not Changed
2 foobar 14 Not Changed

Related

Adding array to Pandas dataframe as row

I want to add an array to an existing pandas dataframe as row. Below is my code :
import pandas as pd
import numpy as np
data = [['tom', 10]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
Y = [10, 100]
df.loc[0] = list(Y)
print(df)
Basically I want to add Y to df as row without disturbing existing rows. I also want to add column names of final df as 'Y1' and 'Y2'
Clearly with above code existing information of df appears to be replaced with Y.
Could you please help me with right code?
Use loc adding the value by exiting row and additional columns
df.loc[0,['Y1','Y2']] =Y
df
Name Age Y1 Y2
0 tom 10 10.0 100.0

How to fill a pandas dataframe column using a value from another dataframe column

Firstly we can import some packages which might be useful
import pandas as pd
import datetime
Say I now have a dataframe which has a date, name and age column.
df1 = pd.DataFrame({'date': ['10-04-2020', '04-07-2019', '12-05-2015' ], 'name': ['john', 'tim', 'sam'], 'age':[20, 22, 27]})
Now say I have another dataframe with some random columns
df2 = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
Question:
How can I take the age value in df1 filtered on the date (can select this value) and populate a whole new column in df2 with this value? Ideally this method should generalise for any number of rows in the dataframe.
Tried
The following is what I have tried (on a similar example) but for some reason it doesn't seem to work (it just shows nan values in the majority of column entries except for a few which randomly seem to populate).
y = datetime.datetime(2015, 5, 12)
df2['new'] = df1[(df1['date'] == y)].age
Expected Output
Since I have filtered above based on sams age (date corresponds to the row with sams name) I would like the new column to be added to df2 with his age as all the entries (in this case 27 repeated 3 times).
df2 = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'new': [27, 27, 27]})
Try:
y = datetime.datetime(2015, 5, 12).strftime('%d-%m-%Y')
df2.loc[:, 'new'] = df1.loc[df1['date'] == y, "age"].item()
# Output
a b new
0 1 4 27
1 2 5 27
2 3 6 27
You'd like to change format of y to Str and try df.loc method
y = datetime.datetime(2015, 5, 12)
y=y.strftime('%d-%m-%Y')
df2['new']=int(df1.loc[df1['date']==y,'age'].values)
df2
Convert df1 date column to datetime type
df1['date'] = pd.to_datetime(df1.date, format='%d-%m-%Y')
Filter dataframe and get the age
req_date = '2015-05-12'
age_for_date = df1.query('date == #req_date').age.iloc[0]
NOTE: This assumes that there is only one age per date (As explained by OP in comments)
Create a new column
df2 = df2.assign(new=age_for_date)
Output
a b new
0 1 4 27
1 2 5 27
2 3 6 27

Can I avoid that the join column of the right data frame in a pandas merge appears in the output?

I am merging two data frames with pandas. I would like to avoid that, when joining, the output includes the join column of the right table.
Example:
import pandas as pd
age = [['tom', 10], ['nick', 15], ['juli', 14]]
df1 = pd.DataFrame(age, columns = ['Name', 'Age'])
toy = [['tom', 'GIJoe'], ['nick', 'car']]
df2 = pd.DataFrame(toy, columns = ['Name_child', 'Toy'])
df = pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
df.columns will give the output Index(['Name', 'Age', 'Name_child', 'Toy'], dtype='object'). Is there an easy way to obtain Index(['Name', 'Age', 'Toy'], dtype='object') instead? I can drop the column afterwards of course like this del df['Name_child'], but I'd like my code to be as short as possible.
Based on #mgc comments, you don't have to rename the columns of df2. Just you pass df2 to merge function with renamed columns. df2 column names will remain as it is.
df = pd.merge(df1,df2.rename(columns={'Name_child': 'Name'}),on='Name', how='left')
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
df2.columns
Index(['Name_child', 'Toy'], dtype='object')
Set the index of the second dataframe to "Name_child". If you do this in the merge statement the columns in df2 remain unchanged.
df = pd.merge(df1,df2.set_index('Name_child'),left_on='Name',right_index=True,how='left')
This ouputs the correct columns:
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
Seems to be even simpler to drop the column right after.
df = (pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
.drop('Name_child', axis=1))
#----------------
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.

How to aggregate, combining dataframes, with pandas groupby

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:
Original dataframe:
name table
Bob Pandas df1
Joe Pandas df2
Bob Pandas df3
Bob Pandas df4
Emily Pandas df5
After groupby:
name table
Bob Pandas df containing the appended df1, df3, and df4
Joe Pandas df2
Emily Pandas df5
I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.
df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))
I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.
Thanks for your help!!
Given 3 dataframes
import pandas as pd
dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})
# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
# display(dfg.loc['Bob', 'table'])
a
0 1
1 2
2 3
3 a
4 b
5 c
6 pie
7 steak
8 milk
# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
Not a duplicate
Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.
df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
However, these all result in a StopIteration error, when there are dataframes to aggregate.
Here let's create a dataframe with dataframes as columns:
First, I start with three dataframes:
import pandas as pd
#creating dataframes that we will assign to Bob and Joe, notice b's and j':
df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})
#lets make a list of dictionaries:
list_of_dfs = [
{'name':'Bob' ,'table':df1},
{'name':'Joe' ,'table':df2},
{'name':'Bob' ,'table':df3}
]
#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)
original_df.shape #shows (3, 2)
Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.
new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)
#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]
The output to the last line of code is:
var1 letter
0 12.0 b1
1 34.0 b2
2 -4.0 b3
3 NaN b4
0 22.0 b5
1 -3.0 b6
2 7.0 b7
3 78.0 b8

How to update values in a specific row in a Python Pandas DataFrame?

With the nice indexing methods in Pandas I have no problems extracting data in various ways. On the other hand I am still confused about how to change data in an existing DataFrame.
In the following code I have two DataFrames and my goal is to update values in a specific row in the first df from values of the second df. How can I achieve this?
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
df2 = pd.DataFrame({'filename' : 'test2.dat', 'n':16}, index=[0])
# this overwrites the first row but we want to update the second
# df.update(df2)
# this does not update anything
df.loc[df.filename == 'test2.dat'].update(df2)
print(df)
gives
filename m n
0 test0.dat 12 None
1 test2.dat 13 None
[2 rows x 3 columns]
but how can I achieve this:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
[2 rows x 3 columns]
So first of all, pandas updates using the index. When an update command does not update anything, check both left-hand side and right-hand side. If you don't update the indices to follow your identification logic, you can do something along the lines of
>>> df.loc[df.filename == 'test2.dat', 'n'] = df2[df2.filename == 'test2.dat'].loc[0]['n']
>>> df
Out[331]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to do this for the whole table, I suggest a method I believe is superior to the previously mentioned ones: since your identifier is filename, set filename as your index, and then use update() as you wanted to. Both merge and the apply() approach contain unnecessary overhead:
>>> df.set_index('filename', inplace=True)
>>> df2.set_index('filename', inplace=True)
>>> df.update(df2)
>>> df
Out[292]:
m n
filename
test0.dat 12 None
test2.dat 13 16
In SQL, I would have do it in one shot as
update table1 set col1 = new_value where col1 = old_value
but in Python Pandas, we could just do this:
data = [['ram', 10], ['sam', 15], ['tam', 15]]
kids = pd.DataFrame(data, columns = ['Name', 'Age'])
kids
which will generate the following output :
Name Age
0 ram 10
1 sam 15
2 tam 15
now we can run:
kids.loc[kids.Age == 15,'Age'] = 17
kids
which will show the following output
Name Age
0 ram 10
1 sam 17
2 tam 17
which should be equivalent to the following SQL
update kids set age = 17 where age = 15
If you have one large dataframe and only a few update values I would use apply like this:
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
data = {'filename' : 'test2.dat', 'n':16}
def update_vals(row, data=data):
if row.filename == data['filename']:
row.n = data['n']
return row
df.apply(update_vals, axis=1)
Update null elements with value in the same location in other.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0
more information in this link
There are probably a few ways to do this, but one approach would be to merge the two dataframes together on the filename/m column, then populate the column 'n' from the right dataframe if a match was found. The n_x, n_y in the code refer to the left/right dataframes in the merge.
In[100] : df = pd.merge(df1, df2, how='left', on=['filename','m'])
In[101] : df
Out[101]:
filename m n_x n_y
0 test0.dat 12 None NaN
1 test2.dat 13 None 16
In[102] : df['n'] = df['n_y'].fillna(df['n_x'])
In[103] : df = df.drop(['n_x','n_y'], axis=1)
In[104] : df
Out[104]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to put anything in the iith row, add square brackets:
df.loc[df.iloc[ii].name, 'filename'] = [{'anything': 0}]
I needed to update and add suffix to few rows of the dataframe on conditional basis based on the another column's value of the same dataframe -
df with column Feature and Entity and need to update Entity based on specific feature type
df.loc[df.Feature == 'dnb', 'Entity'] = 'duns_' + df.loc[df.Feature == 'dnb','Entity']

Categories

Resources