if I have these dataframes
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
df1
index col1 col2
0 1 a h
1 2 b e
2 3 c l
3 4 d p
df2
index col1 col2
0 1 a h
1 2 e e
2 3 f lp
3 4 d p
I want to merge them and see whether or not the rows are different and get an output like this
index col1 col1_validation col2 col2_validation
0 1 a True h True
1 2 b False e True
2 3 c False l False
3 4 d True p True
how can I achieve that?
It looks like col1 and col2 from your "merged" dataframe are just taken from df1. In that case, you can simply compare the col1, col2 between the original data frames and add those as columns:
cols = ["col1", "col2"]
val_cols = ["col1_validation", "col2_validation"]
# (optional) new dataframe, so you don't mutate df1
df = df1.copy()
new_cols = (df1[cols] == df2[cols])
df[val_cols] = new_cols
You can merge and compare the two data frames with something similar to the following:
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
# give columns unique name when merging
df1.columns = df1.columns + '_df1'
df2.columns = df2.columns + '_df2'
# merge/combine data frames
combined = pd.concat([df1, df2], axis = 1)
# add calculated columns
combined['col1_validation'] = combined['col1_df1'] == combined['col1_df2']
combined['col12validation'] = combined['col2_df1'] == combined['col2_df2']
Related
Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0
Below is my script for a generic data frame in Python using pandas. I am hoping to split a certain column in the data frame that will create new columns, while respecting the original orientation of the items in the original column.
Please see below for my clarity. Thank you in advance!
My script:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x,y,z', 'a,b', 'c']})
print(df)
Here's what I want
df = pd.DataFrame({'col1': ['x',np.nan,np.nan],
'col2': ['y','a',np.nan],
'col3': ['z','b','c']})
print(df)
Here's what I get
df = pd.DataFrame({'col1': ['x','a','c'],
'col2': ['y','b',np.nan],
'col3': ['z',np.nan,np.nan]})
print(df)
You can use the justify function from this answer with Series.str.split:
dfn = pd.DataFrame(
justify(df['col1'].str.split(',', expand=True).to_numpy(),
invalid_val=None,
axis=1,
side='right')
).add_prefix('col')
col0 col1 col2
0 x y z
1 None a b
2 None None c
Here is a way of tweaking the split:
max_delim = df['col1'].str.count(',').max() #count the max occurance of `,`
delim_to_add = max_delim - df['col1'].str.count(',') #get difference of count from max
# multiply the delimiter and add it to series, followed by split
df[['col1','col2','col3']] = (df['col1'].radd([','*i for i in delim_to_add])
.str.split(',',expand=True).replace('',np.nan))
print(df)
col1 col2 col3
0 x y z
1 NaN a b
2 NaN NaN c
Try something like
s=df.col1.str.count(',')
#(s.max()-s).map(lambda x : x*',')
#0
#1 ,
#2 ,,
Name: col1, dtype: object
(s.max()-s).map(lambda x : x*',').add(df.col1).str.split(',',expand=True)
0 1 2
0 x y z
1 a b
2 c
Suppose I have the following dataset (2 rows, 2 columns, headers are Char0 and Char1):
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
I would like to one-hot encode the columns Char0 and Char1, so:
df = pd.concat([df, pd.get_dummies(df["Char0"], prefix='Char0')], axis=1)
df = pd.concat([df, pd.get_dummies(df["Char1"], prefix='Char1')], axis=1)
df.drop(['Char0', "Char1"], axis=1, inplace=True)
which results in a dataframe with column headers Char0_A, Char0_B, Char1_B, Char1_C.
Now, I would like to, for each column, have an indication for both A, B, C, and D (even though, there is currently no 'D' in the dataset). In this case, this would mean 8 columns: Char0_A, Char0_B, Char0_C, Char0_D, Char1_A, Char1_B, Char1_C, Char1_D.
Can somebody help me out?
Use get_dummies with all columns and then add DataFrame.reindex with all possible combinations of columns created by itertools.product:
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
vals = ['A','B','C','D']
from itertools import product
cols = ['_'.join(x) for x in product(df.columns, vals)]
print (cols)
['Char0_A', 'Char0_B', 'Char0_C', 'Char0_D', 'Char1_A', 'Char1_B', 'Char1_C', 'Char1_D']
df1 = pd.get_dummies(df).reindex(cols, axis=1, fill_value=0)
print (df1)
Char0_A Char0_B Char0_C Char0_D Char1_A Char1_B Char1_C Char1_D
0 1 0 0 0 0 1 0 0
1 0 1 0 0 0 0 1 0
I have a sample python code:
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
ddf.sort_values(by='Id')
The above snippet produces ' FutureWarning: 'Id' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version'. And it does become a error when I try this under recent version of python. I am quite new to python and pandas. How do I resolve this issue?
Here the best is convert column Id to index with DataFrame.set_index for avoid index.name same with one of columns name:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf = ddf.set_index('Id')
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'col3'], dtype='object')
Better for sorting by index is DataFrame.sort_index:
print (ddf.sort_index())
col1 col3
Id
1 A a
2 B b
3 A x
Your solution working, if change index.name for different:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
Set different index.name by DataFrame.rename_axis or set by scalar:
ddf = ddf.rename_axis('newID')
#alternative
#ddf.index.name = 'newID'
print (ddf.index.name)
newID
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
So now is possible distinguish between index level and columns names, because sort_values working with both:
print(ddf.sort_values(by='Id'))
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
print (ddf.sort_values(by='newID'))
#same like sorting by index
#print (ddf.sort_index())
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
Simple add .values
ddf.index=ddf['Id'].values
ddf.sort_values(by='Id')
Out[314]:
col1 Id col3
1 A 1 a
2 B 2 b
3 A 3 x
Both your columns and row index contain 'Id', a simple solution would be to not set the (row) index as 'Id'.
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.sort_values(by='Id')
Out[0]:
col1 Id col3
1 A 1 a
2 B 2 b
0 A 3 x
Or set the index when you create the df:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'col3': ['x','a','b']},
index=[3,1,2])
ddf.sort_index()
Out[1]:
col1 col3
1 A a
2 B b
3 A x
i am using a for loop to go through two frames to eventually concat them.
data_frames = []
data_frames.append(df1)
data_frames.append(df2)
For data_frame in data_frames:
data_frame['col1'] = 'Test'
if date_frame.name = df1:
data_frame['col2'] = 'Apple'
else:
data_frame['col2'] = 'Orange'
The above fails, but in essence, I want to create data_frame['col2']'s value to be dependent on which dataframe it came from. So if the row is from df1, the value for that column should be 'Apple' and if not it should be 'Orange'
There are quite a few syntax errors in your code, but I believe this is what you're trying to do:
# Example Dataframes
df1 = pd.DataFrame({
'a': [1, 1, 1],
})
# With names!
df1.name = 'df1'
df2 = pd.DataFrame({
'a': [2, 2, 2],
})
df2.name = 'df2'
# Create a list of df1 & df2
data_frames = [df1, df2]
# For each data frame in list
for data_frame in data_frames:
# Set col1 to Test
data_frame['col1'] = 'Test'
# If the data_frame.name is 'df1'
if data_frame.name is 'df1':
# Set col2 to 'Apple'
data_frame['col2'] = 'Apple'
else:
# Else set 'col2' to 'Orange'
data_frame['col2'] = 'Orange'
# Print dataframes
for data_frame in data_frames:
print("{name}:\n{value}\n\n".format(name=data_frame.name, value=data_frame))
Output:
df1:
a col1 col2
0 1 Test Apple
1 1 Test Apple
2 1 Test Apple
df2:
a col1 col2
0 2 Test Orange
1 2 Test Orange
2 2 Test Orange
Let's use pd.concat with keys.
Using #AaronNBrock setup:
df1 = pd.DataFrame({
'a': [1, 1, 1],
})
df2 = pd.DataFrame({
'a': [2, 2, 2],
})
list_of_dfs = ['df1','df2']
df_out = pd.concat([eval(i) for i in list_of_dfs], keys=list_of_dfs)\
.rename_axis(['Source',None]).reset_index()\
.drop('level_1',axis=1)
print(df_out)
Output:
Source a
0 df1 1
1 df1 1
2 df1 1
3 df2 2
4 df2 2
5 df2 2