how to render distinct columns/rows by comparing two dataframes in pandas? - python

I have two dataframes but they have more common columns and few distinct columns that only appeared in one of dataframe. I want to print out those distinct columns and common columns so can have better idea what columns are changed in another dataframe. I got some interesting post on SO but don't know why I got an error. I have two dataframes which has following shape:
df19.shape
(39831, 1952)
df20.shape
(39821, 1962)
here is dummy data:
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6],[11,13],[10,19],[21,23]], columns=['A', 'B'])
df2 = pd.DataFrame([[3, 4,0,7], [1, 3,9,2], [4, 6,3,8],[8,5,1,6]], columns=['A', 'B','C','D'])
current attempt
I came across SO and tried following:
res=pd.concat([df19, df20]).loc[df19.index.symmetric_difference(df20.index)]
res.shape
(10, 1984)
this gave me distinct rows but not distinct columns.I also tried this one but gave me error:
df19.compare(df20, keep_equal=True, keep_shape=True)
how should I render distinct rows and columns by comparing two dataframes in pandas? Does anyone knows of doing this easily in python? Any quick thoughts? Thanks
objective
I simply want to render distinct rows or columns to compare two dataframe by column name or what distinct rows that it has. for instance, compared to df1, what columns are newly added to df2; similarly what rows are added to df2 and so on. Any idea?

I would recommend getting the columns by filtering by the name of the columns.
common = [i for i in list(df1) if i in list(df2)]
temp = df2[common]
distinct = [i for i in list(df2) if i not in list(df1)]
temp = df2[distinct]

Thanks to #Shaido, this one worked for me:
import pandas as pd
df1=pd.read_csv(data1)
df2=pd.read_csv(data2)
df1_cols = df1.columns
df2_cols = df2.columns
common_cols = df1_cols.intersection(df2_cols)
df2_not_df1 = df2_cols.difference(df1_cols)

Related

Join two dataframes on the values present in a specific column in the name_data dataframe using koalas

I am trying to join two the dataframes as shown below on the code column values present in the name_data dataframe.
I have two dataframes shown below and I expect to have a resulting dataframe which would only have the rows from the `team_datadataframe where the correspondingcodevalue column is present in thename_data``` dataframe.
I am using koalas for this on databricks and I have the following code using the join operation.
import databricks.koalas as ks
name_data= ks.DataFrame({'code':['123a', '345b', '678c'],
'id':[1, 2, 3]})
team_data = ks.DataFrame({'code':['123a', '23s', '34a'],
'id':[1, 2, 3]})
team_data_filtered = team_data.join(name_data.set_index('code'), on='code')
display(team_data_filtered)
The expected output would be to see only the following in team_data_filtered.
Code id
'123a' 1
But my code is throwing an error stating that columns overlap but no suffix specified: ['id'].
May someone help to resolve this issue?
Try adding suffix parameters:
team_data_filtered = team_data.join(name_data.set_index('code'), on='code',
lsuffix='_1', rsuffix='_2')
team_data_filtered = team_data_filtered.loc[team_data_filtered.id_1==team_data_filtered.id2]
display(team_data_filtered)
An then to clean the columns, if desired:
team_data_filtered.rename({'id_1':'id'}, inplace=True, axis=1)

How to multiple all rows of a Pandas dataframe by a single row in another Pandas dataframe?

I have this Dataframe in python
and I want to multiple every row in the first dataframe by this single row in the dataframe below as a vector
Some things I have tried from googling : df.mul, df.apply. But it seems to multiply the two frames together normally instead of a vectorized operation
Example data:
df = pd.DataFrame({'x':[1,2,3], 'y':[1,2,3]})
v1 = pd.DataFrame({'x':[2], 'y':[3]})
Multiply DataFrame with row:
df.multiply(np.array(v1), axis='columns')
If the use case needs accurate matching of columns
Example:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
coeffs_df = pd.DataFrame([[10, 9]], columns=['y', 'x'])
Need to convert the df with single row (coeffs_df) to a series first, the perform multiply
df.multiply(coeffs_df.iloc[0], axis='columns')

Multiple columns with the same name in Pandas

I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.

delete a part of pd.DataFrame with Python

I'm iterating over rows in my DataFrame with DataFrame.iterrows() and if a row meets certain criteria I store it in the other DataFrame. Is there a way to delete rows that appear in both of them like set.difference(another_set)?
I was asked to provide a code, so, since I dont know the answer to my question, I worked around my problem and created another DataFrame, to which I save good data instead of having two DataFrames and taking a difference of them both.
def test_right_chain(self, temp):
temp__=pd.DataFrame()
temp_=pd.DataFrame()
key=temp["nr right"].iloc[0]
temp_=temp_.append(temp.iloc[0])
temp=temp[1:]
for index, row in temp.iterrows():
print row
key_=row['nr right']
if abs(key_-key)==1:
pass
elif len(temp_)>2:
print row
temp__.append(temp_)
temp_=pd.DataFrame()
else:
temp_=pd.DataFrame()
temp_=temp_.append(row)
key=key_
return temp__
You can do an intersection of both DataFrames with df.merge(df1, df2, right_index=True, how='inner') function, leaving indexes that appear by the rows in left DataFrame (I don't know why, but this happens when I use right_index=True) and then retrieve indexes of those rows. (I used answer from this question: Compare Python Pandas DataFrames for matching rows)
df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
df3=pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], right_index=True, how='inner')
Now you need indexes of rows that appear in both DataFrames:
indexes= df3.index.values
And then you just need to drop those rows from your DataFrame:
df1=df1.drop(df1.index[indexes])

Drop columns that aren't common between two dataframes?

I have two dataframes that have many columns in column but a few that do not exist in both. I would like to create a dataframe that only has the columns that are in common between both dataframes. So for example:
list(df1)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain']
list(df2)
['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess']
And I would like to go to:
['Survived', 'Age', 'Title_Mr', 'Title_Mrs']
Since Title_Mr and Title_Mrs are in both df1 and df2. I've figured out how to do it by manually entering in the columns names like so:
df1 = df1.drop(['Title_Captain'], axis=1)
But I'd like to find a more robust solution where I don't have to manually enter the column names. Suggestions?
Using the comments of #linuxfan and #PadraicCunningham we can get a list of common columns:
common_cols = list(set(df1.columns).intersection(df2.columns))
Edit: #AdamHughes' answer made me consider preserving the column order. If that is important you could do this instead:
common_cols = [col for col in set(df1.columns).intersection(df2.columns)]
To get another DataFrame with just those columns you use that list to select only those columns from df1:
df3 = df1[common_cols]
According to http://pandas.pydata.org/pandas-docs/stable/indexing.html:
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be
raised.
df1 = df1.drop([col for col in df1.columns if col in df1.columns and col in df2.columns], axis=1)
You don't necessarily need to drop the columns, just select the columns of interest:
In [204]:
df1 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Captain'])
df2 = pd.DataFrame(columns=['Survived', 'Age', 'Title_Mr', 'Title_Mrs', 'Title_Countess'])
# create a list of the common columns using set and intersection
common_cols=list(set.intersection(set(df1), set(df2)))
# use this list to perform column selection
df1[common_cols]
['Title_Mr', 'Age', 'Survived', 'Title_Mrs']
Out[204]:
Empty DataFrame
Columns: [Title_Mr, Age, Survived, Title_Mrs]
Index: []

Categories

Resources