comparing two dataframes and finding a unique combbination of columns - python

I have two DataFrames with different size and different number of column, for example:
DF1:
index col1 col2 col3
1 AA A12 SH7B
2 Ac DJS 283
3 ZH 28S 48d
DF2:
index col1 col2 col3 col4
2 AA cc2 SH7B hd5
7 Ac DJS 283,dhb re
10 ZH 28S SJE,48d 385d
23 3V4 38D 350,eh4 sm4
44 S3 3YE 032,she 3927
so the indexes are different. and there are some unique combination of data in the first dataframe which is similar to other dataframe and I want to find them.So I want to iterate through the rows of second dataframe and find every single combination of data per row (for example: (7,Ac,DJS,283,re) and (7,Ac,DJS,dhb,re) are two combinations of index 7 since there is a column that has more than one value) and compare it with the first dataframe's rows and print it out if there is an identical combination in second dataframe as well.
result:
1 Ac DJS 283
2 ZH 28S 48d
thank you

You need to split the col3 from data frame 2 firstly, and then merge it back with data frame 1; To split col3 of data frame 2, a common approach is to split and flatten the col3 while use numpy.repeat to make equal length of other columns:
import pandas as pd
import numpy as np
from itertools import chain
# count how many repeats are needed for other columns based on commas
repeats = df2.col3.str.count(",") + 1
# repeat columns except for col3, split and flatten col3 and merge it back with df1
(df2.drop('col3', 1).apply(lambda col: np.repeat(col, repeats))
.assign(col3 = list(chain.from_iterable(df2['col3'].str.split(','))))
.merge(df1))
# col1 col2 col4 col3
#0 Ac DJS re 283
#1 ZH 28S 385d 48d

Related

sum() on specific columns of dataframe

I cannot work out how to add a new row at the end. The last row needs to do sum() on specific columns and dividing 2 other columns. While the DF has applied a filter to sum only specific rows.
df:
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.79
1 Cat2 2 -81.91 -15.30 -16.00 10.06
2 Cat3 3 -57.70 -18.62 0.00 0.00
I would like the output to be like so:
3 Total -123.60 -119.02 -26.91 100*(-119.02/-26.91)
col3,col4,col5 would have sum(), and col6 would be the above formula.
If [CategID]==2, then don't include in the TOTAL
I was able to get it almost as I wanted by using .query(), like so:
#tg is a list
df.loc['Total'] = df.query("categID in #tg").sum()
But with the above I cannot have the 'col6' like this 100*(col4.sum() / col5.sum()), because they are all sum().
Then I tried with Series like so, but I don't understand how to apply filter .where()
s = pd.Series( [df['col3'].sum()\
,df['col4'].sum()\
,df['col5'].sum()\
,100*(df['col4'].sum()/df['col5'].sum())\
,index = ['col3','col4','col5','col6'])
df.loc['Total'] = s.where('tag1' in tg)
using the above Series() works, until I add .where()
this gives the error:
ValueError: Array conditional must be same shape as self
So, can I accomplish this with the first method, using .query(), just somehow modify one of the column in TOTAL ?
Otherwise what am I doing wrong in the second method .where()
Thanks
IIUC, you can try:
s = df.mask(df['CategID'].eq(2)).drop("CategID",1).sum()
s.loc['col6'] = 100*(s['col4'] / s['col5'])
df.loc[len(df)] = s
df = df.fillna({'Categ':'Total',"CategID":''})
print(df)
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.790000
1 Cat2 2 -81.91 -15.30 -16.00 10.060000
2 Cat3 3 -57.70 -18.62 0.00 0.000000
3 Total -123.60 -119.02 -26.91 442.289112

I have two csv files that i need to merge based on the intersection of the two files i want to drop the columns that weren't repeated

For example if file 1 looks like this:
id col1 col2 col3
--------------------
1 aa bb cc
2 dd ff gg
and file 2 looks like
id col1 col2 col3 col4
---------------------------
3 qq ww ee tt
I want the output file to look like
id col1 col2 col3
-----------------------
1 aa bb cc
2 dd ff gg
3 qq ww ee
Meaning that I want to merge the files based on the intersection only and I want to discard the columns that were not repeated in both files
I tried the following attempts
df1= pd.read_csv("lastOne.csv")
df2=pd.read_csv("Normal.csv")
dfAll=pd.concat([df1, df2], axis=1, join='inner')
I also tried df1.combine_first(df2) among many others but all fails to do what I need
You were close, but you chose the wrong axis.
axis=0 for when you want to add more rows, with similar columns
axis=1 when you want to add more columns and you have similar rows
The correct answer would be:
pd.concat([df1, df2], join='inner', axis=0)

Get only matching rows for groups in Pandas groupby

I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated
You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2
Make a table with all allowed combinations and then inner join it with this dataframe.

Pandas Data frame group by one column whilst multiplying others

I am using python with pandas imported to manipulate some data from a csv file I have. Just playing around to try and learn something new.
I have the following data frame:
I would like to group the data by col1 so that I get the following result. Which is a groupby on col1 and col3 and col4 multiplied together.
I have been watching some youtube videos and reading some similar questions on stack overflow but I am having trouble. So far I have the following which involves creating a new Col to hold the result of Col3 x Col4:
df['Col5'] = df.Col3 * df.Col4
gf = df.groupby(['col1', 'Col5'])
You can use solution without creating new column, you can multiple columns and aggregate by column df['Col1'] with aggregate sum, it is syntactic sugar:
gf = (df.Col3 * df.Col4).groupby(df['Col1']).sum().reset_index(name='Col2')
print (gf)
Col1 Col2
0 12345 38.64
1 23456 2635.10
2 45678 419.88
Another solution is possible create index by Col1 by set_index, multiple columns by prod and last sum by index by level=0:
gf = df.set_index('Col1')[['Col3','Col4']].prod(axis=1).sum(level=0).reset_index(name='Col2')
Almost, but you are grouping by too many columns in the end. Try:
gf = df.groupby('Col1')['Col5'].sum()
Or to get it as a dataframe, rather than Col1 as an index (I'm judging that this is what you want from your image), include as_index=False in your groupby:
gf = df.groupby('Col1', as_index=False)['Col5'].sum()

drop rows that have duplicated indices

I have a DataFrame where each observation is identified by an index. However, for some indices the DF contains several observations. One of them has the most updated data. I would like to drop the outdated duplicated rows based on values from some of the columns.
For example, in the following DataFrame, how can I drop the first and third rows with index = 122?
index col1 col2
122 - -
122 one two
122 - two
123 four one
124 five -
That is, I would like to get a final DF like this:
index col1 col2
122 one two
123 four one
124 five -
This seems to be a very common problem when we get data through several different retrievals over time. But I cannot figure out an efficient way of cleaning the data.
You could use groupby/transform to create a boolean mask which is True where the group count is greater than 1 and any of the values in the row equals '-'. Then you could use df.loc[~mask] to select the unmasked rows of df:
import pandas as pd
df = pd.read_table('data', sep='\s+')
count = df.groupby(['index'])['col1'].transform('count') > 1
mask = (df['col1'] == '-') | (df['col2'] == '-')
mask = mask & count
result = df.loc[~mask]
print(result)
yields
index col1 col2
0 122 one two
1 123 four one
2 124 five -
If the index is already a column then you can drop_duplicates and pass param take-last=True:
In [14]:
df.drop_duplicates('index', take_last=True)
Out[14]:
index col1 col2
1 122 - two
2 123 four one
if it's actually your index, then you'd be better off calling reset_index first and then perform the above step and then set the index back again.
There is a method for Index to call drop_duplicates but this just removed duplicates from the index, the returned index with duplicates removed does not allow you to index back into the df with the duplicates removed so I recommend the above approach by calling drop_duplicates on the df itself.
EDIT
Based on your new information, the easiest maybe to replace the outdated data with NaN values and drop these:
In [36]:
df.replace('-', np.NaN).dropna()
Out[36]:
col1 col2
index
122 one two
123 four one
Another Edit
What you could do is groupby the index and take the first values of the remaining columns, then call reset_index:
In [56]:
df.groupby('index')['col1', 'col2'].first().reset_index()
Out[56]:
index col1 col2
0 122 - -
1 123 four one
2 124 five -

Categories

Resources