Compare 2 list columns in pandas and find the diff

Compare 2 list columns in pandas and find the diff - python

DataFrame
df = pd.DataFrame({
'Id': [1,1,1,1,2,2,3,4,4,4],
'Col_1':['AD11','BZ23','CQ45','DL36','LM34','MM23','DL35','AD11','BP23','CQ45'],
'Col_2':['AD11',nan,nan,'DL36',nan,nan,'DL35',nan,nan,'CQ45']]
}, columns=['Id','Col_1','Col_2'])
Looks Like
Original data frame looks like this
Please note that Col_1 & Col_2 has alpha numeric values and has more than one character. For eg : 'AD34' , 'EC45', etc.
After groupby and reset index
g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
I want to
store results that match in Match column
results that do not match No_match column
Result :
I tried to use some logic from this
post but doesnt solve my issue.
Is there a better way to do the transformation for my requirement ?
Appreciate the help.

First remove missing values from list and then use set.intersection and set.difference:
g = df.groupby('Id')[['Col_1','Col_2']].agg([lambda x: x.dropna().unique().tolist()])
g= g.reset_index(drop=True)
g.columns = [f'{a}_unique' for a, b in g.columns]
z = list(zip(g['Col_1_unique'], g['Col_2_unique']))
g['Match'] = [list(set(a).intersection(b)) for a, b in z]
g['No_Match'] = [list(set(a).difference(b)) for a, b in z]
print (g)
Col_1_unique Col_2_unique Match No_Match
0 [AD11, BZ23, CQ45, DL36] [AD11, DL36] [DL36, AD11] [CQ45, BZ23]
1 [LM34, MM23] [] [] [LM34, MM23]
2 [DL35] [DL35] [DL35] []
3 [AD11, BP23, CQ45] [CQ45] [CQ45] [AD11, BP23]

Here, my simple logic is to compare both list, by same value on same positions.
Such as, [a,b,c] & [b,a,c] so match will be [c] only.
Code:
df = pd.DataFrame({
'Id': [1,1,1,1,2,2,3,4,4,4],
'Col_1':['A','B','C','D','L','M','D','A','B','C'],
'Col_2':['A','','','D','','','D','', '', 'C']
}, columns=['Id','Col_1','Col_2'])
#In order to compare list by values and position I needed to add unique value on null value
#So the both list length would be same
df['Col_2'] = df.apply(lambda x : x.name if x.Col_2=='' else x.Col_2, axis=1)
g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
g['Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a==b], axis=1)
g['Not_Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a!=b], axis=1)
g
Output:
Col_1unique Col_2unique Match Not_Match
0 [A, B, C, D] [A, 1, 2, D] [A, D] [B, C]
1 [L, M] [4, 5] [] [L, M]
2 [D] [D] [D] []
3 [A, B, C] [7, 8, C] [C] [A, B]

Please try to use the below code but make it more efficient, for time being i tried the below,
import pandas as pd
df = pd.DataFrame({
'Id': [1, 1, 1, 1, 2, 2, 3, 4, 4, 4],
'Col_1': ['A', 'B', 'C', 'D', 'L', 'M', 'D', 'A', 'B', 'C'],
'Col_2': ['A', 'nan', 'nan', 'D', 'nan', 'nan', 'D', 'nan', 'nan', 'C']})
print(df)
df['Match'] = ''
df['No-Match'] = ''
for i, row in df.iterrows():
if row['Col_1'] == row['Col_2']:
df.at[i, 'Match'] = row['Col_1']
else:
df.at[i, 'No-Match'] = row['Col_1']
print(df)
g = df.groupby('Id')['Id','Col_1','Col_2','Match','No-Match'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
print(g)
Once you run this, you will get the below output:
Idunique Col_1unique Col_2unique Matchunique No-Matchunique
0 [1] [A, B, C, D] [A, nan, D] [A, D] [B, C]
1 [2] [L, M] [nan] [] [L, M]
2 [3] [D] [D] [D] []
3 [4] [A, B, C] [nan, C] [C] [A, B]

Related

Pandas Multi-index set value based on three different condition

The objective is to create a new multiindex column based on 3 conditions of the column (B)
Condition for B
if B<0
CONDITION_B='l`
elif B<-1
CONDITION_B='L`
else
CONDITION_B='g`
Naively, I thought, we can simply create two different mask and replace the value as suggested
# Handle CONDITION_B='l` and CONDITION_B='g`
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
and then
# CONDITION_B='L`
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
As expected, this will throw an error
TypeError: sequence item 1: expected str instance, bool found
May I know how to handle the 3 different condition
Expected output
ONE TWO
B B
g L
l l
l g
g l
L L
The code to produce the error is
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)

IIUC:
np.select() is ideal in this case:
conditions=[
df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),
df.loc[:,idx[:,'B']].lt(-1),
df.loc[:,idx[:,'B']].ge(0)
]
labels=['l','L','g']
out=pd.DataFrame(np.select(conditions,labels),columns=df.loc[:,idx[:,'B']].columns)
OR
via np.where():
s=np.where(df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),'l',np.where(df.loc[:,idx[:,'B']].lt(-1),'L','g'))
out=pd.DataFrame(s,columns=df.loc[:,idx[:,'B']].columns)
output of out:
One Two
B B
0 g L
1 l l
2 l g
3 g l
4 L L

I don't fully understand what you want to do but try something like this:
df = pd.DataFrame({'B': [ 0, -1, -2, -2, -1, 0, 0, -1, -1, -2]})
df['ONE'] = np.where(df['B'] < 0, 'l', 'g')
df['TWO'] = np.where(df['B'] < -1, 'L', df['ONE'])
df = df.set_index(['ONE', 'TWO'])
Output result:
>>> df
B
ONE TWO
g g 0
l l -1
L -2
L -2
l -1
g g 0
g 0
l l -1
l -1
L -2

Remove Duplicates from 3 deep nested list - python - sympy

Im working with sympy and I have a list with some duplicates (the order doesn't matter, I still consider them duplicates) and Im looking for a way to remove them.
The list is as fallows,
A=[[[m, b], [f, g]],
[[g, h], [f, b]],
[[f, g], [m, b]]]
I would consider [[m, b], [f, g]] and [[f, g], [m, b]] as the same and am trying to figure out a way to to make a list with them removed. It would look like this,
B=[[[m, b], [f, g]],
[[g, h], [f, b]]].
It dosnt matter which of the duplicate it keeps, so long as only 1 remains.
Ive tried using the set function but it gives out
TypeError: unhashable type: 'list' error and Im not sure sure. Any input or advice is apprecaited.

A = [[['m', 'b'], ['f', 'g']], [['g', 'h'], ['f', 'b']], [['f', 'g'], ['m', 'b']], [['l', 'k'], ['d', 'c']]]
B = []
C = []
for i in A:
for j in i:
if j not in B:
B = B + [j]
c = 0
c1 = 1
counter = int(len(B) / 2)
for k in range(counter):
C.append([B[k+c], B[k+c1]])
c = c + 1
c1 = c + 1
print(B)
print(C)

A column in my dataframe does not seem to correspond to the input List (python)

I want to assign one of the columns of my dataframe to a list. I used the code below.
listone = [['a', 'b', 'c'], ['m', 'g'], ['h'], ['y', 't', 'r']]
df['Letter combinations'] = listone
The 'Letter Combinations' column in the dataframe doesn't correspond to the list, instead seems to assign random elements to each row in the column. I was wondering if this method indexes the elements differently causing a change in the order or if there is something wrong with my code. Any help would be appreciated!
Edit: Here is my complete code
listone = [[a, b, c], [m, g], [h], [y, t, r]]
numbers = [1, 2, 3, 4]
my_matrix = {'Numbers': numbers}
sample = pd.DataFrame(my_matrix)
sample['Letter combinations'] = listone
sample
My output looks like:
```
Numbers Letter combination
0 1 [b]
1 2 [m, g]
2 3 []
3 4 [r]
```

You need to make the listone to be a series. Ie:
sample['Letter combinations'] = pd.Series(listone)
sample
Numbers Letter combinations
0 1 [a, b, c]
1 2 [m, g]
2 3 [h]
3 4 [y, t, r]

delete all nan values from list in pandas dataframe

if any elements are there along with nan then i want to keep element and want to delete nan only like
example 1 ->
index values
0 [nan,'a',nan,nan]
output should be like
index values
0 [a]
example 2->
index values
0 [nan,'a',b,c]
1 [nan,nan,nan]
output should be like
index values
0 [a,b,c]
1 []

This is one approach using df.apply.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [[np.nan, np.nan, np.nan, "a", np.nan], [np.nan, np.nan], ["a", "b"]]})
df["a"] = df["a"].apply(lambda x: [i for i in x if str(i) != "nan"])
print(df)
Output:
a
0 [a]
1 []
2 [a, b]

You can use the fact that np.nan == np.nan evaluates to False:
df = pd.DataFrame([[0, [np.nan, 'a', 'b', 'c']],
[1, [np.nan, np.nan, np.nan]],
[2, [np.nan, 'a', np.nan, np.nan]]],
columns=['index', 'values'])
df['values'] = df['values'].apply(lambda x: [i for i in x if i == i])
print(df)
index values
0 0 [a, b, c]
1 1 []
2 2 [a]
lambda is just an anonymous function. You could also use a named function:
def remove_nan(x):
return [i for i in x if i == i]
df['values'] = df['values'].apply(remove_nan)
Related: Why is NaN not equal to NaN?

df['values'].apply(lambda v: pd.Series(v).dropna().values )

You can use pd.Series.map on df.values
import pandas as pd
my_filter = lambda x: not pd.isna(x)
df['new_values'] = df['values'].map(lambda x: list(filter(my_filter, x)))

How to print list items in python

I have written the following code:
def count():
a = 1
b = 5
c = 2
d = 8
i = 0
list1 = [a, b, c, d]
le = len(list1)
while (i < le):
x = max(list1)
print(x)
list1.remove(x)
i = i + 1
What I want to do is to print the largest number with its variable name like:
d:8
b:5
c:2
but using the above code I can only print the ascending list of numbers, not the corresponding variable names. Please suggest a way to fix this.

Use a dict instead:
In [2]: dic=dict(a=1, b=5, c=2, d=8)
In [3]: dic
Out[3]: {'a': 1, 'b': 5, 'c': 2, 'd': 8}
In [5]: sortedKeys=sorted(dic, key=dic.get, reverse=True)
In [6]: sortedKeys
Out[6]: ['d', 'b', 'c', 'a']
In [7]: for i in sortedKeys:
...: print i, dic[i]
...:
d 8
b 5
c 2
a 1

I think you can use OrderedDict()
from collections import OrderedDict
a, b, c, d = 1, 2, 3, 6
vars = {
'a' : a,
'b' : b,
'c' : c,
'd' : d
}
d_sorted_by_value = OrderedDict(sorted(vars.items(), key=x.get, reverse=True))
for k, v in d_sorted_by_value.items():
print "{}: {}".format(k,v)
List don't save variable names

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare 2 list columns in pandas and find the diff - python

Related

Pandas Multi-index set value based on three different condition

Remove Duplicates from 3 deep nested list - python - sympy

A column in my dataframe does not seem to correspond to the input List (python)

delete all nan values from list in pandas dataframe

How to print list items in python

Categories

Resources