Compare 2 list columns in pandas and find the diff - python

DataFrame
df = pd.DataFrame({
'Id': [1,1,1,1,2,2,3,4,4,4],
'Col_1':['AD11','BZ23','CQ45','DL36','LM34','MM23','DL35','AD11','BP23','CQ45'],
'Col_2':['AD11',nan,nan,'DL36',nan,nan,'DL35',nan,nan,'CQ45']]
}, columns=['Id','Col_1','Col_2'])
Looks Like
Original data frame looks like this
Please note that Col_1 & Col_2 has alpha numeric values and has more than one character. For eg : 'AD34' , 'EC45', etc.
After groupby and reset index
g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
I want to
store results that match in Match column
results that do not match No_match column
Result :
I tried to use some logic from this
post but doesnt solve my issue.
Is there a better way to do the transformation for my requirement ?
Appreciate the help.

First remove missing values from list and then use set.intersection and set.difference:
g = df.groupby('Id')[['Col_1','Col_2']].agg([lambda x: x.dropna().unique().tolist()])
g= g.reset_index(drop=True)
g.columns = [f'{a}_unique' for a, b in g.columns]
z = list(zip(g['Col_1_unique'], g['Col_2_unique']))
g['Match'] = [list(set(a).intersection(b)) for a, b in z]
g['No_Match'] = [list(set(a).difference(b)) for a, b in z]
print (g)
Col_1_unique Col_2_unique Match No_Match
0 [AD11, BZ23, CQ45, DL36] [AD11, DL36] [DL36, AD11] [CQ45, BZ23]
1 [LM34, MM23] [] [] [LM34, MM23]
2 [DL35] [DL35] [DL35] []
3 [AD11, BP23, CQ45] [CQ45] [CQ45] [AD11, BP23]

Here, my simple logic is to compare both list, by same value on same positions.
Such as, [a,b,c] & [b,a,c] so match will be [c] only.
Code:
df = pd.DataFrame({
'Id': [1,1,1,1,2,2,3,4,4,4],
'Col_1':['A','B','C','D','L','M','D','A','B','C'],
'Col_2':['A','','','D','','','D','', '', 'C']
}, columns=['Id','Col_1','Col_2'])
#In order to compare list by values and position I needed to add unique value on null value
#So the both list length would be same
df['Col_2'] = df.apply(lambda x : x.name if x.Col_2=='' else x.Col_2, axis=1)
g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
g['Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a==b], axis=1)
g['Not_Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a!=b], axis=1)
g
Output:
Col_1unique Col_2unique Match Not_Match
0 [A, B, C, D] [A, 1, 2, D] [A, D] [B, C]
1 [L, M] [4, 5] [] [L, M]
2 [D] [D] [D] []
3 [A, B, C] [7, 8, C] [C] [A, B]

Please try to use the below code but make it more efficient, for time being i tried the below,
import pandas as pd
df = pd.DataFrame({
'Id': [1, 1, 1, 1, 2, 2, 3, 4, 4, 4],
'Col_1': ['A', 'B', 'C', 'D', 'L', 'M', 'D', 'A', 'B', 'C'],
'Col_2': ['A', 'nan', 'nan', 'D', 'nan', 'nan', 'D', 'nan', 'nan', 'C']})
print(df)
df['Match'] = ''
df['No-Match'] = ''
for i, row in df.iterrows():
if row['Col_1'] == row['Col_2']:
df.at[i, 'Match'] = row['Col_1']
else:
df.at[i, 'No-Match'] = row['Col_1']
print(df)
g = df.groupby('Id')['Id','Col_1','Col_2','Match','No-Match'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
print(g)
Once you run this, you will get the below output:
Idunique Col_1unique Col_2unique Matchunique No-Matchunique
0 [1] [A, B, C, D] [A, nan, D] [A, D] [B, C]
1 [2] [L, M] [nan] [] [L, M]
2 [3] [D] [D] [D] []
3 [4] [A, B, C] [nan, C] [C] [A, B]

Related

Pandas Multi-index set value based on three different condition

The objective is to create a new multiindex column based on 3 conditions of the column (B)
Condition for B
if B<0
CONDITION_B='l`
elif B<-1
CONDITION_B='L`
else
CONDITION_B='g`
Naively, I thought, we can simply create two different mask and replace the value as suggested
# Handle CONDITION_B='l` and CONDITION_B='g`
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
and then
# CONDITION_B='L`
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
As expected, this will throw an error
TypeError: sequence item 1: expected str instance, bool found
May I know how to handle the 3 different condition
Expected output
ONE TWO
B B
g L
l l
l g
g l
L L
The code to produce the error is
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
IIUC:
np.select() is ideal in this case:
conditions=[
df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),
df.loc[:,idx[:,'B']].lt(-1),
df.loc[:,idx[:,'B']].ge(0)
]
labels=['l','L','g']
out=pd.DataFrame(np.select(conditions,labels),columns=df.loc[:,idx[:,'B']].columns)
OR
via np.where():
s=np.where(df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),'l',np.where(df.loc[:,idx[:,'B']].lt(-1),'L','g'))
out=pd.DataFrame(s,columns=df.loc[:,idx[:,'B']].columns)
output of out:
One Two
B B
0 g L
1 l l
2 l g
3 g l
4 L L
I don't fully understand what you want to do but try something like this:
df = pd.DataFrame({'B': [ 0, -1, -2, -2, -1, 0, 0, -1, -1, -2]})
df['ONE'] = np.where(df['B'] < 0, 'l', 'g')
df['TWO'] = np.where(df['B'] < -1, 'L', df['ONE'])
df = df.set_index(['ONE', 'TWO'])
Output result:
>>> df
B
ONE TWO
g g 0
l l -1
L -2
L -2
l -1
g g 0
g 0
l l -1
l -1
L -2

Remove Duplicates from 3 deep nested list - python - sympy

Im working with sympy and I have a list with some duplicates (the order doesn't matter, I still consider them duplicates) and Im looking for a way to remove them.
The list is as fallows,
A=[[[m, b], [f, g]],
[[g, h], [f, b]],
[[f, g], [m, b]]]
I would consider [[m, b], [f, g]] and [[f, g], [m, b]] as the same and am trying to figure out a way to to make a list with them removed. It would look like this,
B=[[[m, b], [f, g]],
[[g, h], [f, b]]].
It dosnt matter which of the duplicate it keeps, so long as only 1 remains.
Ive tried using the set function but it gives out
TypeError: unhashable type: 'list' error and Im not sure sure. Any input or advice is apprecaited.
A = [[['m', 'b'], ['f', 'g']], [['g', 'h'], ['f', 'b']], [['f', 'g'], ['m', 'b']], [['l', 'k'], ['d', 'c']]]
B = []
C = []
for i in A:
for j in i:
if j not in B:
B = B + [j]
c = 0
c1 = 1
counter = int(len(B) / 2)
for k in range(counter):
C.append([B[k+c], B[k+c1]])
c = c + 1
c1 = c + 1
print(B)
print(C)

A column in my dataframe does not seem to correspond to the input List (python)

I want to assign one of the columns of my dataframe to a list. I used the code below.
listone = [['a', 'b', 'c'], ['m', 'g'], ['h'], ['y', 't', 'r']]
df['Letter combinations'] = listone
The 'Letter Combinations' column in the dataframe doesn't correspond to the list, instead seems to assign random elements to each row in the column. I was wondering if this method indexes the elements differently causing a change in the order or if there is something wrong with my code. Any help would be appreciated!
Edit: Here is my complete code
listone = [[a, b, c], [m, g], [h], [y, t, r]]
numbers = [1, 2, 3, 4]
my_matrix = {'Numbers': numbers}
sample = pd.DataFrame(my_matrix)
sample['Letter combinations'] = listone
sample
My output looks like:
```
Numbers Letter combination
0 1 [b]
1 2 [m, g]
2 3 []
3 4 [r]
```
You need to make the listone to be a series. Ie:
sample['Letter combinations'] = pd.Series(listone)
sample
Numbers Letter combinations
0 1 [a, b, c]
1 2 [m, g]
2 3 [h]
3 4 [y, t, r]

delete all nan values from list in pandas dataframe

if any elements are there along with nan then i want to keep element and want to delete nan only like
example 1 ->
index values
0 [nan,'a',nan,nan]
output should be like
index values
0 [a]
example 2->
index values
0 [nan,'a',b,c]
1 [nan,nan,nan]
output should be like
index values
0 [a,b,c]
1 []
This is one approach using df.apply.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [[np.nan, np.nan, np.nan, "a", np.nan], [np.nan, np.nan], ["a", "b"]]})
df["a"] = df["a"].apply(lambda x: [i for i in x if str(i) != "nan"])
print(df)
Output:
a
0 [a]
1 []
2 [a, b]
You can use the fact that np.nan == np.nan evaluates to False:
df = pd.DataFrame([[0, [np.nan, 'a', 'b', 'c']],
[1, [np.nan, np.nan, np.nan]],
[2, [np.nan, 'a', np.nan, np.nan]]],
columns=['index', 'values'])
df['values'] = df['values'].apply(lambda x: [i for i in x if i == i])
print(df)
index values
0 0 [a, b, c]
1 1 []
2 2 [a]
lambda is just an anonymous function. You could also use a named function:
def remove_nan(x):
return [i for i in x if i == i]
df['values'] = df['values'].apply(remove_nan)
Related: Why is NaN not equal to NaN?
df['values'].apply(lambda v: pd.Series(v).dropna().values )
You can use pd.Series.map on df.values
import pandas as pd
my_filter = lambda x: not pd.isna(x)
df['new_values'] = df['values'].map(lambda x: list(filter(my_filter, x)))

How to print list items in python

I have written the following code:
def count():
a = 1
b = 5
c = 2
d = 8
i = 0
list1 = [a, b, c, d]
le = len(list1)
while (i < le):
x = max(list1)
print(x)
list1.remove(x)
i = i + 1
What I want to do is to print the largest number with its variable name like:
d:8
b:5
c:2
but using the above code I can only print the ascending list of numbers, not the corresponding variable names. Please suggest a way to fix this.
Use a dict instead:
In [2]: dic=dict(a=1, b=5, c=2, d=8)
In [3]: dic
Out[3]: {'a': 1, 'b': 5, 'c': 2, 'd': 8}
In [5]: sortedKeys=sorted(dic, key=dic.get, reverse=True)
In [6]: sortedKeys
Out[6]: ['d', 'b', 'c', 'a']
In [7]: for i in sortedKeys:
...: print i, dic[i]
...:
d 8
b 5
c 2
a 1
I think you can use OrderedDict()
from collections import OrderedDict
a, b, c, d = 1, 2, 3, 6
vars = {
'a' : a,
'b' : b,
'c' : c,
'd' : d
}
d_sorted_by_value = OrderedDict(sorted(vars.items(), key=x.get, reverse=True))
for k, v in d_sorted_by_value.items():
print "{}: {}".format(k,v)
List don't save variable names

Categories

Resources