How to find related rows based on column value similarity with pandas - python

Given a random dataset, I need to find rows related to the first row.
|Row|Foo|Bar|Baz|Qux|
|---|---|---|---|---|
| 0 | A |A🔴 |A | A |
| 1 | B | B | B | B |
| 2 | C | C | C |D🟠|
| 3 | D |A🔴 | D |D🟠|
I should get the related rows which are 0, 2, and 3 because 0['Bar'] == 3['Bar'] and 3['Qux'] == 2['Qux'].
I can just iterate over the columns to get the similarities but that would be slow and inefficient and I would also need to iterate again if there are new similarities.
I hope someone can point me to the right direction like which pandas concept should I be looking at or which functions can help me solve this problem of retrieving intersecting data. Do I even need to use pandas?
Edit:
Providing the solution as suggested by #goodside. This solution will loop until there are no more new matched index found.
table = [
['A', 'A', 'A', 'A'],
['B', 'B', 'B', 'B'],
['C', 'C', 'C', 'D'],
['D', 'A', 'D', 'D']
]
comparators = [0]
while True:
for idx_row, row in enumerate(table):
if idx_row in comparators:
continue
for idx_col, cell in enumerate(row):
for comparator in comparators:
if cell == table[comparator][idx_col]:
comparators.append(idx_row)
break
else:
continue
break
else:
continue
break
else:
break
for item in comparators:
print(table[item])

This is a graph problem. You can use networkx:
# get the list of connected nodes per column
def get_edges(s):
return df['Row'].groupby(s).agg(frozenset)
edges = set(df.apply(get_edges).stack())
edges = list(map(set, edges))
# [{2}, {2, 3}, {0, 3}, {3}, {1}, {0}]
from itertools import pairwise, chain
# pairwise is python ≥ 3.10, see the doc for a recipe for older versions
# create the graph
import networkx as nx
G = nx.from_edgelist(chain.from_iterable(pairwise(e) for e in edges))
G.add_nodes_from(set.union(*edges))
# get the connected components
list(nx.connected_components(G))
Output: [{0, 2, 3}, {1}]
NB. You can read more on the logic to create the graph in this question of mine.
Used input:
df = pd.DataFrame({'Row': [0, 1, 2, 3],
'Foo': ['A', 'B', 'C', 'D'],
'Bar': ['A', 'B', 'C', 'A'],
'Baz': ['A', 'B', 'C', 'D'],
'Qux': ['A', 'B', 'D', 'D']})

Related

Count sequence within a column in pandas

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

Look up value in an array

Suppose I have two datasets
DS1
ArrayCol
[1,2,3,4]
[1,2,3]
DS2
Key Name
1 A
2 B
3 C
4 D
how to look up the values in the array to map the "Name" so that I can have another dataset like the following?
DS3
COlNew
[A,B,C,D]
[A,B,C]
Thanks, it's in databricks, so method is ok . python,sql,scala…...
you can try this
ds1 = [[1, 2, 3, 4], [1, 2, 3]]
ds2 = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}
new_data = [[ds2[cell] for cell in col] for col in ds1]
print(new_data)
output:
[['A', 'B', 'C', 'D'], ['A', 'B', 'C']]
hope that will be help. :)
Lets consider your dataset are in files and you can do something like this,
making use of dict
f=open("ds1.txt").readlines()
g=open("ds2.txt").readlines()
u=dict(item.rstrip().split("\t") for item in g)
for i in f:
i = i.rstrip().strip('][').split(',')
print [u[col] for col in i]
Output
['A', 'B', 'C', 'D']
['A', 'B', 'C']

is there a simpler way to group and count with python?

I am grouping and counting a set of data.
df = pd.DataFrame({'key': ['A', 'B', 'A'],
'data': np.ones(3,)})
df.groupby('key').count()
outputs
data
key
A 2
B 1
The piece of code above works though, I wonder if there is a simpler one.
'data': np.ones(3,) seems to be a placeholder and indispensable.
pd.DataFrame(['A', 'B', 'A']).groupby(0).count()
outputs
A
B
My question is, is there a simpler way to do this, produce the count of 'A' and 'B' respectively, without something like 'data': np.ones(3,) ?
It doesn't have to be a pandas method, numpy or python native function are also appreciated.
Use a Series instead.
>>> import pandas as pd
>>>
>>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
>>>
>>> pd.Series(data).value_counts()
D 5
A 3
C 2
B 1
dtype: int64
Use a defaultdict:
from collections import defaultdict
data = ['A', 'A', 'B', 'A', 'C', 'C', 'A']
d = defaultdict(int)
for element in data:
d[element] += 1
d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})
There's not any grouping , just counting, so you can use
from collections import Counter
counter(['A', 'B', 'A'])

how to get the unique value of a column pandas that contains list or value?

how to get the unique value of a column pandas that contains list or value ?
my column:
column | column
test | [A,B]
test | [A,C]
test | C
test | D
test | [E,B]
i want list like that :
list = [A, B, C, D, E]
thank you
You can apply pd.Series to split up the lists, then stack and unique.
import pandas as pd
df = pd.DataFrame({'col': [['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]})
df.col.apply(pd.Series).stack().unique().tolist()
Outputs
['A', 'B', 'C', 'D', 'E']
You can use a flattening function Credit #wim
import collections
def flatten(l):
for i in l:
if isinstance(i, collections.abc.Iterable) and not isinstance(i, str):
yield from flatten(i)
else:
yield i
Then use set
list(set(flatten(df.B)))
['A', 'B', 'E', 'C', 'D']
Setup
df = pd.DataFrame(dict(
B=[['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]
))

Python: get the indices that would sort an array of strings to match another array of strings

my question is how to get the indices of an array of strings that would sort another array.
I have this two arrays of strings:
A = np.array([ 'a', 'b', 'c', 'd' ])
B = np.array([ 'd', 'b', 'a', 'c' ])
I would like to get the indices that would sort the second one in order to match the first.
I have tried the np.argsort function giving the second array (transformed in a list) as order, but it doesn't seem to work.
Any help would be much apreciated.
Thanks and best regards,
Bradipo
edit:
def sortedIndxs(arr):
???
such that
sortedIndxs([ 'd', 'b', 'a', 'c' ]) = [2,1,3,0]
A vectorised approach is possible via numpy.searchsorted together with numpy.argsort:
import numpy as np
A = np.array(['a', 'b', 'c', 'd'])
B = np.array(['d', 'b', 'a', 'c'])
xsorted = np.argsort(B)
res = xsorted[np.searchsorted(B[xsorted], A)]
print(res)
[2 1 3 0]
A code that obtains a conversion rule from an arbitrary permutation to an arbitrary permutation.
creating indexTable: O (n)
examining indexTable: O (n)
Total: O (n)
A = [ 'a', 'b', 'c', 'd' ]
B = [ 'd', 'b', 'a', 'c' ]
indexTable = {k: v for v, k in enumerate(B)}
// {'d': 0, 'b': 1, 'a': 2, 'c': 3}
result = [indexTable[k] for k in A]
// [2, 1, 3, 0]

Categories

Resources