Given a random dataset, I need to find rows related to the first row.
|Row|Foo|Bar|Baz|Qux|
|---|---|---|---|---|
| 0 | A |A🔴 |A | A |
| 1 | B | B | B | B |
| 2 | C | C | C |D🟠|
| 3 | D |A🔴 | D |D🟠|
I should get the related rows which are 0, 2, and 3 because 0['Bar'] == 3['Bar'] and 3['Qux'] == 2['Qux'].
I can just iterate over the columns to get the similarities but that would be slow and inefficient and I would also need to iterate again if there are new similarities.
I hope someone can point me to the right direction like which pandas concept should I be looking at or which functions can help me solve this problem of retrieving intersecting data. Do I even need to use pandas?
Edit:
Providing the solution as suggested by #goodside. This solution will loop until there are no more new matched index found.
table = [
['A', 'A', 'A', 'A'],
['B', 'B', 'B', 'B'],
['C', 'C', 'C', 'D'],
['D', 'A', 'D', 'D']
]
comparators = [0]
while True:
for idx_row, row in enumerate(table):
if idx_row in comparators:
continue
for idx_col, cell in enumerate(row):
for comparator in comparators:
if cell == table[comparator][idx_col]:
comparators.append(idx_row)
break
else:
continue
break
else:
continue
break
else:
break
for item in comparators:
print(table[item])
This is a graph problem. You can use networkx:
# get the list of connected nodes per column
def get_edges(s):
return df['Row'].groupby(s).agg(frozenset)
edges = set(df.apply(get_edges).stack())
edges = list(map(set, edges))
# [{2}, {2, 3}, {0, 3}, {3}, {1}, {0}]
from itertools import pairwise, chain
# pairwise is python ≥ 3.10, see the doc for a recipe for older versions
# create the graph
import networkx as nx
G = nx.from_edgelist(chain.from_iterable(pairwise(e) for e in edges))
G.add_nodes_from(set.union(*edges))
# get the connected components
list(nx.connected_components(G))
Output: [{0, 2, 3}, {1}]
NB. You can read more on the logic to create the graph in this question of mine.
Used input:
df = pd.DataFrame({'Row': [0, 1, 2, 3],
'Foo': ['A', 'B', 'C', 'D'],
'Bar': ['A', 'B', 'C', 'A'],
'Baz': ['A', 'B', 'C', 'D'],
'Qux': ['A', 'B', 'D', 'D']})
Related
I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!
Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1
Suppose I have two datasets
DS1
ArrayCol
[1,2,3,4]
[1,2,3]
DS2
Key Name
1 A
2 B
3 C
4 D
how to look up the values in the array to map the "Name" so that I can have another dataset like the following?
DS3
COlNew
[A,B,C,D]
[A,B,C]
Thanks, it's in databricks, so method is ok . python,sql,scala…...
you can try this
ds1 = [[1, 2, 3, 4], [1, 2, 3]]
ds2 = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}
new_data = [[ds2[cell] for cell in col] for col in ds1]
print(new_data)
output:
[['A', 'B', 'C', 'D'], ['A', 'B', 'C']]
hope that will be help. :)
Lets consider your dataset are in files and you can do something like this,
making use of dict
f=open("ds1.txt").readlines()
g=open("ds2.txt").readlines()
u=dict(item.rstrip().split("\t") for item in g)
for i in f:
i = i.rstrip().strip('][').split(',')
print [u[col] for col in i]
Output
['A', 'B', 'C', 'D']
['A', 'B', 'C']
I am grouping and counting a set of data.
df = pd.DataFrame({'key': ['A', 'B', 'A'],
'data': np.ones(3,)})
df.groupby('key').count()
outputs
data
key
A 2
B 1
The piece of code above works though, I wonder if there is a simpler one.
'data': np.ones(3,) seems to be a placeholder and indispensable.
pd.DataFrame(['A', 'B', 'A']).groupby(0).count()
outputs
A
B
My question is, is there a simpler way to do this, produce the count of 'A' and 'B' respectively, without something like 'data': np.ones(3,) ?
It doesn't have to be a pandas method, numpy or python native function are also appreciated.
Use a Series instead.
>>> import pandas as pd
>>>
>>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
>>>
>>> pd.Series(data).value_counts()
D 5
A 3
C 2
B 1
dtype: int64
Use a defaultdict:
from collections import defaultdict
data = ['A', 'A', 'B', 'A', 'C', 'C', 'A']
d = defaultdict(int)
for element in data:
d[element] += 1
d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})
There's not any grouping , just counting, so you can use
from collections import Counter
counter(['A', 'B', 'A'])
how to get the unique value of a column pandas that contains list or value ?
my column:
column | column
test | [A,B]
test | [A,C]
test | C
test | D
test | [E,B]
i want list like that :
list = [A, B, C, D, E]
thank you
You can apply pd.Series to split up the lists, then stack and unique.
import pandas as pd
df = pd.DataFrame({'col': [['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]})
df.col.apply(pd.Series).stack().unique().tolist()
Outputs
['A', 'B', 'C', 'D', 'E']
You can use a flattening function Credit #wim
import collections
def flatten(l):
for i in l:
if isinstance(i, collections.abc.Iterable) and not isinstance(i, str):
yield from flatten(i)
else:
yield i
Then use set
list(set(flatten(df.B)))
['A', 'B', 'E', 'C', 'D']
Setup
df = pd.DataFrame(dict(
B=[['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]
))
my question is how to get the indices of an array of strings that would sort another array.
I have this two arrays of strings:
A = np.array([ 'a', 'b', 'c', 'd' ])
B = np.array([ 'd', 'b', 'a', 'c' ])
I would like to get the indices that would sort the second one in order to match the first.
I have tried the np.argsort function giving the second array (transformed in a list) as order, but it doesn't seem to work.
Any help would be much apreciated.
Thanks and best regards,
Bradipo
edit:
def sortedIndxs(arr):
???
such that
sortedIndxs([ 'd', 'b', 'a', 'c' ]) = [2,1,3,0]
A vectorised approach is possible via numpy.searchsorted together with numpy.argsort:
import numpy as np
A = np.array(['a', 'b', 'c', 'd'])
B = np.array(['d', 'b', 'a', 'c'])
xsorted = np.argsort(B)
res = xsorted[np.searchsorted(B[xsorted], A)]
print(res)
[2 1 3 0]
A code that obtains a conversion rule from an arbitrary permutation to an arbitrary permutation.
creating indexTable: O (n)
examining indexTable: O (n)
Total: O (n)
A = [ 'a', 'b', 'c', 'd' ]
B = [ 'd', 'b', 'a', 'c' ]
indexTable = {k: v for v, k in enumerate(B)}
// {'d': 0, 'b': 1, 'a': 2, 'c': 3}
result = [indexTable[k] for k in A]
// [2, 1, 3, 0]