I have similar problem to this one (most similar is the answer with &&). For postgres, I would like to get the intersection of array column and python list. I've tried to do that with && operator:
query(Table.array_column.op('&&')(cast(['a', 'b'], ARRAY(Unicode)))).filter(Table.array_column.op('&&')(cast(['a', 'b'], ARRAY(Unicode))))
but it seems that op('&&') return bool type (what has sense for filter) not the intersection.
So for table data:
id | array_column
1 {'7', 'xyz', 'a'}
2 {'b', 'c', 'd'}
3 {'x', 'y', 'ab'}
4 {'ab', 'ba', ''}
5 {'a', 'b', 'ab'}
I would like to get:
id | array_column
1 {'a'}
2 {'b'}
5 {'a', 'b'}
One way* to do this would be to unnest the array column, and then re-aggregate the rows that match the list values, grouping on id. This could be done as a subquery:
select id, array_agg(un)
from (select id, unnest(array_column) as un from tbl) t
where un in ('a', 'b')
group by id
order by id;
The equivalent SQLAlchemy construct is:
subq = sa.select(
tbl.c.id, sa.func.unnest(tbl.c.array_column).label('col')
).subquery('s')
stmt = (
sa.select(subq.c.id, sa.func.array_agg(subq.c.col))
.where(subq.c.col.in_(['a', 'b']))
.group_by(subq.c.id)
.order_by(subq.c.id)
)
returning
(1, ['a'])
(2, ['b'])
(5, ['a', 'b'])
* There may well be more efficient ways.
Related
Given a random dataset, I need to find rows related to the first row.
|Row|Foo|Bar|Baz|Qux|
|---|---|---|---|---|
| 0 | A |A🔴 |A | A |
| 1 | B | B | B | B |
| 2 | C | C | C |D🟠|
| 3 | D |A🔴 | D |D🟠|
I should get the related rows which are 0, 2, and 3 because 0['Bar'] == 3['Bar'] and 3['Qux'] == 2['Qux'].
I can just iterate over the columns to get the similarities but that would be slow and inefficient and I would also need to iterate again if there are new similarities.
I hope someone can point me to the right direction like which pandas concept should I be looking at or which functions can help me solve this problem of retrieving intersecting data. Do I even need to use pandas?
Edit:
Providing the solution as suggested by #goodside. This solution will loop until there are no more new matched index found.
table = [
['A', 'A', 'A', 'A'],
['B', 'B', 'B', 'B'],
['C', 'C', 'C', 'D'],
['D', 'A', 'D', 'D']
]
comparators = [0]
while True:
for idx_row, row in enumerate(table):
if idx_row in comparators:
continue
for idx_col, cell in enumerate(row):
for comparator in comparators:
if cell == table[comparator][idx_col]:
comparators.append(idx_row)
break
else:
continue
break
else:
continue
break
else:
break
for item in comparators:
print(table[item])
This is a graph problem. You can use networkx:
# get the list of connected nodes per column
def get_edges(s):
return df['Row'].groupby(s).agg(frozenset)
edges = set(df.apply(get_edges).stack())
edges = list(map(set, edges))
# [{2}, {2, 3}, {0, 3}, {3}, {1}, {0}]
from itertools import pairwise, chain
# pairwise is python ≥ 3.10, see the doc for a recipe for older versions
# create the graph
import networkx as nx
G = nx.from_edgelist(chain.from_iterable(pairwise(e) for e in edges))
G.add_nodes_from(set.union(*edges))
# get the connected components
list(nx.connected_components(G))
Output: [{0, 2, 3}, {1}]
NB. You can read more on the logic to create the graph in this question of mine.
Used input:
df = pd.DataFrame({'Row': [0, 1, 2, 3],
'Foo': ['A', 'B', 'C', 'D'],
'Bar': ['A', 'B', 'C', 'A'],
'Baz': ['A', 'B', 'C', 'D'],
'Qux': ['A', 'B', 'D', 'D']})
This question already has answers here:
GroupBy pandas DataFrame and select most common value
(13 answers)
Closed 1 year ago.
I have a dataframe having data of cities having different product types, such as :
city
product_type
A
B
A
B
A
D
A
E
X
B
X
C
X
C
X
C
I want to know what the most common product type is, for each city. For the above df, it would be product B for city A and product C for city X.
I am trying to solve this by first grouping then iterating over the groups and trying to find the product type with max occurrence but it doesn't seem to work:
d = df.groupby('city')['product_type']
prods=[]
for name,group in d:
l = [group]
prod = max(l, key=l.count)
prods.append(prod)
print(prods)# this is list of products with highest occurrence in each city
This piece of code seems to give me ALL the product types, not just the most frequent ones.
You can try something like this:
data = pd.DataFrame({
'city': ['A', 'A', 'A', 'A', 'X', 'X', 'X', 'X'],
'product_type': ['B', 'B', 'D', 'E', 'B', 'C', 'C', 'C']
})
result_dict = {city: city_data.product_type.value_counts().index[0]
for city, city_data in data.groupby('city')}
print(result_dict)
This will result in dictionary: {'A': 'B', 'X': 'C'}. Note that if more than one product has the same number of occurrences this code will only return one of them.
I have a dataframe (but it also can be just sets/lists):
Group Letter
1 {a,b,c,d,e}
2 {b,c,d,e,f}
3 {b,c,d,f,g}
4 {a,b,c,f,g}
5 {a,c,d,e,h}
I want to add column with intersection of group 1-2, 1-2-3, 1-2-3-4, 1-2-3-4-5.
So it'll be sth like this:
Group Letter Intersection
1 {a,b,c,d,e} None
2 {b,c,d,e,f} {b,c,d,e}
3 {b,c,d,f,g} {b,c,d}
4 {a,b,c,f,g} {b,c}
5 {a,c,d,e,h} {c}
I've read abt np.intersect1d, set.intersection, so I can do an intersection of multiple sets.
But I don't know how to do it in smart way.
Can someone help me with this problem?
You might itertools.accumulate for this task as follows
import itertools
letters = [{"a","b","c","d","e"},{"b","c","d","e","f"},{"b","c","d","f","g"},{"a","b","c","f","g"},{"a","c","d","e","h"}]
intersections = list(itertools.accumulate(letters, set.intersection))
print(intersections)
output
[{'e', 'a', 'b', 'c', 'd'}, {'b', 'e', 'c', 'd'}, {'b', 'c', 'd'}, {'b', 'c'}, {'c'}]
Note first element is {'e', 'a', 'b', 'c', 'd'} rather than None, so you would need to alter intersections in that regard.
I have a dictionary mapping strings to lists of strings, for example:
{'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
I am looking to use this to filter a dataframe, creating new dfs with the name of the df being the key and the columns to be copied containing the string listed as the values.
So dataframe A would contain columns from the original dataframe that contain 'A', dataframe B would contain columns that contain 'A', 'B', 'C'. I know that I need to use regex filtering for selecting the columns but am unsure how to do this.
Use DataFrame.filter with regex - join values by | for regex or - it means for key C are selected columns with B or E or C:
d = {'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
dfs = {k:df.filter(regex='|'.join(v)) for k, v in d.items()}
I know this question has already been asked here, but my question a bit different. Lets say I have following df:
import pandas as pd
df = pd.DataFrame({'A': ('a', 'b', 'c', 'd', 'e', 'a', 'b'), 'B': ('a', 'a', 'g', 'l', 'e', 'a', 'b'), 'C': ('b', 'b', 'g', 'a', 'e', 'a', 'b')})
myList = ['a', 'e', 'b']
I use this line to count the total number of occurrence of each elements of myList in my df columns:
print(df.query('A in #myList ').A.count())
5
Now, I am trying to execute the same thing by looping through columns names. Something like this:
for col in df.columns:
print(df.query('col in #myList ').col.count())
Also, I was wondering if using query for this is the most efficient way?
Thanks for the help.
Use this :
df.isin(myList).sum()
A 5
B 5
C 6
dtype: int64
It checks every cell in the dataframe through myList and returns True or False. Sum uses the 1 or 0 reference and gets the total for each column