Intersection of multiple sets iteratively

Intersection of multiple sets iteratively - python

I have a dataframe (but it also can be just sets/lists):
Group Letter
1 {a,b,c,d,e}
2 {b,c,d,e,f}
3 {b,c,d,f,g}
4 {a,b,c,f,g}
5 {a,c,d,e,h}
I want to add column with intersection of group 1-2, 1-2-3, 1-2-3-4, 1-2-3-4-5.
So it'll be sth like this:
Group Letter Intersection
1 {a,b,c,d,e} None
2 {b,c,d,e,f} {b,c,d,e}
3 {b,c,d,f,g} {b,c,d}
4 {a,b,c,f,g} {b,c}
5 {a,c,d,e,h} {c}
I've read abt np.intersect1d, set.intersection, so I can do an intersection of multiple sets.
But I don't know how to do it in smart way.
Can someone help me with this problem?

You might itertools.accumulate for this task as follows
import itertools
letters = [{"a","b","c","d","e"},{"b","c","d","e","f"},{"b","c","d","f","g"},{"a","b","c","f","g"},{"a","c","d","e","h"}]
intersections = list(itertools.accumulate(letters, set.intersection))
print(intersections)
output
[{'e', 'a', 'b', 'c', 'd'}, {'b', 'e', 'c', 'd'}, {'b', 'c', 'd'}, {'b', 'c'}, {'c'}]
Note first element is {'e', 'a', 'b', 'c', 'd'} rather than None, so you would need to alter intersections in that regard.

Related

keep random list elements and remove the others

I have a larg list of elements:
list= [a1, b, c, b, c, d, a2, c,...a3........]
And i want to remove a specific elements from it a1, a2, a3
suppose that i can get the indexes of the elements start with a
a_indexes = [0,6, ...]
Now, i want to remove most of these elements start with a a but not all of them, i want to keep 20 of them chosen arbitrary. How can i do so ?
I know that to remove an elements from a list list_ can use:
list_.remove(list[element position])
But i am not sure how to play with the a list.

Here's a an approach that will work if I understand the question correctly.
We have a list containing numerous items. We want to remove some elements that match a certain criterion - but not all.
So:
from random import sample
li = ['a','b','a','b','a','b','a']
dc = 'a'
keep = 1 # this is how many we want to keep in the list
if (k := li.count(dc) - keep) > 0: # sanity check
il = [i for i, v in enumerate(li) if v == dc]
for i in sorted(sample(il, k), reverse=True):
li.pop(i)
print(li)
Note how the sample is sorted. This is important because we're popping elements. If we do that in no particular order then we could end up removing the wrong elements.
An example of output might be:
['b', 'b', 'a', 'b']

Suppose you have this list:
li=['d', 'a', 'c', 'a', 'g', 'b', 'f', 'a', 'c', 'g', 'e', 'f', 'e', 'g', 'b', 'b', 'c', 'e', 'a', 'd', 'g', 'd', 'd', 'a', 'c', 'e', 'a', 'c', 'f', 'a', 'b', 'a', 'a', 'f', 'b', 'd', 'd', 'b', 'f', 'a', 'd', 'g', 'd', 'b', 'e']
You can define a character to delete and a count k of how many to delete:
delete='a'
k=3
Then use random.shuffle to generate a random group of k indices to delete:
idx=[i for i,c in enumerate(li) if c==delete]
random.shuffle(idx)
idx=idx[:k]
>>> idx
[3, 7, 31]
Then delete those indices:
new_li=[e for i,e in enumerate(li) if i not in idx]

Query dataframe by column name as a variable

I know this question has already been asked here, but my question a bit different. Lets say I have following df:
import pandas as pd
df = pd.DataFrame({'A': ('a', 'b', 'c', 'd', 'e', 'a', 'b'), 'B': ('a', 'a', 'g', 'l', 'e', 'a', 'b'), 'C': ('b', 'b', 'g', 'a', 'e', 'a', 'b')})
myList = ['a', 'e', 'b']
I use this line to count the total number of occurrence of each elements of myList in my df columns:
print(df.query('A in #myList ').A.count())
5
Now, I am trying to execute the same thing by looping through columns names. Something like this:
for col in df.columns:
print(df.query('col in #myList ').col.count())
Also, I was wondering if using query for this is the most efficient way?
Thanks for the help.

Use this :
df.isin(myList).sum()
A 5
B 5
C 6
dtype: int64
It checks every cell in the dataframe through myList and returns True or False. Sum uses the 1 or 0 reference and gets the total for each column

Counting the number of unique sets in a list [duplicate]

This question already has answers here:
How can I create a Set of Sets in Python?
(4 answers)
Closed 3 years ago.
If you have a list of sets like this:
a_list = [{'a'}, {'a'}, {'a', 'b'}, {'a', 'b'}, {'a', 'c', 'b'}, {'a', 'c', 'b'}]
How could one get the number of unique sets in the list?
I have tried:
len(set(a_list ))
I am getting the error:
TypeError: unhashable type: 'set'
Desired output in this case is: 3 as there are three unique sets in the list.

You can use a tuple:
a_list = [{'a'}, {'a'}, {'a', 'b'}, {'a', 'b'}, {'a', 'c', 'b'}, {'a', 'c', 'b'}]
result = list(map(set, set(map(tuple, a_list))))
print(len(result))
Output:
[{'a', 'b'}, {'a'}, {'c', 'a', 'b'}]
3
A less functional approach, perhaps a bit more readable:
result = [set(c) for c in set([tuple(i) for i in a_list])]

How about Converting to a tuple:
a_list = [{'a'}, {'a'}, {'a', 'b'}, {'a', 'b'}, {'a', 'c', 'b'}, {'a', 'c', 'b'}]
print(len(set(map(tuple, a_list))))
OUTPUT:
3

is there a simpler way to group and count with python?

I am grouping and counting a set of data.
df = pd.DataFrame({'key': ['A', 'B', 'A'],
'data': np.ones(3,)})
df.groupby('key').count()
outputs
data
key
A 2
B 1
The piece of code above works though, I wonder if there is a simpler one.
'data': np.ones(3,) seems to be a placeholder and indispensable.
pd.DataFrame(['A', 'B', 'A']).groupby(0).count()
outputs
A
B
My question is, is there a simpler way to do this, produce the count of 'A' and 'B' respectively, without something like 'data': np.ones(3,) ?
It doesn't have to be a pandas method, numpy or python native function are also appreciated.

Use a Series instead.
>>> import pandas as pd
>>>
>>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
>>>
>>> pd.Series(data).value_counts()
D 5
A 3
C 2
B 1
dtype: int64

Use a defaultdict:
from collections import defaultdict
data = ['A', 'A', 'B', 'A', 'C', 'C', 'A']
d = defaultdict(int)
for element in data:
d[element] += 1
d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})

There's not any grouping , just counting, so you can use
from collections import Counter
counter(['A', 'B', 'A'])

Create a combined list of random order from a fixed number of items

Say I have 3 different items being A, B and C. I want to create a combined list containing NA copies of A, NB copies of B and NC copies of C in random orders. So the results should look like this:
finalList = [A, C, A, A, B, C, A, C,...]
Is there a clean way to get around this using np.random.rand Pythonically? If not, any other packages besides numpy?

I don't think you need numpy for that. You can use the random builtin package:
import random
na = nb = nc = 5
l = ['A'] * na + ['B'] *nb + ['C'] * nc
random.shuffle(l)
list l will look something like:
['A', 'C', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'B', 'A', 'C', 'B', 'C', 'A']

You can define a list of tuples. Each tuple should contain a character and desired frequency. Then you can create a list where each element is repeated with specified frequency and finally shuffle it using random.shuffle
>>> import random
>>> l = [('A',3),('B',5),('C',10)]
>>> a = [val for val, freq in l for i in range(freq)]
>>> random.shuffle(a)
>>> ['A', 'B', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'C', 'B', 'C']

Yes, this is very much possible (and simple) with numpy. You'll have to create an array with your unique elements, repeat each element a specified number of times using np.repeat (using an axis argument makes this possible), and then shuffle with np.random.shuffle.
Here's an example with NA as 1, NB as 2, and NC as 3.
a = np.array([['A', 'B', 'C']]).repeat([1, 2, 3], axis=1).squeeze()
np.random.shuffle(a)
print(a)
array(['B', 'C', 'A', 'C', 'B', 'C'],
dtype='<U1')
Note that it is simpler to use numpy, specifying an array of unique elements and repeats, versus a pure python implementation when you have a large number of unique elements to repeat.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Intersection of multiple sets iteratively - python

Related

keep random list elements and remove the others

Query dataframe by column name as a variable

Counting the number of unique sets in a list [duplicate]

is there a simpler way to group and count with python?

Create a combined list of random order from a fixed number of items

Categories

Resources