counting specific dictionary entries in python - python

I have a python dictionary and I want to count the amount of keys that have a specific format.
The keys that I want to count are all the keys that have the format ‘letter, number, number’.
In my specific case the key always begins with the letter ‘A’. only the numbers change.
Example: A12, A16, A71
For example I want to count all the entries that have this AXX format (where the X’s are numbers).
{'A34': 83, 'B32': 70, 'A44': 66, A12: 47, 'B90': 71}
I know I can count all the entries of my dictionary by using:
print(len(my_dict.keys()))
but how do I count up all the entries that have the specific format I need.

You can use a generator comprehension inside the sum function:
print(sum(1 for k in d.keys() if k.startswith('A') and len(k) == 3 and k[1:3].isdigit()))
This does three checks: if the key starts with A, if the length of this key is 3 and if the last two characters of this key is a digit.
You can also use Regex:
import re
print(sum(1 for k in d.keys() if re.match('^A\\d{2}$', k)))
Both snippets outputs 3.

You can try list comprehension.
len([key for key in list(my_dict.keys()) if 'A' in key])
For your specific condition, we can try the below, if you need to be more specific then write a regex in the if clause.
len([key for key in list(my_dict.keys()) if ((key.startswith('A')) and (len(key)==3))])
Should work!

import re
my_dict = { ... }
filtered = filter(lambda k: bool(re.match("^[A-Z][0-9]{2}", k)), my_dict.keys())
print(len(filtered))

Go through all possibilities and check?
result = sum(f'A{i:02}' in my_dict for i in range(100))
Benchmark along with the solutions from the accepted answer, with a dict like you described your real one ("about 5000 items" and "all my A values have 2 digits. However other values that I have such as B and C values will have 3 digits."):
41.5 μs sum(f'A{i:02}' in my_dict for i in range(100))
573.5 μs sum(1 for k in my_dict.keys() if k.startswith('A') and len(k) == 3 and k[1:3].isdigit())
3546.0 μs sum(1 for k in my_dict.keys() if re.match('^A\d{2}$', k))
Benchmark code (Try it online!):
from timeit import repeat
setup = '''
from random import sample
from string import ascii_uppercase as letters
import re
A = [f'A{i:02}' for i in range(100)]
B2Z = [f'{letter}{i}' for letter in letters for i in range(10, 1000)]
A2Z = sample(A + sample(B2Z, 4900), 5000)
my_dict = dict.fromkeys(A2Z)
'''
E = [
"sum(f'A{i:02}' in my_dict for i in range(100))",
"sum(1 for k in my_dict.keys() if k.startswith('A') and len(k) == 3 and k[1:3].isdigit())",
"sum(1 for k in my_dict.keys() if re.match('^A\\d{2}$', k))",
]
for _ in range(3):
for e in E:
number = 10
t = min(repeat(e, setup, number=number)) / number
print('%6.1f μs ' % (t * 1e6), e)
print()

Related

create all possible combinations with multiple variants from list

Ok so the problem is as follows:
let's say I have a list like this [12R,102A,102L,250L] what I would want is a list of all possible combinations, however for only one combination/number. so for the example above, the output I would like is:
[12R,102A,250L]
[12R,102L,250L]
my actual problem is a lot more complex with many more sites. Thanks for your help
edit: after reading some comments I guess this is slightly unclear. I have 3 unique numbers here, [12, 102, and 250] and for some numbers, I have different variations, for example [102A, 102L]. what I need is a way to combine the different positions[12,102,250] and all possible variations within. just like the lists, I presented above. they are the only valid solutions. [12R] is not. neither is [12R,102A,102L,250L]. so far I have done this with nested loops, but I have a LOT of variation within these numbers, so I can't really do that anymore
ill edit this again: ok so it seems as though there is still some confusion so I might extend the point I made before. what I am dealing with there is DNA. 12R means the 12th position in the sequence was changed to an R. so the solution [12R,102A,250L] means that the amino acid on position 12 is R, 102 is A 250 is L.
this is why a solution like [102L, 102R, 250L] is not usable, because the same position can not be occupied by 2 different amino acids.
thank you
You can use a recursive generator function:
from itertools import groupby as gb
import re
def combos(d, c = []):
if not d:
yield c
else:
for a, b in d[0]:
yield from combos(d[1:], c + [a+b])
d = ['12R', '102A', '102L', '250L']
vals = [re.findall('^\d+|\w+$', i) for i in d]
new_d = [list(b) for _, b in gb(sorted(vals, key=lambda x:x[0]), key=lambda x:x[0])]
print(list(combos(new_d)))
Output:
[['102A', '12R', '250L'], ['102L', '12R', '250L']]
So it works with ["10A","100B","12C","100R"] (case 1) and ['12R','102A','102L','250L'] (case 2)
import itertools as it
liste = ['12R','102A','102L','250L']
comb = []
for e in it.combinations(range(4), 3):
e1 = liste[e[0]][:-1]
e2 = liste[e[1]][:-1]
e3 = liste[e[2]][:-1]
if e1 != e2 and e2 != e3 and e3 != e1:
comb.append([e1+liste[e[0]][-1], e2+liste[e[1]][-1], e3+liste[e[2]][-1]])
print(list(comb))
# case 1 : [['10A', '100B', '12C'], ['10A', '12C', '100R']]
# case 2 : [['12R', '102A', '250L'], ['12R', '102L', '250L']]
import re
def get_grouped_options(input):
options = {}
for option in input:
m = re.match('([\d]+)([A-Z])$', option)
if m:
position = int(m.group(1))
acid = m.group(2)
else:
continue
if position not in options:
options[position] = []
options[position].append(acid)
return options
def yield_all_combos(options):
n = len(options)
positions = list(options.keys())
indices = [0] * n
while True:
yield ["{}{}".format(position, options[position][indices[i]])
for i, position in enumerate(positions)]
j = 0
indices[j] += 1
while indices[j] == len(options[positions[j]]):
# carry
indices[j] = 0
j += 1
if j == n:
# overflow
return
indices[j] += 1
input = ['12R', '102A', '102L', '250L']
options = get_grouped_options(input)
for combo in yield_all_combos(options):
print("[{}]".format(",".join(combo)))
Gives:
[12R,102A,250L]
[12R,102L,250L]
Try this:
from itertools import groupby
import re
def __genComb(arr, res=[]):
for i in range(len(res), len(arr)):
el=arr[i]
if(len(el[1])==1):
res+=el[1]
else:
for el_2 in el[1]:
yield from __genComb(arr, res+[el_2])
break
if(len(res)==len(arr)): yield res
def genComb(arr):
res=[(k, list(v)) for k,v in groupby(sorted(arr), key=lambda x: re.match(r"(\d*)", x).group(1))]
yield from __genComb(res)
Sample output (using the input you provided):
test=["12R","102A","102L","250L"]
for el in genComb(test):
print(el)
# returns:
['102A', '12R', '250L']
['102L', '12R', '250L']
I believe this is what you're looking for!
This works by
generating a collection of all the postfixes each prefix can have
finding the total count of positions (multiply the length of each sublist together)
rotating through each postfix by basing the read index off of both its member postfix position in the collection and the absolute result index (known location in final results)
import collections
import functools
import operator
import re
# initial input
starting_values = ["12R","102A","102L","250L"]
d = collections.defaultdict(list) # use a set if duplicates are possible
for value in starting_values:
numeric, postfix = re.match(r"(\d+)(.*)", value).groups()
d[numeric].append(postfix) # .* matches ""; consider (postfix or "_") to give value a size
# d is now a dictionary of lists where each key is the prefix
# and each value is a list of possible postfixes
# each set of postfixes multiplies the total combinations by its length
total_combinations = functools.reduce(
operator.mul,
(len(sublist) for sublist in d.values())
)
results = collections.defaultdict(list)
for results_pos in range(total_combinations):
for index, (prefix, postfix_set) in enumerate(d.items()):
results[results_pos].append(
"{}{}".format( # recombine the values
prefix, # numeric prefix
postfix_set[(results_pos + index) % len(postfix_set)]
))
# results is now a dictionary mapping { result index: unique list }
displaying
# set width of column by longest prefix string
# need a collection for intermediate cols, but beyond scope of Q
col_width = max(len(str(k)) for k in results)
for k, v in results.items():
print("{:<{w}}: {}".format(k, v, w=col_width))
0: ['12R', '102L', '250L']
1: ['12R', '102A', '250L']
with a more advanced input
["12R","102A","102L","250L","1234","1234A","1234C"]
0: ['12R', '102L', '250L', '1234']
1: ['12R', '102A', '250L', '1234A']
2: ['12R', '102L', '250L', '1234C']
3: ['12R', '102A', '250L', '1234']
4: ['12R', '102L', '250L', '1234A']
5: ['12R', '102A', '250L', '1234C']
You can confirm the values are indeed unique with a set
final = set(",".join(x) for x in results.values())
for f in final:
print(f)
12R,102L,250L,1234
12R,102A,250L,1234A
12R,102L,250L,1234C
12R,102A,250L,1234
12R,102L,250L,1234A
12R,102A,250L,1234C
notes
in cPython, regexes are cached after their first compile
list member multiplier from "How can I multiply all items in a list together with Python?"

Generate a dictionary of all possible Kakuro solutions

I'm just starting out with Python and had an idea to try to generate a dictionary of all the possible solutions for a Kakuro puzzle. There are a few posts out there about these puzzles, but none that show how to generate said dictionary. What I'm after is a dictionary that has keys from 3-45, with their values being tuples of the integers which sum to the key (so for example mydict[6] = ([1,5],[2,4],[1,2,3])). It is essentially a Subset Sum Problem - https://mathworld.wolfram.com/SubsetSumProblem.html
I've had a go at this myself and have it working for tuples up to three digits long. My method requires a loop for each additional integer in the tuple, so would require me to write some very repetitive code! Is there a better way to do this? I feel like i want to loop the creation of loops, if that is a thing?
def kakuro():
L = [i for i in range(1,10)]
mydict = {}
for i in L:
L1 = L[i:]
for j in L1:
if i+j in mydict:
mydict[i+j].append((i,j))
else:
mydict[i+j] = [(i,j)]
L2 = L[j:]
for k in L2:
if i+j+k in mydict:
mydict[i+j+k].append((i,j,k))
else:
mydict[i+j+k] = [(i,j,k)]
for i in sorted (mydict.keys()):
print(i,mydict[i])
return
my attempt round 2 - getting better!
def kakurodict():
from itertools import combinations as combs
L = [i for i in range(1,10)]
mydict={}
mydict2={}
for i in L[1:]:
mydict[i] = list(combs(L,i))
for j in combs(L,i):
val = sum(j)
if val in mydict2:
mydict2[val].append(j)
else:
mydict2[val] = [j]
return mydict2
So this is written with the following assumptions.
dict[n] cannot have a list with the value [n].
Each element in the subset has to be unique.
I hope there is a better solution offered by someone else, because when we generate all subsets for values 3-45, it takes quite some time. I believe the time complexity of the subset sum generation problem is 2^n so if n is 45, it's not ideal.
import itertools
def subsetsums(max):
if (max < 45):
numbers = [x for x in range(1, max)]
else:
numbers = [x for x in range(1, 45)]
result = [list(seq) for i in range(len(numbers), 0, -1) for seq in itertools.combinations(numbers, i) if sum(seq) == max]
return(result)
mydict = {}
for i in range(3, 46):
mydict[i] = subsetsums(i)
print(mydict)

Run a random algorithm mutiple times and average over the results

I have the following random selection script:
import random
length_of_list = 200
my_list = list(range(length_of_list))
num_selections = 10
numbers = random.sample(my_list, num_selections)
It looks at a list of predetermined size and randomly selects 10 numbers. Is there a way to run this section 500 times and then get the top 10 numbers which were selected the most? I was thinking that I could feed the numbers into a dictionary and then get the top 10 numbers from there. So far, I've done the following:
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
current_number = my_dict.get(number)
key_number = number
my_dict.update(number = number+1)
print(my_dict)
Here I want the code to take the current number assigned to that key and then add 1, but I cannot manage to make it work. It seems like the key for the dictionary update has to be that specific key, cannot insert a variable.. Also, I think having this nested loop might not be so efficient as I have to run this 500 times 1500 times 23... so I am concerned about performance. If anyone has an idea of what I should try, it would be great! Thanks
SOLUTION:
import random
from collections import defaultdict
from collections import OrderedDict
length_of_list = 50
my_list = list(range(length_of_list))
num_selections = 10
my_dict = dict.fromkeys(my_list)
di = defaultdict(int)
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
di[number] += 1
def get_top_numbers(data, n, order=False):
"""Gets the top n numbers from the dictionary"""
top = sorted(data.items(), key=lambda x: x[1], reverse=True)[:n]
if order:
return OrderedDict(top)
return dict(top)
print(get_top_numbers(di, n=10))
my_dict.update(number = number+1) in this line you are assigning a new value to a variable inside the parentheses of a function call. Unless you're giving the function a kwarg called number with value number+1 this in the following error:
TypeError: 'number' is an invalid keyword argument for this function
Also dict.update doesn't accept an integer but another dictionary. You should read the documentation about this function: https://www.tutorialspoint.com/python3/dictionary_update.htm
Here it say's dict.update(dict2) takes a dictionary which it will integrate into dict. See example below:
dict = {'Name': 'Zara', 'Age': 17}
dict2 = {'Gender': 'female' }
dict.update(dict2)
print ("updated dict : ", dict)
Gives as result:
updated dict : {'Gender': 'female', 'Age': 17, 'Name': 'Zara'}
So far for the errors in your code, I see a good answer is already given so I won't repeat him.
Checkout defaultdict of collections module,
So basically, you create a defaultdict with default value 0 and then iterate over your numbers list and update the value of the number to +=1
from collections import defaultdict
di = defaultdict(int)
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
di[number] += 1
print(di)
You can use for this task collections.Counter which provides addition method. So you will use two counters one which is sum of all and second which contains count of samples.
counter = collections.Counter()
for run in range(500):
samples = random.sample(my_list, num_samples)
sample_counter = collections.Counter(samples)
counter = counter + sample_counter

Remove adjacent duplicates given a condition

I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!
Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'
Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...
You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res

Finding motif in a sequence of characters

I have also a dictionary in which the keys are ids and the values are long sequences made not only with K and M but also with some more characters which are not important for me.
li = {id1: "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM",
id2:"MMKFJDFKFGKJKMKMKMKMKMJKJHFKMKMKM"}
I want to find the motifs of "KMKMKM" with the length of at least 6. it could be even or odd just equal or longer than 6. it should also be in a dictionary with the same keys but instead of the whole sequence, the value must be the list of motifs. like the following example.
results = {id1: ["KMKMKMK"], id2: ["KMKMKMKMKM", "KMKMKM"] }
I have wrote this code but did not return interested motifs.
{k: re.findall(r'(?:KM){6,1000}', v) for k, v in li.items()}
This one does the job:
((?:KM){3,}K?)
Explanation:
( : group 1
(?:KM){3,} : non capture group, 3 or more times KM
K? : optional K
) : end group 1
In action:
import re
li = {'id1': "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM",
'id2':"MMKFJDFKFGKJKMKMKMKMKMJKJHFKMKMKM"}
res = {k: re.findall(r'((?:KM){3,}K?)', v) for k, v in li.items()}
print(res)
Output:
{'id2': ['KMKMKMKMKM', 'KMKMKM'], 'id1': ['KMKMKMK']}
Is this what you are looking for:
import re
stringA = "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM";
motifs = "KMKMKM";
m = re.search(motifs, stringA)
if m:
print(motifs);
In reply to your comment below:
stringA = "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM";
motifs = "KMKMKM";
i = 0;
while True:
seq = stringA[i:]
i = i + 1;
if (seq.startswith(motifs)):
print(seq);
if (len(stringA) == i):
break;

Categories

Resources