Finding an unknown pattern in a string python - python

I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.
My problem is next. I get a string of characters e.g
'1211' and what I need to do is see that 1 is most often repeated
and this 2 times in a row.
But it can also be "121212112" where 12 is repeated 3 times in a
row.
But with 12221221 it is 221 that is repeated 2 times rather than 2
that repeats 3 times.
here are some results I like to get (the only numbers ever used are 1 and 2's)
>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2
the outcome I want is how many times it occurs.
I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.
Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.

Really inefficient, but you can
find all substrings (https://stackoverflow.com/a/22470047/264596)
put them into a set to avoid duplicates
for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)
Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.

I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:
#Func1
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
for start in range(u_start):
c_str = string[start:i+start+1]
if c_str in combos:
combos[c_str] += 1
else:
combos[c_str] = 1
return combos
sub_string_cts('21212')
{'2': 3,
'1': 2,
'21': 2,
'12': 2,
'212': 2,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
After your comment I think this is more what you're looking for:
#Func2
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
substrs = set([string[start:i+start+1] for start in range(u_start)])
for substring in substrs:
combos[substring] = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)
return combos
sub_string_cts('21212')
{'2': 1,
'1': 1,
'21': 2,
'12': 2,
'212': 1,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:
def max_by_len(result_dict):
results = {}
for k, v in result_dict.items():
if len(k) not in results:
results[len(k)] = {}
for c_len in [ln for ln in results]:
len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
for k,v in result_dict.items():
if len(k) == c_len:
if v == len_max_count:
results[c_len][k] = v
return results
#Func1:
max_by_len(sub_string_cts('21212'))
{1: {'2': 3},
2: {'21': 2, '12': 2},
3: {'212': 2},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
#Func2:
max_by_len(sub_string_cts('21212'))
{1: {'2': 1, '1': 1},
2: {'21': 2, '12': 2},
3: {'212': 1, '121': 1},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:
def remove_lesser_patterns(result_dict):
len_lst = sorted([k for k in result_dict], reverse=True)
#len_lst = sorted([k for k in max_len_results])
len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}
for i_len in len_lst[:-1]:
eval_lst = [i for i in len_lst if i < i_len]
for i in eval_lst:
if len_crosswalk[i] <= len_crosswalk[i_len]:
if i in result_dict:
del result_dict[i]
return result_dict
#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}
#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{2: {'21': 2, '12': 2}, 5: {'21212': 1}}
results:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
print("<Output: {}\n".format(c_answer))
<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice
<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice
<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.
Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.
Note:
With your example:
3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.
the code solves this ambiguity in your desired output (2 or 3) by giving you both:
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1
<Input: '1212'
<Output: 2
<Input: '21212'
<Output: 2
<Input: '12221221'
<Output: 1

Related

What is the best way to compare values of list over a list of list?

The Problem goes like:
suppose we have 3 shops and different item number are listed.
Each shopkeeper has the following items:
Shop 1 : [2, 3]
Shop 2 : [1, 2]
Shop 3 : [4]
A=no of shops
dict = {shop_no:[item_list]}
need = set(items that are needed)
And I need item [1,4], so i can achieve it by visiting shop 2 and shop 3.
So my question is how to get the minimum no of shops that needs to be visited.
My approach!!!
BitMasking to generate all possible shop combinations, and then comparing elements.
I need a better way to compare these.
x=2**(A)
for i in range(1,x):
count=0
temp=[]
for j in range(32):
if i&(1<<j)>0:
count+=1
temp+=dict[j+1]
temp=set(temp)
#Am generating items by combining shops and then doing a set difference
if len(need-temp)==0:
return count
return -1
Someone suggested me rabin karp algorithm, How can i implement that???
Here's my cheesy brute-force solution:
from itertools import combinations
from typing import Dict, Set
shops = {
1: {2, 3},
2: {1, 2},
3: {4},
}
need = {1, 4}
def shortest_visit(shops: Dict[int, Set[int]], need: Set[int]) -> Set[int]:
for n in range(len(shops)):
for visit in combinations(shops.keys(), n):
if need <= {item for shop in visit for item in shops[shop]}:
return set(visit)
assert False, "Some of the needed items aren't available in any shop!"
print(shortest_visit(shops, need))
It has the advantage of checking the shortest combinations first rather than brute-forcing through all of them in all cases, so if there's a short solution you'll find it relatively quickly.
You can use a recursive generator together with functools.lru_cache in order to compute the minimum number of shops required to buy a certain set of items:
from functools import lru_cache
#lru_cache()
def find_best_path(need: frozenset):
return min(visit_shops(need), key=len)
def visit_shops(need):
for k, items in shops.items():
buy = items & need
if buy == need:
yield (k,) # there's a single best option: just visit that shop
break
elif buy:
yield (k,) + find_best_path(need - buy)
Testing on your example:
shops = {
'A': {2, 3},
'B': {1, 2},
'C': {4},
}
need = frozenset({1, 4})
print(find_best_path(need)) # ('B', 'C')
Testing on another example with multiple options:
shops = {
'A': {1, 2, 3},
'B': {4},
'C': {5},
'D': {1, 3, 5},
'E': {2, 4},
}
need = frozenset({1, 2, 3, 4, 5})
print(find_best_path(need)) # ('D', 'E')

Fast/Pythonic way to count intervals between repeated list values

I want to make a histogram of all the intervals between repeated values in a list. I wrote some code that works, but it's using a for loop with if statements. I often find that if one can manage to write a version using clever slicing and/or predefined python (numpy) methods, that one can get much faster Python code than using for loops, but in this case I can't think of any way of doing that. Can anyone suggest a faster or more pythonic way of doing this?
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = {}, {}
for i in values:
hist[i] = {}
last_index[i] = -1 # some default value
# now go through the array and find intervals
for i in range(len(a)):
val = a[i]
if last_index[val] != -1: # do nothing if it's the first time
interval = i - last_index[val]
if interval in hist[val]:
hist[val][interval] += 1
else:
hist[val][interval] = 1
last_index[val] = i
return hist
# example list/array
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
histdict = hist_intervals(a)
print("histdict = ",histdict)
# correct answer for this example
answer = { 1: {3:1, 6:1},
2: {2:1, 5:1},
3: {1:1, 3:1, 6:1},
4: {6:1},
5: {6:1}
}
print("answer = ",answer)
Sample output:
histdict = {1: {3: 1, 6: 1}, 2: {5: 1, 2: 1}, 3: {3: 1, 6: 1, 1: 1}, 4: {6: 1}, 5: {6: 1}}
answer = {1: {3: 1, 6: 1}, 2: {2: 1, 5: 1}, 3: {1: 1, 3: 1, 6: 1}, 4: {6: 1}, 5: {6: 1}}
^ note: I don't care about the ordering in the dict, so this solution is acceptable, but I want to be able to run on really large arrays/lists and I'm suspecting my current method will be slow.
You can eliminate the setup loop by a carefully constructed defaultdict. Then you're just left with a single scan over the input list, which is as good as it gets. Here I change the resultant defaultdict back to a regular Dict[int, Dict[int, int]], but that's just so it prints nicely.
from collections import defaultdict
def count_intervals(iterable):
# setup
last_seen = {}
hist = defaultdict(lambda: defaultdict(int))
# The actual work
for i, x in enumerate(iterable):
if x in last_seen:
hist[x][i-last_seen[x]] += 1
last_seen[x] = i
return hist
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
hist = count_intervals(a)
for k, v in hist.items():
print(k, dict(v))
# 1 {3: 1, 6: 1}
# 3 {3: 1, 6: 1, 1: 1}
# 2 {5: 1, 2: 1}
# 5 {6: 1}
# 4 {6: 1}
There is an obvious change to make in terms of data structures. instead of using a dictionary of dictionaries for hist use a defaultdict of Counter this lets the code become
from collections import defaultdict, Counter
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = defaultdict(Counter), {}
# now go through the array and find intervals
for i, val in enumerate(a):
if val in last_index
interval = i - last_index[val]
hist[val].update((interval,))
last_index[val] = i
return hist
this will be faster as the if's are written in C, and will also be cleaner.

How do I filter a dictionary based on the partial string matches?

I have a dictionary of values:
dic = {1: "a1+b+c", 2: "a1+c+v", 3: "a1+z+e", 4: "a2+p+a", 5: "a2+z+v", 6: "a3+q+v", ...}
I have a page in Flask, that has checkboxes for each partial string value in a dictionary, e.g. checkboxes "a", "b", "c",... etc. On the page, the checkboxes are located in groups a1, a2, a3, etc.
I need to filter the dictionary by the partial values based on the values of the selected checkboxes, for example, when selecting "c" in group a1, it would return:
1: a1+b+c
2: a1+c+v
When selecting "z" from group a2, it would return:
5: "a2+z+v"
The code that generates an error is:
sol = [k for k in dic if 'a1' in k]
Can someone point me to the right direction?
You can easily solve this with a quite short function:
def lookup(dct, *args):
for needle in args:
dct = {key: value for key, value in dct.items() if needle in value}
return dct
For example:
>>> dic = {1: "a1+b+c", 2: "a1+c+v", 3: "a1+z+e", 4: "a2+p+a", 5: "a2+z+v", 6: "a3+q+v"}
>>> lookup(dic, "a1", "c")
{1: 'a1+b+c', 2: 'a1+c+v'}
However that always needs to iterate over all keys for each "needle". You can do better if you have a helper dictionary (I'll use a collections.defaultdict here) that stores all keys that match one needle (assuming + is supposed to be a delimiter in your dictionary):
from collections import defaultdict
helperdict = defaultdict(set)
for key, value in dic.items():
for needle in value.split('+'):
helperdict[needle].add(key)
That helperdict now contains all keys that match one particular part of a value:
>>> print(dict(helperdict))
{'z': {3, 5}, 'p': {4}, 'a1': {1, 2, 3}, 'a3': {6}, 'v': {2, 5, 6}, 'a2': {4, 5}, 'e': {3}, 'b': {1}, 'a': {4}, 'c': {1, 2}, 'q': {6}}
And using set.intersection allows you to quickly get all matches for different combinations:
>>> search = ['a2', 'z']
>>> matches = set.intersection(*[helperdict[needle] for needle in search])
>>> {match: dic[match] for match in matches}
{5: 'a2+z+v'}
It's definitely longer than the first approach and requires more external memory but if you plan to do several queries it will be much faster.

python trim down dictionaries in a list of dictionaries

i have the following list which can contain multiple dictionaries of different sizes.
The keys in each dictionary are unique, but one key may exist in different dictionaries. Values are unique across dictionaries.
I want to trim down my dictionaries so that they contain the keys and values for which the value is the highest among all dictionaries.
For example, the key '1258' exists in three of the four dictionaries, and it has the highest value only in the last one, so in the reconstructed list, this key and its value will be in the last dictionary only.
If the key doesn't exist in other dictionaries, then it will remain in the dictionary where it belongs to.
here is sample data:
[{'1258': 1.0167004,
'160': 1.5989301000000002,
'1620': 1.3058813000000002,
'2571': 0.7914598,
'26': 4.554409,
'2943': 0.5072369,
'2951': 0.4955711,
'2952': 1.2380746000000002,
'2953': 1.6159719,
'2958': 0.4340355,
'2959': 0.6026906,
'2978': 0.619001,
'2985': 1.5677016,
'3075': 1.04948,
'3222': 0.9721148000000001,
'3388': 1.680108,
'341': 0.8871856,
'3443': 0.6000103,
'361': 2.6682623000000003,
'4': 5.227341,
'601': 2.2614983999999994,
'605': 0.6303175999999999,
'9': 5.0326675},
{'1457': 5.625237999999999,
'1469': 25.45585200000001,
'1470': 25.45585200000001,
'160': 0.395728,
'1620': 0.420267,
'2571': 0.449151,
'26': 0.278281,
'601': 0.384822,
'605': 5.746278700000001,
'9': 1.487241},
{'1258': 0.27440200000000003,
'1457': 0.8723639999999999,
'1620': 0.182567,
'2571': 0.197134,
'2943': 0.3461654,
'2951': 0.47372800000000004,
'2952': 0.6662919999999999,
'2953': 0.6725458,
'2958': 0.4437159,
'2959': 0.690856,
'2985': 0.8106226999999999,
'3075': 0.352618,
'3222': 0.7866500000000001,
'3388': 0.760664,
'3443': 0.129771,
'601': 0.345448,
'605': 1.909823,
'9': 0.888999},
{'1258': 1.0853083,
'160': 0.622579,
'1620': 0.7419095,
'2571': 0.9828758,
'2943': 2.254124,
'2951': 0.6294688,
'2952': 1.0965362,
'2953': 1.8409954000000002,
'2958': 0.7394122999999999,
'2959': 0.9398920000000001,
'2978': 0.672122,
'2985': 1.2385512999999997,
'3075': 0.912366,
'3222': 0.8364904,
'3388': 0.37316499999999997,
'341': 1.0399186,
'3443': 0.547093,
'361': 0.3313275,
'601': 0.5318834,
'605': 0.2909876}]
Here's one approach. I shortened your example to one that's easier to reason about.
>>> dcts = [
... {1:2, 3:4, 5:6},
... {1:3, 6:7, 8:9},
... {6:10, 8:11, 9:12}]
>>>
>>> [{k:v for k,v in d.items() if v == max(d.get(k) for d in dcts)} for d in dcts]
[{3: 4, 5: 6}, {1: 3}, {8: 11, 9: 12, 6: 10}]
edit:
more efficient because the max is only computed once for each key:
>>> from operator import or_
>>> from functools import reduce
>>> allkeys = reduce(or_, (d.viewkeys() for d in dcts))
>>> max_vals = {k:max(d.get(k) for d in dcts) for k in allkeys}
>>> result = [{k:v for k,v in d.items() if v == max_vals[k]} for d in dcts]
>>> result
[{3: 4, 5: 6}, {1: 3}, {8: 11, 9: 12, 6: 10}]

Python - FOR loop not writing correct values in nested dictionaries

Hopefully someone can help me with this as none of my research has helped me. I have a simple dictionary:
mydict = {
1: {1: 'Foo'},
2: {1: 'Bar'}
}
I'm duplicating each of the key / value pairs assigning new key values:
nextKey = len(mydict) + 1
for currKey in range(len(mydict)):
mydict[nextKey] = mydict[currKey + 1]
nextKey += 1
Which give me a mydict of:
{
1: {1: 'Foo'},
2: {1: 'Bar'},
3: {1: 'Foo'},
4: {1: 'Bar'},
}
I now want to add a new key value pair to all of the existing nested dictionaries. The keys for each should be '2' and the values for each should increase for each nested dictionary:
newValue = 1
for key in mydict:
mydict[key][2] = newValue
newValue += 1
I am expecting:
{
1: {1: 'Foo', 2: 1},
2: {1: 'Bar', 2: 2},
3: {1: 'Foo', 2: 3},
4: {1: 'Bar', 2: 4},
}
But this is giving me a mydict of:
{
1: {1: 'Foo', 2: 3},
2: {1: 'Bar', 2: 4},
3: {1: 'Foo', 2: 3},
4: {1: 'Bar', 2: 4},
}
I have used the visualisation tool of the IDE I'm using and after I have run the loop to duplicate the key / value pairs it appears the new keys just reference the duplicated value rather than actually containing it, perhaps this has something to do with it?
IDE Visualisation
Can anyone please help / explain?
This happens because in your first loop you are not copying the nested dictionaries but rather just adding a new reference to the same dictionaries.
To maybe give a clearer example: If you were to break out of the second loop after two loops (with your original code) your output would be:
{1: {1: 'Foo', 2: 1},
2: {1: 'Bar', 2: 2},
3: {1: 'Foo', 2: 1},
4: {1: 'Bar', 2: 2}}
So you can fix your first loop like this:
for currKey in range(len(mydict)):
mydict[nextKey] = mydict[currKey + 1].copy()
nextKey += 1
The copy() function creates a real copy of the dictionary so that you can then access these 4 different dictionaries in your second loop.
Your assumption is right. Minimal example:
>>> a = {1: {"a": "A"}, 2:{"b": "B"}}
>>> a[3] = a[1]
>>> a
{1: {'a': 'A'}, 2: {'b': 'B'}, 3: {'a': 'A'}}
>>> id(a[3]) # a[1] and a[3] are the same object
140510365203264
>>> id(a[1])
140510365203264
>>> a[1]['a'] = "changed" # so changing one affects the "oter one"
>>> a
{1: {'a': 'changed'}, 2: set(['B', 'b']), 3: {'a': 'changed'}}
If you want to copy objects, there are several ways to do this:
for lists: new = old[:], using the slicing notation
for dictionaries: dict.copy()
see also the copy module

Categories

Resources