I want to make a histogram of all the intervals between repeated values in a list. I wrote some code that works, but it's using a for loop with if statements. I often find that if one can manage to write a version using clever slicing and/or predefined python (numpy) methods, that one can get much faster Python code than using for loops, but in this case I can't think of any way of doing that. Can anyone suggest a faster or more pythonic way of doing this?
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = {}, {}
for i in values:
hist[i] = {}
last_index[i] = -1 # some default value
# now go through the array and find intervals
for i in range(len(a)):
val = a[i]
if last_index[val] != -1: # do nothing if it's the first time
interval = i - last_index[val]
if interval in hist[val]:
hist[val][interval] += 1
else:
hist[val][interval] = 1
last_index[val] = i
return hist
# example list/array
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
histdict = hist_intervals(a)
print("histdict = ",histdict)
# correct answer for this example
answer = { 1: {3:1, 6:1},
2: {2:1, 5:1},
3: {1:1, 3:1, 6:1},
4: {6:1},
5: {6:1}
}
print("answer = ",answer)
Sample output:
histdict = {1: {3: 1, 6: 1}, 2: {5: 1, 2: 1}, 3: {3: 1, 6: 1, 1: 1}, 4: {6: 1}, 5: {6: 1}}
answer = {1: {3: 1, 6: 1}, 2: {2: 1, 5: 1}, 3: {1: 1, 3: 1, 6: 1}, 4: {6: 1}, 5: {6: 1}}
^ note: I don't care about the ordering in the dict, so this solution is acceptable, but I want to be able to run on really large arrays/lists and I'm suspecting my current method will be slow.
You can eliminate the setup loop by a carefully constructed defaultdict. Then you're just left with a single scan over the input list, which is as good as it gets. Here I change the resultant defaultdict back to a regular Dict[int, Dict[int, int]], but that's just so it prints nicely.
from collections import defaultdict
def count_intervals(iterable):
# setup
last_seen = {}
hist = defaultdict(lambda: defaultdict(int))
# The actual work
for i, x in enumerate(iterable):
if x in last_seen:
hist[x][i-last_seen[x]] += 1
last_seen[x] = i
return hist
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
hist = count_intervals(a)
for k, v in hist.items():
print(k, dict(v))
# 1 {3: 1, 6: 1}
# 3 {3: 1, 6: 1, 1: 1}
# 2 {5: 1, 2: 1}
# 5 {6: 1}
# 4 {6: 1}
There is an obvious change to make in terms of data structures. instead of using a dictionary of dictionaries for hist use a defaultdict of Counter this lets the code become
from collections import defaultdict, Counter
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = defaultdict(Counter), {}
# now go through the array and find intervals
for i, val in enumerate(a):
if val in last_index
interval = i - last_index[val]
hist[val].update((interval,))
last_index[val] = i
return hist
this will be faster as the if's are written in C, and will also be cleaner.
Related
I want to create a dictionary with keys of 1, 2, 3, 4
and each value to the key is [].
n = [1,2,3,4]
d = dict.fromkeys(n, [])
d[1].append(777)
print(d)
--> {1: [777], 2: [777], 3: [777], 4: [777]}
Why does this happen? Why is this not {1: [777], 2: [], 3: [], 4: []} ?
The list that you use as values in the second step all point to the same memory. So when you update one of the values, all of them update.
You can check that by using the following -
n = [1,2,3,4]
d = dict.fromkeys(n, [])
d[1] is d[2] #share memory
#True
So, one way you can instead do the following -
n = [1,2,3,4]
d = {k:[] for k in n}
d[1] is d[2]
#False
Then you can set then -
d[1].append(777)
{1: [777], 2: [], 3: [], 4: []}
A better way to do this is to use collections.defaultdict. This allows you to create a dictionary where values are list objects by default using defaultdict(list). Its scalable as you can choose which datatype you need your dictionary values to be.
from collections import defaultdict
n = [1,2,3,4]
d = defaultdict(list)
d[1].append(777)
I have a list : operation = [5,6] and a dictionary dic = {0: None, 1: None}
And I want to replace each values of dic with the values of operation.
I tried this but it don't seem to run.
operation = [5,6]
for i in oper and val, key in dic.items():
dic_op[key] = operation[i]
Does someone have an idea ?
Other option, maybe:
operation = [5,6]
dic = {0: None, 1: None}
for idx, val in enumerate(operation):
dic[idx] = val
dic #=> {0: 5, 1: 6}
Details for using index here: Accessing the index in 'for' loops?
zip method will do the job
operation = [5, 6]
dic = {0: None, 1: None}
for key, op in zip(dic, operation):
dic[key] = op
print(dic) # {0: 5, 1: 6}
The above solution assumes that dic is ordered in order that element position in operation is align to the keys in the dic.
Using zip in Python 3.7+, you could just do:
operation = [5,6]
dic = {0: None, 1: None}
print(dict(zip(dic, operation)))
# {0: 5, 1: 6}
I am well aware of following question which also exists on stack overflow String Unknown pattern Matching but the answer there doesn't really work for me.
My problem is next. I get a string of characters e.g
'1211' and what I need to do is see that 1 is most often repeated
and this 2 times in a row.
But it can also be "121212112" where 12 is repeated 3 times in a
row.
But with 12221221 it is 221 that is repeated 2 times rather than 2
that repeats 3 times.
here are some results I like to get (the only numbers ever used are 1 and 2's)
>>> counter('1211')
1
>>> counter('1212')
2
>>> counter('21212')
2
the outcome I want is how many times it occurs.
I have no idea how to even start looking for a pattern since it is not known on forehand and I did some research online and don't find anything similar.
Does anyone have any idea how I even start to tackle this problem ? All help is welcome and if you want more information don't hesitate to let me know.
Really inefficient, but you can
find all substrings (https://stackoverflow.com/a/22470047/264596)
put them into a set to avoid duplicates
for each of the substring, find all its occurrences - and use some function to find the max (I am not sure how you choose between short strings occurring many times and long strings occurring few times)
Obviously you can use some datastructure to pass through the string once and do some counting on the way, but since I am not sure what your constraints and desired output is, I can give you only this.
I agree with Jirka, not sure how you score long vs short to select the optimal results but this function will give you the menu:
#Func1
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
for start in range(u_start):
c_str = string[start:i+start+1]
if c_str in combos:
combos[c_str] += 1
else:
combos[c_str] = 1
return combos
sub_string_cts('21212')
{'2': 3,
'1': 2,
'21': 2,
'12': 2,
'212': 2,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
After your comment I think this is more what you're looking for:
#Func2
def sub_string_cts(string):
combos = {}
for i in range(len(string)):
u_start = len(string) - i
substrs = set([string[start:i+start+1] for start in range(u_start)])
for substring in substrs:
combos[substring] = max([len(i) for i in re.findall("((?:{})+)".format(substring), string)])//len(substring)
return combos
sub_string_cts('21212')
{'2': 1,
'1': 1,
'21': 2,
'12': 2,
'212': 1,
'121': 1,
'2121': 1,
'1212': 1,
'21212': 1}
You could narrow that down to the 'best' candidates by collapsing on the highest occuring instance of each string length:
def max_by_len(result_dict):
results = {}
for k, v in result_dict.items():
if len(k) not in results:
results[len(k)] = {}
for c_len in [ln for ln in results]:
len_max_count = max([v for (k, v) in result_dict.items() if len(k) == c_len])
for k,v in result_dict.items():
if len(k) == c_len:
if v == len_max_count:
results[c_len][k] = v
return results
#Func1:
max_by_len(sub_string_cts('21212'))
{1: {'2': 3},
2: {'21': 2, '12': 2},
3: {'212': 2},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
#Func2:
max_by_len(sub_string_cts('21212'))
{1: {'2': 1, '1': 1},
2: {'21': 2, '12': 2},
3: {'212': 1, '121': 1},
4: {'2121': 1, '1212': 1},
5: {'21212': 1}}
Assuming we wouldn't select '2121' or '1212' because their occurrence matches '21212' and they're shorter in length, and that similarly we wouldn't select '21' or '12' as they occur at the same frequency as '212' we could limit our viable candidates down to '2', '212', and '21212' with the following code:
def remove_lesser_patterns(result_dict):
len_lst = sorted([k for k in result_dict], reverse=True)
#len_lst = sorted([k for k in max_len_results])
len_crosswalk = {i_len: max([v for (k,v) in result_dict[i_len].items()]) for i_len in len_lst}
for i_len in len_lst[:-1]:
eval_lst = [i for i in len_lst if i < i_len]
for i in eval_lst:
if len_crosswalk[i] <= len_crosswalk[i_len]:
if i in result_dict:
del result_dict[i]
return result_dict
#Func1
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{1: {'2': 3}, 3: {'212': 2}, 5: {'21212': 1}}
#Func2
remove_lesser_patterns(max_by_len(sub_string_cts('21212')))
{2: {'21': 2, '12': 2}, 5: {'21212': 1}}
results:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns(max_by_len(sub_string_cts(string)))
print("<Output: {}\n".format(c_answer))
<Input: '1211'
<Output: {1: {'1': 2}, 4: {'1211': 1}}
# '1' is repeated twice
<Input: '1212'
<Output: {2: {'12': 2}, 4: {'1212': 1}}
# '12' is repeated twice
<Input: '21212'
<Output: {2: {'21': 2, '12': 2}, 5: {'21212': 1}}
# '21' and '12' are both repeated twice
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
These functions together give you the highest occurrence of each pattern by length. The key for the dictionary is the length, with a sub-dictionary of the highest (multiple if tied) occuring patterns.
Func2 requires the patterns be sequential, whereas Func1 does not -- it is strictly occurrence based.
Note:
With your example:
3. But with 12221221 it is 221 that is repeated 2 times rather than 2 that repeats 3 times.
the code solves this ambiguity in your desired output (2 or 3) by giving you both:
<Input: '12221221'
<Output: {1: {'2': 3}, 3: {'221': 2}, 8: {'12221221': 1}}
# '2' is repeated 3 times, '221' is repeated twice
If you're only interested in the 2 char lengths you can easily pull those out of the max_by_len results as follows:
test_string = ["1211", "1212", "21212", "12221221"]
for string in test_string:
print("<Input: '{}'".format(string))
c_answer = remove_lesser_patterns({k:v for (k,v) in max_by_len(sub_string_cts(string)).items() if k == 2})
print("<Output: {}\n".format(max([v for (k,v) in c_answer[2].items()])))
#Func2
<Input: '1211'
<Output: 1
<Input: '1212'
<Output: 2
<Input: '21212'
<Output: 2
<Input: '12221221'
<Output: 1
I have the following code that generates a nested dictionary.
import random
import numpy as np
dict1 = {}
for i in range(0,2):
dict2 = {}
for j in range(0,3):
dict2[j] = random.randint(1,10)
dict1[i] = dict2
For example it can generate the following content of dict1:
{0: {0: 7, 1: 2, 2: 5}, 1: {0: 3, 1: 10, 2: 10}}
I want to find the sub-key of a minimum value for the fixed key. For example, for the fixed key 0, the minimum value among the nested dictionary values is 2 which refers to thew sub-key 1. Therefore the result should be 1:
result=find_min(dict1[0])
result
1
How can I develop such find_min function?
You can reverse the keys and the values, then obtain the key with the minimum value:
a = {0: {0: 7, 1: 2, 2: 5}, 1: {0: 3, 1: 10, 2: 10}}
dict(zip(a[0].values(),a[0].keys())).get(min(a[0].values()))
here we create a new dictionary whose keys and values are the reverse of the original dictionary. eg
dict(zip(a[0].values(),a[0].keys()))
Out[1575]: {7: 0, 2: 1, 5: 2}
Then from here, we obtain the minimum value in the original dictionary and use that as the key in this reversed dictionary
EDIT
As indicated in the comments, one can simply use the key within the min function:
min(a[0],key = a[0].get)
To extract the sub-dict for key 0, just do:
sub_dict = dict1[0]
Then, to find the key corresponding to the minimum value:
min_value, min_key = min((value, key) for key, value in sub_dict.items())
import random
def find_min(d, fixed_key):
# Given a dictionary of dictionaries d, and a fixed_key, get the dictionary associated with the key
myDict = d[fixed_key]
# treat the dictionary keys as a list
# get the index of the minimum value, then use it to get the key
sub_key = list(myDict.keys())[myDict.values().index(min(myDict.values()))]
return sub_key
dict1 = {0: {0: 7, 1: 2, 2: 5}, 1: {0: 3, 1: 10, 2: 10}}
print dict1
print find_min(dict1, 0)
I have a dictionary of values:
dic = {1: "a1+b+c", 2: "a1+c+v", 3: "a1+z+e", 4: "a2+p+a", 5: "a2+z+v", 6: "a3+q+v", ...}
I have a page in Flask, that has checkboxes for each partial string value in a dictionary, e.g. checkboxes "a", "b", "c",... etc. On the page, the checkboxes are located in groups a1, a2, a3, etc.
I need to filter the dictionary by the partial values based on the values of the selected checkboxes, for example, when selecting "c" in group a1, it would return:
1: a1+b+c
2: a1+c+v
When selecting "z" from group a2, it would return:
5: "a2+z+v"
The code that generates an error is:
sol = [k for k in dic if 'a1' in k]
Can someone point me to the right direction?
You can easily solve this with a quite short function:
def lookup(dct, *args):
for needle in args:
dct = {key: value for key, value in dct.items() if needle in value}
return dct
For example:
>>> dic = {1: "a1+b+c", 2: "a1+c+v", 3: "a1+z+e", 4: "a2+p+a", 5: "a2+z+v", 6: "a3+q+v"}
>>> lookup(dic, "a1", "c")
{1: 'a1+b+c', 2: 'a1+c+v'}
However that always needs to iterate over all keys for each "needle". You can do better if you have a helper dictionary (I'll use a collections.defaultdict here) that stores all keys that match one needle (assuming + is supposed to be a delimiter in your dictionary):
from collections import defaultdict
helperdict = defaultdict(set)
for key, value in dic.items():
for needle in value.split('+'):
helperdict[needle].add(key)
That helperdict now contains all keys that match one particular part of a value:
>>> print(dict(helperdict))
{'z': {3, 5}, 'p': {4}, 'a1': {1, 2, 3}, 'a3': {6}, 'v': {2, 5, 6}, 'a2': {4, 5}, 'e': {3}, 'b': {1}, 'a': {4}, 'c': {1, 2}, 'q': {6}}
And using set.intersection allows you to quickly get all matches for different combinations:
>>> search = ['a2', 'z']
>>> matches = set.intersection(*[helperdict[needle] for needle in search])
>>> {match: dic[match] for match in matches}
{5: 'a2+z+v'}
It's definitely longer than the first approach and requires more external memory but if you plan to do several queries it will be much faster.