Efficient looping algorithm in python - python

I am doing the below operation where the set of values in a dictionary (dictionary_with_large_values) can have more than 1 million values. Looping over each of them is taking lot of time. There are 7 to 8 keys with such data in the dictionary. What is the better algorithm that can be used in python which is more time efficient? The algorithm checks if two strings are same after error checks and create mapping from parent string with list of similar strings(dictionary_test_1). And another dictionary to create a reverse mapping(dictionary_test_2)
#type of dictionary. Values contains set of strings
dictionary_with_large_values = defaultdict(set)
dictionary_test_1 = defaultdict(set)
dictionary_test_2 = defaultdict()
# method which stores data to dictionary
# algorithm to parse the data
for k,v in dictionary_with_large_values.items():
i=0
values = list(v)
while i < len(values):
string_data = values[i].replace(" ", "")
j = i + 1
while j < len(values):
string2_data = values[j].replace(" ", "")
# algorithm to check if normalized(string_data) == normalized(string2_data)
data = areTheySame(string_data,string2_data)
if data:
dictionary_test_1[values[i]].add(values[j])
dictionary_test_2[values[j]] = values[i]
del values[j]
else:
j += 1
i += 1

Related

How to find the highest value element in a list with reference to a dictionary on python

How do I code a function in python which can:
iterate through a list of word strings which may contain duplicate words and referencing to a dictionary,
find the word with the highest absolute sum, and
output it along with the corresponding absolute value.
The function also has to ignore words which are not in the dictionary.
For example,
Assume the function is called H_abs_W().
Given the following list and dict:
list_1 = ['apples','oranges','pears','apples']
Dict_1 = {'apples':5.23,'pears':-7.62}
Then calling the function as:
H_abs_W(list_1,Dict_1)
Should give the output:
'apples',10.46
EDIT:
I managed to do it in the end with the code below. Looking over the answers, turns out I could have done it in a shorter fashion, lol.
def H_abs_W(list_1,Dict_1):
freqW = {}
for char in list_1:
if char in freqW:
freqW[char] += 1
else:
freqW[char] = 1
ASum_W = 0
i_word = ''
for a,b in freqW.items():
x = 0
d = Dict_1.get(a,0)
x = abs(float(b)*float(d))
if x > ASum_W:
ASum_W = x
i_word = a
return(i_word,ASum_W)
list_1 = ['apples','oranges','pears','apples']
Dict_1 = {'apples':5.23,'pears':-7.62}
d = {k:0 for k in list_1}
for x in list_1:
if x in Dict_1.keys():
d[x]+=Dict_1[x]
m = max(Dict_1, key=Dict_1.get)
print(m,Dict_1[m])
try this,
key, value = sorted(Dict_1.items(), key = lambda x : x[1], reverse=True)[0]
print(f"{key}, {list_1.count(key) * value}")
# apples, 10.46
you can use Counter to calculate the frequency(number of occurrences) of each item in the list.
max(counter.values()) will give us the count of maximum occurring element
max(counter, key=counter.get) will give the which item in the list is
associated with that highest count.
========================================================================
from collections import Counter
def H_abs_W(list_1, Dict_1):
counter = Counter(list_1)
count = max(counter.values())
item = max(counter, key=counter.get)
return item, abs(count * Dict_1.get(item))

Convert manager.dict() to list of tuples of form [[a,b,c],[q,w,e],[e,r,t].......]

I am using multiprocessing to increase the computation speed of my program for which I used
manager=Manager()
parallel_array_sites=manager.dict()
find_sites()
removal()
find_sites function is running properly
my removal function is
global array_sites
for i in parallel_array_sites:
array_sites.append(i)
#----not very relevant from here on-----
count = 0
remove_sites = {} # dictionary which contains index to remove sites
for i in range(len(array_sites)):
remove_sites[i] = 0
for i in range(len(array_sites)):
if remove_sites[i]:
continue
for j in range(len(array_sites)):
if(j > i and remove_sites[j] == 0):
x = array_sites[i][0] - array_sites[j][0]
y = array_sites[i][1] - array_sites[j][1]
z = array_sites[i][2] - array_sites[j][2]
r = math.sqrt(x*x + y*y + z*z)
if(r < (rmin/1.1)):
count = count + 1
remove_sites[j] = 1
print "after removel",len(array_sites)
#print remove_sites
count = 0
for key,val in remove_sites.iteritems():
if(val == 1):
del array_sites[key-count]
count = count + 1
The removal function requires me to use the tuples stored in
parallel_array_sites
as tuples in the list
array_sites
All the objects in parallel_array_list are tuples of 3 elements each
The number of entries can be fairly large which is why i don't want to specify the size while declaring a multiprocessing.list() instead.
The loop
for i in parallel_array_sites:
array_sites.append(i)
does not work and gives the following error:
File "/usr/lib/python2.7/multiprocessing/managers.py", line 774, in _callmethod
raise convert_to_error(kind, result)
KeyError: 1081
Require help with any kind of changes I can make
Used
for i in range(len(parallel_array_sites)):
array_sites.append(parallel_array_sites[i])
instead because
for i in parallel_array_sites:
does not work for a dictionary

Search and replace multiple specific sequences of elements in Python list/array

I currently have 6 separate for loops which iterate over a list of numbers looking to match specific sequences of numbers within larger sequences, and replace them like this:
[...0,1,0...] => [...0,0,0...]
[...0,1,1,0...] => [...0,0,0,0...]
[...0,1,1,1,0...] => [...0,0,0,0,0...]
And their inverse:
[...1,0,1...] => [...1,1,1...]
[...1,0,0,1...] => [...1,1,1,1...]
[...1,0,0,0,1...] => [...1,1,1,1,1...]
My existing code is like this:
for i in range(len(output_array)-2):
if output_array[i] == 0 and output_array[i+1] == 1 and output_array[i+2] == 0:
output_array[i+1] = 0
for i in range(len(output_array)-3):
if output_array[i] == 0 and output_array[i+1] == 1 and output_array[i+2] == 1 and output_array[i+3] == 0:
output_array[i+1], output_array[i+2] = 0
In total I'm iterating over the same output_array 6 times, using brute force checking. Is there a faster method?
# I would create a map between the string searched and the new one.
patterns = {}
patterns['010'] = '000'
patterns['0110'] = '0000'
patterns['01110'] = '00000'
# I would loop over the lists
lists = [[0,1,0,0,1,1,0,0,1,1,1,0]]
for lista in lists:
# i would join the list elements as a string
string_list = ''.join(map(str,lista))
# we loop over the patterns
for pattern,value in patterns.items():
# if a pattern is detected, we replace it
string_list = string_list.replace(pattern, value)
lista = list(string_list)
print lista
While this question related to the questions Here and Here, the question from OP relates to fast searching of multiple sequences at once. While the accepted answer works well, we may not want to loop through all the search sequences for every sub-iteration of the base sequence.
Below is an algo which checks for a sequence of i ints only if the sequence of (i-1) ints is present in the base sequence
# This is the driver function which takes in a) the search sequences and
# replacements as a dictionary and b) the full sequence list in which to search
def findSeqswithinSeq(searchSequences,baseSequence):
seqkeys = [[int(i) for i in elem.split(",")] for elem in searchSequences]
maxlen = max([len(elem) for elem in seqkeys])
decisiontree = getdecisiontree(seqkeys)
i = 0
while i < len(baseSequence):
(increment,replacement) = get_increment_replacement(decisiontree,baseSequence[i:i+maxlen])
if replacement != -1:
baseSequence[i:i+len(replacement)] = searchSequences[",".join(map(str,replacement))]
i +=increment
return baseSequence
#the following function gives the dictionary of intermediate sequences allowed
def getdecisiontree(searchsequences):
dtree = {}
for elem in searchsequences:
for i in range(len(elem)):
if i+1 == len(elem):
dtree[",".join(map(str,elem[:i+1]))] = True
else:
dtree[",".join(map(str,elem[:i+1]))] = False
return dtree
# the following is the function does most of the work giving us a) how many
# positions we can skip in the search and b)whether the search seq was found
def get_increment_replacement(decisiontree,sequence):
if str(sequence[0]) not in decisiontree:
return (1,-1)
for i in range(1,len(sequence)):
key = ",".join(map(str,sequence[:i+1]))
if key not in decisiontree:
return (1,-1)
elif decisiontree[key] == True:
key = [int(i) for i in key.split(",")]
return (len(key),key)
return 1, -1
You can test the above code with this snippet:
if __name__ == "__main__":
inputlist = [5,4,0,1,1,1,0,2,0,1,0,99,15,1,0,1]
patternsandrepls = {'0,1,0':[0,0,0],
'0,1,1,0':[0,0,0,0],
'0,1,1,1,0':[0,0,0,0,0],
'1,0,1':[1,1,1],
'1,0,0,1':[1,1,1,1],
'1,0,0,0,1':[1,1,1,1,1]}
print(findSeqswithinSeq(patternsandrepls,inputlist))
The proposed solution represents the sequences to be searched as a decision tree.
Due to skipping the many of the search points, we should be able to do better than O(m*n) with this method (where m is number of search sequences and n is length of base sequence.
EDIT: Changed answer based on more clarity in edited question.

Python: Concatenate similiar objects in List

I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.

Python nested loop comparing two lists and updating a dictionary

This code works as expected but takes up a lot of memory and takes vastly longer to run than any other part of my code.
def function(input1, input2):
mapping = []
for item in input1:
risks = {"A":0, "B":0, "C":0, "D":0, "E":0}
temp = []
for row in input2:
if item in row[0]:
for key in risks.keys():
if row[1] == key:
risks[key] += 1
temp.append(item)
for key in risks.keys():
temp.append(risks[key])
mapping.append(temp)
return mapping
I'm hoping to find a more efficient way to do this and with far less memory. input1 is a list of unique strings and input2 is a list of tuples that are not unique. There has got to be a better way to do this.
Thanks for your help.
First some test data:
import random
input1 = list(range(1000))
input2 = [
([random.randint(0, 1000) for _ in range(100)], random.choice("ABCDE"))
for _ in range(10000)
]
Then my new function:
def newfunction(input1, input2):
input1_map = {i: dict.fromkeys("ABCDE", 0) for i in input1}
for row in input2:
for i1 in set(row[0]):
try:
input1_map[i1][row[1]] += 1
except KeyError:
pass
return [[i] + list(input1_map[i].values()) for i in input1]
This is of depth 2, not 3, so ends up being a lot faster for large inputs.
If row[0] does not contain duplicates, change set(row[0]) to just row[0].
If the try should never fail, remove it.
Choose better variable names. I don't know what things represent so my names here are pretty bad.
The last statement could be micro-optimised away if speed is too much of a concern, but I wouldn't expect it would matter much.
The list(input1_map[i].values()) is non-deterministic. Think about that.
For reference, here's the old version:
def function(input1, input2):
mapping = []
for item in input1:
risks = {"A":0, "B":0, "C":0, "D":0, "E":0}
temp = []
for row in input2:
if item in row[0]:
for key in risks.keys():
if row[1] == key:
risks[key] += 1
temp.append(item)
for key in risks.keys():
temp.append(risks[key])
mapping.append(temp)
return mapping
And it passes my test:
function(input1, input2) == newfunction(input1, input2)
#>>> True

Categories

Resources