Python nested loop comparing two lists and updating a dictionary - python

This code works as expected but takes up a lot of memory and takes vastly longer to run than any other part of my code.
def function(input1, input2):
mapping = []
for item in input1:
risks = {"A":0, "B":0, "C":0, "D":0, "E":0}
temp = []
for row in input2:
if item in row[0]:
for key in risks.keys():
if row[1] == key:
risks[key] += 1
temp.append(item)
for key in risks.keys():
temp.append(risks[key])
mapping.append(temp)
return mapping
I'm hoping to find a more efficient way to do this and with far less memory. input1 is a list of unique strings and input2 is a list of tuples that are not unique. There has got to be a better way to do this.
Thanks for your help.

First some test data:
import random
input1 = list(range(1000))
input2 = [
([random.randint(0, 1000) for _ in range(100)], random.choice("ABCDE"))
for _ in range(10000)
]
Then my new function:
def newfunction(input1, input2):
input1_map = {i: dict.fromkeys("ABCDE", 0) for i in input1}
for row in input2:
for i1 in set(row[0]):
try:
input1_map[i1][row[1]] += 1
except KeyError:
pass
return [[i] + list(input1_map[i].values()) for i in input1]
This is of depth 2, not 3, so ends up being a lot faster for large inputs.
If row[0] does not contain duplicates, change set(row[0]) to just row[0].
If the try should never fail, remove it.
Choose better variable names. I don't know what things represent so my names here are pretty bad.
The last statement could be micro-optimised away if speed is too much of a concern, but I wouldn't expect it would matter much.
The list(input1_map[i].values()) is non-deterministic. Think about that.
For reference, here's the old version:
def function(input1, input2):
mapping = []
for item in input1:
risks = {"A":0, "B":0, "C":0, "D":0, "E":0}
temp = []
for row in input2:
if item in row[0]:
for key in risks.keys():
if row[1] == key:
risks[key] += 1
temp.append(item)
for key in risks.keys():
temp.append(risks[key])
mapping.append(temp)
return mapping
And it passes my test:
function(input1, input2) == newfunction(input1, input2)
#>>> True

Related

How do I use a while loop to access all the 2nd elements of lists which are the values stored in a dictionary?

If I have a dictionary like this, filled with similar lists, how can I apply a while loo tp extract a list that prints that second element:
racoona_valence={}
racoona_valence={"rs13283416": ["7:87345874365-839479328749+","BOBB7"],\}
I need to print the part that says "BOBB7" for 2nd element of the lists in a larger dictionary. There are ten key-value pairs in it, so I am starting it like so, but unsure what to do because all the examples I can find don't relate to my problem:
n=10
gene_list = []
while n>0:
Any help greatly appreciated.
Well, there's a bunch of ways to do it depending on how well-structured your data is.
racoona_valence={"rs13283416": ["7:87345874365-839479328749+","BOBB7"], "rs13283414": ["7:87345874365-839479328749+","BOBB4"]}
output = []
for key in racoona_valence.keys():
output.append(racoona_valence[key][1])
print(output)
other_output = []
for key, value in racoona_valence.items():
other_output.append(value[1])
print(other_output)
list_comprehension = [value[1] for value in racoona_valence.values()]
print(list_comprehension)
n = len(racoona_valence.values())-1
counter = 0
gene_list = []
while counter<=n:
gene_list.append(list(racoona_valence.values())[n][1])
counter += 1
print(gene_list)
Here is a list comprehension that does what you want:
second_element = [x[1] for x in racoona_valence.values()]
Here is a for loop that does what you want:
second_element = []
for value in racoona_valence.values():
second_element.append(value[1])
Here is a while loop that does what you want:
# don't use a while loop to loop over iterables, it's a bad idea
i = 0
second_element = []
dict_values = list(racoona_valence.values())
while i < len(dict_values):
second_element.append(dict_values[i][1])
i += 1
Regardless of which approach you use, you can see the results by doing the following:
for item in second_element:
print(item)
For the example that you gave, this is the output:
BOBB7

How to take only last value from a list with unique tag?

In my LIST(not dictionary) I have these strings:
"K:60",
"M:37",
"M_4:47",
"M_5:89",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q:50",
"Q_7:89"
in output I need to have
"K:60",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q_7:89"
What is the possible decision?
Or even maybe, how to take tag with the maximum among strings with the same tag.
Use re.split and list comprehension as shown below. Use the fact that when the dictionary dct is created, only the last value is kept for each repeated key.
import re
lst = [
"K:60",
"M:37",
"M_4:47",
"M_5:89",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q:50",
"Q_7:89"
]
dct = dict([ (re.split(r'[:_]', s)[0], s) for s in lst])
lst_uniq = list(dct.values())
print(lst_uniq)
# ['K:60', 'M_6:91', 'N:15', 'O:24', 'P:50', 'Q_7:89']
Probably far from the cleanest but here is a method quite easy to understand.
l = ["K:60", "M:37", "M_4:47", "M_5:89", "M_6:91", "N:15", "O:24", "P:50", "Q:50", "Q_7:89"]
reponse = []
val = []
complete_val = []
for x in l:
if x[0] not in reponse:
reponse.append(x[0])
complete_val.append(x.split(':')[0])
val.append(int(x.split(':')[1]))
elif int(x.split(':')[1]) > val[reponse.index(x[0])]:
val[reponse.index(x[0])] = int(x.split(':')[1])
for x in range(len(complete_val)):
print(str(complete_val[x]) + ":" + str(val[x]))
K:60
M:91
N:15
O:24
P:50
Q:89
I do not see any straight-forward technique. Other than iterating on entire thing and computing yourself, I do not see if any built-in can be used. I have written this where you do not require your values to be sorted in your input.
But I like the answer posted by Timur Shtatland, you can make us of that if your values are already sorted in input.
intermediate = {}
for item in a:
key, val = item.split(':')
key = key.split('_')[0]
val = int(val)
if intermediate.get(key, (float('-inf'), None))[0] < val:
intermediate[key] = (val, item)
ans = [x[1] for x in intermediate.values()]
print(ans)
which gives:
['K:60', 'M_6:91', 'N:15', 'O:24', 'P:50', 'Q_7:89']

Python: Concatenate similiar objects in List

I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.

Do dictionaries keep track of the point in time where a item was assigned?

I was coding a High Scores system where the user would enter a name and a score then the program would test if the score was greater than the lowest score in high_scores. If it was, the score would be written and the lowest score, deleted. Everything was working just fine, but i noticed something. The high_scores.txt file was like this:
PL1 50
PL2 50
PL3 50
PL4 50
PL5 50
PL1 was the first score added, PL2 was the second, PL3 the third and so on. Then I tried adding another score, higher than all the others (PL6 60) and what happened was that the program assigned PL1 as the lowest score. PL6 was added and PL1 was deleted. That was exactly the behavior I wanted but I don't understand how it happened. Do dictionaries keep track of the point in time where a item was assigned? Here's the code:
MAX_NUM_SCORES = 5
def getHighScores(scores_file):
"""Read scores from a file into a list."""
try:
cache_file = open(scores_file, 'r')
except (IOError, EOFError):
print("File is empty or does not exist.")
return []
else:
lines = cache_file.readlines()
high_scores = {}
for line in lines:
if len(high_scores) < MAX_NUM_SCORES:
name, score = line.split()
high_scores[name] = int(score)
else:
break
return high_scores
def writeScore(file_, name, new_score):
"""Write score to a file."""
if len(name) > 3:
name = name[0:3]
high_scores = getHighScores(file_)
if high_scores:
lowest_score = min(high_scores, key=high_scores.get)
if new_score > high_scores[lowest_score] or len(high_scores) < 5:
if len(high_scores) == 5:
del high_scores[lowest_score]
high_scores[name.upper()] = int(new_score)
else:
return 0
else:
high_scores[name.upper()] = int(new_score)
write_file = open(file_, 'w')
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
line = highest_key + ' ' + str(high_scores[highest_key]) + '\n'
write_file.write(line)
del high_scores[highest_key]
return 1
def displayScores(file_):
"""Display scores from file."""
high_scores = getHighScores(file_)
print("HIGH SCORES")
if high_scores:
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
print(highest_key, high_scores[highest_key])
del high_scores[highest_key]
else:
print("No scores yet.")
def resetScores(file_):
open(file_, "w").close()
No. The results you got were due to arbitrary choices internal to the dict implementation that you cannot depend on always happening. (There is a subclass of dict that does keep track of insertion order, though: collections.OrderedDict.) I believe that with the current implementation, if you switch the order of the PL1 and PL2 lines, PL1 will probably still be deleted.
As others noted, the order of items in the dictionary is "up to the implementation".
This answer is more a comment to your question, "how min() decides what score is the lowest?", but is much too long and format-y for a comment. :-)
The interesting thing is that both max and min can be used this way. The reason is that they (can) work on "iterables", and dictionaries are iterable:
for i in some_dict:
loops i over all the keys in the dictionary. In your case, the keys are the user names. Further, min and max allow passing a key argument to turn each candidate in the iterable into a value suitable for a binary comparison. Thus, min is pretty much equivalent to the following python code, which includes some tracing to show exactly how this works:
def like_min(iterable, key=None):
it = iter(iterable)
result = it.next()
if key is None:
min_val = result
else:
min_val = key(result)
print '** initially, result is', result, 'with min_val =', min_val
for candidate in it:
if key is None:
cmp_val = candidate
else:
cmp_val = key(candidate)
print '** new candidate:', candidate, 'with val =', cmp_val
if cmp_val < min_val:
print '** taking new candidate'
result = candidate
return result
If we run the above on a sample dictionary d, using d.get as our key:
d = {'p': 0, 'ayyy': 3, 'b': 5, 'elephant': -17}
m = like_min(d, key=d.get)
print 'like_min:', m
** initially, result is ayyy with min_val = 3
** new candidate: p with val = 0
** taking new candidate
** new candidate: b with val = 5
** new candidate: elephant with val = -17
** taking new candidate
like_min: elephant
we find that we get the key whose value is the smallest. Of course, if multiple values are equal, the choice of "smallest" depends on the dictionary iteration order (and also whether min actually uses < or <= internally).
(Also, the method you use to "sort" the high scores to print them out is O(n2): pick highest value, remove it from dictionary, repeat until empty. This traverses n items, then n-1, ... then 2, then 1 => n+(n-1)+...+2+1 steps = n(n+1)/2 = O(n2). Deleting the high one is also an expensive operation, although it should still come in at or under O(n2), I think. With n=5 this is not that bad (5 * 6 / 2 = 15), but ... not elegant. :-) )
This is pretty much what http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ is about.
Short version: Get the treap module, which works like a sorted dictionary, and keep the keys in order. Or use the nest module to get the n greatest (or least) values automatically.
collections.OrderedDict is good for preserving insertion order, but not key order.

Python: Check the occurrences in a list against a value

lst = [1,2,3,4,1]
I want to know 1 occurs twice in this list, is there any efficient way to do?
lst.count(1) would return the number of times it occurs. If you're going to be counting items in a list, O(n) is what you're going to get.
The general function on the list is list.count(x), and will return the number of times x occurs in a list.
Are you asking whether every item in the list is unique?
len(set(lst)) == len(lst)
Whether 1 occurs more than once?
lst.count(1) > 1
Note that the above is not maximally efficient, because it won't short-circuit -- even if 1 occurs twice, it will still count the rest of the occurrences. If you want it to short-circuit you will have to write something a little more complicated.
Whether the first element occurs more than once?
lst[0] in lst[1:]
How often each element occurs?
import collections
collections.Counter(lst)
Something else?
For multiple occurrences, this give you the index of each occurence:
>>> lst=[1,2,3,4,5,1]
>>> tgt=1
>>> found=[]
>>> for index, suspect in enumerate(lst):
... if(tgt==suspect):
... found.append(index)
...
>>> print len(found), "found at index:",", ".join(map(str,found))
2 found at index: 0, 5
If you want the count of each item in the list:
>>> lst=[1,2,3,4,5,2,2,1,5,5,5,5,6]
>>> count={}
>>> for item in lst:
... count[item]=lst.count(item)
...
>>> count
{1: 2, 2: 3, 3: 1, 4: 1, 5: 5, 6: 1}
def valCount(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
u = [ x for x,y in valCount(lst).iteritems() if y > 1 ]
u is now a list of all values which appear more than once.
Edit:
#katrielalex: thank you for pointing out collections.Counter, of which I was not previously aware. It can also be written more concisely using a collections.defaultdict, as demonstrated in the following tests. All three methods are roughly O(n) and reasonably close in run-time performance (using collections.defaultdict is in fact slightly faster than collections.Counter).
My intention was to give an easy-to-understand response to what seemed a relatively unsophisticated request. Given that, are there any other senses in which you consider it "bad code" or "done poorly"?
import collections
import random
import time
def test1(lst):
res = {}
for v in lst:
try:
res[v] += 1
except KeyError:
res[v] = 1
return res
def test2(lst):
res = collections.defaultdict(lambda: 0)
for v in lst:
res[v] += 1
return res
def test3(lst):
return collections.Counter(lst)
def rndLst(lstLen):
r = random.randint
return [r(0,lstLen) for i in xrange(lstLen)]
def timeFn(fn, *args):
st = time.clock()
res = fn(*args)
return time.clock() - st
def main():
reps = 5000
res = []
tests = [test1, test2, test3]
for t in xrange(reps):
lstLen = random.randint(10,50000)
lst = rndLst(lstLen)
res.append( [lstLen] + [timeFn(fn, lst) for fn in tests] )
res.sort()
return res
And the results, for random lists containing up to 50,000 items, are as follows:
(Vertical axis is time in seconds, horizontal axis is number of items in list)
Another way to get all items that occur more than once:
lst = [1,2,3,4,1]
d = {}
for x in lst:
d[x] = x in d
print d[1] # True
print d[2] # False
print [x for x in d if d[x]] # [1]
You could also sort the list which is O(n*log(n)), then check the adjacent elements for equality, which is O(n). The result is O(n*log(n)). This has the disadvantage of requiring the entire list be sorted before possibly bailing when a duplicate is found.
For a large list with a relatively rare duplicates, this could be the about the best you can do. The best way to approach this really does depend on the size of the data involved and its nature.

Categories

Resources