I'm trying to change some elements of a list based on the properties of previous ones. Because I need to assign an intermediate variable, I don't think this can be done as a list comprehension. The following code, with comment, is what I'm trying to achieve:
for H in header:
if "lower" in H.lower():
pref="lower"
elif "higher" in H.lower():
pref="higher"
if header.count(H) > 1:
# change H inplace
H = pref+H
The best solution I've come up with is:
for ii,H in enumerate(header):
if "lower" in H.lower():
pref="lower"
elif "higher" in H.lower():
pref="higher"
if header.count(H) > 1:
header[ii] = pref+H
It doesn't quite work, and feels un-pythonic to me because of the indexing. Is there a better way to do this?
Concrete example:
header = ['LowerLevel','Term','J','UpperLevel','Term','J']
desired output:
header = ['LowerLevel','LowerTerm','LowerJ','UpperLevel','UpperTerm','UpperJ']
Note that neither of my solutions work: the former never modifies header at all, the latter only returns
header = ['LowerLevel','LowerTerm','LowerJ','UpperLevel','Term','J']
because count is wrong after the modifications.
header = ['LowerLevel','Term','J','UpperLevel','Term','J']
prefixes = ['lower', 'upper']
def prefixed(header):
prefix = ''
for h in header:
for p in prefixes:
if h.lower().startswith(p):
prefix, h = h[:len(p)], h[len(p):]
yield prefix + h
print list(prefixed(header))
I don't really know that this is better than what you had. It's different...
$ ./lower.py
['LowerLevel', 'LowerTerm', 'LowerJ', 'UpperLevel', 'UpperTerm', 'UpperJ']
something like this, using generator function:
In [62]: def func(lis):
pref=""
for x in lis:
if "lower" in x.lower():
pref="Lower"
elif "upper" in x.lower():
pref="Upper"
if header.count(x)>1:
yield pref+x
else:
yield x
....:
In [63]: list(func(header))
Out[63]: ['LowerLevel', 'LowerTerm', 'LowerJ', 'UpperLevel', 'UpperTerm', 'UpperJ']
This should work for the data you presented.
from collections import defaultdict
def find_dups(seq):
'''Finds duplicates in a sequence and returns a dict
of value:occurences'''
seen = defaultdict(int)
for curr in seq:
seen[curr] += 1
d = dict([(i, seen[i]) for i in seen if seen[i] > 1])
return d
if __name__ == '__main__':
header = ['LowerLevel','Term','J','UpperLevel','Term','J']
d = find_dups(header)
for i, s in enumerate(header):
if s in d:
if d[s] % 2:
pref = 'Upper'
else:
pref = 'Lower'
header[i] = pref + s
d[s] -= 1
But it give me the creeps to suggest anything, not knowing but a little about the entire set of data you will be working with.
good luck,
Mike
Related
In my LIST(not dictionary) I have these strings:
"K:60",
"M:37",
"M_4:47",
"M_5:89",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q:50",
"Q_7:89"
in output I need to have
"K:60",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q_7:89"
What is the possible decision?
Or even maybe, how to take tag with the maximum among strings with the same tag.
Use re.split and list comprehension as shown below. Use the fact that when the dictionary dct is created, only the last value is kept for each repeated key.
import re
lst = [
"K:60",
"M:37",
"M_4:47",
"M_5:89",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q:50",
"Q_7:89"
]
dct = dict([ (re.split(r'[:_]', s)[0], s) for s in lst])
lst_uniq = list(dct.values())
print(lst_uniq)
# ['K:60', 'M_6:91', 'N:15', 'O:24', 'P:50', 'Q_7:89']
Probably far from the cleanest but here is a method quite easy to understand.
l = ["K:60", "M:37", "M_4:47", "M_5:89", "M_6:91", "N:15", "O:24", "P:50", "Q:50", "Q_7:89"]
reponse = []
val = []
complete_val = []
for x in l:
if x[0] not in reponse:
reponse.append(x[0])
complete_val.append(x.split(':')[0])
val.append(int(x.split(':')[1]))
elif int(x.split(':')[1]) > val[reponse.index(x[0])]:
val[reponse.index(x[0])] = int(x.split(':')[1])
for x in range(len(complete_val)):
print(str(complete_val[x]) + ":" + str(val[x]))
K:60
M:91
N:15
O:24
P:50
Q:89
I do not see any straight-forward technique. Other than iterating on entire thing and computing yourself, I do not see if any built-in can be used. I have written this where you do not require your values to be sorted in your input.
But I like the answer posted by Timur Shtatland, you can make us of that if your values are already sorted in input.
intermediate = {}
for item in a:
key, val = item.split(':')
key = key.split('_')[0]
val = int(val)
if intermediate.get(key, (float('-inf'), None))[0] < val:
intermediate[key] = (val, item)
ans = [x[1] for x in intermediate.values()]
print(ans)
which gives:
['K:60', 'M_6:91', 'N:15', 'O:24', 'P:50', 'Q_7:89']
I am writing a Python code to remove equal same characters from two strings which lies on the same indices. For example remove_same('ABCDE', 'ACBDE') should make both arguments as BC and CB. I know that string is immutable here so I have converted them to list. I am getting an out of index error.
def remove_same(l_string, r_string):
l_list = list(l_string)
r_list = list(r_string)
i = 0
while i != len(l_list):
print(f'in {i} length is {len(l_list)}')
while l_list[i] == r_list[i]:
l_list.pop(i)
r_list.pop(i)
if i == len(l_list) - 1:
break
if i != len(l_list):
i += 1
return l_list[0] == r_list[0]
I would avoid using a while loop in that case, I think this is a better and more clear solution:
def remove_same(s1, s2):
l1 = list(s1)
l2 = list(s2)
out1 = []
out2 = []
for c1, c2 in zip(l1, l2):
if c1 != c2:
out1.append(c1)
out2.append(c2)
s1_out = "".join(out1)
s2_out = "".join(out2)
print(s1_out)
print(s2_out)
It could be shortened using some list comprehensions but I was trying to be as explicit as possible
I feel this could be a problem.
while l_list[i] == r_list[i]:
l_list.pop(i)
r_list.pop(i)
This could reduce size of list and it can go below i.
Do a dry run on this, if l_list = ["a"] and r_list = ["a"].
It is in general not a good idea to modify a list in a loop. Here is a cleaner, more Pythonic solution. The two strings are zipped and processed in parallel. Each pair of equal characters is discarded, and the remaining characters are arranged into new strings.
a = 'ABCDE'
b = 'ACFDE'
def remove_same(s1, s2):
return ["".join(s) for s
in zip(*[(x,y) for x,y in zip(s1,s2) if x!=y])]
remove_same(a, b)
#['BC', 'CF']
Here you go:
def remove_same(l_string, r_string):
# if either string is empty, return False
if not l_string or not r_string:
return False
l_list = list(l_string)
r_list = list(r_string)
limit = min(len(l_list), len(r_list))
i = 0
while i < limit:
if l_list[i] == r_list[i]:
l_list.pop(i)
r_list.pop(i)
limit -= 1
else:
i += 1
return l_list[0] == r_list[0]
print(remove_same('ABCDE', 'ACBDE'))
Output:
False
My problem is on my second function code.
This is my code so far....
def simi(d1,d2):
dna_1 = d1.lower()
dna_2 = d2.lower()
lst = []
i = 0
while i < len(dna_1):
if dna_1[i] == dna_2[i]:
lst.append(1)
i += 1
return len(lst) / len(d1)
def match(list_1, d , s):
dna = []
for item in list_1:
dna.append(simi(item, d))
if max(dna) < s:
return None
return list_1[max(dna)]
You have two problems, the first is you return in the loop before you have tried all the elements, secondly your function simi(item, d)returns a float if it works correctly so trying to index a list with a float will also fail. There is no way your code could do anything but error or return None.
I imagine you want to keep track of the best each iteration and return the item that is the best based on what it's simi distance calc is and if the simi is > s or else return None:
def match(list_1, d , s):
best = None
mx = float("-inf")
for item in list_1:
f = simi(item, d)
if f > mx:
mx = f
best = item
return best if mx > s else None
You can also use range in simi instead of your while loop with a list comp:
def simi(d1,d2):
dna_1 = d1.lower()
dna_2 = d2.lower()
lst = [1 for i in range(len(dna_1)) if dna_1[i] == dna_2[i] ]
return len(lst) / len(dna_1)
But if you just want to add 1 each time they condition is True you can use sum:
def simi(d1,d2):
dna_1 = d1.lower()
dna_2 = d2.lower()
sm = sum(dna_1[i] == dna_2[i] for i in range(len(dna_1)))
return sm / len(dna_1)
Using some builtins:
from functools import partial
similarity_with_sample = partial(simi, 'TACgtAcGaCGT')
Now similarity_with_sample is a function that takes one argument, and returns its similarity with 'TACgtAcGaCGT'.
Now use that as the key argument of the builtin max function:
best_match = max(list_of_samples, key=similarity_with_sample)
I'm not sure what your s variable is doing.
I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.
Okay, basically what I want is to compress a file by reusing code and then at runtime replace missing code. What I've come up with is really ugly and slow, at least it works. The problem is that the file has no specific structure, for example 'aGVsbG8=\n', as you can see it's base64 encoding. My function is really slow because the length of the file is 1700+ and it checks for patterns 1 character at the time. Please help me with new better code or at least help me with optimizing what I got :). Anything that helps is welcome! BTW i have already tried compression libraries but they didn't compress as good as my ugly function.
def c_long(inp, cap=False, b=5):
import re,string
if cap is False: cap = len(inp)
es = re.escape; le=len; ref = re.findall; ran = range; fi = string.find
c = b;inpc = inp;pattern = inpc[:b]; l=[]
rep = string.replace; ins = list.insert
while True:
if c == le(inpc) and le(inpc) > b+1: c = b; inpc = inpc[1:]; pattern = inpc[:b]
elif le(inpc) <= b: break
if c == cap: c = b; inpc = inpc[1:]; pattern = inpc[:b]
p = ref(es(pattern),inp)
pattern += inpc[c]
if le(p) > 1 and le(pattern) >= b+1:
if l == []: l = [[pattern,le(p)+le(pattern)]]
elif le(ref(es(inpc[:c+2]),inp))+le(inpc[:c+2]) < le(p)+le(pattern):
x = [pattern,le(p)+le(inpc[:c+1])]
for i in ran(le(l)):
if x[1] >= l[i][1] and x[0][:-1] not in l[i][0]: ins(l,i,x); break
elif x[1] >= l[i][1] and x[0][:-1] in l[i][0]: l[i] = x; break
inpc = inpc[:fi(inpc,x[0])] + inpc[le(x[0]):]
pattern = inpc[:b]
c = b-1
c += 1
d = {}; c = 0
s = ran(le(l))
for x in l: inp = rep(inp,x[0],'{%d}' % s[c]); d[str(s[c])] = x[0]; c += 1
return [inp,d]
def decompress(inp,l): return apply(inp.format, [l[str(x)] for x in sorted([int(x) for x in l.keys()])])
The easiest way to compress base64-encoded data is to first convert it to binary data -- this will already save 25 percent of the storage space:
>>> s = "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXo=\n"
>>> t = s.decode("base64")
>>> len(s)
37
>>> len(t)
26
In most cases, you can compress the string even further using some compression algorithm, like t.encode("bz2") or t.encode("zlib").
A few remarks on your code: There are lots of factors that make the code hard to read: inconsistent spacing, overly long lines, meaningless variable names, unidiomatic code, etc. An example: Your decompress() function could be equivalently written as
def decompress(compressed_string, substitutions):
subst_list = [substitutions[k] for k in sorted(substitutions, key=int)]
return compressed_string.format(*subst_list)
Now it's already much more obvious what it does. You could go one step further: Why is substitutions a dictionary with the string keys "0", "1" etc.? Not only is it strange to use strings instead of integers -- you don't need the keys at all! A simple list will do, and decompress() will simplify to
def decompress(compressed_string, substitutions):
return compressed_string.format(*substitutions)
You might think all this is secondary, but if you make the rest of your code equally readable, you will find the bugs in your code yourself. (There are bugs -- it crashes for "abcdefgabcdefg" and many other strings.)
Typically one would pump the program through a compression algorithm optimized for text, then run that through exec, e.g.
code="""..."""
exec(somelib.decompress(code), globals=???, locals=???)
It may be the case that .pyc/.pyo files are compressed already, and one could check by creating one with x="""aaaaaaaa""", then increasing the length to x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa""" and seeing if the size changes appreciably.