I'm not a python expert, and I ran into this snippet of code which actually works and produces the correct answer, but I'm not sure I understand what happens in the second line:
for i in range(len(motifs[0])):
best = ''.join([motifs[j][i] for j in range(len(motifs))])
profile.append([(best.count(base)+1)/float(len(best)) for base in 'ACGT'])
I was trying to replace it with something like:
for i in range(len(motifs[0])):
for j in range(len(motifs)):
best =[motifs[j][i]]
profile.append([(best.count(base)+1)/float(len(best)) for base in 'ACGT'])
and also tried to break down the last line like this:
for i in range(len(motifs[0])):
for j in range(len(motifs)):
best =[motifs[j][i]]
for base in 'ACGT':
profile.append(best.count(base)+1)/float(len(best)
I tried some more variations but non of them worked.
My question is: What are those expressions (second and third line of first code) mean and how would you break it down to a few lines?
Thanks :)
''.join([motifs[j][i] for j in range(len(motifs))])
is idiomatically written
''.join(m[i] for m in motifs)
so it concatenates the i'th entry of all motifs, in order. Similarly,
[(best.count(bseq)+1)/float(len(seq)) for base in 'ACGT']
builds a list of (best.count(bseq)+1)/float(len(seq)) values for of ACGT; since the base variable doesn't actually occur, it's a list containing the same value four times and can be simplified to
[(best.count(bseq)+1) / float(len(seq))] * 4
for i in range(len(motifs[0])):
seq = ''.join([motifs[j][i] for j in range(len(motifs))])
profile.append([(best.count(bseq)+1)/float(len(seq)) for base in 'ACGT'])
is equivalent to:
for i in range(len(motifs[0])):
seq = ''
for j in range(len(motifs)):
seq += motifs[j][i]
profile.append([(best.count(bseq)+1)/float(len(seq)) for base in 'ACGT'])
which can be improved in countless ways.
For example:
seqs = [ ''.join(motif) for motif in motifs ]
bc = best.count(bseq)+1
profilte.extend([ map(lambda x: bc / float(len(x)),
seq) for base in 'ACGT' ] for seq in seqs)
correctness of which, I cannot test due to lack of input/output conditions.
Closest I got without being able to test it
for i, _ in enumerate(motifs[0]):
seq = ""
for m in motifs:
seq += m[i]
tmp = []
for base in "ACGT":
tmp.append(best.count(bseq) + 1 / float(len(seq)))
profile.append(tmp)
Related
I need to loop trough n lines of a file and for any i between 1 and n-1 to get the difference between words of line(n-1) - line(n) (eg. line[i]word[j] - line[i+1]word[j] etc .. )
Input :
Hey there !
Hey thre !
What a suprise.
What a uprise.
I don't know what to do.
I don't know wt to do.
Output:
e
s
ha
The goal is to extract the missing character(s) between two consecutive line words only.
I'm new to python so if you can guide me through writing the code, I would be more than thankful.
Without any lib :
def extract_missing_chars(s1, s2):
if len(s1) < len(s2):
return extract_missing_chars(s2, s1)
i = 0
to_return = []
for c in s1:
if s2[i] != c:
to_return.append(c)
else:
i += 1
return to_return
f = open('testfile')
l1 = f.readline()
while l1:
l2 = f.readline()
print(''.join(extract_missing_chars(l1, l2)))
l1 = f.readline()
Your example indicates that you want the comparisons between pairs of lines. This is different from defining it as line(n-1)-line(n) which would give you 5 results, not 3.
The result also depends on what you consider to be differences. Is it positional, is it simply based on missing letters from the odd lines or are the differences applicable in both directions.
(e.g. "boat"-"tub" = "boat", "oa" or "oau" ?).
You also have to decide if you want the differences to be case sensitive or not.
Here's an example where computation of the differences is centralized in a function so that you can change the rules more easily. It assumes that "boat"-"tub" = "oau".
lines = """Hey there !
Hey thre !
What a suprise.
What a uprise.
I don't know what to do.
I don't know wt to do.
""".split('\n')
def differences(word1,word2):
if isinstance(word1,list):
return "".join( differences(w1,w2) for w1,w2 in zip(word1+[""]*len(word2),word2+[""]*len(word1)) )
return "".join( c*abs(word1.count(c)-word2.count(c)) for c in set(word1+word2) )
result = [ differences(line1.split(),line2.split()) for line1,line2 in zip(lines[::2],lines[1::2]) ]
# ['e', 's', 'ha']
Note that line processing for result is based on your example (not on your definition).
clean_offset = len(malware)
tuple_clean = []
tuple_malware = []
for i in malware:
tuple_malware.append([malware.index(i), 0])
print(malware.index(i))
print(tuple_malware)
for j in clean:
tuple_clean.append([(clean_offset + clean.index(j)), 1])
print(clean.index(j))
print(tuple_clean)
import pdb; pdb.set_trace()
training_data_size_mal = 0.8 * len(malware)
training_data_size_clean = 0.8 * len(clean)
i increments as normal and produces correct output however j remains at 0 for three loops and then jumps to 3. I don't understand this.
There is a logical error on clean.index(j).
Array.index will return the first matched index in that array.
So if there are some equal variables there will be some error
You can inspect with below code.
malware = [1,2,3,4,5,6,7,8,8,8,8,8,2]
clean = [1,2,3,4,4,4,4,4,4,2,4,4,4,4]
clean_offset = len(malware)
tuple_clean = []
tuple_malware = []
for i in malware:
tuple_malware.append([malware.index(i), 0])
print(malware.index(i))
print(tuple_malware)
for j in clean:
tuple_clean.append([(clean_offset + clean.index(j)), 1])
print(clean.index(j))
print(tuple_clean)
training_data_size_mal = 0.8 * len(malware)
training_data_size_clean = 0.8 * len(clean)
for a in something
a is what is contained in something, not the index
for example:
for n in [1, 10, 9, 3]:
print(n)
gives
1
10
9
3
You either want
for i in range(len(malware))
or
for i, element in enumerate(malware)
at which point the i is the count and the element in the malware.index(i)
The last one is considered best practice when needing both the index and the element at that index in the loop.
op has already figured the question, but in case anyone is wondering or needs a TL;DR of Barkin's comment, its just a small correction,
replace
for i in malware
for j in clean
with
for i in range(len(malware))
for j in range(len(clean))
and at the end remove the .index() function, and place i and j.
I have attempted to write a program which asks the user for a string and a number (On the same line) and then prints all possible combinations of the string up to the size of the number. The output format should be: All capitals, Each combination on each line, Length of combination(Shortest First) and in alphabetical.
My code outputs the right combinations in the right order but it places an empty before the outputs and I'm not sure why.
from itertools import combinations
allcombo = []
S = input().strip()
inputlist = S.split()
k = int(inputlist[1])
S = inputlist[0]
#
for L in range(0, k+1):
allcombo = []
for pos in combinations(S, L):
pos = sorted(pos)
pos = str(pos).translate({ord(c): None for c in "[]()', "})
allcombo.append(pos)
allcombo = sorted(allcombo)
print(*allcombo, sep = '\n')
Input:
HACK 2
Output:
(Empty Line)
A
C
H
K
AC
AH
AK
CH
CK
HK
Also I've only been coding for about a week so if anyone would like to show me how to write this properly, I'd be very pleased.
Observe the line:
for L in range(0, k+1) # Notice that L is starting at 0.
Now, observe this line:
for pos in combinations(S, L)
So, we will have the following during our first iteration of the inner for loop:
for pos in combinations(S, 0) # This is an empty collection during your first loop.
Basically no work is being performed inside your loop because there is nothing to iterate over, and you will just being printing an empty string.
Change the following code:
for L in range(0, k+1)
to this:
for L in range(1, k+1) # Skips the empty collection since L starts at 1.
and this will fix your problem.
This is a follow up question to this response and the pseudo-code algorithm that the user posted. I didn't comment on that question because of its age. I am only interested in validating whether or not a string can be split into words. The algorithm doesn't need to actually split the string. This is the response from the linked question:
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if
the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for
i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and
isWord[j..i]).
I'm translating this algorithm into simple python code, but I'm not sure if I'm understanding it properly. Code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, str_len):
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
I have two related questions. 1) Is this code a proper translation of the linked algorithm into Python, and if it is, 2) Now that I have S, how do I use it to tell if the string is only comprised of words? In this case, is_word is a function that simply looks a given word up in a list. I haven't implemented it as a trie yet.
UPDATE: After updating the code to include the suggested change, it doesn't work. This is the updated code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, i): #THIS LINE WAS UPDATED
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints FALSE
a_string = "hello"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints TRUE
It should return True for both of these.
Here is a modified version of your code that should return good results.
Notice that your mistake was simply in the translation from pseudocode array indexing (starting at 1) to python array indexing (starting at 0) therefore S[0] and S[1] where populated with the same value where S[L-1] was actually never computed. You can easily trace this mistake by printing the whole S values. You will find that S[3] is set true in the first example where it should be S[2] for the word "car".
Also you could speed up the process by storing the index of composite words found so far, instead of testing each position.
def is_all_words(a_string, dictionary):
str_len = len(a_string)
S = [False] * (str_len)
# I replaced is_word function by a simple list lookup,
# feel free to replace it with whatever function you use.
# tries or suffix tree are best for this.
S[0] = (a_string[0] in dictionary)
for i in range(1, str_len):
check = a_string[0:i+1] in dictionary # i+1 instead of i
if (check):
S[i] = check
else:
for j in range(0,i+1): # i+1 instead of i
if (S[j-1] and (a_string[j:i+1] in dictionary)): # i+1 instead of i
S[i] = True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, ["a","car","carrot","for","eve","forever"])
print(S[len(a_string)-1]) #prints TRUE
a_string = "helloworld"
S = is_all_words(a_string, ["hello","world"])
print(S[len(a_string)-1]) #prints TRUE
For a real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the recursive approach. By modifying the score function you can prioritize longer matches.
Installation is easy with pip:
$ pip install wordsegment
And segment returns a list of words:
>>> import wordsegment
>>> wordsegment.segment('carrotfever')
['carrot', 'forever']
1) at first glance, looks good. One thing: for j in range(1, str_len): should be for j in range(1, i): I think
2) if S[str_len-1]==true, then the whole string should consist of whole words only.
After all S[i] is true iff
the whole string from 0 to i consists of a single dictionary word
OR there exists a S[j-1]==true with j<i, and the string[j:i] is a single dictionaryword
so if S[str_len-1] is true, then the whole string is composed out of dictionary words
Okay, basically what I want is to compress a file by reusing code and then at runtime replace missing code. What I've come up with is really ugly and slow, at least it works. The problem is that the file has no specific structure, for example 'aGVsbG8=\n', as you can see it's base64 encoding. My function is really slow because the length of the file is 1700+ and it checks for patterns 1 character at the time. Please help me with new better code or at least help me with optimizing what I got :). Anything that helps is welcome! BTW i have already tried compression libraries but they didn't compress as good as my ugly function.
def c_long(inp, cap=False, b=5):
import re,string
if cap is False: cap = len(inp)
es = re.escape; le=len; ref = re.findall; ran = range; fi = string.find
c = b;inpc = inp;pattern = inpc[:b]; l=[]
rep = string.replace; ins = list.insert
while True:
if c == le(inpc) and le(inpc) > b+1: c = b; inpc = inpc[1:]; pattern = inpc[:b]
elif le(inpc) <= b: break
if c == cap: c = b; inpc = inpc[1:]; pattern = inpc[:b]
p = ref(es(pattern),inp)
pattern += inpc[c]
if le(p) > 1 and le(pattern) >= b+1:
if l == []: l = [[pattern,le(p)+le(pattern)]]
elif le(ref(es(inpc[:c+2]),inp))+le(inpc[:c+2]) < le(p)+le(pattern):
x = [pattern,le(p)+le(inpc[:c+1])]
for i in ran(le(l)):
if x[1] >= l[i][1] and x[0][:-1] not in l[i][0]: ins(l,i,x); break
elif x[1] >= l[i][1] and x[0][:-1] in l[i][0]: l[i] = x; break
inpc = inpc[:fi(inpc,x[0])] + inpc[le(x[0]):]
pattern = inpc[:b]
c = b-1
c += 1
d = {}; c = 0
s = ran(le(l))
for x in l: inp = rep(inp,x[0],'{%d}' % s[c]); d[str(s[c])] = x[0]; c += 1
return [inp,d]
def decompress(inp,l): return apply(inp.format, [l[str(x)] for x in sorted([int(x) for x in l.keys()])])
The easiest way to compress base64-encoded data is to first convert it to binary data -- this will already save 25 percent of the storage space:
>>> s = "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXo=\n"
>>> t = s.decode("base64")
>>> len(s)
37
>>> len(t)
26
In most cases, you can compress the string even further using some compression algorithm, like t.encode("bz2") or t.encode("zlib").
A few remarks on your code: There are lots of factors that make the code hard to read: inconsistent spacing, overly long lines, meaningless variable names, unidiomatic code, etc. An example: Your decompress() function could be equivalently written as
def decompress(compressed_string, substitutions):
subst_list = [substitutions[k] for k in sorted(substitutions, key=int)]
return compressed_string.format(*subst_list)
Now it's already much more obvious what it does. You could go one step further: Why is substitutions a dictionary with the string keys "0", "1" etc.? Not only is it strange to use strings instead of integers -- you don't need the keys at all! A simple list will do, and decompress() will simplify to
def decompress(compressed_string, substitutions):
return compressed_string.format(*substitutions)
You might think all this is secondary, but if you make the rest of your code equally readable, you will find the bugs in your code yourself. (There are bugs -- it crashes for "abcdefgabcdefg" and many other strings.)
Typically one would pump the program through a compression algorithm optimized for text, then run that through exec, e.g.
code="""..."""
exec(somelib.decompress(code), globals=???, locals=???)
It may be the case that .pyc/.pyo files are compressed already, and one could check by creating one with x="""aaaaaaaa""", then increasing the length to x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa""" and seeing if the size changes appreciably.