Related
Hel lo I need your help in a complicated task.
Here is a file1.txt :
>Name1.1_1-40_-__Sp1
AAAAAACC-------------
>Name1.1_67-90_-__Sp1
------CCCCCCCCC------
>Name1.1_90-32_-__Sp1
--------------CCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
and the idea is to create a new file called file1.txt_Hsp such as:
>Name1.1-3HSPs-__Sp1
AAAAAACCCCCCCCCCDDDDD
>Name3.1_-__Sp2
AAAAAACCCCCCCCCCC----
>Name4.1_-__Sp2
-------CCCCCCCCCCCCCC
So basically the idea is to:
Compare each sequence from the same SpN <-- (here it is very important only with the same SpN name) with each other in file1.txt.
For instance I will have to compare :
Name1.1_1-40_-__Sp1 vs Name1.1_67-90_-__Sp1
Name1.1_1-40_-__Sp1 vs Name1.1_90-32_-__Sp1
Name1.1_67-90_-__Sp1 vs Name1.1_90-32_-__Sp1
Name2.1_20-89_-__Sp2 vs Name2.1_78-200_-__Sp2
So for exemple when I compare:
Name1.1_1-40_-__Sp1 vs Name1.1_67-90_-__Sp1 I get :
>Name1.1_1-40_-__Sp1
AAAAAACC-------------
>Name1.1_67-90_-__Sp1
------CCCCCCCCC------
here I want to concatenate the two sequences if ratio between number of letter matching with another letter / nb letter matching with a (-) is < 0.20`.
Here for example there are 21 characters, and the number of letter matching with another letter = 2 (C and C).
And the number of letter that match with a - , is 13 (AAAAAA+CCCCCCC)
so
ratio = 2/15 : 0.1538462
and if this ratio < 0.20 then I want to concatenate this 2 sequences such as :
>Name1.1-2HSPs_-__Sp1
AAAAAACCCCCCCCC------
(As you can se the name of the new seq is now : Name.1-2HSPs_-__Sp1 with the 2 meaning that there are 2 sequences concatenated) So we remove the number-number part for XHSPS with X being the number of sequence concatenated.
and get the file1.txt_Hsp :
>Name1.1-2HSPs_-__Sp1
AAAAAACCCCCCCCC------
>Name1.1_90-32_-__Sp1
--------------CCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Then I do it again with Name1.1-2HSPs_-__Sp1 vs Name1.1_90-32_-__Sp1
>Name1.1-2HSPs_-__Sp1
AAAAAACCCCCCCCC------
>Name1.1_90-32-__Sp1
--------------CCDDDDD
Where ratio = 1/20 = 0.05
Then because the ratio is < 0.20 I want to concatenate this 2 sequences such as :
>Name1.1-3HSPs_-__Sp1
AAAAAACCCCCCCCCCDDDDD
(As you can see the name of the new seq is now : Name.1-3HSPs_-__Sp1 with the 3 meaning that there are 3 sequences concatenated)
file1.txt_Hsp:
>Name1.1-3HSPs_-__Sp1
AAAAAACCCCCCCCCCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Then I do it again with Name2.1_20-89_-__Sp2 vs Name2.1_78-200_-__Sp2
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Where ratio = 10/11 = 0.9090909
Then because the ratio is > 0.20 I do nothing and get the final file1.txt_Hsp:
>Name1.1-3HSPs_-__Sp1
AAAAAACCCCCCCCCCDDDDD
>Name2.1_20-89_-__Sp2
AAAAAACCCCCCCCCCC----
>Name2.1_78-200_-__Sp2
-------CCCCCCCCCCDDDD
Which is the final result I needed.
A simplest exemple would be :
>Name1.1_10-60_-__Seq1
AAA------
>Name1.1_70-120_-__Seq1
--AAAAAAA
>Name2.1_12-78_-__Seq2
--AAAAAAA
The ratio is 1/8 = 0.125 because only 1 letter is matching and 8 because 8 letters are matching with a (-)
Because the ratio < 0.20 I concatenate the two sequences Seq1 to:
>Name1.1_2HSPs_-__Seq1
AAAAAAAAA
and the new file should be :
>Name1.1_2HSPs_-__Seq1
AAAAAAAAA
>Name2.1_-__Seq2
--AAAAAAA
** Here is an exemple from my real data **
>YP_009186705
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>XO009980.1_26784332-20639090_-__Agapornis_vilveti
------------------------------------------------------LNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>CM009917.1_20634332-20634508_-__Neodiprion_lecontei
---CDSWMIKFFARISQMC---IKIHSKYEEVSFFLFQSK--KKKIADSHFFRSLNQDTA
-------LNTVSY----------
>XO009980.1_20634508-20634890_-__Agapornis_vilveti
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKL--------------
-----------------------
>YUUBBOX12
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
and I should get :
>YP_009186705
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>XO009980.1_2HSPs_-__Agapornis_vilveti
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
>CM009917.1_20634332-20634508_-__Neodiprion_lecontei
---CDSWMIKFFARISQMC---IKIHSKYEEVSFFLFQSK--KKKIADSHFFRSLNQDTA
-------LNTVSY----------
>YUUBBOX12
MMSCQSWMMKYFTKVCNRSNLALPFDQSVNPVSFSMISSHDVMLKLDDEIFYKSLNQSNL
ALPFDQSVNPVSFSMISSHDLIA
the ratio between XO009980.1_26784332-20639090_-__Agapornis_vilveti and XO009980.1_20634508-20634890_-__Agapornis_vilveti was : 0/75 = 0
Here as you can see, some sequence does not have the [\d]+[-]+[\d] patterns such as YP_009186705 or YUUBBOX12, these one does not have to be concatenate, they juste have to be added in the outputfile.
Thanks a lot for your help.
First, let's read the text files into tuples of (name, seq):
with open('seq.txt', 'r+') as f:
lines = f.readlines()
seq_map = []
for i in range(0, len(lines), 2):
seq_map.append((lines[i].strip('\n'), lines[i+1].strip('\n')))
#[('>Name1.1_10-60_-__Seq1', 'AAA------'),
# ('>Name1.1_70-120_-__Seq1', '--AAAAAAA'),
# ('>Name2.1_12-78_-__Seq2', '--AAAAAAA')]
#
# or
#
# [('>Name1.1_1-40_-__Sp1', 'AAAAAACC-------------'),
# ('>Name1.1_67-90_-__Sp1', '------CCCCCCCCC------'),
# ('>Name1.1_90-32_-__Sp1', '--------------CCDDDDD'),
# ('>Name2.1_20-89_-__Sp2', 'AAAAAACCCCCCCCCCC----'),
# ('>Name2.1_78-200_-__Sp2', '-------CCCCCCCCCCDDDD')]
Then we define helper functions, one each for checking for a concat, then concat for seq, and merge for name (with a nest helper for getting HSPs counts):
import re
def count_num(x):
num = re.findall(r'[\d]+?(?=HSPs)', x)
count = int(num[0]) if num and 'HSPs' in x else 1
return count
def concat_name(nx, ny):
count, new_name = 0, []
count += count_num(nx)
count += count_num(ny)
for ind, x in enumerate(nx.split('_')):
if ind == 1:
new_name.append('{}HSPs'.format(count))
else:
new_name.append(x)
new_name = '_'.join([x for x in new_name])
return new_name
def concat_seq(x, y):
mash, new_seq = zip(x, y), ''
for i in mash:
if i.count('-') > 1:
new_seq += '-'
else:
new_seq += i[0] if i[1] == '-' else i[1]
return new_seq
def check_concat(x, y):
mash, sim, dissim = zip(x, y), 0 ,0
for i in mash:
if i[0] == i[1] and '-' not in i:
sim += 1
if '-' in i and i.count('-') == 1:
dissim += 1
return False if not dissim or float(sim)/float(dissim) >= 0.2 else True
Then we will write a script to run over the tuples in sequence, checking for spn matches, then concat_checks, and taking forward the new pairing for the next comparison, adding to the final list where necessary:
tmp_seq_map = seq_map[:]
final_seq = []
for ind in range(1, len(seq_map)):
end = True if ind == len(seq_map)-1 else False
pair_a = tmp_seq_map[ind-1]
pair_b = tmp_seq_map[ind]
name_a = pair_a[0][:]
name_b = pair_b[0][:]
if name_a.split('__')[1] == name_b.split('__')[1]:
if check_concat(pair_a[1], pair_b[1]):
new_name = concat_name(pair_a[0], pair_b[0])
new_seq = concat_seq(pair_a[1], pair_b[1])
tmp_seq_map[ind] = (((new_name, new_seq)))
if end:
final_seq.append(tmp_seq_map[ind])
end = False
else:
final_seq.append(pair_a)
else:
final_seq.append(pair_a)
if end:
final_seq.append(pair_b)
print(final_seq)
#[('>Name1.1_2HSPs_-__Seq1', 'AAAAAAAAA'),
# ('>Name2.1_12-78_-__Seq2', '--AAAAAAA')]
#
# or
#
#[('>Name1.1_3HSPs_-__Sp1', 'AAAAAACCCCCCCCCCDDDDD'),
# ('>Name2.1_20-89_-__Sp2', 'AAAAAACCCCCCCCCCC----'),
# ('>Name2.1_78-200_-__Sp2', '-------CCCCCCCCCCDDDD')]
Please note that I have checked for concatenation of only consecutive sequences from the text files, and that you would have to re-use the methods I've written in a different script for accounting for combinations. I leave that to your discretion.
Hope this helps. :)
You can do this as follows.
from collections import defaultdict
with open('lines.txt','r') as fp:
lines=fp.readlines()
dnalist = defaultdict(list)
for i,line in enumerate(lines):
line = line.replace('\n','')
if i%2: #'Name' in line:
dnalist[n].append(line)
else:
n = line.split('-')[-1]
That gives you a dictionary with keys being the file numbers and values being the dna sequences in a list.
def calc_ratio(str1,str2):
n_skipped,n_matched,n_notmatched=0,0,0
print(len(str1),len(str2))
for i,ch in enumerate(str1):
if ch=='-' or str2[i]=='-':
n_skipped +1
elif ch == str2[i]:
n_matched += 1
else:
n_notmatched+=1
retval = float(n_matched)/float(n_matched+n_notmatched+n_skipped)
print(n_matched,n_notmatched,n_skipped)
return retval
That gets you the ratio; you might want to consider the case where characters in the sequences dont match (and neither is '-'), here I assumed that's not a different case than one being '-'.
A helper function to concatenate the strings: here I took the case of non-matching chars and put in an 'X' to mark it (if it ever happens) .
def dna_concat(str1,str2):
outstr=[]
for i,ch in enumerate(str1):
if ch!=str2[i]:
if ch == '-':
outchar = str2[i]
elif str2[i] == '-':
outchar = ch
else:
outchar = 'X'
else:
outchar = ch
outstr.append(outchar)
outstr = ''.join(outstr)
return outstr
And finally a loop thru the dictionary lists to get the concatenated answers, in another dictionary with filenumbers as keys and lists of concatenations as values.
for filenum,dnalist in dnalist.items():
print(dnalist)
answers = defaultdict(list)
for i,seq in enumerate(dnalist):
for seq2 in dnalist[i+1:len(dnalist)]:
ratio = calc_ratio(seq,seq2)
print('i {} {} ration {}'.format(seq,seq2,ratio))
if ratio<0.2:
answers[filenum].append(dna_concat(seq,seq2))
print(dna_concat(seq,seq2))
How would I count consecutive characters in Python to see the number of times each unique digit repeats before the next unique digit?
At first, I thought I could do something like:
word = '1000'
counter = 0
print range(len(word))
for i in range(len(word) - 1):
while word[i] == word[i + 1]:
counter += 1
print counter * "0"
else:
counter = 1
print counter * "1"
So that in this manner I could see the number of times each unique digit repeats. But this, of course, falls out of range when i reaches the last value.
In the example above, I would want Python to tell me that 1 repeats 1, and that 0 repeats 3 times. The code above fails, however, because of my while statement.
How could I do this with just built-in functions?
Consecutive counts:
You can use itertools.groupby:
s = "111000222334455555"
from itertools import groupby
groups = groupby(s)
result = [(label, sum(1 for _ in group)) for label, group in groups]
After which, result looks like:
[("1": 3), ("0", 3), ("2", 3), ("3", 2), ("4", 2), ("5", 5)]
And you could format with something like:
", ".join("{}x{}".format(label, count) for label, count in result)
# "1x3, 0x3, 2x3, 3x2, 4x2, 5x5"
Total counts:
Someone in the comments is concerned that you want a total count of numbers so "11100111" -> {"1":6, "0":2}. In that case you want to use a collections.Counter:
from collections import Counter
s = "11100111"
result = Counter(s)
# {"1":6, "0":2}
Your method:
As many have pointed out, your method fails because you're looping through range(len(s)) but addressing s[i+1]. This leads to an off-by-one error when i is pointing at the last index of s, so i+1 raises an IndexError. One way to fix this would be to loop through range(len(s)-1), but it's more pythonic to generate something to iterate over.
For string that's not absolutely huge, zip(s, s[1:]) isn't a a performance issue, so you could do:
counts = []
count = 1
for a, b in zip(s, s[1:]):
if a==b:
count += 1
else:
counts.append((a, count))
count = 1
The only problem being that you'll have to special-case the last character if it's unique. That can be fixed with itertools.zip_longest
import itertools
counts = []
count = 1
for a, b in itertools.zip_longest(s, s[1:], fillvalue=None):
if a==b:
count += 1
else:
counts.append((a, count))
count = 1
If you do have a truly huge string and can't stand to hold two of them in memory at a time, you can use the itertools recipe pairwise.
def pairwise(iterable):
"""iterates pairwise without holding an extra copy of iterable in memory"""
a, b = itertools.tee(iterable)
next(b, None)
return itertools.zip_longest(a, b, fillvalue=None)
counts = []
count = 1
for a, b in pairwise(s):
...
A solution "that way", with only basic statements:
word="100011010" #word = "1"
count=1
length=""
if len(word)>1:
for i in range(1,len(word)):
if word[i-1]==word[i]:
count+=1
else :
length += word[i-1]+" repeats "+str(count)+", "
count=1
length += ("and "+word[i]+" repeats "+str(count))
else:
i=0
length += ("and "+word[i]+" repeats "+str(count))
print (length)
Output :
'1 repeats 1, 0 repeats 3, 1 repeats 2, 0 repeats 1, 1 repeats 1, and 0 repeats 1'
#'1 repeats 1'
Totals (without sub-groupings)
#!/usr/bin/python3 -B
charseq = 'abbcccdddd'
distros = { c:1 for c in charseq }
for c in range(len(charseq)-1):
if charseq[c] == charseq[c+1]:
distros[charseq[c]] += 1
print(distros)
I'll provide a brief explanation for the interesting lines.
distros = { c:1 for c in charseq }
The line above is a dictionary comprehension, and it basically iterates over the characters in charseq and creates a key/value pair for a dictionary where the key is the character and the value is the number of times it has been encountered so far.
Then comes the loop:
for c in range(len(charseq)-1):
We go from 0 to length - 1 to avoid going out of bounds with the c+1 indexing in the loop's body.
if charseq[c] == charseq[c+1]:
distros[charseq[c]] += 1
At this point, every match we encounter we know is consecutive, so we simply add 1 to the character key. For example, if we take a snapshot of one iteration, the code could look like this (using direct values instead of variables, for illustrative purposes):
# replacing vars for their values
if charseq[1] == charseq[1+1]:
distros[charseq[1]] += 1
# this is a snapshot of a single comparison here and what happens later
if 'b' == 'b':
distros['b'] += 1
You can see the program output below with the correct counts:
➜ /tmp ./counter.py
{'b': 2, 'a': 1, 'c': 3, 'd': 4}
You only need to change len(word) to len(word) - 1. That said, you could also use the fact that False's value is 0 and True's value is 1 with sum:
sum(word[i] == word[i+1] for i in range(len(word)-1))
This produces the sum of (False, True, True, False) where False is 0 and True is 1 - which is what you're after.
If you want this to be safe you need to guard empty words (index -1 access):
sum(word[i] == word[i+1] for i in range(max(0, len(word)-1)))
And this can be improved with zip:
sum(c1 == c2 for c1, c2 in zip(word[:-1], word[1:]))
If we want to count consecutive characters without looping, we can make use of pandas:
In [1]: import pandas as pd
In [2]: sample = 'abbcccddddaaaaffaaa'
In [3]: d = pd.Series(list(sample))
In [4]: [(cat[1], grp.shape[0]) for cat, grp in d.groupby([d.ne(d.shift()).cumsum(), d])]
Out[4]: [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('a', 4), ('f', 2), ('a', 3)]
The key is to find the first elements that are different from their previous values and then make proper groupings in pandas:
In [5]: sample = 'abba'
In [6]: d = pd.Series(list(sample))
In [7]: d.ne(d.shift())
Out[7]:
0 True
1 True
2 False
3 True
dtype: bool
In [8]: d.ne(d.shift()).cumsum()
Out[8]:
0 1
1 2
2 2
3 3
dtype: int32
This is my simple code for finding maximum number of consecutive 1's in binaray string in python 3:
count= 0
maxcount = 0
for i in str(bin(13)):
if i == '1':
count +=1
elif count > maxcount:
maxcount = count;
count = 0
else:
count = 0
if count > maxcount: maxcount = count
maxcount
There is no need to count or groupby. Just note the indices where a change occurs and subtract consecutive indicies.
w = "111000222334455555"
iw = [0] + [i+1 for i in range(len(w)-1) if w[i] != w[i+1]] + [len(w)]
dw = [w[i] for i in range(len(w)-1) if w[i] != w[i+1]] + [w[-1]]
cw = [ iw[j] - iw[j-1] for j in range(1, len(iw) ) ]
print(dw) # digits
['1', '0', '2', '3', '4']
print(cw) # counts
[3, 3, 3, 2, 2, 5]
w = 'XXYXYYYXYXXzzzzzYYY'
iw = [0] + [i+1 for i in range(len(w)-1) if w[i] != w[i+1]] + [len(w)]
dw = [w[i] for i in range(len(w)-1) if w[i] != w[i+1]] + [w[-1]]
cw = [ iw[j] - iw[j-1] for j in range(1, len(iw) ) ]
print(dw) # characters
print(cw) # digits
['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'z', 'Y']
[2, 1, 1, 3, 1, 1, 2, 5, 3]
A one liner that returns the amount of consecutive characters with no imports:
def f(x):s=x+" ";t=[x[1] for x in zip(s[0:],s[1:],s[2:]) if (x[1]==x[0])or(x[1]==x[2])];return {h: t.count(h) for h in set(t)}
That returns the amount of times any repeated character in a list is in a consecutive run of characters.
alternatively, this accomplishes the same thing, albeit much slower:
def A(m):t=[thing for x,thing in enumerate(m) if thing in [(m[x+1] if x+1<len(m) else None),(m[x-1] if x-1>0 else None)]];return {h: t.count(h) for h in set(t)}
In terms of performance, I ran them with
site = 'https://web.njit.edu/~cm395/theBeeMovieScript/'
s = urllib.request.urlopen(site).read(100_000)
s = str(copy.deepcopy(s))
print(timeit.timeit('A(s)',globals=locals(),number=100))
print(timeit.timeit('f(s)',globals=locals(),number=100))
which resulted in:
12.528256356999918
5.351301653001428
This method can definitely be improved, but without using any external libraries, this was the best I could come up with.
In python
your_string = "wwwwweaaaawwbbbbn"
current = ''
count = 0
for index, loop in enumerate(your_string):
current = loop
count = count + 1
if index == len(your_string)-1:
print(f"{count}{current}", end ='')
break
if your_string[index+1] != current:
print(f"{count}{current}",end ='')
count = 0
continue
This will output
5w1e4a2w4b1n
#I wrote the code using simple loops and if statement
s='feeekksssh' #len(s) =11
count=1 #f:0, e:3, j:2, s:3 h:1
l=[]
for i in range(1,len(s)): #range(1,10)
if s[i-1]==s[i]:
count = count+1
else:
l.append(count)
count=1
if i == len(s)-1: #To check the last character sequence we need loop reverse order
reverse_count=1
for i in range(-1,-(len(s)),-1): #Lopping only for last character
if s[i] == s[i-1]:
reverse_count = reverse_count+1
else:
l.append(reverse_count)
break
print(l)
Today I had an interview and was asked the same question. I was struggling with the original solution in mind:
s = 'abbcccda'
old = ''
cnt = 0
res = ''
for c in s:
cnt += 1
if old != c:
res += f'{old}{cnt}'
old = c
cnt = 0 # default 0 or 1 neither work
print(res)
# 1a1b2c3d1
Sadly this solution always got unexpected edge cases result(is there anyone to fix the code? maybe i need post another question), and finally timeout the interview.
After the interview I calmed down and soon got a stable solution I think(though I like the groupby best).
s = 'abbcccda'
olds = []
for c in s:
if olds and c in olds[-1]:
olds[-1].append(c)
else:
olds.append([c])
print(olds)
res = ''.join([f'{lst[0]}{len(lst)}' for lst in olds])
print(res)
# [['a'], ['b', 'b'], ['c', 'c', 'c'], ['d'], ['a']]
# a1b2c3d1a1
Here is my simple solution:
def count_chars(s):
size = len(s)
count = 1
op = ''
for i in range(1, size):
if s[i] == s[i-1]:
count += 1
else:
op += "{}{}".format(count, s[i-1])
count = 1
if size:
op += "{}{}".format(count, s[size-1])
return op
data_input = 'aabaaaabbaaaaax'
start = 0
end = 0
temp_dict = dict()
while start < len(data_input):
if data_input[start] == data_input[end]:
end = end + 1
if end == len(data_input):
value = data_input[start:end]
temp_dict[value] = len(value)
break
if data_input[start] != data_input[end]:
value = data_input[start:end]
temp_dict[value] = len(value)
start = end
print(temp_dict)
PROBLEM: we need to count consecutive characters and return characters with their count.
def countWithString(input_string:str)-> str:
count = 1
output = ''
for i in range(1,len(input_string)):
if input_string[i]==input_string[i-1]:
count +=1
else:
output += f"{count}{input_string[i-1]}"
count = 1
# Used to add last string count (at last else condition will not run and data will not be inserted to ouput string)
output += f"{count}{input_string[-1]}"
return output
countWithString(input)
input:'aaabbbaabbcc'
output:'3a3b2a2b2c'
Time Complexity: O(n)
Space Complexity: O(1)
temp_str = "aaaajjbbbeeeeewwjjj"
def consecutive_charcounter(input_str):
counter = 0
temp_list = []
for i in range(len(input_str)):
if i==0:
counter+=1
elif input_str[i]== input_str[i-1]:
counter+=1
if i == len(input_str)-1:
temp_list.extend([input_str[i - 1], str(counter)])
else:
temp_list.extend([input_str[i-1],str(counter)])
counter = 1
print("".join(temp_list))
consecutive_charcounter(temp_str)
My problem is to find the consecutive '3's in a list. For example list('133233313333') . What makes it difficult is only two adjacent '3's is valid, three or more adjacent '3's are not. So '33' is valid, but triple '3's and '3333' are not valid. I tried the following at first:
try:
if l[i] == '3' and l[i+1] == '3' and l[i+2] != '3' and l[i-1] != '3':
record_current(i)
except IndexError:
pass
My intention is to ignore the comparison and let it be true if there is an IndexError, but it doesn't work.
If list has a method like dict.get(), which returns None is there's an KeyError, I could write it as (l[i+2] == None or l[i+2] != '3').
If I must finish it now, I would treat the first item and the last two items sperately from the other items. But is there some way to solve this problem elegantly?
You can do this using itertools.groupby:
>>> from operator import itemgetter
>>> from itertools import groupby
>>> s = list('1332333133334433')
>>> for k, g in groupby(enumerate(s), itemgetter(1)):
if k == '3':
ind = next(g)[0]
if sum(1 for _ in g) == 1:
print ind
...
1
14
Count the consecutive 3s !
Keep a counter which is incremented every time you meet a '3' and reset on a non-'3'; compare to 2 before a reset:
j= 0
for i in range(len(L)):
if L[i] == '3':
j+= 1
else:
if j == 2:
print "Found at", i - j
j= 0
if j == 2:
print "Found at", i - j + 1 # Late fix (+ 1)
Alternatively, one may find successive runs of '3's and non-'3's. This way, one avoids testing j == 2 on every non-'3' element, at the expense of one extra loop test for every sequence of 3's:
i= 0
while i < len(L):
# Find the next '3'
while i < len(L) and L[i] != '3':
i+= 1
j= i
# Find the next non-'3'
while i < len(L) and L[i] == '3':
i+= 1
if i - j == 2:
print "Found at", j
You are trying to check for a certain Grammar. For this, you can implement a Deterministic Finite Automaton (or DFA).
Here is a solution that uses regular expressions:
import re
m = re.finditer('(?<!3)3{2}(?!3)', '1332333133334433')
for x in m:
print x.span()[0]
The regular expression finds all matches for two successive threes, as long as they are not followed by or preceeded by a 3. The output is:
1
14
You can substitute any character for the '3' in the regular expression, to search for that letter instead.
data = "1332333133334433"
from itertools import groupby
from operator import itemgetter
result = []
for char, grp in groupby(enumerate(data), itemgetter(1)):
groups = list(grp)
if char == "3" and len(groups) == 2:
result.append(groups[0][0])
print result
Output
[1, 14]
This returns True if '333' in the list
>>> l = "1332333133334433"
>>> any([(i[:3]=='333' and i[3] != '3') for i in map("".join,zip(l[:],l[1:],l[2:],l[3:]))])
True
you can see that:
>>> map("".join,zip(l[:],l[1:],l[2:],l[3:]))
['1332', '3323', '3233', '2333', '3331', '3313', '3133', '1333', '3333', '3334', '3344', '3443', '4433']
Here's a general solution for finding two consecutive letters that are the same:
def find_two_consecutive(my_str):
prev_letter = None
count = 1
for index, current_letter in enumerate(my_str):
if current_letter == prev_letter:
count += 1
else:
if count == 2:
print("Starting at index: %d" % (index - 2))
count = 1
prev_letter = current_letter
if count == 2:
print("Starting at index: %d" % (index - 2))
If your list really only contains one-letter elements you should use the re module:
import re
chars = list('133233313333433')
numberstr = ''.join(chars)
for match in re.finditer('(?<!3)33(?!3)', numberstr):
print(match.start())
Result:
1
13
The pattern (?<!3)33(?!3) means: find two consecutive 3s that are neither preceded nor followed by a 3.
The documentation can be found here.
Oh, and this:
chars = list('133233313333433')
numberstr = ''.join(chars)
should probably be just:
numberstr = '133233313333433'
EDIT: I am aware that a question with similar task was already asked in SO but I'm interested to find out the problem in this specific piece of code. I am also aware that this problem can be solved without using recursion.
The task is to write a program which will find (and print) the longest sub-string in which the letters occur in alphabetical order. If more than 1 equally long sequences were found, then the first one should be printed. For example, the output for a string abczabcd will be abcz.
I have solved this problem with recursion which seemed to pass my manual tests. However when I run an automated tests set which generate random strings, I have noticed that in some cases, the output is incorrect. For example:
if s = 'hixwluvyhzzzdgd', the output is hix instead of luvy
if s = 'eseoojlsuai', the output is eoo instead of jlsu
if s = 'drurotsxjehlwfwgygygxz', the output is dru instead of ehlw
After some time struggling, I couldn't figure out what is so special about these strings that causes the bug.
This is my code:
pos = 0
maxLen = 0
startPos = 0
endPos = 0
def last_pos(pos):
if pos < (len(s) - 1):
if s[pos + 1] >= s[pos]:
pos += 1
if pos == len(s)-1:
return len(s)
else:
return last_pos(pos)
return pos
for i in range(len(s)):
if last_pos(i+1) != None:
diff = last_pos(i) - i
if diff - 1 > maxLen:
maxLen = diff
startPos = i
endPos = startPos + diff
print s[startPos:endPos+1]
There are many things to improve in your code but making minimum changes so as to make it work. The problem is you should have if last_pos(i) != None: in your for loop (i instead of i+1) and you should compare diff (not diff - 1) against maxLen. Please read other answers to learn how to do it better.
for i in range(len(s)):
if last_pos(i) != None:
diff = last_pos(i) - i + 1
if diff > maxLen:
maxLen = diff
startPos = i
endPos = startPos + diff - 1
Here. This does what you want. One pass, no need for recursion.
def find_longest_substring_in_alphabetical_order(s):
groups = []
cur_longest = ''
prev_char = ''
for c in s.lower():
if prev_char and c < prev_char:
groups.append(cur_longest)
cur_longest = c
else:
cur_longest += c
prev_char = c
return max(groups, key=len) if groups else s
Using it:
>>> find_longest_substring_in_alphabetical_order('hixwluvyhzzzdgd')
'luvy'
>>> find_longest_substring_in_alphabetical_order('eseoojlsuai')
'jlsu'
>>> find_longest_substring_in_alphabetical_order('drurotsxjehlwfwgygygxz')
'ehlw'
Note: It will probably break on strange characters, has only been tested with the inputs you suggested. Since this is a "homework" question, I will leave you with the solution as is, though there is still some optimization to be done, I wanted to leave it a little bit understandable.
You can use nested for loops, slicing and sorted. If the string is not all lower-case then you can convert the sub-strings to lower-case before comparing using str.lower:
def solve(strs):
maxx = ''
for i in xrange(len(strs)):
for j in xrange(i+1, len(strs)):
s = strs[i:j+1]
if ''.join(sorted(s)) == s:
maxx = max(maxx, s, key=len)
else:
break
return maxx
Output:
>>> solve('hixwluvyhzzzdgd')
'luvy'
>>> solve('eseoojlsuai')
'jlsu'
>>> solve('drurotsxjehlwfwgygygxz')
'ehlw'
Python has a powerful builtin package itertools and a wonderful function within groupby
An intuitive use of the Key function can give immense mileage.
In this particular case, you just have to keep a track of order change and group the sequence accordingly. The only exception is the boundary case which you have to handle separately
Code
def find_long_cons_sub(s):
class Key(object):
'''
The Key function returns
1: For Increasing Sequence
0: For Decreasing Sequence
'''
def __init__(self):
self.last_char = None
def __call__(self, char):
resp = True
if self.last_char:
resp = self.last_char < char
self.last_char = char
return resp
def find_substring(groups):
'''
The Boundary Case is when an increasing sequence
starts just after the Decresing Sequence. This causes
the first character to be in the previous group.
If you do not want to handle the Boundary Case
seperately, you have to mak the Key function a bit
complicated to flag the start of increasing sequence'''
yield next(groups)
try:
while True:
yield next(groups)[-1:] + next(groups)
except StopIteration:
pass
groups = (list(g) for k, g in groupby(s, key = Key()) if k)
#Just determine the maximum sequence based on length
return ''.join(max(find_substring(groups), key = len))
Result
>>> find_long_cons_sub('drurotsxjehlwfwgygygxz')
'ehlw'
>>> find_long_cons_sub('eseoojlsuai')
'jlsu'
>>> find_long_cons_sub('hixwluvyhzzzdgd')
'luvy'
Simple and easy.
Code :
s = 'hixwluvyhzzzdgd'
r,p,t = '','',''
for c in s:
if p <= c:
t += c
p = c
else:
if len(t) > len(r):
r = t
t,p = c,c
if len(t) > len(r):
r = t
print 'Longest substring in alphabetical order is: ' + r
Output :
Longest substring in alphabetical order which appeared first: luvy
Here is a single pass solution with a fast loop. It reads each character only once. Inside the loop operations are limited to
1 string comparison (1 char x 1 char)
1 integer increment
2 integer subtractions
1 integer comparison
1 to 3 integer assignments
1 string assignment
No containers are used. No function calls are made. The empty string is handled without special-case code. All character codes, including chr(0), are properly handled. If there is a tie for the longest alphabetical substring, the function returns the first winning substring it encountered. Case is ignored for purposes of alphabetization, but case is preserved in the output substring.
def longest_alphabetical_substring(string):
start, end = 0, 0 # range of current alphabetical string
START, END = 0, 0 # range of longest alphabetical string yet found
prev = chr(0) # previous character
for char in string.lower(): # scan string ignoring case
if char < prev: # is character out of alphabetical order?
start = end # if so, start a new substring
end += 1 # either way, increment substring length
if end - start > END - START: # found new longest?
START, END = start, end # if so, update longest
prev = char # remember previous character
return string[START : END] # return longest alphabetical substring
Result
>>> longest_alphabetical_substring('drurotsxjehlwfwgygygxz')
'ehlw'
>>> longest_alphabetical_substring('eseoojlsuai')
'jlsu'
>>> longest_alphabetical_substring('hixwluvyhzzzdgd')
'luvy'
>>>
a lot more looping, but it gets the job done
s = raw_input("Enter string")
fin=""
s_pos =0
while s_pos < len(s):
n=1
lng=" "
for c in s[s_pos:]:
if c >= lng[n-1]:
lng+=c
n+=1
else :
break
if len(lng) > len(fin):
fin= lng`enter code here`
s_pos+=1
print "Longest string: " + fin
def find_longest_order():
`enter code here`arr = []
`enter code here`now_long = ''
prev_char = ''
for char in s.lower():
if prev_char and char < prev_char:
arr.append(now_long)
now_long = char
else:
now_long += char
prev_char = char
if len(now_long) == len(s):
return now_long
else:
return max(arr, key=len)
def main():
print 'Longest substring in alphabetical order is: ' + find_longest_order()
main()
Simple and easy to understand:
s = "abcbcd" #The original string
l = len(s) #The length of the original string
maxlenstr = s[0] #maximum length sub-string, taking the first letter of original string as value.
curlenstr = s[0] #current length sub-string, taking the first letter of original string as value.
for i in range(1,l): #in range, the l is not counted.
if s[i] >= s[i-1]: #If current letter is greater or equal to previous letter,
curlenstr += s[i] #add the current letter to current length sub-string
else:
curlenstr = s[i] #otherwise, take the current letter as current length sub-string
if len(curlenstr) > len(maxlenstr): #if current cub-string's length is greater than max one,
maxlenstr = curlenstr; #take current one as max one.
print("Longest substring in alphabetical order is:", maxlenstr)
s = input("insert some string: ")
start = 0
end = 0
temp = ""
while end+1 <len(s):
while end+1 <len(s) and s[end+1] >= s[end]:
end += 1
if len(s[start:end+1]) > len(temp):
temp = s[start:end+1]
end +=1
start = end
print("longest ordered part is: "+temp)
I suppose this is problem set question for CS6.00.1x on EDX. Here is what I came up with.
s = raw_input("Enter the string: ")
longest_sub = ""
last_longest = ""
for i in range(len(s)):
if len(last_longest) > 0:
if last_longest[-1] <= s[i]:
last_longest += s[i]
else:
last_longest = s[i]
else:
last_longest = s[i]
if len(last_longest) > len(longest_sub):
longest_sub = last_longest
print(longest_sub)
I came up with this solution
def longest_sorted_string(s):
max_string = ''
for i in range(len(s)):
for j in range(i+1, len(s)+1):
string = s[i:j]
arr = list(string)
if sorted(string) == arr and len(max_string) < len(string):
max_string = string
return max_string
Assuming this is from Edx course:
till this question, we haven't taught anything about strings and their advanced operations in python
So, I would simply go through the looping and conditional statements
string ="" #taking a plain string to represent the then generated string
present ="" #the present/current longest string
for i in range(len(s)): #not len(s)-1 because that totally skips last value
j = i+1
if j>= len(s):
j=i #using s[i+1] simply throws an error of not having index
if s[i] <= s[j]: #comparing the now and next value
string += s[i] #concatinating string if above condition is satisied
elif len(string) != 0 and s[i] > s[j]: #don't want to lose the last value
string += s[i] #now since s[i] > s[j] #last one will be printed
if len(string) > len(present): #1 > 0 so from there we get to store many values
present = string #swapping to largest string
string = ""
if len(string) > len(present): #to swap from if statement
present = string
if present == s[len(s)-1]: #if no alphabet is in order then first one is to be the output
present = s[0]
print('Longest substring in alphabetical order is:' + present)
I agree with #Abhijit about the power of itertools.groupby() but I took a simpler approach to (ab)using it and avoided the boundary case problems:
from itertools import groupby
LENGTH, LETTERS = 0, 1
def longest_sorted(string):
longest_length, longest_letters = 0, []
key, previous_letter = 0, chr(0)
def keyfunc(letter):
nonlocal key, previous_letter
if letter < previous_letter:
key += 1
previous_letter = letter
return key
for _, group in groupby(string, keyfunc):
letters = list(group)
length = len(letters)
if length > longest_length:
longest_length, longest_letters = length, letters
return ''.join(longest_letters)
print(longest_sorted('hixwluvyhzzzdgd'))
print(longest_sorted('eseoojlsuai'))
print(longest_sorted('drurotsxjehlwfwgygygxz'))
print(longest_sorted('abcdefghijklmnopqrstuvwxyz'))
OUTPUT
> python3 test.py
luvy
jlsu
ehlw
abcdefghijklmnopqrstuvwxyz
>
s = 'azcbobobegghakl'
i=1
subs=s[0]
subs2=s[0]
while(i<len(s)):
j=i
while(j<len(s)):
if(s[j]>=s[j-1]):
subs+=s[j]
j+=1
else:
subs=subs.replace(subs[:len(subs)],s[i])
break
if(len(subs)>len(subs2)):
subs2=subs2.replace(subs2[:len(subs2)], subs[:len(subs)])
subs=subs.replace(subs[:len(subs)],s[i])
i+=1
print("Longest substring in alphabetical order is:",subs2)
s = 'gkuencgybsbezzilbfg'
x = s.lower()
y = ''
z = [] #creating an empty listing which will get filled
for i in range(0,len(x)):
if i == len(x)-1:
y = y + str(x[i])
z.append(y)
break
a = x[i] <= x[i+1]
if a == True:
y = y + str(x[i])
else:
y = y + str(x[i])
z.append(y) # fill the list
y = ''
# search of 1st longest string
L = len(max(z,key=len)) # key=len takes length in consideration
for i in range(0,len(z)):
a = len(z[i])
if a == L:
print 'Longest substring in alphabetical order is:' + str(z[i])
break
first_seq=s[0]
break_seq=s[0]
current = s[0]
for i in range(0,len(s)-1):
if s[i]<=s[i+1]:
first_seq = first_seq + s[i+1]
if len(first_seq) > len(current):
current = first_seq
else:
first_seq = s[i+1]
break_seq = first_seq
print("Longest substring in alphabetical order is: ", current)
I am currently stuck with this program. I am attempting to determine the molecular weight of a compound given the molecular equation (only Cs, Hs, and Os). I also am unsure of how to correctly format [index +1], as I am trying to determine what the next character after "x" is to see if it is a number or another molecule
def main():
C1 = 0
H1 = 0
O1 = 0
num = 0
chemicalFormula = input("Enter the chemical formula, or enter key to quit: ")
while True:
cformula = list(chemicalFormula)
for index, x in enumerate(cformula):
if x == 'C':
if cformula[index + 1] == 'H' or cformula[index + 1] == 'O':
C1 += 1
else:
for index, y in range(index + 1, 1000000000):
if cformula[index + 1] != 'H' or cformula[index + 1] != 'O':
num = int(y)
num = num*10 + int(cformula[index + 1])
else:
C1 += num
break
this is the error I keep getting
Enter the chemical formula, or enter key to quit: C2
File "/Users/ykasznik/Documents/ykasznikp7.py", line 46, in main
for index, y in range(index + 1, 1000000000):
TypeError: 'int' object is not iterable
>>>
You should change this line
for index, y in range(index + 1, 1000000000):
to
for y in range(index + 1, 1000000000):
The answers provided here focus on two different aspects of solving your problem:
A very specific solution to your error (int is not iterable), by correcting some code.
A bit bigger perspective of how to handle your code.
Regarding 1, a comment to your question noted the issue: the syntax of tuple-unpacking in your inner loop.
An example of Tuple-unpacking would be
a,b = ['a','b']
Here, Python would take the first element of the right hand side (RHS) and assign it to the first name on the left hand side (LHS), the second element of RHS and assign it to the second name in the LHF.
Your inner loop that faults,
for index, y in range(index + 1, 1000000000),
is equivalent of trying to do
index, y = 1
Now, an integer is not a collection of elements, so this would not work.
Regarding 2, you should focus on the strategy of modularization, which basically means you write a function for each sub-problem. Python was almost born for this. (Note, this strategy does not necessarily mean writing Python-modules for each subproblem.)
In you case, your main goal can be divided into several sub-problems:
Getting the molecular sequences.
Split the sequences into individual sequences.
Splitting the sequence into its H, C, and O-elements.
Given the number of H, C and O-atoms, calculate the molecular weight.
Step 3 and 4 are excellent candidates for independent functions, as their core problem is isolated from the remaining context.
Here, I assume we only get 1 sequence at a time, and that they can be of the form:
CH4
CHHHH
CP4H3OH
Step 3:
def GetAtoms(sequence):
'''
Counts the number of C's, H's and O's in sequence and returns a dictionary.
Only works with a numeric suffices up to 9, e.g. C10H12 would not work.
'''
atoms = ['C','H','O'] # list of which atoms we want to count.
res = {atom:0 for atom in atoms}
last_c = None
for c in sequence:
if c in atoms:
res[c] += 1
last_c = c
elif c.isdigit() and last_c is not None:
res[last_c] += int(c) - 1
last_c = None
else:
last_c = None
return res
You can see, that regardless of how you obtain the sequence and how the molecular weight is calculated, this method works (under the preconditions). If you later need to extend the capabilities of how you obtain the atom-count, this can be altered without affecting the remaining logic.
Step 4:
def MolecularWeight(atoms):
return atoms['H']*1 + atoms['C']*8 + atoms['O']*18
Now your total logic could be this:
while True:
chemicalFormula = input("Enter the chemical formula, or enter key to quit: ")
if len(chemicalFormula) == 0:
break
print 'Molecular weight of', chemicalFormula, 'is', MolecularWeight(GetAtoms(chemicalFormula))
Here's my idea on how to solve the problem. Basically, you keep track of the current 'state' and iterate through each character exactly once, so you can't lose track of where you are or anything like that.
def getWeightFromChemical(chemical):
chemicals = {"C" : 6, "H" : 1, "O" : 8}
return chemicals.get(chemical, 0)
def chemicalWeight(chemicalFormula):
lastchemical = ""
currentnumber = ""
weight = 0
for c in chemicalFormula:
if str.isalpha(c): # prepare new chemical
if len(lastchemical) > 0:
weight += getWeightFromChemical(lastchemical)*int("1" if currentnumber == "" else currentnumber)
lastchemical = c
currentnumber = ""
elif str.isdigit(c): # build up number for previous chemical
currentnumber += c
# one last check
if len(lastchemical) > 0:
weight += getWeightFromChemical(lastchemical)*int("1" if currentnumber == "" else currentnumber)
return weight
By the way, can anyone see how to refactor this to not have that piece of code twice? It bugs me.
Change
for index, y in range(index + 1, 1000000000):
to
for index, y in enumerate(range(index + 1, 1000000000)):
Although you may consider renaming your outer loop or inner loop index for clarity
for index, x in enumerate(cformula):
if x == 'C':
if cformula[index + 1] == 'H' or cformula[index + 1] == 'O':
C1 += 1
else:
for index, y in range(index + 1, 1000000000):
This is a Really Bad Idea. You are overwriting the value of index from the outer loop with the value of index from the inner loop.
You should use a different name, say index2 for the inner loop.
Also, when you say for index, y in range(index + 1, 1000000000): you are acting as if you are expecting range() to produce a sequence of 2-tuples. But range always produces a sequence of ints.
Roger has suggested for y in range(index + 1, 1000000000): but I think you are intending to get the value of y from somewhere else (it's not clear where. Maybe you want to use the second argument of enumerate() to specify the value to start from, instead?
That is,
for index2, y in enumerate(whereeveryoumeanttogetyfrom, index + 1)
so that index2 equals index +1 on the first step through the loop, index +2 on the second, etc.
Range returns either a list of int, or an iterable of int, depending on which version of Python you are using. Attempting to assign that single int into two names causes Python to attempt to iterate through that int in automated tuple unpacking.
So, change the
for index, y in range(index + 1, y):
to
for y in range(index + 1, y):
Also, you use index + 1 repeatedly, but mostly to look up the next symbol in your cformula. Since that doesn't change over the course of your outer loop, just assign it its own name once, and keep using that name:
for index, x in enumerate(cformula):
next_index = index + 1
next_symbol = cformula[next_index]
if x == 'C':
if next_symbol == 'H' or next_symbol == 'O':
C1 += 1
else:
for y in range(next_index, 1000000000):
if next_symbol != 'H' or next_symbol != 'O':
num = y*10 + int(next_symbol)
else:
C1 += num
break
I've also refactored out some constants to make the code cleaner. Your inner loop as written was failing on tuple assignment, and would only be counting up the y. Also, your index would be reset again once you exited the inner loop, so you would be processing all of your digits repeatedly.
If you want to iterate over the substring after your current symbol, you could just use slice notation to get all of those characters: for subsequent in cformula[next_index:]
For example:
>>> chemical = 'CH3OOCH3'
>>> chemical[2:]
'3OOCH3'
>>> for x in chemical[2:]:
... print x
...
3
O
O
C
H
3