Read all possible sequential substrings in Python - python

If I have a list of letters, such as:
word = ['W','I','N','E']
and need to get every possible sequence of substrings, of length 3 or less, e.g.:
W I N E, WI N E, WI NE, W IN E, WIN E etc.
What is the most efficient way to go about this?
Right now, I have:
word = ['W','I','N','E']
for idx,phon in enumerate(word):
phon_seq = ""
for p_len in range(3):
if idx-p_len >= 0:
phon_seq = " ".join(word[idx-(p_len):idx+1])
print(phon_seq)
This just gives me the below, rather than the sub-sequences:
W
I
W I
N
I N
W I N
E
N E
I N E
I just can't figure out how to create every possible sequence.

Try this recursive algorithm:
def segment(word):
def sub(w):
if len(w) == 0:
yield []
for i in xrange(1, min(4, len(w) + 1)):
for s in sub(w[i:]):
yield [''.join(w[:i])] + s
return list(sub(word))
# And if you want a list of strings:
def str_segment(word):
return [' '.join(w) for w in segment(word)]
Output:
>>> segment(word)
[['W', 'I', 'N', 'E'], ['W', 'I', 'NE'], ['W', 'IN', 'E'], ['W', 'INE'], ['WI', 'N', 'E'], ['WI', 'NE'], ['WIN', 'E']]
>>> str_segment(word)
['W I N E', 'W I NE', 'W IN E', 'W INE', 'WI N E', 'WI NE', 'WIN E']

As there can either be a space or not in each of three positions (after W, after I and after N), you can think of this as similar to bits being 1 or 0 in a binary representation of a number ranging from 1 to 2^3 - 1.
input_word = "WINE"
for variation_number in xrange(1, 2 ** (len(input_word) - 1)):
output = ''
for position, letter in enumerate(input_word):
output += letter
if variation_number >> position & 1:
output += ' '
print output
Edit: To include only variations with sequences of 3 characters or less (in the general case where input_word may be longer than 4 characters), we can exclude cases where the binary representation contains 3 zeroes in a row. (We also start the range from a higher number in order to exclude the cases which would have 000 at the beginning.)
for variation_number in xrange(2 ** (len(input_word) - 4), 2 ** (len(input_word) - 1)):
if not '000' in bin(variation_number):
output = ''
for position, letter in enumerate(input_word):
output += letter
if variation_number >> position & 1:
output += ' '
print output

My implementation for this problem.
#!/usr/bin/env python
# this is a problem of fitting partitions in the word
# we'll use itertools to generate these partitions
import itertools
word = 'WINE'
# this loop generates all possible partitions COUNTS (up to word length)
for partitions_count in range(1, len(word)+1):
# this loop generates all possible combinations based on count
for partitions in itertools.combinations(range(1, len(word)), r=partitions_count):
# because of the way python splits words, we only care about the
# difference *between* partitions, and not their distance from the
# word's beginning
diffs = list(partitions)
for i in xrange(len(partitions)-1):
diffs[i+1] -= partitions[i]
# first, the whole word is up for taking by partitions
splits = [word]
# partition the word's remainder (what was not already "taken")
# with each partition
for p in diffs:
remainder = splits.pop()
splits.append(remainder[:p])
splits.append(remainder[p:])
# print the result
print splits

As an alternative answer , you can do it with itertools module and use groupby function for grouping your list and also i use combination to create a list of pair index for grouping key : (i<=word.index(x)<=j) and at last use set for get a unique list .
Also note that you can got a unique combination of pair index at first by this method that when you have pairs like (i1,j1) and (i2,j2) if i1==0 and j2==3 and j1==i2 like (0,2) and (2,3) it mean that those slices result are same you need to remove one of them.
All in one list comprehension :
subs=[[''.join(i) for i in j] for j in [[list(g) for k,g in groupby(word,lambda x: i<=word.index(x)<=j)] for i,j in list(combinations(range(len(word)),2))]]
set([' '.join(j) for j in subs]) # set(['WIN E', 'W IN E', 'W INE', 'WI NE', 'WINE'])
Demo in details :
>>> cl=list(combinations(range(len(word)),2))
>>> cl
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
>>> new_l=[[list(g) for k,g in groupby(word,lambda x: i<=word.index(x)<=j)] for i,j in cl]
>>> new_l
[[['W', 'I'], ['N', 'E']], [['W', 'I', 'N'], ['E']], [['W', 'I', 'N', 'E']], [['W'], ['I', 'N'], ['E']], [['W'], ['I', 'N', 'E']], [['W', 'I'], ['N', 'E']]]
>>> last=[[''.join(i) for i in j] for j in new_l]
>>> last
[['WI', 'NE'], ['WIN', 'E'], ['WINE'], ['W', 'IN', 'E'], ['W', 'INE'], ['WI', 'NE']]
>>> set([' '.join(j) for j in last])
set(['WIN E', 'W IN E', 'W INE', 'WI NE', 'WINE'])
>>> for i in set([' '.join(j) for j in last]):
... print i
...
WIN E
W IN E
W INE
WI NE
WINE
>>>

i think it can be like this:
word = "ABCDE"
myList = []
for i in range(1, len(word)+1,1):
myList.append(word[:i])
for j in range(len(word[len(word[1:]):]), len(word)-len(word[i:]),1):
myList.append(word[j:i])
print(myList)
print(sorted(set(myList), key=myList.index))
return myList

Related

Split a string into chunks of substrings with successively increasing length

Let's say I have this string:
a = 'abcdefghijklmnopqrstuvwxyz'
And I want to split this string into chunks, like below:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
so that every chunk has a different number of characters. For instance, the first one should have one character, the second two and so on.
If there are not enough characters in the last chunk, then I need to add spaces so it matches the length.
I tried this code so far:
print([a[i: i + i + 1] for i in range(len(a))])
But it outputs:
['a', 'bc', 'cde', 'defg', 'efghi', 'fghijk', 'ghijklm', 'hijklmno', 'ijklmnopq', 'jklmnopqrs', 'klmnopqrstu', 'lmnopqrstuvw', 'mnopqrstuvwxy', 'nopqrstuvwxyz', 'opqrstuvwxyz', 'pqrstuvwxyz', 'qrstuvwxyz', 'rstuvwxyz', 'stuvwxyz', 'tuvwxyz', 'uvwxyz', 'vwxyz', 'wxyz', 'xyz', 'yz', 'z']
Here is my desired output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
I don't think any one liner or for loop will look as elegant, so let's go with a generator:
from itertools import islice, count
def get_increasing_chunks(s):
it = iter(s)
c = count(1)
nxt, c_ = next(it), next(c)
while nxt:
yield nxt.ljust(c_)
nxt, c_ = ''.join(islice(it, c_+1)), next(c)
return out
[*get_increasing_chunks(a)]
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Thanks to #Prune's comment, I managed to figure out a way to solve this:
a = 'abcdefghijklmnopqrstuvwxyz'
lst = []
c = 0
for i in range(1, len(a) + 1):
c += i
lst.append(c)
print([a[x: y] + ' ' * (i - len(a[x: y])) for i, (x, y) in enumerate(zip([0] + lst, lst), 1) if a[x: y]])
Output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
I find the triangular numbers than do a list comprehension, and add spaces if the length is not right.
so what you need is to have a number that controls how many characters you're going to grab (in this case the amount of iterations), and a second number that remembers what the last index was, plus one last number to tell where to stop.
my_str = "abcdefghijklmnopqrstuvwxyz"
last_index = 0
index = 1
iter_count = 1
while True:
sub_string = my_str[last_index:index]
print(sub_string)
last_index = index
iter_count += 1
index = index + iter_count
if last_index > len(my_str):
break
note that you don't need the while loop. i was just feeling lazy
It seems like the split_into recipe at more_itertools can help here. This is less elegant than the answer by #cs95, but perhaps this will help others discover the utility of the itertools module.
Yield a list of sequential items from iterable of length ā€˜nā€™ for each integer ā€˜nā€™ in sizes.
>>> list(split_into([1,2,3,4,5,6], [1,2,3]))
[[1], [2, 3], [4, 5, 6]]
To use this, we need to construct a list of sizes like [1, 2, 3, 3, 5, 6, 7].
import itertools
def split_into(iterable, sizes):
it = iter(iterable)
for size in sizes:
if size is None:
yield list(it)
return
else:
yield list(itertools.islice(it, size))
a = 'abcdefghijklmnopqrstuvwxyz'
sizes = [1]
while sum(sizes) <= len(a):
next_value = sizes[-1] + 1
sizes.append(next_value)
# sizes = [1, 2, 3, 4, 5, 6, 7]
list(split_into(a, sizes))
# [['a'],
# ['b', 'c'],
# ['d', 'e', 'f'],
# ['g', 'h', 'i', 'j'],
# ['k', 'l', 'm', 'n', 'o'],
# ['p', 'q', 'r', 's', 't', 'u'],
# ['v', 'w', 'x', 'y', 'z']]
chunks = list(map("".join, split_into(a, sizes)))
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz']
# Pad last item with whitespace.
chunks[-1] = chunks[-1].ljust(sizes[-1], " ")
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Here is a solution using accumulate from itertools.
>>> from itertools import accumulate
>>> from string import ascii_lowercase
>>> s = ascii_lowercase
>>> n = 0
>>> accum = 0
>>> while accum < len(s):
n += 1
accum += n
>>> L = [s[j:i+j] for i, j in enumerate(accumulate(range(n)), 1)]
>>> L[-1] += ' ' * (n-len(L[-1]))
>>> L
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Update: Could also be obtained within the while loop
n = 0
accum = 0
L = []
while accum < len(s):
n += 1
L.append(s[accum:accum+n])
accum += n
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz']
Adding a little to U11-Forward's answer:
a = 'abcdefghijklmnopqrstuvwxyz'
l = list(range(len(a))) # numberes list / 1 to len(a)
triangular = [sum(l[:i+2]) for i in l] # sum of 1, 2 and 1,2,3 and 1,2,3,4 and etc
print([a[x: y].ljust(i, ' ') for i, (x, y) in enumerate(zip([0] + triangular, triangular), 1) if a[x: y]])
Output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Find the triangular numbers, do a list comprehension and fill with spaces if the length is incorrect.
a = 'abcdefghijklmnopqrstuvwxyz'
inc = 0
output = []
for i in range(0, len(a)):
print(a[inc: inc+i+1])
inc = inc+i+1
if inc > len(a):
break
output.append(a[inc: inc+i+1])
print(output)
Hey, here is the snippet for your required output. I have just altered your logic.
Output:
['b', 'de', 'ghi', 'klmn', 'pqrst', 'vwxyz']

All possible substring in Python

Can anyone help me with finding all the possible substring in a string using python?
E.g:
string = 'abc'
output
a, b, c, ab, bc, abc
P.s : I am a beginner and would appreciate if the solution is simple to understand.
You could do something like:
for length in range(len(string)):
for index in range(len(string) - length):
print(string[index:index+length+1])
Output:
a
b
c
ab
bc
abc
else one way is using the combinations
from itertools import combinations
s = 'abc'
[
''.join(x)
for size in range(1, len(s) + 1)
for x in (combinations(s, size))
]
Out
['a', 'b', 'c', 'ab', 'ac', 'bc', 'abc']
Every substring contains a unique start index and a unique end index (which is greater than the start index). You can use two for loops to get all unique combinations of indices.
def all_substrings(s):
all_subs = []
for end in range(1, len(s) + 1):
for start in range(end):
all_subs.append(s[start:end])
return all_subs
s = 'abc'
print(all_substrings(s)) # prints ['a', 'ab', 'b', 'abc', 'bc', 'c']
You can do like:
def subString(s):
for i in range(len(s)):
for j in range(i+1,len(s)+1):
print(s[i:j])
subString("aashu")
a
aa
aas
aash
aashu
a
as
ash
ashu
s
sh
shu
h
hu
u

List all possible words with n letters

I want to list all possible words with n letters where the first letter can be a1 or a2, the second can be b1, b2 or b3, the third can be c1 or c2, ... Here's a simple example input-output for n=2 with each letter having 2 alternatives:
input = [["a","b"],["c","d"]]
output = ["ac", "ad", "bc", "bd"]
I tried doing this recursively by creating all possible words with the first 2 letters first, so something like this:
def go(l):
if len(l) > 2:
head = go(l[0:2])
tail = l[2:]
tail.insert(0, head)
go(tail)
elif len(l) == 2:
res = []
for i in l[0]:
for j in l[1]:
res.append(i+j)
return res
elif len(l) == 1:
return l
else:
return None
However, this becomes incredibly slow for large n or many alternatives per letter. What would be a more efficient way to solve this?
Thanks
I think you just want itertools.product here:
>>> from itertools import product
>>> lst = ['ab', 'c', 'de']
>>> words = product(*lst)
>>> list(words)
[('a', 'c', 'd'), ('a', 'c', 'e'), ('b', 'c', 'd'), ('b', 'c', 'e')]`
Or, if you wanted them joined into words:
>>> [''.join(word) for word in product(*lst)]
['acd', 'ace', 'bcd', 'bce']
Or, with your example:
>>> lst = [["a","b"],["c","d"]]
>>> [''.join(word) for word in product(*lst)]
['ac', 'ad', 'bc', 'bd']
Of course for very large n or very large sets of letters (size m), this will be slow. If you want to generate an exponentially large set of outputs (O(m**n)), that will take exponential time. But at least it has constant rather than exponential space (it generates one product at a time, instead of a giant list of all of them), and will be faster than what you were on your way to by a decent constant factor, and it's a whole lot simpler and harder to get wrong.
You can use the permutations from the built-in itertools module to achieve this, like so
>>> from itertools import permutations
>>> [''.join(word) for word in permutations('abc', 2)]
['ab', 'ac', 'ba', 'bc', 'ca', 'cb']
Generating all strings of some length with given alphabet :
test.py :
def generate_random_list(alphabet, length):
if length == 0: return []
c = [[a] for a in alphabet[:]]
if length == 1: return c
c = [[x,y] for x in alphabet for y in alphabet]
if length == 2: return c
for l in range(2, length):
c = [[x]+y for x in alphabet for y in c]
return c
if __name__ == "__main__":
for p in generate_random_list(['h','i'],2):
print p
$ python2 test.py
['h', 'h']
['h', 'i']
['i', 'h']
['i', 'i']
Next Way :
def generate_random_list(alphabet, length):
c = []
for i in range(length):
c = [[x]+y for x in alphabet for y in c or [[]]]
return c
if __name__ == "__main__":
for p in generate_random_list(['h','i'],2):
print p
Next Way :
import itertools
if __name__ == "__main__":
chars = "hi"
count = 2
for item in itertools.product(chars, repeat=count):
print("".join(item))
import itertools
print([''.join(x) for x in itertools.product('hi',repeat=2)])
Next Way :
from itertools import product
#from string import ascii_letters, digits
#for i in product(ascii_letters + digits, repeat=2):
for i in product("hi",repeat=2):
print(''.join(i))

How to get certain number of alphabets from a list?

I have a 26-digit list. I want to print out a list of alphabets according to the numbers. For example, I have a list(consisting of 26-numbers from input):
[0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
I did like the output to be like this:
[e,e,l,s]
'e' is on the output 2-times because on the 4-th index it is the 'e' according to the English alphabet formation and the digit on the 4-th index is 2. It's the same for 'l' since it is on the 11-th index and it's digit is 1. The same is for s. The other letters doesn't appear because it's digits are zero.
For example, I give another 26-digit input. Like this:
[1,2,2,3,4,0,3,4,4,1,3,1,4,4,1,0,0,0,0,0,4,2,3,2,2,1]
The output should be:
[a,b,b,c,c,d,d,d,e,e,e,e,g,g,g,h,h,h,h,i,i,i,i,j,k,k,k,l,m,m,m,m,n,n,n,n,o,u,u,u,u,v,v,w,w,w,x,x,y,y,z]
Is, there any possible to do this in Python 3?
You can use chr(97 + item_index) to get the respective items and then multiply by the item itself:
In [40]: [j * chr(97 + i) for i, j in enumerate(lst) if j]
Out[40]: ['ee', 'l', 's']
If you want them separate you can utilize itertools module:
In [44]: from itertools import repeat, chain
In [45]: list(chain.from_iterable(repeat(chr(97 + i), j) for i, j in enumerate(lst) if j))
Out[45]: ['e', 'e', 'l', 's']
Yes, it is definitely possible in Python 3.
Firstly, define an example list (as you did) of numbers and an empty list to store the alphabetical results.
The actual logic to link with the index is using chr(97 + index), ord("a") = 97 therefore, the reverse is chr(97) = a. First index is 0 so 97 remains as it is and as it iterates the count increases and your alphabets too.
Next, a nested for-loop to iterate over the list of numbers and then another for-loop to append the same alphabet multiple times according to the number list.
We could do this -> result.append(chr(97 + i) * my_list[i]) in the first loop itself but it wouldn't yield every alphabet separately [a,b,b,c,c,d,d,d...] rather it would look like [a,bb,cc,ddd...].
my_list = [1,2,2,3,4,0,3,4,4,1,3,1,4,4,1,0,0,0,0,0,4,2,3,2,2,1]
result = []
for i in range(len(my_list)):
if my_list[i] > 0:
for j in range(my_list[i]):
result.append(chr(97 + i))
else:
pass
print(result)
An alternative to the wonderful answer by #Kasramvd
import string
n = [0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
res = [i * c for i, c in zip(n, string.ascii_lowercase) if i]
print(res) # -> ['ee', 'l', 's']
Your second example produces:
['a', 'bb', 'cc', 'ddd', 'eeee', 'ggg', 'hhhh', 'iiii', 'j', 'kkk', 'l', 'mmmm', 'nnnn', 'o', 'uuuu', 'vv', 'www', 'xx', 'yy', 'z']
Splitting the strings ('bb' to 'b', 'b') can be done with the standard schema:
[x for y in something for x in y]
Using a slightly different approach, which gives the characters individually as in your example:
import string
a = [0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
alphabet_lookup = np.repeat(np.arange(len(a)), a)
letter_lookup = np.array(list(string.ascii_lowercase))
res = letter_lookup[alphabet_lookup]
print(res)
To get
['e' 'e' 'l' 's']

How to combine 3 validation checks into one loop?

For example I have a tuple such as
tup = [['P Y T F EY EN', 'p y t h o n'], ['R O K', 'r o x']]
I then separate the tuple into lists such as
lst1 = [['P', 'Y', 'T', 'F', 'EY', 'EN'], ['R', 'O', 'K']]
lst2 = [['p', 'y', 't', 'h', 'o', 'n'], ['r', 'o', 'x']]
The 3 conditions I have are as follows:
First the length of the 1st element in the tuple must be equal to that of the 2nd
for i in tup:
if not len(tup[0].split()) == len(tup[1].split()) :
count +=1
break
2nd condition is that for every element in lst1, each character in the element must be in another document such as a csv file
for i in lst1:
for j in i:
if j not in file:
count+=1
break
3rd condition is that every element in lst2, each character must also be in another document
for i in lst2:
for j in i:
if j not in other_file:
count+=1
break
As you can see I want the count to increase each time one of these conditions is broken. I also don't want the counts to overlap and to skip onto the next row if a condition is broken while appending to the count.
Maybe this will help:
I am assuming the files are small enough to be read all at once:
f = open('doc1.csv', 'r') # read all of doc1.csv now
doc1 = f.read()
f.close()
f = open('doc2.csv', 'r') # read all of doc2.csv now
doc2 = f.read()
f.close()
count = 0 # count of all docs that are invalid
for item in tup:
l1 = item[0].split() # get list version of first and string
l2 = item[1].split()
if len(l1) != len(l2) or not all([char in doc1 for char in l1]) or not all([char in doc2 for char in l2]): # check if lengths are same, if any character in l1 is not in doc1, and any char in l2 is not in doc2
count += 1
print count
First of all, there are two issues with your example:
1) tup is a list, not tuple;
2) tup[0] = ['P Y T F EY EN', 'p y t h o n']; tup[1] = ['R O K', 'r o x'];
Both of them are list, and cannot do split()
If you would like to calculate the total count, you could do in one statement like the following:
print sum([ not len(i[0].split()) == len(i[1].split()) for i in tup ] + \
[ j not in file for j in i for i in lst1 ] + \
[ j not in other_file for j in i for i in lst2 ])

Categories

Resources