Subwords of a string in Python

Subwords of a string in Python - python

I am trying to create a list of every possible version of a string in a fast way. I don't really mean specifically subwords - for example from a string "ABC", I want to get:
['C', 'B', 'BC', 'A', 'AB', 'ABC']
(without "AC" which is a subword)
Same goes for "123":
I want to get ['3', '2', '23', '1', '12', '123'] instead of ['3', '2', '23', '1', '13', '12', '123']

Here is a simple loop and slice based generator function:
def subs(s):
for i in range(len(s)):
for j in range(i+1, len(s)+1):
yield s[i:j]
>>> list(subs("ABC"))
['A', 'AB', 'ABC', 'B', 'BC', 'C']

Might be faster to extend the substrings instead of freshly slicing each:
def subs(s):
while s:
t = ''
for c in s:
t += c
yield t
s = s[1:]
Benchmark results for s = "z" * 5000:
8.4 seconds subs_slice
1.5 seconds subs_extend
Benchmark code (Try it online!):
from timeit import timeit
from collections import deque
def subs_slice(s):
for i in range(len(s)):
for j in range(i+1, len(s)+1):
yield s[i:j]
def subs_extend(s):
while s:
t = ''
for c in s:
t += c
yield t
s = s[1:]
funcs = subs_slice, subs_extend
for func in funcs:
print(list(func('ABCD')))
s = "z" * 5000
for _ in range(3):
for func in funcs:
t = timeit(lambda: deque(func(s), 0), number=1)
print(t, func.__name__)
print()

For ABC you can just get ['C', 'B', 'BC', 'A', 'AB', 'ABC', 'AC'] then use remove() to remove the subword from your list. E.i:
abc_list = ['C', 'B', 'BC', 'A', 'AB', 'ABC', 'AC']
abc_list.remove('AC')
Output: ['C', 'B', 'BC', 'A', 'AB', 'ABC']
There is a lack of context to the question to give you a full answer. Do all of your strings have 3 characters or more? how do you define what you don't need?
If all the strings are 3 characters in length, then you can use this:
def subwording(word: str):
subword = word[0]+word[2]
return subword
Then you can remove subword from your list.

Related

python from ['a','b','c','d'] to ['a', 'ab', abc', 'abcd']

I have a list ['a','b','c','d'], want to make another list, like this: ['a', 'ab', abc', 'abcd']?
Thanks
Tried:
list1=['a','b','c', 'd']
for i in range(1, (len(list1)+1)):
for j in range(1, 1+i):
print(*[list1[j-1]], end = "")
print()
returns:
a
ab
abc
abcd
It does print what i want, but not sure,how to add it to a list to look like ['a', 'ab', abc', 'abcd']

Use itertools.accumulate, which by default sums up the elements for accumulation like a cummulative sum. Since addition (__add__) is defined for str and results in the concatenation of the strings
assert "a" + "b" == "ab"
we can use accumulate as is:
import itertools
list1 = ["a", "b", "c", "d"]
list2 = list(itertools.accumulate(list1)) # list() because accumulate returns an iterator
print(list2) # ['a', 'ab', 'abc', 'abcd']

Append to a second list in a loop:
list1=['a','b','c', 'd']
list2 = []
s = ''
for c in list1:
s += c
list2.append(s)
print(list2)
Output:
['a', 'ab', 'abc', 'abcd']

list1=['a','b','c', 'd']
l = []
for i in range(len(list1)):
l.append("".join(list1[:i+1]))
print(l)
Printing stuff is useless if you want to do ANYTHING else with the data you are printing. Only use it when you actually want to display something to console.

You could form a string and slice it in a list comprehension:
s = ''.join(['a', 'b', 'c', 'd'])
out = [s[:i+1] for i, _ in enumerate(s)]
print(out):
['a', 'ab', 'abc', 'abcd']

You can do this in a list comprehension:
vals = ['a', 'b', 'c', 'd']
res = [''.join(vals[:i+1]) for i, _ in enumerate(vals)]

Code:
[''.join(list1[:i+1]) for i,l in enumerate(list1)]
Output:
['a', 'ab', 'abc', 'abcd']

Split a string into chunks of substrings with successively increasing length

Let's say I have this string:
a = 'abcdefghijklmnopqrstuvwxyz'
And I want to split this string into chunks, like below:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
so that every chunk has a different number of characters. For instance, the first one should have one character, the second two and so on.
If there are not enough characters in the last chunk, then I need to add spaces so it matches the length.
I tried this code so far:
print([a[i: i + i + 1] for i in range(len(a))])
But it outputs:
['a', 'bc', 'cde', 'defg', 'efghi', 'fghijk', 'ghijklm', 'hijklmno', 'ijklmnopq', 'jklmnopqrs', 'klmnopqrstu', 'lmnopqrstuvw', 'mnopqrstuvwxy', 'nopqrstuvwxyz', 'opqrstuvwxyz', 'pqrstuvwxyz', 'qrstuvwxyz', 'rstuvwxyz', 'stuvwxyz', 'tuvwxyz', 'uvwxyz', 'vwxyz', 'wxyz', 'xyz', 'yz', 'z']
Here is my desired output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']

I don't think any one liner or for loop will look as elegant, so let's go with a generator:
from itertools import islice, count
def get_increasing_chunks(s):
it = iter(s)
c = count(1)
nxt, c_ = next(it), next(c)
while nxt:
yield nxt.ljust(c_)
nxt, c_ = ''.join(islice(it, c_+1)), next(c)
return out
[*get_increasing_chunks(a)]
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']

Thanks to #Prune's comment, I managed to figure out a way to solve this:
a = 'abcdefghijklmnopqrstuvwxyz'
lst = []
c = 0
for i in range(1, len(a) + 1):
c += i
lst.append(c)
print([a[x: y] + ' ' * (i - len(a[x: y])) for i, (x, y) in enumerate(zip([0] + lst, lst), 1) if a[x: y]])
Output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
I find the triangular numbers than do a list comprehension, and add spaces if the length is not right.

so what you need is to have a number that controls how many characters you're going to grab (in this case the amount of iterations), and a second number that remembers what the last index was, plus one last number to tell where to stop.
my_str = "abcdefghijklmnopqrstuvwxyz"
last_index = 0
index = 1
iter_count = 1
while True:
sub_string = my_str[last_index:index]
print(sub_string)
last_index = index
iter_count += 1
index = index + iter_count
if last_index > len(my_str):
break
note that you don't need the while loop. i was just feeling lazy

It seems like the split_into recipe at more_itertools can help here. This is less elegant than the answer by #cs95, but perhaps this will help others discover the utility of the itertools module.
Yield a list of sequential items from iterable of length ‘n’ for each integer ‘n’ in sizes.
>>> list(split_into([1,2,3,4,5,6], [1,2,3]))
[[1], [2, 3], [4, 5, 6]]
To use this, we need to construct a list of sizes like [1, 2, 3, 3, 5, 6, 7].
import itertools
def split_into(iterable, sizes):
it = iter(iterable)
for size in sizes:
if size is None:
yield list(it)
return
else:
yield list(itertools.islice(it, size))
a = 'abcdefghijklmnopqrstuvwxyz'
sizes = [1]
while sum(sizes) <= len(a):
next_value = sizes[-1] + 1
sizes.append(next_value)
# sizes = [1, 2, 3, 4, 5, 6, 7]
list(split_into(a, sizes))
# [['a'],
# ['b', 'c'],
# ['d', 'e', 'f'],
# ['g', 'h', 'i', 'j'],
# ['k', 'l', 'm', 'n', 'o'],
# ['p', 'q', 'r', 's', 't', 'u'],
# ['v', 'w', 'x', 'y', 'z']]
chunks = list(map("".join, split_into(a, sizes)))
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz']
# Pad last item with whitespace.
chunks[-1] = chunks[-1].ljust(sizes[-1], " ")
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']

Here is a solution using accumulate from itertools.
>>> from itertools import accumulate
>>> from string import ascii_lowercase
>>> s = ascii_lowercase
>>> n = 0
>>> accum = 0
>>> while accum < len(s):
n += 1
accum += n
>>> L = [s[j:i+j] for i, j in enumerate(accumulate(range(n)), 1)]
>>> L[-1] += ' ' * (n-len(L[-1]))
>>> L
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Update: Could also be obtained within the while loop
n = 0
accum = 0
L = []
while accum < len(s):
n += 1
L.append(s[accum:accum+n])
accum += n
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz']

Adding a little to U11-Forward's answer:
a = 'abcdefghijklmnopqrstuvwxyz'
l = list(range(len(a))) # numberes list / 1 to len(a)
triangular = [sum(l[:i+2]) for i in l] # sum of 1, 2 and 1,2,3 and 1,2,3,4 and etc
print([a[x: y].ljust(i, ' ') for i, (x, y) in enumerate(zip([0] + triangular, triangular), 1) if a[x: y]])
Output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Find the triangular numbers, do a list comprehension and fill with spaces if the length is incorrect.

a = 'abcdefghijklmnopqrstuvwxyz'
inc = 0
output = []
for i in range(0, len(a)):
print(a[inc: inc+i+1])
inc = inc+i+1
if inc > len(a):
break
output.append(a[inc: inc+i+1])
print(output)
Hey, here is the snippet for your required output. I have just altered your logic.
Output:
['b', 'de', 'ghi', 'klmn', 'pqrst', 'vwxyz']

How to join subsequent digits in a Python list into a double (or more) digit number

I have the following string:
string = 'TAA15=ATT'
I make a list out of it:
string_list = list(string)
print(string_list)
and the result is:
['T', 'A', 'A', '1', '5','=', 'A', 'T', 'T']
I need to detect subsequent digits and join them into a single number, as shown below:
['T', 'A', 'A', '15','=', 'A', 'T', 'T']
I'm also quite concerned with performances. This string conversion is done thousand times.
Thank you for any hints you can provide.

Here is a very short solution
import re
def digitsMerger(source):
return re.findall(r'\d+|.', source)
digitsMerger('TAA15=ATT')
['T', 'A', 'A', '15', '=', 'A', 'T', 'T']

Using itertools.groupby
Ex:
from itertools import groupby
string = 'TAA15=ATT'
result = []
for k, v in groupby(string, str.isdigit):
if k:
result.append("".join(v))
else:
result.extend(v)
print(result)
Output:
['T', 'A', 'A', '15', '=', 'A', 'T', 'T']

Another regexp:
import re
s = 'TAA15=ATT'
pattern = r'\d+|\D'
m = re.findall(pattern, s)
print(m)

You can use regular expressions, in Python the library re:
import re
string = 'TAA15=ATT'
num = re.sub('[^0-9,]', "", string)
pos = string.find(num)
str2 = re.sub('\\d+',"", string)
str2 = re.sub('=',"", str2)
print(str2)
l = list()
for el in str2:
l.append(el)
l.insert(pos, num)
print(l)
Basically re.sub('[^0-9,]', "", string) is telling: take the string, match all the characters that are not (^ means negation) numbers (0-9) and substitute them with the second parameter, ie., an empty string. So basically what's left are only digits that you have to convert to an integer.
If the = is always after the digit instead of
str2 = re.sub('\\d+',"", string)
str2 = re.sub('=',"", str2)
you can do
str2 = re.sub('\\d+=',"", string)

You can create a function that compares the last value seen and the next and use functools.reduce:
from functools import reduce
string_list = ['T', 'A', 'A', '1', '5', 'A', 'T', 'T']
def combine_nums(lst, nxt):
if lst and all(map(str.isdigit, (lst[-1], nxt))):
nxt = lst[-1] + nxt
return lst + [nxt]
print(reduce(combine_nums, string_list, [])
Results:
['T', 'A', 'A', '1', '15', 'A', 'T', 'T']

Combinations and Permutations of characters

I am trying to come up with elegant code that creates combinations/permutations of characters from a single character:
E.g. from a single character I'd like code to create these permutations (order of the result is not important):
'a' ----> ['a', 'aa', 'A', 'AA', 'aA', 'Aa']
The not so elegant solutions I have thus far:
# this does it...
from itertools import permutations
char = 'a'
p = [char, char*2, char.upper(), char.upper()*2]
pp = [] # stores the final list of permutations
for j in range(1,3):
for i in permutations(p,j):
p2 = ''.join(i)
if len(p2) < 3:
pp.append(p2)
print pp
['a', 'aa', 'A', 'AA', 'aA', 'Aa']
#this also works...
char = 'a'
p = ['', char, char*2, char.upper(), char.upper()*2]
pp = [] # stores the final list of permutations
for i in permutations(p,2):
j = ''.join(i)
if len(j) < 3:
pp.append(j)
print list(set(pp))
['a', 'aa', 'aA', 'AA', 'Aa', 'A']
# and finally... so does this:
char = 'a'
p = ['', char, char.upper()]
pp = [] # stores the final list of permutations
for i in permutations(p,2):
pp.append(''.join(i))
print list(set(pp)) + [char*2, char.upper()*2]
['a', 'A', 'aA', 'Aa', 'aa', 'AA']
I'm not great with lambdas, and I suspect that may be where a better solution lies.
So, could you help me find the most elegant/pythonic way to the desired result?

You can simply use the itertools.product with different repeat values to get the expected result
>>> pop = ['a', 'A']
>>> from itertools import product
>>> [''.join(item) for i in range(len(pop)) for item in product(pop, repeat=i + 1)]
['a', 'A', 'aa', 'aA', 'Aa', 'AA']

Python: find all possible word combinations with a sequence of characters (word segmentation)

I'm doing some word segmentation experiments like the followings.
lst is a sequence of characters, and output is all the possible words.
lst = ['a', 'b', 'c', 'd']
def foo(lst):
...
return output
output = [['a', 'b', 'c', 'd'],
['ab', 'c', 'd'],
['a', 'bc', 'd'],
['a', 'b', 'cd'],
['ab', 'cd'],
['abc', 'd'],
['a', 'bcd'],
['abcd']]
I've checked combinations and permutations in itertools library,
and also tried combinatorics.
However, it seems that I'm looking at the wrong side because this is not pure permutation and combinations...
It seems that I can achieve this by using lots of loops, but the efficiency might be low.
EDIT
The word order is important so combinations like ['ba', 'dc'] or ['cd', 'ab'] are not valid.
The order should always be from left to right.
EDIT
#Stuart's solution doesn't work in Python 2.7.6
EDIT
#Stuart's solution does work in Python 2.7.6, see the comments below.

itertools.product should indeed be able to help you.
The idea is this:-
Consider A1, A2, ..., AN separated by slabs. There will be N-1 slabs.
If there is a slab there is a segmentation. If there is no slab, there is a join.
Thus, for a given sequence of length N, you should have 2^(N-1) such combinations.
Just like the below
import itertools
lst = ['a', 'b', 'c', 'd']
combinatorics = itertools.product([True, False], repeat=len(lst) - 1)
solution = []
for combination in combinatorics:
i = 0
one_such_combination = [lst[i]]
for slab in combination:
i += 1
if not slab: # there is a join
one_such_combination[-1] += lst[i]
else:
one_such_combination += [lst[i]]
solution.append(one_such_combination)
print solution

#!/usr/bin/env python
from itertools import combinations
a = ['a', 'b', 'c', 'd']
a = "".join(a)
cuts = []
for i in range(0,len(a)):
cuts.extend(combinations(range(1,len(a)),i))
for i in cuts:
last = 0
output = []
for j in i:
output.append(a[last:j])
last = j
output.append(a[last:])
print(output)
output:
zsh 2419 % ./words.py
['abcd']
['a', 'bcd']
['ab', 'cd']
['abc', 'd']
['a', 'b', 'cd']
['a', 'bc', 'd']
['ab', 'c', 'd']
['a', 'b', 'c', 'd']

There are 8 options, each mirroring the binary numbers 0 through 7:
000
001
010
011
100
101
110
111
Each 0 and 1 represents whether or not the 2 letters at that index are "glued" together. 0 for no, 1 for yes.
>>> lst = ['a', 'b', 'c', 'd']
... output = []
... formatstr = "{{:0{}.0f}}".format(len(lst)-1)
... for i in range(2**(len(lst)-1)):
... output.append([])
... s = "{:b}".format(i)
... s = str(formatstr.format(float(s)))
... lstcopy = lst[:]
... for j, c in enumerate(s):
... if c == "1":
... lstcopy[j+1] = lstcopy[j] + lstcopy[j+1]
... else:
... output[-1].append(lstcopy[j])
... output[-1].append(lstcopy[-1])
... output
[['a', 'b', 'c', 'd'],
['a', 'b', 'cd'],
['a', 'bc', 'd'],
['a', 'bcd'],
['ab', 'c', 'd'],
['ab', 'cd'],
['abc', 'd'],
['abcd']]
>>>

You can use a recursive generator:
def split_combinations(L):
for split in range(1, len(L)):
for combination in split_combinations(L[split:]):
yield [L[:split]] + combination
yield [L]
print (list(split_combinations('abcd')))
Edit. I'm not sure how well this would scale up for long strings and at what point it hits Python's recursion limit. Similarly to some of the other answers, you could also use combinations from itertools to work through every possible combination of split-points.
def split_string(s, t):
return [s[start:finish] for start, finish in zip((None, ) + t, t + (None, ))]
def split_combinations(s):
for i in range(len(s)):
for split_points in combinations(range(1, len(s)), i):
yield split_string(s, split_points)
These both seem to work as intended in Python 2.7 (see here) and Python 3.2 (here). As #twasbrillig says, make sure you indent it as shown.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subwords of a string in Python - python

Here is a simple loop and slice based generator function: def subs(s): for i in range(len(s)): for j in range(i+1, len(s)+1): yield s[i:j] >>> list(subs("ABC")) ['A', 'AB', 'ABC', 'B', 'BC', 'C']

Related

python from ['a','b','c','d'] to ['a', 'ab', abc', 'abcd']

Split a string into chunks of substrings with successively increasing length

How to join subsequent digits in a Python list into a double (or more) digit number

Combinations and Permutations of characters

Python: find all possible word combinations with a sequence of characters (word segmentation)

Categories

Resources