Adding spaces to string based on list - python

I have a string s and a list of strings, arr.
The length of s is equal to the total length of strings in arr.
I need to split s into a list, such that each element in the list has the same length as the corresponding element in arr.
For example:
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
expected == ['Python', 'is', 'an', 'programming', 'language']

It is much cleaner to use iter with next:
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
new_s = iter(s)
result = [''.join(next(new_s) for _ in i) for i in arr]
Output:
['Python', 'is', 'an', 'programming', 'language']

One way would be to do this:
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
expected = []
i = 0
for word in arr:
expected.append(s[i:i+len(word)])
i+= len(word)
print(expected)

Using a simple for loop this can be done as follows:
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
start_index = 0
expected = list()
for a in arr:
expected.append(s[start_index:start_index+len(a)])
start_index += len(a)
print(expected)

In the future, an alternative approach will be to use an assignment expression (new in Python 3.8):
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
i = 0
expected = [s[i:(i := i+len(word))] for word in arr]

You can use itertools.accumulate to get the positions where you want to split the string:
>>> s = 'Pythonisanprogramminglanguage'
>>> arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
>>> import itertools
>>> L = list(itertools.accumulate(map(len, arr)))
>>> L
[6, 8, 10, 21, 29]
Now if you zip the list with itself, you get the intervals:
>>> list(zip([0]+L, L))
[(0, 6), (6, 8), (8, 10), (10, 21), (21, 29)]
And you just have to use the intervals to split the string:
>>> [s[i:j] for i,j in zip([0]+L, L)]
['Python', 'is', 'an', 'programming', 'language']

The itertools module has a function named accumulate() (added in Py 3.2) which helps make this relatively easy:
from itertools import accumulate # added in Py 3.2
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
cuts = tuple(accumulate(len(item) for item in arr))
words = [s[i:j] for i, j in zip((0,)+cuts, cuts)]
print(words) # -> ['Python', 'is', 'an', 'programming', 'language']

Create a simple loop and use the length of the words as your index:
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
ctr = 0
words = []
for x in arr:
words.append(s[ctr:len(x) + ctr])
ctr += len(x)
print(words)
# ['Python', 'is', 'an', 'programming', 'language']

Here is another approach :
import numpy as np
ar = [0]+list(map(len, arr))
ar = list(np.cumsum(ar))
output_ = [s[i:ar[ar.index(i)+1]] for i in ar[:-1]]
Output :
['Python', 'is', 'an', 'programming', 'language']

One more way
a,l = 0,[]
for i in map(len,arr):
l.append(s[a:a+i])
a+=i
print (l)
#['Python', 'is', 'an', 'programming', 'language']

Props to the answer using iter. The accumulate answers are my favorite. Here is another accumulate answer using map instead of a list comprehension
import itertools
s = 'Pythonisanprogramminglanguage'
arr = ['lkjhgf', 'zx', 'qw', 'ertyuiopakk', 'foacdhlc']
ticks = itertools.accumulate(map(len, arr[0:]))
words = list(map(lambda i, x: s[i:len(x) + i], (0,) + tuple(ticks), arr))
Output:
['Python', 'is', 'an', 'programming', 'language']

You could collect slices off the front of s.
output = []
for word in arr:
i = len(word)
chunk, s = s[:i], s[i:]
output.append(chunk)
print(output) # -> ['Python', 'is', 'an', 'programming', 'language']

Yet another approach would be to create a regex pattern describing the desired length of words. You can replace every character by . (=any character) and surround the words with ():
arr = ['lkjhgf', 'zx', 'q', 'ertyuiopakk', 'foacdhlc']
import re
pattern = '(' + ')('.join(re.sub('.', '.', word) for word in arr) + ')'
#=> '(......)(..)(.)(...........)(........)'
If the pattern matches, you get the desired words in groups directly:
s = 'Pythonisaprogramminglanguage'
re.match(pattern, s).groups()
#=> ('Python', 'is', 'a', 'programming', 'language')

Related

Function works for small samples but not larger ones (Python)

I'm trying to make a function to see if words appear within a certain distance of one another, my code is as follows:
file_cont = [['man', 'once', 'upon', 'time', 'love',
'princess'], ['python', 'code', 'cool', 'uses', 'java'],
['man', 'help', 'test', 'weird', 'love']] #words I want to measure 'distance' between
dat = [{ind: val for val, ind in enumerate(el)} for el in file_cont]
def myfunc(w1, w2, dist, dat):
arr = []
for x in dat:
i1 = x.get(w1)
i2 = x.get(w2)
if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist ):
arr.append(list(x.keys())[i1:i2+1])
return arr
It works in this instance,
myfunc("man", "love",4, dat) returns [['man', 'once', 'upon', 'time', 'love'],
['man', 'help', 'test', 'weird', 'love']] which is what I want
The problem I have is when I use a much bigger dataset (the elements of file_cont becomes thousands of words), it outputs odd results
For example I know the words 'jon' and 'snow' appear together in at least one instance in one of the elements of file_cont
When I do myfunc('jon','snow',6,dat) I get:
[[], [], ['castle', 'ward'], [], [], []]
something completely out of context, it doesn't mention 'jon' or 'snow'
What is the problem here and how would I go about fixing it?
The problem comes from the fact that your text may contain multiple occurrences of the same word, which you typically observe with larger excerpts.
Here's a minimal working example showing how the function may fail:
new_file = [['man', 'once', 'man', 'time', 'love', 'once']]
data = [{ind: val for val, ind in enumerate(el)} for el in new_file]
def myfunc(w1, w2, dist, dat):
arr = []
for x in dat:
i1 = x.get(w1)
i2 = x.get(w2)
if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist ):
arr.append(list(x.keys())[i1:i2+1])
return arr
myfunc("man", "love", 4, data)
# > [['time', 'love']]
Notice that here, your dictionary will look like this:
# > [{'man': 2, 'once': 5, 'time': 3, 'love': 4}]
This is because, when creating the dictionary, each new occurence of a word will replace its key in the dictionary with the newly observed (higher) index. Thus, the function myfunc fails as the keys in the dictionary do not correspond anymore to the indices of the words in the excerpt.
A way to achieve what you want to do could be (for instance):
data = ['man', 'once', 'upon', 'man', 'time', 'love', 'princess', 'man']
w1 = 'man'
w2 = 'love'
dist = 3
def new_func(w1, w2, dist, data):
w1_indices = [i for i, x in enumerate(data) if x == w1]
w2_indices = [i for i, x in enumerate(data) if x == w2]
for i in w1_indices:
for j in w2_indices:
if abs(i-j) < dist:
print(data[min(i, j):max(i, j)+1])
new_func(w1, w2, dist, data)
# > ['man', 'time', 'love']
# > ['love', 'princess', 'man']
With a list of lists like in your case, you can do:
file_cont = [['man', 'once', 'upon', 'time', 'love', 'princess'], ['python', 'code', 'cool', 'uses', 'java'],
['man', 'help', 'test', 'weird', 'love']]
results = [new_func(w1, w2, dist, x) for x in file_cont]
print(results)
# > ['man', 'once', 'upon', 'time', 'love']
# > ['man', 'help', 'test', 'weird', 'love']

Split a string into chunks of substrings with successively increasing length

Let's say I have this string:
a = 'abcdefghijklmnopqrstuvwxyz'
And I want to split this string into chunks, like below:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
so that every chunk has a different number of characters. For instance, the first one should have one character, the second two and so on.
If there are not enough characters in the last chunk, then I need to add spaces so it matches the length.
I tried this code so far:
print([a[i: i + i + 1] for i in range(len(a))])
But it outputs:
['a', 'bc', 'cde', 'defg', 'efghi', 'fghijk', 'ghijklm', 'hijklmno', 'ijklmnopq', 'jklmnopqrs', 'klmnopqrstu', 'lmnopqrstuvw', 'mnopqrstuvwxy', 'nopqrstuvwxyz', 'opqrstuvwxyz', 'pqrstuvwxyz', 'qrstuvwxyz', 'rstuvwxyz', 'stuvwxyz', 'tuvwxyz', 'uvwxyz', 'vwxyz', 'wxyz', 'xyz', 'yz', 'z']
Here is my desired output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
I don't think any one liner or for loop will look as elegant, so let's go with a generator:
from itertools import islice, count
def get_increasing_chunks(s):
it = iter(s)
c = count(1)
nxt, c_ = next(it), next(c)
while nxt:
yield nxt.ljust(c_)
nxt, c_ = ''.join(islice(it, c_+1)), next(c)
return out
[*get_increasing_chunks(a)]
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Thanks to #Prune's comment, I managed to figure out a way to solve this:
a = 'abcdefghijklmnopqrstuvwxyz'
lst = []
c = 0
for i in range(1, len(a) + 1):
c += i
lst.append(c)
print([a[x: y] + ' ' * (i - len(a[x: y])) for i, (x, y) in enumerate(zip([0] + lst, lst), 1) if a[x: y]])
Output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
I find the triangular numbers than do a list comprehension, and add spaces if the length is not right.
so what you need is to have a number that controls how many characters you're going to grab (in this case the amount of iterations), and a second number that remembers what the last index was, plus one last number to tell where to stop.
my_str = "abcdefghijklmnopqrstuvwxyz"
last_index = 0
index = 1
iter_count = 1
while True:
sub_string = my_str[last_index:index]
print(sub_string)
last_index = index
iter_count += 1
index = index + iter_count
if last_index > len(my_str):
break
note that you don't need the while loop. i was just feeling lazy
It seems like the split_into recipe at more_itertools can help here. This is less elegant than the answer by #cs95, but perhaps this will help others discover the utility of the itertools module.
Yield a list of sequential items from iterable of length ‘n’ for each integer ‘n’ in sizes.
>>> list(split_into([1,2,3,4,5,6], [1,2,3]))
[[1], [2, 3], [4, 5, 6]]
To use this, we need to construct a list of sizes like [1, 2, 3, 3, 5, 6, 7].
import itertools
def split_into(iterable, sizes):
it = iter(iterable)
for size in sizes:
if size is None:
yield list(it)
return
else:
yield list(itertools.islice(it, size))
a = 'abcdefghijklmnopqrstuvwxyz'
sizes = [1]
while sum(sizes) <= len(a):
next_value = sizes[-1] + 1
sizes.append(next_value)
# sizes = [1, 2, 3, 4, 5, 6, 7]
list(split_into(a, sizes))
# [['a'],
# ['b', 'c'],
# ['d', 'e', 'f'],
# ['g', 'h', 'i', 'j'],
# ['k', 'l', 'm', 'n', 'o'],
# ['p', 'q', 'r', 's', 't', 'u'],
# ['v', 'w', 'x', 'y', 'z']]
chunks = list(map("".join, split_into(a, sizes)))
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz']
# Pad last item with whitespace.
chunks[-1] = chunks[-1].ljust(sizes[-1], " ")
# ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Here is a solution using accumulate from itertools.
>>> from itertools import accumulate
>>> from string import ascii_lowercase
>>> s = ascii_lowercase
>>> n = 0
>>> accum = 0
>>> while accum < len(s):
n += 1
accum += n
>>> L = [s[j:i+j] for i, j in enumerate(accumulate(range(n)), 1)]
>>> L[-1] += ' ' * (n-len(L[-1]))
>>> L
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Update: Could also be obtained within the while loop
n = 0
accum = 0
L = []
while accum < len(s):
n += 1
L.append(s[accum:accum+n])
accum += n
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz']
Adding a little to U11-Forward's answer:
a = 'abcdefghijklmnopqrstuvwxyz'
l = list(range(len(a))) # numberes list / 1 to len(a)
triangular = [sum(l[:i+2]) for i in l] # sum of 1, 2 and 1,2,3 and 1,2,3,4 and etc
print([a[x: y].ljust(i, ' ') for i, (x, y) in enumerate(zip([0] + triangular, triangular), 1) if a[x: y]])
Output:
['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz ']
Find the triangular numbers, do a list comprehension and fill with spaces if the length is incorrect.
a = 'abcdefghijklmnopqrstuvwxyz'
inc = 0
output = []
for i in range(0, len(a)):
print(a[inc: inc+i+1])
inc = inc+i+1
if inc > len(a):
break
output.append(a[inc: inc+i+1])
print(output)
Hey, here is the snippet for your required output. I have just altered your logic.
Output:
['b', 'de', 'ghi', 'klmn', 'pqrst', 'vwxyz']

Add a maximum of 200 elements on en empty 2D list

I am having a Python program that creates lists of 3 elements.
['a', 'was', 'mother']
and adds them on an empty list,
output_text=[]
while True:
candidates = [t for t in lines if t[0:2] == last_two]
if not candidates:
break
triplet = random.choice(candidates)
last_two = triplet[1:3]
output_text.append(triplet)
print('\n Επιλογή Matching Τριάδας: \n',triplet)
print('\n Δύο Τελευταίες Λέξεις Matching Τριάδας: \n',last_two)
print(output_text)
I want to create an if statement that keeps adding the 3-element lists to output_text until 200 words (total elements) are being stored.
Any ideas?
You could e.g. combine itertools.cycle and .islice, or just use modulo %:
>>> from itertools import islice, cycle
>>> lst = ['a', 'was', 'mother']
>>> list(islice(cycle(lst), 10))
['a', 'was', 'mother', 'a', 'was', 'mother', 'a', 'was', 'mother', 'a']
>>> [lst[i % len(lst)] for i in range(10)]
['a', 'was', 'mother', 'a', 'was', 'mother', 'a', 'was', 'mother', 'a']
(Technically, this does not append to an empty list but creates the list in one go, but I assume that's okay.)
The idiomatic solution would involve itertools.cycle, which is an iterator that yields items from a given iterable indefinitely, and itertools.islice to grab the first 200 items from the cycle-iterator:
from itertools import cycle, islice
words = list(islice(cycle(("a", "was", "mother")), 200))
I believe it would be easier done with a while loop, like this:
lst = ['a', 'was', 'mother']
output_text = []
while len(output_text) < 200:
if len(output_text) - 200 > 3:
output_text += lst
else:
output_text += lst[:(200-len(output_text))%3]
print(output_text)

split existed list based on the repeated word

I tried to split a list into new list. Here's the initial list:
initList =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
If I want to split the list based on the PTExyz to new list which looks:
newList = ['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']
How should I develop proper algorithm for general case with repeated item PTExyz?
Thank You!
The algorithm will be something like this.
Iterate over the list. Find a the string s that starts with PTE. Assign it to a temp string which is initialized as an empty string. Add every next string s with temp unless that string starts with PTE. In that case, if the temp string is not empty then append it with your result list else add the string with temp.
ls = ['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word', 'title', 'PTE427', 'how', 'are', 'you']
result = []
temp = ''
for s in ls:
if s.startswith('PTE'):
if temp != '':
result.append(temp)
temp = s
else:
if temp == '':
continue
temp += ' ' + s
result.append(temp)
print(result)
Edit
For handling the pattern PTExyz you can use regular expression. In that case the code will be like this where the line is s.startswith('PTE'):
re.match(r'PTE\w{3}$', s)
I think it will work
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
resultlist = []
s = ' '.join(l)
str = s.split('PTE')
for i in str:
resultlist.append('PTE'+i)
resultlist.remove('PTE')
print resultlist
It works on a regular expression PTExyz
import re
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
pattern = re.compile(r'[P][T][E]\d\d\d')
k = []
for i in l:
if pattern.match(i) is not None:
k.append(i)
s = ' '.join(l)
str = re.split(pattern, s)
str.remove('')
for i in range(len(k)):
str[i] = k[i] + str[i]
print str
>>> list =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
>>> index_list =[ list.index(item) for item in list if "PTE" in item]
>>> index_list.append(len(list))
>>> index_list
[0, 5, 9, 13]
>>> [' '.join(list[index_list[i-1]:index_list[i]]) for i,item in enumerate(index_list) if item > 0 ]
Output
['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']

Find all the strings with max length using max() function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Longest strings from list
lst = [str1, str2, str3, ...]
max(lst, key=len)
This returns only one of the strings with max length. Is there any way to do that without defining another procedure?
How about:
maxlen = len(max(l, key=len))
maxlist = [s for s in l if len(s) == maxlen]
If you want to get all the values with the max length, you probably want to sort the list by length; then you just need to take all the values until the length changes. itertools provides multiple ways to do that—takewhile, groupby, etc. For example:
>>> l = ['abc', 'd', 'ef', 'ghi', 'j']
>>> l2 = sorted(l, key=len, reverse=True)
>>> groups = itertools.groupby(len, l2)
>>> maxlen, maxvalues = next(groups)
>>> print(maxlen, list(maxvalues))
3, ['abc', 'ghi']
If you want a one-liner:
>>> maxlen, maxvalues = next(itertools.groupby(len, sorted(l, key=len, reverse=True)))
>>> print(maxlen, list(maxvalues))
Of course you can always just make two passes over the list if you prefer—first to find the max length, then to find all matching values:
>>> maxlen = len(max(l, key=len))
>>> maxvalues = (value for value in l if len(value) == maxlen)
>>> print(maxlen, list(maxvalues))
Just for the sake of completeness, filter is also an option:
maxlens = filter(lambda s: len(s)==max(myList, key=len), myList)
Here is a one-pass solution, collecting longest-seen-so-far words as they are found.
def findLongest(words):
if not words:
return []
worditer = iter(words)
ret = [next(worditer)]
cur_len = len(ret[0])
for wd in worditer:
len_wd = len(wd)
if len_wd > cur_len:
ret = [wd]
cur_len = len_wd
else:
if len_wd == cur_len:
ret.append(wd)
return ret
Here are the results from some test lists:
tests = [
[],
"Four score and seven years ago".split(),
"To be or not to be".split(),
"Now is the winter of our discontent made glorious summer by this sun of York".split(),
]
for test in tests:
print test
print findLongest(test)
print
[]
[]
['Four', 'score', 'and', 'seven', 'years', 'ago']
['score', 'seven', 'years']
['To', 'be', 'or', 'not', 'to', 'be']
['not']
['Now', 'is', 'the', 'winter', 'of', 'our', 'discontent', 'made', 'glorious', 'summer', 'by', 'this', 'sun', 'of', 'York']
['discontent']

Categories

Resources