Dividing the string with multiple matches in python - python

I have a string that has to be split for words that are present in "words"
words = ['word1', 'word2', 'word3']
text = " long statement word1 statement1 word2 statement2 word3 statement3 " # a single lined string
The code I'm using, is there any simple way for this?
for l in words:
if l == "word1": t1 = text.split(l)
if l == "word2": t2 = str(t1[1]).split(l)
if l == "word3": t3 = str(t2[1]).split(l)
print(t1[0])
print(t2[0])
print(t3[0])
The output is like:
statement
statement1
statement2
statement3

How about using itertools.groupby:
from itertools import groupby
words = ['word1', 'word2', 'word3']
text = " long statement word1 statement1 word2 statement2 word3 statement "
delimiters = set(words)
statements = [
' '.join(g) for k, g in groupby(text.split(), lambda w: w in delimiters)
if not k
]
print(statements)
Output:
['long statement', 'statement1', 'statement2', 'statement3']

You could Regex for solving your problem in this way.
import re
words = ['word1', 'word2', 'word3']
text = " long statement word1 statement1 word2 statement2 word3 statement3 "
print(*re.split('|'.join(words),text), sep="\n")

Related

Group list of 4-strings into list of pairs

I have following list of strings:
['word1 word2 word3 word4', 'word5 word6 word7 word8']
(I have shown only two strings, but there can be many.)
I want to create new list which should look like this:
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']
I tried following:
lines = ['word1 word2 word3 word4', 'word5 word6 word7 word8']
[[word1 + ' ' + word2, word3 + ' ' + word4] for line in lines for word1, word2, word3, word4 in line.split()]
But it gives following error:
ValueError: too many values to unpack (expected 4)
How do I do this in most pythonic way?
With short regex matching:
import re
lst = ['word1 word2 word3 word4', 'word5 word6 word7 word8']
res = [pair for words in lst for pair in re.findall(r'\S+ \S+', words)]
\S+ \S+ - matches 2 consecutive "words"
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']
Modified #jsbueno's ealier answer which was slightly incorrect:
>>> words = [item for line in lines for item in line.split()]
>>> words
['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8']
>>> [l[i] + ' ' + l[i+1] for i in range(0, len(words), 2)]
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']
Pythonic doesn't mean "fewer lines". This is easily done with a simple for loop:
result = []
for line in lines:
words = line.split()
result.append(' '.join(words[:2]))
result.append(' '.join(words[2:]))
Which gives your desired result:
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']
Try it online!
If you want to make this more general for strings with more words, you can write a function that will yield chunks of the desired size, and use that with str.join:
def chunks(iterable, chunk_size):
c = []
for item in iterable:
c.append(item)
if len(c) == chunk_size:
yield c
c = []
if c: yield c
result = []
for line in lines:
words = line.split()
for chunk in chunks(words, 2):
result.append(' '.join(chunk))
Try it online!
An optimized solution that pushes all the per-item work to the C layer:
from itertools import chain
lines = ['word1 word2 word3 word4', 'word5 word6 word7 word8']
words = chain.from_iterable(map(str.split, lines))
paired = list(map('{} {}'.format, words, words))
print(paired)
Try it online!
chain.from_iterable(map(str.split, lines)) creates an iterator of the individual words. map('{} {}'.format, words, words) maps the same iterator twice to put them back together in pairs (map(' '.join, zip(words, words)) would get the same effect, but with an additional intermediate product; feel free to test which is faster in practice). The list wrapper consumes it to produce the final result.
This beats the existing answers by avoiding all per-item work at the Python layer (no additional bytecode executed as the input grows), and avoids one of the weirdly high overhead aspects of Python (indexing and simple integer math).

Pythonic and efficient usage of logical and membership operators

What is the best way to check if some words in combination using logical operators (or,and) exist in a list of strings ?
Say you have a list of strings:
list_of_str = ['some phrase with word1','another phrase','other phrase with word2']
I have two cases 1), and 2) where I would like to get the strings that contain or do not contain some words. however I would prefer not to repeat as I do now if 'word1' not in i and 'word2' not in i and 'word3' not in i
I would like to get 1)
list_1 = [i for i in list_of_str if 'word1' not in i and 'word2' not in i and 'word3' not in i]
output: ['another phrase']
and 2)
list_2 = [i for i in list_of_str if 'word1' in i or 'word2' in i or 'word3' in i]
output: ['some phrase with word1', 'other phrase with word2']
I did find that I can do this for 2), but couldn't use the all for case 1)
list_2 = [i for i in list_of_str if any(word in ['word1','word2','word3'] for word in i.split())]
output: ['some phrase with word1', 'other phrase with word2']
Also is this the most efficient way of doing things ?
you can use:
words = ['word1', 'word2', 'word23']
list_1 = [i for i in list_of_str if all(w not in i for w in words)]
list_2 = [i for i in list_of_str if any(w in i for w in words)]
I think this is a good use-case for regex alternation, if efficiency matters:
>>> import re
>>> words = ['word1', 'word2', 'word23']
>>> regex = re.compile('|'.join([re.escape(w) for w in words]))
>>> regex
re.compile('word1|word2|word23')
>>> list_of_str = ['some phrase with word1','another phrase','other phrase with word2']
>>> [phrase for phrase in list_of_str if not regex.search(phrase)]
['another phrase']
>>> [phrase for phrase in list_of_str if regex.search(phrase)]
['some phrase with word1', 'other phrase with word2']
>>>
If you think about it in sets, you want sentences from that list where the set of search words and the set of words in the sentence are either disjoint or intersect.
E.g.:
set('some phrase with word1'.split()).isdisjoint({'word1', 'word2', 'word23'})
not set('some phrase with word1'.split()).isdisjoint({'word1', 'word2', 'word23'})
# or:
set('some phrase with word1'.split()) & {'word1', 'word2', 'word23'}
So:
search_terms = {'word1', 'word2', 'word23'}
list1 = [i for i in list_of_str if set(i.split()).isdisjoint(search_terms)]
list2 = [i for i in list_of_str if not set(i.split()).isdisjoint(search_terms)]

python nested list comprehension in nltk

This is a question from NLTK book but I got stuck. Any one know how to write this as a nested list comprehension?
>>> words = ['attribution', 'confabulation', 'elocution',
... 'sequoia', 'tenacious', 'unidirectional']
>>> vsequences = set()
>>> for word in words:
... vowels = []
... for char in word:
... if char in 'aeiou':
... vowels.append(char)
... vsequences.add(''.join(vowels))
>>> sorted(vsequences)
['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']
You can do
In [75]: ["".join([char for char in word if char in 'aeiou']) for word in words]
Out[75]: ['aiuio', 'oauaio', 'eouio', 'euoia', 'eaiou', 'uiieioa']
If you need as set and sorted
sorted(set(["".join([char for char in word if char in 'aeiou']) for word in words]))

Python - Check if there's only one element of multiple lists in a string

The following code allow me to check if there is only one element of the lists that is in ttext.
from itertools import product, chain
from string import punctuation
list1 = ['abra', 'hello', 'cfre']
list2 = ['dacc', 'ex', 'you', 'fboaf']
list3 = ['ihhio', 'oih', 'oihoihoo']
l = [list1, list2, list3]
def test(l, tt):
counts = {word.strip(punctuation):0 for word in tt.split()}
for word in chain(*product(*l)):
if word in counts:
counts[word] += 1
if sum(v > 1 for v in counts.values()) > 1:
return False
return True
Output:
In [16]: ttext = 'hello my name is brian'
In [17]: test(l,ttext)
Out[17]: True
In [18]: ttext = 'hello how are you?'
In [19]: test(l,ttext)
Out[19]: False
Now, how can i do the same if i have space in the elements of the lists, "I have", "you are" and "he is"?
You could add a list comprehension that goes through and splits all the words:
def test(l, tt):
counts = {word.strip(punctuation):0 for word in tt.split()}
splitl = [[word for item in sublist for word in item.split(' ')] for sublist in l]
for word in chain(*product(*splitl)):
if word in counts:
counts[word] += 1
if sum(v > 1 for v in counts.values()) > 1:
return False
return True
You can simplify a lot by just concatenating the lists using '+' rather than having a list of lists. This code also words if the string has spaces in it.
import string
list1 = ['abra', 'hello', 'cfre']
list2 = ['dacc', 'ex', 'you', 'fboaf']
list3 = ['ihhio', 'oih', 'oihoihoo']
l = list1 + list2 + list3
def test(l, tt):
count = 0
for word in l:
#set of all punctuation to exclude
exclude = set(string.punctuation)
#remove punctuation from word
word = ''.join(ch for ch in word if ch not in exclude)
if word in tt:
count += 1
if count > 1:
return False
else:
return True
You could just split all the list input by iterating through it. Something like:
words=[]
for list in l:
for word in list:
string=word.split()
words.append(string)
You may consider using sets for this kind of processing.
Here is a quick implementation :
from itertools import chain
from string import punctuation
list1 = ['abra', 'hello', 'cfre']
list2 = ['dacc', 'ex', 'you', 'fboaf']
list3 = ['ihhio', 'oih', 'oihoihoo']
l = list(chain(list1, list2, list3))
words = set(w.strip(punctuation) for word in l for w in word.split()) # 1
def test(words, text):
text_words = set(word.strip(punctuation) for word in text.split()) # 2
return len(words & text_words) == 1 # 3
Few comments:
Double for-loop on intentions works, you get a list of the words. The set make sure each word is unique.
Same thing on the input sentence
Using set intersection to get all words in the sentence that are also in your search set. Then using the length of this set to see if there is only one.
Well, first, lets rewrite the function to be more natural:
from itertools import chain
def only_one_of(lists, sentence):
found = None
for item in chain(*lists):
if item in sentence:
if found: return False
else: found = item
return True if found not is None else False
This already works with your constrains as it is only looking for some string item being a substring of sentence. It does not matter if it includes spaces or not. But it may lead to unexpected results. Imagine:
list1 = ['abra', 'hello', 'cfre']
list2 = ['dacc', 'ex', 'you', 'fboaf']
list3 = ['ihhio', 'oih', 'oihoihoo']
l = [list1, list2, list3]
only_one_of(l, 'Cadabra')
This returns True because abra is a substring of Cadabra. If this is what you want, then you're done. But if not, you need to redefine what item in sentence really means. So, let's redefine our function:
def only_one_of(lists, sentence, is_in=lambda i, c: i in c):
found = None
for item in chain(*lists):
if is_in(item, sentence):
if found: return False
else: found = item
return True if found not is None else False
Now the last parameter expects to be a function to be applied to two strings that return True if the first is found in the second or False, elsewhere.
You usually want to check if the item is inside the sentence as a word (but a word that can contain spaces in the middle) so let's use regular expressions to do that:
import re
def inside(string, sentence):
return re.search(r'\b%s\b' % string, sentence)
This function returns True when string is in sentence but considering string as a word (the special sequence \b in regular expression stands for word boundary).
So, the following code should pass your constrains:
import re
from itertools import chain
def inside(string, sentence):
return re.search(r'\b%s\b' % string, sentence)
def only_one_of(lists, sentence, is_in=lambda i, c: i in c):
found = None
for item in chain(*lists):
if is_in(item, sentence):
if found: return False
else: found = item
return True if found not is None else False
list1 = ['abra', 'hello', 'cfre']
list2 = ['dacc', 'ex', 'you', 'fboaf']
list3 = ['ihhio', 'oih', 'oihoihoo']
list4 = ['I have', 'you are', 'he is']
l = [list1, list2, list3, list4]
only_one_of(l, 'hello my name is brian', inside) # True
only_one_of(l, 'hello how are you?', inside) # False
only_one_of(l, 'Cadabra', inside) # False
only_one_of(l, 'I have a sister', inside) # True
only_one_of(l, 'he is my ex-boyfriend', inside) # False, ex and boyfriend are two words
only_one_of(l, 'he is my exboyfriend', inside) # True, exboyfriend is only one word

Python: replace nth word in string

What is the easiest way in Python to replace the nth word in a string, assuming each word is separated by a space?
For example, if I want to replace the tenth word of a string and get the resulting string.
I guess you may do something like this:
nreplace=1
my_string="hello my friend"
words=my_string.split(" ")
words[nreplace]="your"
" ".join(words)
Here is another way of doing the replacement:
nreplace=1
words=my_string.split(" ")
" ".join([words[word_index] if word_index != nreplace else "your" for word_index in range(len(words))])
Let's say your string is:
my_string = "This is my test string."
You can split the string up using split(' ')
my_list = my_string.split()
Which will set my_list to
['This', 'is', 'my', 'test', 'string.']
You can replace the 4th list item using
my_list[3] = "new"
And then put it back together with
my_new_string = " ".join(my_list)
Giving you
"This is my new string."
A solution involving list comprehension:
text = "To be or not to be, that is the question"
replace = 6
replacement = 'it'
print ' '.join([x if index != replace else replacement for index,x in enumerate(s.split())])
The above produces:
To be or not to be, it is the question
You could use a generator expression and the string join() method:
my_string = "hello my friend"
nth = 0
new_word = 'goodbye'
print(' '.join(word if i != nth else new_word
for i, word in enumerate(my_string.split(' '))))
Output:
goodbye my friend
Through re.sub.
>>> import re
>>> my_string = "hello my friend"
>>> new_word = 'goodbye'
>>> re.sub(r'^(\s*(?:\S+\s+){0})\S+', r'\1'+new_word, my_string)
'goodbye my friend'
>>> re.sub(r'^(\s*(?:\S+\s+){1})\S+', r'\1'+new_word, my_string)
'hello goodbye friend'
>>> re.sub(r'^(\s*(?:\S+\s+){2})\S+', r'\1'+new_word, my_string)
'hello my goodbye'
Just replace the number within curly braces with the position of the word you want to replace - 1. ie, for to replace the first word, the number would be 0, for second word the number would be 1, likewise it goes on.

Categories

Resources