Removing list of words from a string - python

I have a list of stopwords. And I have a search string. I want to remove the words from the string.
As an example:
stopwords=['what','who','is','a','at','is','he']
query='What is hello'
Now the code should strip 'What' and 'is'. However in my case it strips 'a', as well as 'at'. I have given my code below. What could I be doing wrong?
for word in stopwords:
if word in query:
print word
query=query.replace(word,"")
If the input query is "What is Hello", I get the output as:
wht s llo
Why does this happen?

This is one way to do it:
query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
print(result)
I noticed that you want to also remove a word if its lower-case variant is in the list, so I've added a call to lower() in the condition check.

the accepted answer works when provided a list of words separated by spaces, but that's not the case in real life when there can be punctuation to separate the words. In that case re.split is required.
Also, testing against stopwords as a set makes lookup faster (even if there's a tradeoff between string hashing & lookup when there's a small number of words)
My proposal:
import re
query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}
resultwords = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)
output (as list of words):
['hello','Says','']
There's a blank string in the end, because re.split annoyingly issues blank fields, that needs filtering out. 2 solutions here:
resultwords = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords] # filter out empty words
or add empty string to the list of stopwords :)
stopwords = {'what','who','is','a','at','is','he',''}
now the code prints:
['hello','Says']

building on what karthikr said, try
' '.join(filter(lambda x: x.lower() not in stopwords, query.split()))
explanation:
query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]
filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
# filters it based on the function which will take in one item at
# a time and return true.false
lambda x: x.lower() not in stopwords # anonymous function that takes in variable,
# converts it to lower case, and returns true if
# the word is not in the iterable stopwords
' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
#using the string/char in front of the dot, i.e. ' ' as a joiner.
# i.e. ["What", "is","hello"] -> "What is hello"

Looking at the other answers to your question I noticed that they told you how to do what you are trying to do, but they did not answer the question you posed at the end.
If the input query is "What is Hello", I get the output as:
wht s llo
Why does this happen?
This happens because .replace() replaces the substring you give it exactly.
for example:
"My, my! Hello my friendly mystery".replace("my", "")
gives:
>>> "My, ! Hello friendly stery"
.replace() is essentially splitting the string by the substring given as the first parameter and joining it back together with the second parameter.
"hello".replace("he", "je")
is logically similar to:
"je".join("hello".split("he"))
If you were still wanting to use .replace to remove whole words you might think adding a space before and after would be enough, but this leaves out words at the beginning and end of the string as well as punctuated versions of the substring.
"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"
"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"
"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"
Additionally, adding spaces before and after will not catch duplicates as it has already processed the first sub-string and will ignore it in favor of continuing on:
"hello my my friend".replace(" my ", " ")
>>> "hello my friend"
For these reasons your accepted answer by Robby Cornelissen is the recommended way to do what you are wanting.

" ".join([x for x in query.split() if x not in stopwords])

stopwords=['for','or','to']
p='Asking for help, clarification, or responding to other answers.'
for i in stopwords:
n=p.replace(i,'')
p=n
print(p)

Related

How to remove words starting with capital letter in a list of strings using re.sub in python

I am using Python and I would like to remove words starting with a capital letter in a list of strings, using re.sub.
For example, given the following list:
l = ['I am John','John is going to US']
I want to get the following output, without any extra spaces for the removed words:
['am','is going to']
you can try this:
output = []
for sentence in l:
output.append(" ".join([word for word in sentence.strip().split(" ") if not re.match(r"[A-Z]",word)]))
print(output)
output:
['am', 'is going to']
You can try
import re
l=['I am John','John is going to US']
print([re.sub(r"\s*[A-Z]\w*\s*", " ", i).strip() for i in l])
Output
['am', 'is going to']
This is a regex that removes all words from a given string that starts with a capital letter in addition it will remove all spaces before and after the word.

Issue with extending a list with another list

Problem Definition
Separate each line into sentences. Assume that the following characters delimit sentences: periods ('.'), question marks ('?'), and exclamation points ('!'). These delimiters should be omitted from the returned sentences, too. Remove any leading or trailing spaces in each sentence. If, after the above, a sentence is blank (the empty string, ''), that sentence should be omitted. Return the list of sentences. The sentences must be in the same order that they appear in the file.
Here is my current code
import re
def get_sentences(doc):
assert isinstance(doc, list)
result = []
for line in doc:
result.extend(
[sentence.strip() for sentence in re.split(r'\.|\?|\!', line) if sentence]
)
return result
# Demo:
get_sentences(demo_input)
Input
demo_input = [" This is a phrase; this, too, is a phrase. But this is another sentence.",
"Hark!",
" ",
"Come what may <-- save those spaces, but not these --> ",
"What did you say?Split into 3 (even without a space)? Okie dokie."]
Desired Output
["This is a phrase; this, too, is a phrase",
"But this is another sentence",
"Hark",
"Come what may <-- save those spaces, but not these -->",
"What did you say",
"Split into 3 (even without a space)",
"Okie dokie"]
However, my code produces this:
['This is a phrase; this, too, is a phrase',
'But this is another sentence',
'Hark',
'',
'Come what may <-- save those spaces, but not these -->',
'What did you say',
'Split into 3 (even without a space)',
'Okie dokie']
Question: Why am I getting that '' empty sentence in there even though my code is leaving it out?
I can solve the problem with the following code but I will have to go through the list again and I don't want to do that. I want to do it in the same pass.
import re
def get_sentences(doc):
assert isinstance(doc, list)
result = []
for line in doc:
result.extend([sentence.strip() for sentence in re.split(r'\.|\?|\!', line)])
result = [s for s in result if s]
return result
# Demo:
get_sentences(demo_input)
Try using if sentence.strip(), i.e.:
for line in doc:
result.extend([sentence.strip() for sentence in re.split(r'\.|\?|\!', line) if sentence.strip()])

Check if text within a list is withing a string

I would like to check against a list of words if they are within a string.
For Example:
listofwords = ['hi','bye','yes','no']
String = 'Hi how are you'
string2 = 'I have none of the words'
String 1 is true as it contains 'hi' and string2 is false as it does not.
I have tried the following code but it always returns false.
if any(ext in String for ext in listofwords):
print(String)
I would also like to show what the matching word was to check this is correct.
hi and Hi are different words. Use .lower before comparing.
if any(ext.lower() in String.lower() for ext in listofwords):
print(String)
Update:
to print matching word use for loop to iterate and print words that match.
Example:
listofwords = ['hi','bye','yes','no']
String = 'Hi how are you'
string2 = 'I have none of the words'
for word in listofwords:
if word.lower() in map(str.lower,String.split()): # map both of the words to lowercase before matching
print(word)
for word in listofwords:
if word.lower() in map(str.lower,string2.split()): # map both of the words to lowercase before matching
print(word)
PS: Not the optimized version. You can store String.split results in a list and then start iterating that will save time for larger strings. But purpose of the code is to demonstrate use of lower case.
Python is case sensitive. Hence hi is not equal to Hi. This works:
listofwords = ['hi','bye','yes','no']
String = 'hi how are you'
string2 = 'I have none of the words'
if any(ext in String for ext in listofwords):
print(String)
The problem is both with case-sensitivity and with using in directly with a string.
If you want to make your search case-insensitive, consider converting both the String and the word to lower case, also, you should split the string after lower casing it, if you want to properly search for words:
if any(ext.lower() in String.lower().split() for ext in listofwords):
print(String)
Splitting avoids returning True for strings like no in none and only works if no (or any other word) is present on its own. So now the above will work for both String (it will print it) and for string2 (it will not print it).

Convert titlecase words in the string to lowercase words

I want to convert all the titlecase words (words starting with uppercase character and having rest of the characters as lowercase) in the string to the lowercase characters. For example, if my initial string is:
text = " ALL people ARE Great"
I want my resultant string to be:
"ALL people ARE great"
I tried the following but it did not work
text = text.split()
for i in text:
if i in [word for word in a if not word.islower() and not word.isupper()]:
text[i]= text[i].lower()
I also checked related question Check if string is upper, lower, or mixed case in Python.. I want to iterate over my dataframe and for each word that meet this criteria.
You could define your transform function
def transform(s):
if len(s) == 1 and s.isupper():
return s.lower()
if s[0].isupper() and s[1:].islower():
return s.lower()
return s
text = " ALL people ARE Great"
final_text = " ".join([transform(word) for word in text.split()])
You can use str.istitle() to check whether your word represents the titlecased string, i.e. whether first character of the word is uppercase and rest are lowercase.
For getting your desired result, you need to:
Convert your string to list of words using str.split()
Do the transformation you need using str.istitle() and str.lower() (I am using list comprehension for iterating the list and for generating a new list of words in desired format)
Join back the list to strings using str.join() as:
For example:
>>> text = " ALL people ARE Great"
>>> ' '.join([word.lower() if word.istitle() else word for word in text.split()])
'ALL people ARE great'

Python How to skip the part in a string marked by certain symbols?

I‘m trying to reconstruct a sentence by one-to-one matching the words in a word list to a sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i)
text=final
print(final)
the expected output will be like:
a cat is an animal
If I run my code, the 'a' and 'an' in 'animal' will be unavoidably separated too.
So I want to sort the word list by the length, and search for the long words first.
words.sort(key=len)
words=words[::-1]
Then I would like to mark the long words with special symbols, and expect the program could skip the part I marked. For example:
acatisan%animal&
And finally I will erase the symbols. But I'm stuck here. I don't know what to do to make the program skip the certain parts between '%' and '&' . Can anyone help me?? Or are there better ways to solve the spacing problem? Lots of Thanks!
**For another case,what if the text include the words that are not included in the word list?How could I handle this?
text=‘wowwwwacatisananimal’
A more generalized approach would be to look for all valid words at the beginning, split them off and explore the rest of the letters, e.g.:
def compose(letters, words):
q = [(letters, [])]
while q:
letters, result = q.pop()
if not letters:
return ' '.join(result)
for word in words:
if letters.startswith(word):
q.append((letters[len(word):], result+[word]))
>>> words=['cat','is','an','a','animal']
>>> compose('acatisananimal', words)
'a cat is an animal'
If there are potentially multiple possible sentence compositions it would trivial to turn this into a generator and replace return with yield to yield all matching sentence compositions.
Contrived example (just replace return with yield):
>>> words=['adult', 'sex', 'adults', 'exchange', 'change']
>>> list(compose('adultsexchange', words))
['adults exchange', 'adult sex change']
Maybe you can replace the word with the index, so the final string should be like this 3 0 1 2 4 and then convert it back to sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in sorted(words,key=len,reverse=True):
if i in text:
final=text.replace(i,' %s'%words.index(i))
text=final
print(" ".join(words[int(i)] for i in final.split()))
Output:
a cat is an animal
You need a small modification in your code, update the code line
final=text.replace(i,' '+i)
to
final=text.replace(i,' '+i, 1) . This will replace only the first occurrence.
So the updated code would be
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i, 1)
text=final
print(final)
Output is:
a cat is an animal
if you are getting on that part of removing only the symbols...then regex is your what you are looking for..import a module called re and do this.
import re
code here
print re.sub(r'\W+', ' ', final)
I wouldn't recommend using different delimeters either side of your matched words(% and & in your example.)
It's easier to use the same delimiter either side of your marked word and use Python's list slicing.
The solution below uses the [::n] syntax for getting every nth element of a list.
a[::2] gets even-numbered elements, a[1::2] gets the odd ones.
>>> fox = "the|quick|brown|fox|jumpsoverthelazydog"
Because they have | characters on either side, 'quick' and 'fox' are odd-numbered elements when you split the string on |:
>>> splitfox = fox.split('|')
>>> splitfox
['the', 'quick', 'brown', 'fox', 'jumpsoverthelazydog']
>>> splitfox[1::2]
['quick', 'fox']
and the rest are even:
>>> splitfox[::2]
['the', 'brown', 'jumpsoverthelazydog']
So, by enclosing known words in | characters, splitting, and scanning even-numbered elements, you're searching only those parts of the text that are not yet matched. This means you don't match within already-matched words.
from itertools import chain
def flatten(list_of_lists):
return chain.from_iterable(list_of_lists)
def parse(source_text, words):
words.sort(key=len, reverse=True)
texts = [source_text, ''] # even number of elements helps zip function
for word in words:
new_matches_and_text = []
for text in texts[::2]:
new_matches_and_text.append(text.replace(word, f"|{word}|"))
previously_matched = texts[1::2]
# merge new matches back in
merged = '|'.join(flatten(zip(new_matches_and_text, previously_matched)))
texts = merged.split('|')
# remove blank words (matches at start or end of a string)
texts = [text for text in texts if text]
return ' '.join(texts)
>>> parse('acatisananimal', ['cat', 'is', 'a', 'an', 'animal'])
'a cat is an animal'
>>> parse('atigerisanenormousscaryandbeautifulanimal', ['tiger', 'is', 'an', 'and', 'animal'])
'a tiger is an enormousscary and beautiful animal'
The merge code uses the zip and flatten functions to splice the new matches and old matches together. It basically works by pairing even and odd elements of the list, then "flattening" the result back into one long list ready for the next word.
This approach leaves the unrecognised words in the text.
'beautiful' and 'a' are handled well because they're on their own (i.e. next to recognised words.)
'enormous' and 'scary' are not known and, as they're next to each other, they're left stuck together.
Here's how to list the unknown words:
>>> known_words = ['cat', 'is', 'an', 'animal']
>>> sentence = parse('anayeayeisananimal', known_words)
>>> [word for word in sentence.split(' ') if word not in known_words]
['ayeaye']
I'm curious: is this a bioinformatics project?
List and dict comprehension is another way to do it:
result = ' '.join([word for word, _ in sorted([(k, v) for k, v in zip(words, [text.find(word) for word in words])], key=lambda x: x[1])])
So, I used zip to combine words and their position in text, sorted the words by their position in original text and finally joined the result with ' '.

Categories

Resources