I have a list of phrases (n-grams) that need to be removed from a given sentence.
removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
I want to get:
new_sentence = 'Oranges are the main ingredient for a wide of'
I tried Remove list of phrases from string but it doesn't work ('Oranges' turns into 'Os', 'drinks' is removed instead of a phrase 'food and drinks')
Does anyone know how to solve it? Thank you!
Since you want to match on whole words only, I think the first step is to turn everything into lists of words, and then iterate from longest to shortest phrase in order to find things to remove:
>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
... for i in range(len(words) - len(ngram)+1):
... if words[i:i+len(ngram)] == ngram:
... words = words[:i] + words[i+len(ngram):]
... break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'
Note that there are some flaws with this simple approach -- multiple copies of the same n-gram won't be removed, but you can't continue with that loop after modifying words either (the length will be different), so if you want to handle duplicates, you'll need to batch the updates.
Regular expression time!
In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
...: removed = sorted(removed, key=len, reverse=True)
...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
...: new_sentence = sentence
...: import re
...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
...: for removal in removals:
...: new_sentence = re.sub(removal, '', new_sentence)
...: new_sentence = ' '.join(new_sentence.split())
...: print(sentence)
...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of
import re
removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
# sort the removed tokens according to their length,
removed = sorted(removed, key=len, reverse=True)
# using word boundaries
for r in removed:
sentence = re.sub(r"\b{}\b".format(r), " ", sentence)
# replace multiple whitspaces with a single one
sentence = re.sub(' +',' ',sentence)
I hope this would help:
first, you need to sort the removed strings according to their length, in this way 'food and drinks' will be replaced before 'drinks'
Here you go
removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
words = sentence.split()
resultwords = [word for word in words if word.lower() not in removed]
result = ' '.join(resultwords)
print(result)
Results:
Oranges the main ingredient for a wide of food and
Related
I have two list of strings
data_1 = ['The art is performed by james john.', 'art is quite silent']
data_2 = ['The art is performed by hans.', 'art is very quite silent']
I want to remove common words from strings present in list and return two separate lists
result_1 = ['james john','']
result_2 = ['hans', 'very']
I tried this way
print([' '.join(set(i.split()).difference(set(data_1))) for i in data_2])
How to obtain a result like result_1 and result_2
You could try using numpy's setdiff1d function. Like:
difference_1 = [" ".join(list(np.setdiff1d(np.array(x.split()), np.array(y.split())))) for x, y in zip(data_1, data_2)]
Using set.diference() also should work:
difference_1 = [" ".join(set(x.split()).difference(set(z.split()))) for x, z in zip(data_1, data_2)]
First tokenize the sentences using nltk
from nltk import word_tokenize
def list_tokenize(data):
return [word_tokenize(sentence) for sentence in data]
then get the common words
def get_common_words(data_1_tokenized,data_2_tokenized):
return [
list(set.intersection(set(sentence_1), set(sentence_2)))
for sentence_1, sentence_2 in zip(data_1_tokenized, data_2_tokenized)
]
Then remove the common words
def remove_common_words(data, common_words):
result = []
for i in range(len(data)):
result.append(
" ".join([word for word in data[i] if word not in common_words[i]]))
return result
Combined function to get unique words
def get_unique(data_1,data_2):
data_1_tokenized = list_tokenize(data_1)
data_2_tokenized = list_tokenize(data_2)
common_words = get_common_words(data_1_tokenized,data_2_tokenized)
result1 = remove_common_words(data_1_tokenized,common_words)
result2 = remove_common_words(data_2_tokenized,common_words)
return result1,result2
final usage
data_1 = ['The art is performed by james john.', 'art is quite silent']
data_2 = ['The art is performed by hans.', 'art is very quite silent']
result1,result2 = get_unique(data_1,data_2)
Results
result1=['james john', '']
result2=['hans', 'very']
I have two lists. The first is with adjectives and the second is with sentences.
I need to return a sentence, if there is an adjective from our list and write the sentence in a dictionary with value = 'adj'. It'd return (['Have a good day'], 'adj').
Or at least if it could just return the sentence with a match adj.
sents_cleaned = ['have a good day', 'don't forget your yellow umbrella', 'bold seagull']
adjectives = ['good', 'red', 'green', 'yellow']
This is what I've tried so far. Didn't work as expected, sorry, I'm a noobie.
for sents in sents_cleaned:
sents = sents.strip().split(" ")
for words in sents:
for adj in adjectives:
if adj in sents:
print(sents)
Output would be ['have a good day', 'adj'],
['don't forget your yellow umbrella', 'adj']
Suppose you want to store it in a dictionary called d, with sentences as keys and adjectives as values.
The following code assumes you want only 1 adjective from each sentence. If however, you require multiple adjectives, keeping a dictionary of string to list of strings would help.
d = dict()
for sents in sents_cleaned:
sents = sents.strip().split(" ")
for word in sents:
if word in adjectives:
d[sents] = word
print(d)
Not sure how to remove the "\n" thing at the end of output
Basically, i have this txt file with sentences such as:
"What does Bessie say I have done?" I asked.
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child
taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I managed to split the sentences by semicolon with code:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
re.split(';',low)
But not sure how to count the words of the split sentences as len() doesn't work:
The output of the sentences:
['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']
The third sentence for example, i am trying to count the 3 words at left and 8 words at right.
Thanks for reading!
The number of words is the number of spaces plus one:
e.g.
Two spaces, three words:
World is wonderful
Code:
import re
import string
lines = []
with open('file.txt', 'r') as f:
lines = f.readlines()
DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
# Remove empty sentance
if not sentence.strip():
continue
# Remove punctuation besides our delimiter ';'
sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
# Split by our delimeter
splitted = re.split(DELIMETER, sentence)
# The number of words is the number of spaces plus one
word_count.append([1 + x.strip().count(' ') for x in splitted])
# [[9], [7, 9], [7], [3, 8]]
print(word_count)
Use str.rstrip('\n') to remove the \n at the end of each sentence.
To count the words in a sentence, you can use len(sentence.split(' '))
To transform a list of sentences into a list of counts, you can use the map function.
So here it is:
import re
with open("testing.txt") as file:
for i, line in enumerate(file.readlines()):
# Ignore empty lines
if line.strip(' ') != '\n':
line = line.lower()
# Split by semicolons
parts = re.split(';', line)
print("SENTENCES:", parts)
counts = list(map(lambda part: len(part.split()), parts))
print("COUNTS:", counts)
Outputs
SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]
You'll need the library nltk
from nltk import sent_tokenize, word_tokenize
mytext = """I have a dog.
The dog is called Bob."""
for sent in sent_tokenize(mytext):
print(len(word_tokenize(sent)))
Output
5
6
Step by step explanation:
for sent in sent_tokenize(mytext):
print('Sentence >>>',sent)
print('List of words >>>',word_tokenize(sent))
print('Count words per sentence>>>', len(word_tokenize(sent)))
Output:
Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6
`
import re
sentences = [] #empty list for storing result
with open('testtext.txt') as fileObj:
lines = [line.strip() for line in fileObj if line.strip()] #makin list of lines allready striped from '\n's
for line in lines:
sentences += re.split(';', line) #spliting lines by ';' and store result in sentences
for sentence in sentences:
print(sentence +' ' + str(len(sentence.split()))) #out
try this one:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
low = low.strip()
low = low.replace('\n', '')
re.split(';',low)
I want to edit my text like this:
arr = []
# arr is full of tokenized words from my text
For example:
"Abraham Lincoln Hotel is very beautiful place and i want to go there with
Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
Edit: Basically I want to detect Proper Names and group them by using istitle() and isAlpha() in for statement like:
for i in arr:
if arr[i].istitle() and arr[i].isAlpha
In the example arr appened until the next word hasn't his first letter upper case.
arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel
This is what i want with my new arr:
['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].
"Also" is not problem for me it will be usefull when i try to match with my dataset.
You could do something like this:
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
if(word.istitle() and word.isalpha()):
if(last_word_index == idx-1):
proper_nouns[-1] = proper_nouns[-1] + " " + word
else:
proper_nouns.append(word)
last_word_index = idx
print(proper_nouns)
This code will:
Split all the words into a list
Iterate over all of the words and
If the last capitalized word was the previous word, it will append it to the last entry in the list
else it will store the word as a new entry in the list
Record the last index that a capitalized word was found
Is this what you are asking?
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
chars = ".!?," # Characters you want to remove from the words in the array
table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table) # Replace characters with spaces
arr = sentence.split() # Split the string into an array whereever a space occurs
print(arr)
The output is:
['Abraham',
'Lincoln',
'Hotel',
'is',
'very',
'beautiful',
'place',
'and',
'i',
'want',
'to',
'go',
'there',
'with',
'Barbara',
'Palvin',
'Also',
'there',
'are',
'stores',
'like',
'Adidas',
'Nike',
'Reebok']
Note about this code: any character that is in the chars variable will be removed from the strings in the array. Explenation is in the code.
To remove the non-names just do this:
import string
new_arr = []
for i in arr:
if i[0] in string.ascii_uppercase:
new_arr.append(i)
This code will include ALL words that start with a capital letter.
To fix that you will need to change chars to:
chars = ","
And change the above code to:
import string
new_arr = []
end = ".!?"
b = 1
for i in arr:
if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
new_arr.append(i)
b += 1
And that will output:
['Abraham',
'Lincoln',
'Hotel',
'Barbara',
'Palvin.',
'Adidas',
'Nike',
'Reebok.']
I have a string and a dictionary, I have to replace every occurrence of the dict key in that text.
text = 'I have a smartphone and a Smart TV'
dict = {
'smartphone': 'toy',
'smart tv': 'junk'
}
If there is no space in keys, I will break the text into word and compare one by one with dict. Look like it took O(n). But now the key have space inside it so thing is more complected. Please suggest me the good way to do this and please notice the key may not match case with the text.
Update
I have think of this solution but it not efficient. O(m*n) or more...
for k,v in dict.iteritems():
text = text.replace(k,v) #or regex...
If the key word in the text is not close to each others (keyword other keyword) we may do this. Took O(n) to me >"<
def dict_replace(dictionary, text, strip_chars=None, replace_func=None):
"""
Replace word or word phrase in text with keyword in dictionary.
Arguments:
dictionary: dict with key:value, key should be in lower case
text: string to replace
strip_chars: string contain character to be strip out of each word
replace_func: function if exist will transform final replacement.
Must have 2 params as key and value
Return:
string
Example:
my_dict = {
"hello": "hallo",
"hallo": "hello", # Only one pass, don't worry
"smart tv": "http://google.com?q=smart+tv"
}
dict_replace(my_dict, "hello google smart tv",
replace_func=lambda k,v: '[%s](%s)'%(k,v))
"""
# First break word phrase in dictionary into single word
dictionary = dictionary.copy()
for key in dictionary.keys():
if ' ' in key:
key_parts = key.split()
for part in key_parts:
# Mark single word with False
if part not in dictionary:
dictionary[part] = False
# Break text into words and compare one by one
result = []
words = text.split()
words.append('')
last_match = '' # Last keyword (lower) match
original = '' # Last match in original
for word in words:
key_word = word.lower().strip(strip_chars) if \
strip_chars is not None else word.lower()
if key_word in dictionary:
last_match = last_match + ' ' + key_word if \
last_match != '' else key_word
original = original + ' ' + word if \
original != '' else word
else:
if last_match != '':
# If match whole word
if last_match in dictionary and dictionary[last_match] != False:
if replace_func is not None:
result.append(replace_func(original, dictionary[last_match]))
else:
result.append(dictionary[last_match])
else:
# Only match partial of keyword
match_parts = last_match.split(' ')
match_original = original.split(' ')
for i in xrange(0, len(match_parts)):
if match_parts[i] in dictionary and \
dictionary[match_parts[i]] != False:
if replace_func is not None:
result.append(replace_func(match_original[i], dictionary[match_parts[i]]))
else:
result.append(dictionary[match_parts[i]])
result.append(word)
last_match = ''
original = ''
return ' '.join(result)
If your keys have no spaces:
output = [dct[i] if i in dct else i for i in text.split()]
' '.join(output)
You should use dct instead of dict so it doesn't collide with the built in function dict()
This makes use of a dictionary comprehension, and a ternary operator
to filter the data.
If your keys do have spaces, you are correct:
for k,v in dct.iteritems():
string.replace('d', dct[d])
And yes, this time complexity will be m*n, as you have to iterate through the string every time for each key in dct.
Drop all dictionary keys and the input text to lower case, so the comparisons are easy. Now ...
for entry in my_dict:
if entry in text:
# process the match
This assumes that the dictionary is small enough to warrant the match. If, instead, the dictionary is large and the text is small, you'll need to take each word, then each 2-word phrase, and see whether they're in the dictionary.
Is that enough to get you going?
You need to test all the neighbor permutations from 1 (each individual word) to len(text) (the entire string). You can generate the neighbor permutations this way:
text = 'I have a smartphone and a Smart TV'
array = text.lower().split()
key_permutations = [" ".join(array[j:j + i]) for i in range(1, len(array) + 1) for j in range(0, len(array) - (i - 1))]
>>> key_permutations
['i', 'have', 'a', 'smartphone', 'and', 'a', 'smart', 'tv', 'i have', 'have a', 'a smartphone', 'smartphone and', 'and a', 'a smart', 'smart tv', 'i have a', 'have a smartphone', 'a smartphone and', 'smartphone and a', 'and a smart', 'a smart tv', 'i have a smartphone', 'have a smartphone and', 'a smartphone and a', 'smartphone and a smart', 'and a smart tv', 'i have a smartphone and', 'have a smartphone and a', 'a smartphone and a smart', 'smartphone and a smart tv', 'i have a smartphone and a', 'have a smartphone and a smart', 'a smartphone and a smart tv', 'i have a smartphone and a smart', 'have a smartphone and a smart tv', 'i have a smartphone and a smart tv']
Now we substitute through the dictionary:
import re
for permutation in key_permutations:
if permutation in dict:
text = re.sub(re.escape(permutation), dict[permutation], text, flags=re.IGNORECASE)
>>> text
'I have a toy and a junk'
Though you'll likely want to try the permutations in the reverse order, longest first, so more specific phrases have precedence over individual words.
You can do this pretty easily with regular expressions.
import re
text = 'I have a smartphone and a Smart TV'
dict = {
'smartphone': 'toy',
'smart tv': 'junk'
}
for k, v in dict.iteritems():
regex = re.compile(re.escape(k), flags=re.I)
text = regex.sub(v, text)
It still suffers from the problem of depending on processing order of the dict keys, if the replacement value for one item is part of the search term for another item.