I am trying to remove suffixes but for some reason, it is not working.
Code is:
# Stemming
suffix_list = ['-ed', '-ing', '-s']
for word in range(len(output)): # loop
# range returns the sequence, len checks the lenght
for suffix in range(len(suffix_list)):
# .endswith checks x in both
if output[word].endswith(suffix_list[suffix]):
# .removesuffix removes from output if in suffix_list
print(output[word].removesuffix(suffix_list[suffix]))
return output
print(textPreprocessing("I'm gathering herbs."))
print(textPreprocessing("When life gives you lemons, make lemonade"))
Outcome is:
gather
herb
['im', 'gathering', 'herbs']
give
lemon
['life', 'gives', 'you', 'lemons', 'make', 'lemonade']
Where it should be:
['im', 'gather', 'herbs']
['life', 'give', 'you', 'lemon', 'make', 'lemonade']
Any help?
I feel like I am missing something obvious...
output[word] = output[word].removesuffix(suffix_list[suffix])
This will work:-
str_sample = "When life gives you lemons make lemonade"
suffixes, words = ["s"], str_sample.split()
for i in range(len(suffixes)):
for j in range(len(words)):
if words[j].endswith(suffixes[i]):
words[j] = words[j].replace(f"{suffixes[i]}", "")
print(words)
It can be compressed to a List Comprehension as :-
str_sample = "When life gives you lemons make lemonade"
suffixes, words = ["s"], str_sample.split()
res = [words[j].replace(f"{suffixes[i]}", "") if words[j].endswith(suffixes[i]) else words[j] for j in range(len(words)) for i in range(len(suffixes))]
print(words)
Related
Im having an issue limiting characters not words running different solutions as shown below..any help is appreciated!
test_list = ['Running the boy gladly jumped']
test_words = test_list[0].split()
suffix_list = ['ed', 'ly', 'ing']
final_list = []
for word in test_words:
if suffix_list[0] == word[-len(suffix_list[0]):]:
final_list.append(word[0:-len(suffix_list[0])])
elif suffix_list[1] == word[-len(suffix_list[1]):]:
final_list.append(word[0:-len(suffix_list[1])])
elif suffix_list[2] == word[-len(suffix_list[2]):]:
final_list.append(word[0:-len(suffix_list[2])])
else:
final_list.append(word)
final_list = [' '.join(final_list)]
print (final_list)
If you mean to include only the first 8 characters of each word, you can do this with a list comprehension over final_list like so:
final_list = [word[:min(len(word), 8)] for word in final_list
Removes suffixes and limits each result word to limit characters
limit = 8
test_list = ['Running the boy gladly jumped continuously']
test_words = test_list[0].split()
suffix_list = ['ed', 'ly', 'ing']
final_list = []
for word in test_words:
for suffix in suffix_list:
if word.endswith(suffix):
final_list.append(word[:-len(suffix)][:limit])
break
else:
final_list.append(word[:limit])
print(' '.join(final_list))
Prints:
Runn the boy glad jump continuo
You could use splicing to get the first 8 words
final_list = [' '.join(final_list)][:8]
Complete beginner here. I am trying to remove words from a list that start with one of the letters in another list.
So if
list_of_words = ['apple', 'apricot', 'banana', 'blueberry', 'cherry']
list_of_letters = ['a', 'c']
Then I'd like my final list to be
['banana', 'blueberry']
In case it's relevant, my actual lists are a lot longer than these two examples.
What should I do?
Thank you!
Edit:
Yes, I did try to code it, but I didn't include it in the initial post since I was sure it was completely off-base.
But what I tried was
def first_letter_program(first_letter, word_list):
if first_letter == word_list[0]:
return new_word_list.remove(word_list)
first_letter_program(list_of_letters, word_list)
Maybe I should've included it in the first place to show that I did at least put some effort into this.
You could try this:
result = []
for word in list_of_words:
if word[0] not in list_of_letters:
result.append(word)
Or functional way:
def first_letter_program(first_letter, word_list):
result = []
for word in word_list: # looping each word in the word_list
if word[0] not in first_letter: # check the first letter of each word
result.append(word)
return result
>>>
This question already has answers here:
extract each word from list of strings
(7 answers)
Closed 1 year ago.
I would like to turn each word in a string in a list into elements in a list
Example:
Input: l = ["the train was late", "I looked for mary and sam at the bus station"]
Output: l = ["the","train","was","late","I","looked","for","mary","and","sam","at","the","bus","station"]
I tried doing this:
l = []
for word in data_txt_file:
s = word.split(" ")
l.append(s)
But then I get:
Output: l = [["the","train","was","late",],["I","looked","for","mary","and","sam","at","the","bus","station"]]
Maybe from here I can just remove the nested lists and flatten it, however, can I just I immediately go there, instead of doing like this middle step with splitting it.
The most simple way to go with the current code itself is to make use of the extend methods provided to lists:
l = []
for word in data_txt_file:
s = word.split(" ")
l.extend(s)
The easy solution would be to transform the first list to a single string, so you can just tokenize it as a whole:
l = ' '.join(data_txt_file).split(' ')
Or, you could just flatten the nested list you got in your solution:
l2 = []
for e in l: l2+=e
This would also result in a list that just has the words as elements.
Or, if you really want to make this word by word as in your first solution:
l = []
for line in data_txt_file:
for word in line.split(' '):
l.append(word)
This can be done in a single line of code:
l = ",".join(l).replace(",", " ").split(" ")
Output: ['the', 'train', 'was', 'late', 'I', 'looked', 'for', 'mary', 'and', 'sam', 'at', 'the', 'bus', 'station']
I need to concatenate certain words that appear separated in a list of words, such as "computer" (below). These words appear separated in the list due to line breaks and I want to fix this.
lst=['love','friend', 'apple', 'com', 'puter']
the expected result is:
lst=['love','friend', 'apple', 'computer']
My code doesn't work. Can anyone help me to do that?
the code I am trying is:
from collections import defaultdict
import enchant
import string
words=['love', 'friend', 'car', 'apple',
'com', 'puter', 'vi']
myit = iter(words)
dic=enchant.Dict('en_UK')
lst=[]
errors=[]
for i in words:
if dic.check(i) is True:
lst.append(i)
if dic.check(i) is False:
a= i + next(myit)
if dic.check(a) is True:
lst.append(a)
else:
continue
print (lst)`
Notwithstanding the fact that this method is not very robust (you would miss "ham-burger", for example), the main error was that you didn't loop on the iterator, but on the list itself. Here is a corrected version.
Note that I renamed the variables to give them more expressive names, and I replaced the dictionnary check by a simple word in dic with a sample vocabulary - the module you import is not part of the standard library, which make your code as-is difficult to run for us who don't have it.
dic = {'love', 'friend', 'car', 'apple',
'computer', 'banana'}
words=['love', 'friend', 'car', 'apple', 'com', 'puter', 'vi']
words_it = iter(words)
valid_words = []
for word in words_it:
if word in dic:
valid_words.append(word)
else:
try:
concacenated = word + next(words_it)
if concacenated in dic:
valid_words.append(concacenated)
except StopIteration:
pass
print (valid_words)
# ['love', 'friend', 'car', 'apple', 'computer']
You need the try ... except part in case the last word of the list is not in the dictionnary, as next() will raise a StopIteration in this case.
The main problem with your code is that you are, on the one hand, iterating words in the for loop and, on the other hand, through the iterator myit. These two iterations are independent, so you cannot use next(myit) within your loop to get the word after i (also, if i is the last word there would be no next word). On the other hand, your problem can be complicated by the fact that there may be split words with parts that are too in the dictionary (e.g. printable is a word, but so are print and able).
Assuming a simple scenario where split word parts are never in the dictionary, I think this algorithm could work better for you:
import enchant
words = ['love', 'friend', 'car', 'apple', 'com', 'puter', 'vi']
myit = iter(words)
dic = enchant.Dict('en_UK')
lst = []
# The word that you are currently considering
current = ''
for i in words:
# Add the next word
current += i
# If the current word is in the dictionary
if dic.check(current):
# Add it to the list
lst.append(current)
# Clear the current word
current = ''
# If the word is not in the dictionary we keep adding words to current
print(lst)
I want to edit my text like this:
arr = []
# arr is full of tokenized words from my text
For example:
"Abraham Lincoln Hotel is very beautiful place and i want to go there with
Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
Edit: Basically I want to detect Proper Names and group them by using istitle() and isAlpha() in for statement like:
for i in arr:
if arr[i].istitle() and arr[i].isAlpha
In the example arr appened until the next word hasn't his first letter upper case.
arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel
This is what i want with my new arr:
['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].
"Also" is not problem for me it will be usefull when i try to match with my dataset.
You could do something like this:
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
if(word.istitle() and word.isalpha()):
if(last_word_index == idx-1):
proper_nouns[-1] = proper_nouns[-1] + " " + word
else:
proper_nouns.append(word)
last_word_index = idx
print(proper_nouns)
This code will:
Split all the words into a list
Iterate over all of the words and
If the last capitalized word was the previous word, it will append it to the last entry in the list
else it will store the word as a new entry in the list
Record the last index that a capitalized word was found
Is this what you are asking?
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
chars = ".!?," # Characters you want to remove from the words in the array
table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table) # Replace characters with spaces
arr = sentence.split() # Split the string into an array whereever a space occurs
print(arr)
The output is:
['Abraham',
'Lincoln',
'Hotel',
'is',
'very',
'beautiful',
'place',
'and',
'i',
'want',
'to',
'go',
'there',
'with',
'Barbara',
'Palvin',
'Also',
'there',
'are',
'stores',
'like',
'Adidas',
'Nike',
'Reebok']
Note about this code: any character that is in the chars variable will be removed from the strings in the array. Explenation is in the code.
To remove the non-names just do this:
import string
new_arr = []
for i in arr:
if i[0] in string.ascii_uppercase:
new_arr.append(i)
This code will include ALL words that start with a capital letter.
To fix that you will need to change chars to:
chars = ","
And change the above code to:
import string
new_arr = []
end = ".!?"
b = 1
for i in arr:
if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
new_arr.append(i)
b += 1
And that will output:
['Abraham',
'Lincoln',
'Hotel',
'Barbara',
'Palvin.',
'Adidas',
'Nike',
'Reebok.']