I want to edit my text like this:
arr = []
# arr is full of tokenized words from my text
For example:
"Abraham Lincoln Hotel is very beautiful place and i want to go there with
Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
Edit: Basically I want to detect Proper Names and group them by using istitle() and isAlpha() in for statement like:
for i in arr:
if arr[i].istitle() and arr[i].isAlpha
In the example arr appened until the next word hasn't his first letter upper case.
arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel
This is what i want with my new arr:
['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].
"Also" is not problem for me it will be usefull when i try to match with my dataset.
You could do something like this:
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
if(word.istitle() and word.isalpha()):
if(last_word_index == idx-1):
proper_nouns[-1] = proper_nouns[-1] + " " + word
else:
proper_nouns.append(word)
last_word_index = idx
print(proper_nouns)
This code will:
Split all the words into a list
Iterate over all of the words and
If the last capitalized word was the previous word, it will append it to the last entry in the list
else it will store the word as a new entry in the list
Record the last index that a capitalized word was found
Is this what you are asking?
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
chars = ".!?," # Characters you want to remove from the words in the array
table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table) # Replace characters with spaces
arr = sentence.split() # Split the string into an array whereever a space occurs
print(arr)
The output is:
['Abraham',
'Lincoln',
'Hotel',
'is',
'very',
'beautiful',
'place',
'and',
'i',
'want',
'to',
'go',
'there',
'with',
'Barbara',
'Palvin',
'Also',
'there',
'are',
'stores',
'like',
'Adidas',
'Nike',
'Reebok']
Note about this code: any character that is in the chars variable will be removed from the strings in the array. Explenation is in the code.
To remove the non-names just do this:
import string
new_arr = []
for i in arr:
if i[0] in string.ascii_uppercase:
new_arr.append(i)
This code will include ALL words that start with a capital letter.
To fix that you will need to change chars to:
chars = ","
And change the above code to:
import string
new_arr = []
end = ".!?"
b = 1
for i in arr:
if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
new_arr.append(i)
b += 1
And that will output:
['Abraham',
'Lincoln',
'Hotel',
'Barbara',
'Palvin.',
'Adidas',
'Nike',
'Reebok.']
Related
I have a list of strings such as
words = ['Twinkle Twinkle', 'How I wonder']
I am trying to create a function that will find and replace words in the original list and I was able to do that except for when the user inputs single letter words such as 'I' or 'a' etc.
current function
def sub(old: string, new: string, words: list):
words[:] = [w.replace(old, new) for w in words]
if input for old = 'I'
and new = 'ASD'
current output = ['TwASDnkle TwASDnkle', 'How ASD wonder']
intended output = ['Twinkle Twinkle', 'How ASD wonder']
This is my first post here and I have only been learning python for a few months now so I would appreciate any help, thank you
Don't use str.replace in a loop. This often doesn't do what is expected as it doesn't work on words but on all matches.
Instead, split the words, replace on match and join:
l = ['Twinkle Twinkle', 'How I wonder']
def sub(old: str, new: str, words: list):
words[:] = [' '.join(new if w==old else w for w in x.split()) for x in words]
sub('I', 'ASD', l)
Output: ['Twinkle Twinkle', 'How ASD wonder']
Or use a regex with word boundaries:
import re
def sub(old, new, words):
words[:] = [re.sub(fr'\b{re.escape(old)}\b', new, w) for w in words]
l = ['Twinkle Twinkle', 'How I wonder']
sub('I', 'ASD', l)
# ['Twinkle Twinkle', 'How ASD wonder']
NB. As #re-za pointed out, it might be a better practice to return a new list rather than mutating the input, just be aware of it
It seems like you are replacing letters and not words. I recommend splitting sentences (strings) into words by splitting strings by the ' ' (space char).
output = []
I would first get each string from the list like this:
for string in words:
I would then split the strings into a list of words like this:
temp_string = '' # a temp string we will use later to reconstruct the words
for word in string.split(' '):
Then I would check to see if the word is the one we are looking for by comparing it to old, and replacing (if it matches) with new:
if word == old:
temp_string += new + ' '
else:
temp_string += word + ' '
Now that we have each word reconstructed or replaced (if needed) back into a temp_string we can put all the temp_strings back into the array like this:
output.append(temp_string[:-1]) # [:-1] means we omit the space at the end
It should finally look like this:
def sub(old: string, new: string, words: list):
output = []
for string in words:
temp_string = '' # a temp string we will use later to reconstruct the words
for word in string.split(' '):
if word == old:
temp_string += new + ' '
else:
temp_string += word + ' '
output.append(temp_string[:-1]) # [:-1] means we omit the space at the end
return output
I am trying to remove suffixes but for some reason, it is not working.
Code is:
# Stemming
suffix_list = ['-ed', '-ing', '-s']
for word in range(len(output)): # loop
# range returns the sequence, len checks the lenght
for suffix in range(len(suffix_list)):
# .endswith checks x in both
if output[word].endswith(suffix_list[suffix]):
# .removesuffix removes from output if in suffix_list
print(output[word].removesuffix(suffix_list[suffix]))
return output
print(textPreprocessing("I'm gathering herbs."))
print(textPreprocessing("When life gives you lemons, make lemonade"))
Outcome is:
gather
herb
['im', 'gathering', 'herbs']
give
lemon
['life', 'gives', 'you', 'lemons', 'make', 'lemonade']
Where it should be:
['im', 'gather', 'herbs']
['life', 'give', 'you', 'lemon', 'make', 'lemonade']
Any help?
I feel like I am missing something obvious...
output[word] = output[word].removesuffix(suffix_list[suffix])
This will work:-
str_sample = "When life gives you lemons make lemonade"
suffixes, words = ["s"], str_sample.split()
for i in range(len(suffixes)):
for j in range(len(words)):
if words[j].endswith(suffixes[i]):
words[j] = words[j].replace(f"{suffixes[i]}", "")
print(words)
It can be compressed to a List Comprehension as :-
str_sample = "When life gives you lemons make lemonade"
suffixes, words = ["s"], str_sample.split()
res = [words[j].replace(f"{suffixes[i]}", "") if words[j].endswith(suffixes[i]) else words[j] for j in range(len(words)) for i in range(len(suffixes))]
print(words)
Suppose you have a string:
text = "coding in python is a lot of fun"
And character positions:
positions = [(0,6),(10,16),(29,32)]
These are intervals, which cover certain words within text, i.e. coding, python and fun, respectively.
Using the character positions, how could you split the text on those words, to get this output:
['coding','in','python','is a lot of','fun']
This is just an example, but it should work for any string and any list of character positions.
I'm not looking for this:
[text[i:j] for i,j in positions]
I'd flatten positions to be [0,6,10,16,29,32] and then do something like
positions.append(-1)
prev_positions = [0] + positions
words = []
for begin, end in zip(prev_positions, positions):
words.append(text[begin:end])
This exact code produces ['', 'coding', ' in ', 'python', ' is a lot of ', 'fun', ''], so it needs some additional work to strip the whitespace
Below code works as expected
text = "coding in python is a lot of fun"
positions = [(0,6),(10,16),(29,32)]
textList = []
lastIndex = 0
for indexes in positions:
s = slice(indexes[0], indexes[1])
if positions.index(indexes) > 0:
print(lastIndex)
textList.append(text[lastIndex: indexes[0]])
textList.append(text[indexes[0]: indexes[1]])
lastIndex = indexes[1] + 1
print(textList)
Output: ['coding', 'in ', 'python', 'is a lot of ', 'fun']
Note: If space are not needed you can trim them
This question already has answers here:
extract each word from list of strings
(7 answers)
Closed 1 year ago.
I would like to turn each word in a string in a list into elements in a list
Example:
Input: l = ["the train was late", "I looked for mary and sam at the bus station"]
Output: l = ["the","train","was","late","I","looked","for","mary","and","sam","at","the","bus","station"]
I tried doing this:
l = []
for word in data_txt_file:
s = word.split(" ")
l.append(s)
But then I get:
Output: l = [["the","train","was","late",],["I","looked","for","mary","and","sam","at","the","bus","station"]]
Maybe from here I can just remove the nested lists and flatten it, however, can I just I immediately go there, instead of doing like this middle step with splitting it.
The most simple way to go with the current code itself is to make use of the extend methods provided to lists:
l = []
for word in data_txt_file:
s = word.split(" ")
l.extend(s)
The easy solution would be to transform the first list to a single string, so you can just tokenize it as a whole:
l = ' '.join(data_txt_file).split(' ')
Or, you could just flatten the nested list you got in your solution:
l2 = []
for e in l: l2+=e
This would also result in a list that just has the words as elements.
Or, if you really want to make this word by word as in your first solution:
l = []
for line in data_txt_file:
for word in line.split(' '):
l.append(word)
This can be done in a single line of code:
l = ",".join(l).replace(",", " ").split(" ")
Output: ['the', 'train', 'was', 'late', 'I', 'looked', 'for', 'mary', 'and', 'sam', 'at', 'the', 'bus', 'station']
I have a very large list of strings like this:
list_strings = ['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
and a very large list of lists like this:
lis_of_lis = [['the storm was good blight'],['this is overcloud'],...,[there was a plague stormicide]]
How can I return a list of counts of all the words that appear in list_strings on each sub-list of lis_of_lis. For instance for the above example this will be the desired output: [2,1,1]
For example:
['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
['the storm was good blight']
The count is 2, since storm and blight appear in the first sublist (lis_of_lis)
['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
['this is overcloud stormicide']
The count is 1, since overcloud appear in the first sublist (lis_of_lis)
since stormicide doesnt appear in the first list
['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
[there was a plague]
The count is 1, since plague appear in the first sublist (lis_of_lis)
Hence is the desired output [2,1,1]
The problem with all the answers is that are counting all the substrings in a word instead of the full word
You can use sum function within a list comprehension :
[sum(1 for i in list_strings if i in sub[0]) for sub in lis_of_lis]
result = []
for sentence in lis_of_lis:
result.append(0)
for word in list_strings:
if word in sentence[0]:
result[-1]+=1
print(result)
which is the long version of
result = [sum(1 for word in list_strings if word in sentence[0]) for sentence in lis_of_lis]
This will return [2,2,1] for your example.
If you want only whole words, add spaces before and after the words / sentences:
result = []
for sentence in lis_of_lis:
result.append(0)
for word in list_strings:
if ' '+word+' ' in ' '+sentence[0]+' ':
result[-1]+=1
print(result)
or short version:
result = [sum(1 for word in list_strings if ' '+word+' ' in ' '+sentence[0]+' ') for sentence in lis_of_lis]
This will return [2,1,1] for your example.
This creates a dictionary with the words in list_string as keys, and the values starting at 0. It then iterates through the lis_of_lis, splits the phrase up into a list of words, iterates through that, and checks to see if they are in the dictionary. If they are, 1 is added to the corresponding value.
word_count = dict()
for word in list_string:
word_count[word] = 0
for phrase in lis_of_lis:
words_in_phrase = phrase.split()
for word in words_in_phrase:
if word in word_count:
word_count[word] += 1
This will create a dictionary with the words as keys, and the frequency as values. I'll leave it to you to get the correct output out of that data structure.