Simplest way to convert char offsets to word offsets - python

I have a python string and a substring of selected text. The string for example could be
stringy = "the bee buzzed loudly"
I want to select the text "bee buzzed" within this string. I have the character offsets i.e 4-14 for this particular string. Because those are the character level indices that the selected text is between.
What is the simplest way to convert these to word level indices i.e 1-2 because the second and third words are being selected. I have many strings that are labeled like this and I would like to convert the indices simply and efficiently. The data is currently stored ina dictionary like so:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
I would like to convert it to this form
data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}
Thank you!

It seem like a tokenisation problem.
My solution would to use a span tokenizer and then search you substring spans in the spans.
So using the nltk library:
import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b, sub_e = 4, 14 # substring begin and end
[i for i, (b, e) in enumerate(tokenizer.span_tokenize(stringy))
if b >= sub_b and e <= sub_e]
But this is kind of intricate.
tokenizer.span_tokenize(stringy) returns spans for each token/word it identified.

Heres a simple list index approach:
# set up data
string = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"", "start_word":0,"end_word":0}
#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly', 'start_word': 1, 'end_word': 2}
take note that this assumes you're using a chronological order of words inside the string

Try this code, please;
def char_change(dic, start_char, end_char, *arg):
dic[arg[0]] = start_char
dic[arg[1]] = end_char
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))
char_change(data, start_char, end_char, "start_char", "end_char")
print(data)
Default Dictionary:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
INPUT
Please enter your start character: 1
Please enter your end character: 2
OUTPUT Dictionary:
{'string': 'the bee buzzed loudly', 'start_char': 1, 'end_char': 2}

Related

Replace string in list then join list to form new string

I have a project where I need to do the following:
User inputs a sentence
intersect sentence with list for matching strings
replace one of the matching strings with a new string
print the original sentence featuring the replacement
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
# Asks the user for a sentence.
random_sentence = str(input('Please enter a random sentence:\n')).title()
stripped_sentence = random_sentence.strip(',.!?')
split_sentence = stripped_sentence.split()
# Solve for single word fruit names
sentence_intersection = set(fruits).intersection(split_sentence)
# Finds and replaces at least one instance of a fruit in the sentence with “Brussels Sprouts”.
intersection_as_list = list(sentence_intersection)
intersection_as_list[-1] = 'Brussels Sprouts'
Example Input: "I would like some raisins and strawberries."
Expected Output: "I would like some raisins and Brussels Sprouts."
But I can't figure out how to join the string back together after making the replacement. Any help is appreciated!
You can do it with a regex:
(?i)Quince|Raisins|Raspberries|Rhubarb|Strawberries|Tangelo|Tangerines
This pattern will match any of your words in a case insensitive way (?i).
In Python, you can obtain that pattern by joining your fruits into a single string. Then you can use the re.sub function to replace your first matching word with "Brussels Sprouts".
import re
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
# Asks the user for a sentence.
#random_sentence = str(input('Please enter a random sentence:\n')).title()
sentence = "I would like some raisins and strawberries."
pattern = '(?i)' + '|'.join(fruits)
replacement = 'Brussels Sprouts'
print(re.sub(pattern, replacement, sentence, 1))
Output:
I would like some Brussels Sprouts and strawberries.
Check the Python demo here.
Create a set of lowercase possible word matches, then use a replacement function.
If a word is found, clear the set, so replacement works only once.
import re
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
fruit_set = {x.lower() for x in fruits}
s = "I would like some raisins and strawberries."
def repfunc(m):
w = m.group(1)
if w.lower() in fruit_set:
fruit_set.clear()
return "Brussel Sprouts"
else:
return w
print(re.sub(r"(\w+)",repfunc,s))
prints:
I would like some Brussel Sprouts and strawberries.
That method has the advantage of being O(1) on lookup. If there are a lot of possible words it will beat the linear search that | performs when testing word after word.
It's simpler to replace just the first occurrence, but replacing the last occurrence, or a random occurrence is also doable. First you have to count how many fruits are in the sentence, then decide which replacement is effective in a second pass.
like this: (not very beautiful, using a lot of globals and all)
total = 0
def countfunc(m):
global total
w = m.group(1)
if w.lower() in fruit_set:
total += 1
idx = 0
def repfunc(m):
global idx
w = m.group(1)
if w.lower() in fruit_set:
if total == idx+1:
return "Brussel Sprouts"
else:
idx += 1
return w
else:
return w
re.sub(r"(\w+)",countfunc,s)
print(re.sub(r"(\w+)",repfunc,s))
first sub just counts how many fruits would match, then the second function replaces only when the counter matches. Here last occurrence is selected.

Translate paragraph in python

I am trying to translate a Paragraph from english to my local language which I have written the code as:
def translate(inputvalue):
//inputvalue is an array of english paragraphs
try:
translatedData = []
trans = Translator()
for i in inputvalue:
sentence = re.sub(r'(?<=[.,])(?=[^\s])', r' ', i)
//adding space where there is no space after , or ,
t = trans.translate(sentence, src='en', dest = 'ur')
//translating from english to my local language urdu
translatedData.append(t.text)
//appending data in translatedData array
DisplayOutput.output(translatedData)
//finally calling DisplayOutput function to print translated data
The problem I am facing here is that my local language begins writing from [Right side]
and googletrans is not giving proper output. It puts periods ,commas, untranslated words at the beginning or at the end for example:
I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD.
it would translate this sentence as:
میری عمر 6 سال ہے،. مجھے کارٹون جانور اور پودے کھینچنا پسند ہےمجھے ADHD 6نہیں ہے.
As you can observe it could not translate ADHD as it is just an abbreviation it puts that at the beginning of the sentence and same goes for periods and numbers and commas.
How should I translate it so that it does not conflict like that.
If putting the sentence in another array like:
['I am', '6', 'years old', '.', 'I love to draw cartoons',',', 'animals',',', 'and plants','.', 'I do not have', 'ADHD','.']
I have no idea how to achieve this type of array but I believe it can solve the problem.
As I can translate only the parts that has English words and then appending the list in a string.
Kindly Help me generate this type of array or any other solution
string = "I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD."
arr = []
substring = ""
alpha = None
for char in string:
if char.isalpha() or char == " ": alpha = True
else: alpha = False
if substring.replace(" ","").isalpha():
if alpha:
substring += char
else:
arr.append(substring)
substring = char
else:
if alpha:
arr.append(substring)
substring = char
while " " in arr: arr.remove(" ")
while "" in arr: arr.remove("")
print(arr)
Loop through each character in the string, then check if it is a letter or not a letter with ".isalpha()". Then depending on the conditions of the current substring, you append to it or create a new one.

Pull several substrings from an input using specific characters to find them

I need to make a user created madlib where the user would input a madlib for someone else to use. The input would be something like this:
The (^noun^) and the (^adj^) (^noun^)
I need to pull anything between (^ and ^) so I can use the word to code so I get another input prompt to complete the madlib.
input('Enter "word in-between the characters":')
This is my code right now
madlib = input("Enter (^madlib^):")
a = "(^"
b = "^)"
start = madlib.find(a) + len(a)
end = madlib.find(b)
substring = madlib[start:end]
def mad():
if "(^" in madlib:
substring = madlib[start:end]
m = input("Enter " + substring + ":")
mad = madlib.replace(madlib[start:end],m)
return mad
print(mad())
What am I missing?
You can use re.finditer() to do this fairly cleanly by collecting the .span() of each match!
import re
# collect starting madlib
madlib_base = input('Enter madlib base with (^...^) around words like (^adj^)): ')
# list to put the collected blocks of spans and user inputs into
replacements = []
# yield every block like (^something^) by matching each end and `not ^` inbetween
for match in re.finditer(r"\(\^([^\^]+)\^\)", madlib_base):
replacements.append({
"span": match.span(), # position of the match in madlib_base
"sub_str": input(f"enter a {match.group(1)}: "), # replacement str
})
# replacements mapping and madlib_base can be saved for later!
def iter_replace(base_str, replacements_mapping):
# yield alternating blocks of text and replacement
# skip the replacement span from the text when yielding
base_index = 0 # index in base str to begin from
for block in replacements_mapping:
head, tail = block["span"] # unpack span
yield base_str[base_index:head] # next value up to span
yield block["sub_str"] # string the user gave us
base_index = tail # start from the end of the span
# collect the iterable into a single result string
# this can be done at the same time as the earlier loop if the input is known
result = "".join(iter_replace(madlib_base, replacements))
Demonstration
...
enter a noun: Madlibs
enter a adj: rapidly
enter a noun: house
...
>>> result
'The Madlibs and the rapidly house'
>>> replacements
[{'span': (4, 12), 'sub_str': 'Madlibs'}, {'span': (21, 28), 'sub_str': 'rapidly'}, {'span': (29, 37), 'sub_str': 'house'}]
>>> madlib_base
'The (^noun^) and the (^adj^) (^noun^)'
Your mad() function only does one substitution, and it's only called once. For your sample input with three required substitutions, you'll only ever get the first noun. In addition, mad() depends on values that are initialized outside the function, so calling it multiple times won't work (it'll keep trying to operate on the same substring, etc).
To fix it, you need to make it so that mad() does one substitution on whatever text you give it, regardless of any other state outside of the function; then you need to call it until it's substituted all the words. You can make this easier by having mad return a flag indicating whether it found anything to substitute.
def mad(text):
start = text.find("(^")
end = text.find("^)")
substring = text[start+2:end] if start > -1 and end > start else ""
if substring:
m = input(f"Enter {substring}: ")
return text.replace(f"(^{substring}^)", m, 1), True
return text, False
madlib, do_mad = input("Enter (^madlib^):"), True
while do_mad:
madlib, do_mad = mad(madlib)
print(madlib)
Enter (^madlib^):The (^noun^) and the (^adj^) (^noun^)
Enter noun: cat
Enter adj: lazy
Enter noun: dog
The cat and the lazy dog

Replace a word in a String by indexing without "string replace function" -python

Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.
just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))
Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]
You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)
Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.

Python code flow does not work as expected?

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a slight problem. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far, I have been able to do this only once. When I try to keep this to continue, the program only keeps printing the first sentence my search yields. It should actually look for the longest word in this new sentence and keep applying my code flow described above.
Below is my code along with a sample input/output :
Sample input
"Thane of code"
Sample output
"Thane of code Norway himselfe , with terrible numbers , Assisted by that most disloyall Traytor , The Thane of Cawdor , began a dismall Conflict , Till that Bellona ' s Bridegroome , lapt in proofe , Confronted him with selfe - comparisons , Point against Point , rebellious Arme ' gainst Arme , Curbing his lauish spirit : and to conclude , The Victorie fell on vs"
Now this should actually take the sentence that starts with 'Norway himselfe....' and look for the longest word in it and do the steps above and so on but it doesn't. Any suggestions? Thanks.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
split_str = triggerSentence.split()#split the sentence into words
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr
How about this?
You find longest word in trigger
You find longest word in the longest sentence containing word found in 1.
The word of 1. is the longest word of the sentence of 2.
What happens? Hint: answer starts with "Infinite". To correct the problem you could find set of words in lower case to be useful.
BTW when you think MontyPython becomes False and the program finish?
Rather than searching the entire corpus each time, it may be faster to construct a single map from word to the longest sentence containing that word. Here's my (untested) attempt to do this.
import collections
from nltk.corpus import gutenberg
def words_in(sentence):
"""Generate all words in the sentence (lower-cased)"""
for word in sentence.split():
word = word.strip('.,"\'-:;')
if word:
yield word.lower()
def make_sentence_map(books):
"""Construct a map from words to the longest sentence containing the word."""
result = collections.defaultdict(str)
for book in books:
for sentence in book:
for word in words_in(sentence):
if len(sentence) > len(result[word]):
result[word] = sent
return result
def generate_random_text(sentence, sentence_map):
while True:
yield sentence
longest_word = max(words_in(sentence), key=len)
sentence = sentence_map[longest_word]
sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map):
print sentence
Mr. Hankin's answer is more elegant, but the following is more in keeping with the approach you began with:
import sys
import string
import nltk
from nltk.corpus import gutenberg
def longest_element(p):
"""return the first element of p which has the greatest len()"""
max_len = 0
elem = None
for e in p:
if len(e) > max_len:
elem = e
max_len = len(e)
return elem
def downcase(p):
"""returns a list of words in p shifted to lower case"""
return map(string.lower, p)
def unique_words():
"""it turns out unique_words was never referenced so this is here
for pedagogy"""
# there are 2.6 million words in the gutenburg corpus but only ~42k unique
# ignoring case, let's pare that down a bit
for word in gutenberg.words():
words.add(word.lower())
print 'gutenberg.words() has', len(words), 'unique caseless words'
return words
print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():
sentences.append(downcase(sentence))
trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None
while target != last_target:
matched_sentences = []
for sentence in sentences:
if target in sentence:
matched_sentences.append(sentence)
print '===', target, 'matched', len(matched_sentences), 'sentences'
longestSentence = longest_element(matched_sentences)
print ' '.join(longestSentence)
trigger = longestSentence
last_target = target
target = longest_element(trigger).lower()
Given your sample sentence though, it reaches fixation in two cycles:
$ python nltkgut.py Thane of code
loading gutenburg corpus...
=== target thane matched 24 sentences
norway himselfe , with terrible
numbers , assisted by that most
disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
=== target bridegroome matched 1 sentences
norway himselfe , with
terrible numbers , assisted by that
most disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
Part of the trouble with the response to the last problem is that it did what you asked, but you asked a more specific question than you wanted an answer to. Thus the response got bogged down in some rather complicated list expressions that I'm not sure you understood. I suggest that you make more liberal use of print statements and don't import code if you don't know what it does. While unwrapping the list expressions I found (as noted) that you never used the corpus wordlist. Functions are a help also.
You are assigning "split_str" outside of the loop, so it gets the original value and then keeps it. You need to assign it at the beginning of the while loop, so it changes each time.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#so this is run every time through the loop
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr

Categories

Resources