Pull several substrings from an input using specific characters to find them

Pull several substrings from an input using specific characters to find them - python

I need to make a user created madlib where the user would input a madlib for someone else to use. The input would be something like this:
The (^noun^) and the (^adj^) (^noun^)
I need to pull anything between (^ and ^) so I can use the word to code so I get another input prompt to complete the madlib.
input('Enter "word in-between the characters":')
This is my code right now
madlib = input("Enter (^madlib^):")
a = "(^"
b = "^)"
start = madlib.find(a) + len(a)
end = madlib.find(b)
substring = madlib[start:end]
def mad():
if "(^" in madlib:
substring = madlib[start:end]
m = input("Enter " + substring + ":")
mad = madlib.replace(madlib[start:end],m)
return mad
print(mad())
What am I missing?

You can use re.finditer() to do this fairly cleanly by collecting the .span() of each match!
import re
# collect starting madlib
madlib_base = input('Enter madlib base with (^...^) around words like (^adj^)): ')
# list to put the collected blocks of spans and user inputs into
replacements = []
# yield every block like (^something^) by matching each end and `not ^` inbetween
for match in re.finditer(r"\(\^([^\^]+)\^\)", madlib_base):
replacements.append({
"span": match.span(), # position of the match in madlib_base
"sub_str": input(f"enter a {match.group(1)}: "), # replacement str
})
# replacements mapping and madlib_base can be saved for later!
def iter_replace(base_str, replacements_mapping):
# yield alternating blocks of text and replacement
# skip the replacement span from the text when yielding
base_index = 0 # index in base str to begin from
for block in replacements_mapping:
head, tail = block["span"] # unpack span
yield base_str[base_index:head] # next value up to span
yield block["sub_str"] # string the user gave us
base_index = tail # start from the end of the span
# collect the iterable into a single result string
# this can be done at the same time as the earlier loop if the input is known
result = "".join(iter_replace(madlib_base, replacements))
Demonstration
...
enter a noun: Madlibs
enter a adj: rapidly
enter a noun: house
...
>>> result
'The Madlibs and the rapidly house'
>>> replacements
[{'span': (4, 12), 'sub_str': 'Madlibs'}, {'span': (21, 28), 'sub_str': 'rapidly'}, {'span': (29, 37), 'sub_str': 'house'}]
>>> madlib_base
'The (^noun^) and the (^adj^) (^noun^)'

Your mad() function only does one substitution, and it's only called once. For your sample input with three required substitutions, you'll only ever get the first noun. In addition, mad() depends on values that are initialized outside the function, so calling it multiple times won't work (it'll keep trying to operate on the same substring, etc).
To fix it, you need to make it so that mad() does one substitution on whatever text you give it, regardless of any other state outside of the function; then you need to call it until it's substituted all the words. You can make this easier by having mad return a flag indicating whether it found anything to substitute.
def mad(text):
start = text.find("(^")
end = text.find("^)")
substring = text[start+2:end] if start > -1 and end > start else ""
if substring:
m = input(f"Enter {substring}: ")
return text.replace(f"(^{substring}^)", m, 1), True
return text, False
madlib, do_mad = input("Enter (^madlib^):"), True
while do_mad:
madlib, do_mad = mad(madlib)
print(madlib)
Enter (^madlib^):The (^noun^) and the (^adj^) (^noun^)
Enter noun: cat
Enter adj: lazy
Enter noun: dog
The cat and the lazy dog

Related

Replace string in list then join list to form new string

I have a project where I need to do the following:
User inputs a sentence
intersect sentence with list for matching strings
replace one of the matching strings with a new string
print the original sentence featuring the replacement
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
# Asks the user for a sentence.
random_sentence = str(input('Please enter a random sentence:\n')).title()
stripped_sentence = random_sentence.strip(',.!?')
split_sentence = stripped_sentence.split()
# Solve for single word fruit names
sentence_intersection = set(fruits).intersection(split_sentence)
# Finds and replaces at least one instance of a fruit in the sentence with “Brussels Sprouts”.
intersection_as_list = list(sentence_intersection)
intersection_as_list[-1] = 'Brussels Sprouts'
Example Input: "I would like some raisins and strawberries."
Expected Output: "I would like some raisins and Brussels Sprouts."
But I can't figure out how to join the string back together after making the replacement. Any help is appreciated!

You can do it with a regex:
(?i)Quince|Raisins|Raspberries|Rhubarb|Strawberries|Tangelo|Tangerines
This pattern will match any of your words in a case insensitive way (?i).
In Python, you can obtain that pattern by joining your fruits into a single string. Then you can use the re.sub function to replace your first matching word with "Brussels Sprouts".
import re
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
# Asks the user for a sentence.
#random_sentence = str(input('Please enter a random sentence:\n')).title()
sentence = "I would like some raisins and strawberries."
pattern = '(?i)' + '|'.join(fruits)
replacement = 'Brussels Sprouts'
print(re.sub(pattern, replacement, sentence, 1))
Output:
I would like some Brussels Sprouts and strawberries.
Check the Python demo here.

Create a set of lowercase possible word matches, then use a replacement function.
If a word is found, clear the set, so replacement works only once.
import re
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
fruit_set = {x.lower() for x in fruits}
s = "I would like some raisins and strawberries."
def repfunc(m):
w = m.group(1)
if w.lower() in fruit_set:
fruit_set.clear()
return "Brussel Sprouts"
else:
return w
print(re.sub(r"(\w+)",repfunc,s))
prints:
I would like some Brussel Sprouts and strawberries.
That method has the advantage of being O(1) on lookup. If there are a lot of possible words it will beat the linear search that | performs when testing word after word.
It's simpler to replace just the first occurrence, but replacing the last occurrence, or a random occurrence is also doable. First you have to count how many fruits are in the sentence, then decide which replacement is effective in a second pass.
like this: (not very beautiful, using a lot of globals and all)
total = 0
def countfunc(m):
global total
w = m.group(1)
if w.lower() in fruit_set:
total += 1
idx = 0
def repfunc(m):
global idx
w = m.group(1)
if w.lower() in fruit_set:
if total == idx+1:
return "Brussel Sprouts"
else:
idx += 1
return w
else:
return w
re.sub(r"(\w+)",countfunc,s)
print(re.sub(r"(\w+)",repfunc,s))
first sub just counts how many fruits would match, then the second function replaces only when the counter matches. Here last occurrence is selected.

Given string, return a dictionary of all the phone numbers in that text

I just started learning dictionaries and regex and I'm having trouble creating a dictionary. In my task, area code is a combination of plus sign and three numbers. The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily.
def find_phone_numbers(text: str) -> dict:
pattern = r'\+\w{3} \w{8}|\+\w{11}|\+\w{3} \w{7}|\+\w{10}|\w{8}|\w{7}'
match = re.findall(pattern, text)
str1 = " "
phone_str = str1.join(match)
phone_dict = {}
phones = phone_str.split(" ")
for phone in phones:
if phone[0] == "+":
phone0 = phone
if phone_str[0:4] not in phone_dict.keys():
phone_dict[phone_str[0:4]] = [phone_str[5:]]
return phone_dict
The result should be:
print(find_phone_numbers("+372 56887364 +37256887364 +33359835647 56887364 +11 1234567 +327 1 11111111")) ->
{'+372': ['56887364', '56887364'], '+333': ['59835647'], '': ['56887364', '1234567', '11111111']}
The main problem is that phone numbers with the same area code can be written together or separately. I had an idea to use a for loop to get rid of the "tail" in the form of a phone number and only the area code will remain, but I don't understand how to get rid of the tail here +33359835647. How can this be done and is there a more efficient way?

Try (the regex pattern explained here - Regex101):
import re
s = "+372 56887364 +37256887364 +33359835647 56887364 +11 1234567 +327 1 11111111"
pat = re.compile(r"(\+\d{3})?\s*(\d{7,8})")
out = {}
for pref, number in pat.findall(s):
out.setdefault(pref, []).append(number)
print(out)
Prints:
{
"+372": ["56887364", "56887364"],
"+333": ["59835647"],
"": ["56887364", "1234567", "11111111"],
}

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}

You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.

Simplest way to convert char offsets to word offsets

I have a python string and a substring of selected text. The string for example could be
stringy = "the bee buzzed loudly"
I want to select the text "bee buzzed" within this string. I have the character offsets i.e 4-14 for this particular string. Because those are the character level indices that the selected text is between.
What is the simplest way to convert these to word level indices i.e 1-2 because the second and third words are being selected. I have many strings that are labeled like this and I would like to convert the indices simply and efficiently. The data is currently stored ina dictionary like so:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
I would like to convert it to this form
data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}
Thank you!

It seem like a tokenisation problem.
My solution would to use a span tokenizer and then search you substring spans in the spans.
So using the nltk library:
import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b, sub_e = 4, 14 # substring begin and end
[i for i, (b, e) in enumerate(tokenizer.span_tokenize(stringy))
if b >= sub_b and e <= sub_e]
But this is kind of intricate.
tokenizer.span_tokenize(stringy) returns spans for each token/word it identified.

Heres a simple list index approach:
# set up data
string = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"", "start_word":0,"end_word":0}
#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly', 'start_word': 1, 'end_word': 2}
take note that this assumes you're using a chronological order of words inside the string

Try this code, please;
def char_change(dic, start_char, end_char, *arg):
dic[arg[0]] = start_char
dic[arg[1]] = end_char
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))
char_change(data, start_char, end_char, "start_char", "end_char")
print(data)
Default Dictionary:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
INPUT
Please enter your start character: 1
Please enter your end character: 2
OUTPUT Dictionary:
{'string': 'the bee buzzed loudly', 'start_char': 1, 'end_char': 2}

Is there a way of getting this string down to 3 words?

There are multiple problems with the code i posted below, since as i also said on my previous post im new to coding i have some trouble finding stuff by myself :(
My goal is to take user input, narrow it down to 3 words by size and then sort them alphabetically. Am i doing this right?
Probably not because it prints it out with commas. For example, with "i like eating cake" as input, the output is:
"'cake',", "'eating'", "'i',", "'like',"
But I want it to be:
cake, eating, like
Any help is much appreciated.
input = input(" ")
prohibited = {'this','although','and','as','because','but','even if','he','and','however','cosmos','an','a','is','what','question :','question','[',']',',','cosmo',' ',' ',' '}
processedinput = [word for word in re.split("\W+",input) if word.lower() not in prohibited]
processed = processedinput
processed.sort(key = len)
processed = re.sub('[\[\]]','',repr(processedinput)) #removes brackets
keywords = processed
keywords = keywords.split()
keywords.sort(key=str.lower)
keywords.sort()
keywords = re.sub('[\[\]]','',repr(keywords))
str(keywords)
print(keywords)

The first issue with your code is input = input(). The problem with this is that input is the name of the function you are calling, but you are overwriting input with the user's string. Consequently, if you tried to run input() again, it would fail.
The second issue is that you are misunderstanding lists. In the code below, tokens is a list, not a string. Each element in the list is a string. So there is no need to strip out brackets and such. You can simply order the list (that part of your code was correct) in reverse order of length, then print the first three words.
Code:
import re
user_input = input(" ")
prohibited = {'this','although','and','as','because','but','even if','he','and','however','cosmos','an','a','is','what','question :','question','[',']',',','cosmo',' ',' ',' '}
tokens = [word for word in re.split("\W+", user_input) if word.lower() not in prohibited]
tokens.sort(key=len, reverse=True)
print(tokens[0], end=', ')
print(tokens[1], end=', ')
print(tokens[2])
Input:
i like eating cake
Output:
eating, like, cake

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pull several substrings from an input using specific characters to find them - python

Related

Replace string in list then join list to form new string

Given string, return a dictionary of all the phone numbers in that text

Function to extract company register number from text string using Regex

Simplest way to convert char offsets to word offsets

Is there a way of getting this string down to 3 words?

Categories

Resources