Splitting a sentence below a specific "space" character with regex in Python - python

I have been trying to solve a problem for splitting a sentence down to a meaningful set of words under specific length.
string1 = "Alice is in wonderland"
string2 = "Bob is playing games on his computer"
I want to have a regex that matches the representative words that match the condition of being lower than 20 characters.
new_string1 = "Alice is in"
new_string2 = "Bob is playing games"
Is this possible to do it with Regex?

This is not a good usecase of regular expression. Although, the textwrap.shorten method achieves exactly that.
import textwrap
string1 = "Alice is in wonderland"
string2 = "Bob is playing games on his computer"
new_string1 = textwrap.shorten(string1, 20, placeholder="")
new_string2 = textwrap.shorten(string2, 20, placeholder="")
print(new_string1) # Alice is in
print(new_string2) # Bob is playing games
The only downside of textwrap.shorten is that it collapses spaces. In the event you do not want that to happen, you can implement your own method.
def shorten(s, max_chars):
# Special case is the string is shorter than the number of required chars
if len(s) <= max_chars:
return s.rstrip()
stop = 0
for i in range(max_chars + 1):
# Always keep the location of the last space behind the pointer
if s[i].isspace():
stop = i
# Get rid of possible extra space added on the tail of the string
return s[:stop].rstrip()
string1 = "Alice is in wonderland"
string2 = "Bob is playing games on his computer"
new_string1 = shorten(string1, 20)
new_string2 = shorten(string2, 20)
print(new_string1) # Alice is in
print(new_string2) # Bob is playing games

Related

Replace string in list then join list to form new string

I have a project where I need to do the following:
User inputs a sentence
intersect sentence with list for matching strings
replace one of the matching strings with a new string
print the original sentence featuring the replacement
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
# Asks the user for a sentence.
random_sentence = str(input('Please enter a random sentence:\n')).title()
stripped_sentence = random_sentence.strip(',.!?')
split_sentence = stripped_sentence.split()
# Solve for single word fruit names
sentence_intersection = set(fruits).intersection(split_sentence)
# Finds and replaces at least one instance of a fruit in the sentence with “Brussels Sprouts”.
intersection_as_list = list(sentence_intersection)
intersection_as_list[-1] = 'Brussels Sprouts'
Example Input: "I would like some raisins and strawberries."
Expected Output: "I would like some raisins and Brussels Sprouts."
But I can't figure out how to join the string back together after making the replacement. Any help is appreciated!
You can do it with a regex:
(?i)Quince|Raisins|Raspberries|Rhubarb|Strawberries|Tangelo|Tangerines
This pattern will match any of your words in a case insensitive way (?i).
In Python, you can obtain that pattern by joining your fruits into a single string. Then you can use the re.sub function to replace your first matching word with "Brussels Sprouts".
import re
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
# Asks the user for a sentence.
#random_sentence = str(input('Please enter a random sentence:\n')).title()
sentence = "I would like some raisins and strawberries."
pattern = '(?i)' + '|'.join(fruits)
replacement = 'Brussels Sprouts'
print(re.sub(pattern, replacement, sentence, 1))
Output:
I would like some Brussels Sprouts and strawberries.
Check the Python demo here.
Create a set of lowercase possible word matches, then use a replacement function.
If a word is found, clear the set, so replacement works only once.
import re
fruits = ['Quince', 'Raisins', 'Raspberries', 'Rhubarb', 'Strawberries', 'Tangelo', 'Tangerines']
fruit_set = {x.lower() for x in fruits}
s = "I would like some raisins and strawberries."
def repfunc(m):
w = m.group(1)
if w.lower() in fruit_set:
fruit_set.clear()
return "Brussel Sprouts"
else:
return w
print(re.sub(r"(\w+)",repfunc,s))
prints:
I would like some Brussel Sprouts and strawberries.
That method has the advantage of being O(1) on lookup. If there are a lot of possible words it will beat the linear search that | performs when testing word after word.
It's simpler to replace just the first occurrence, but replacing the last occurrence, or a random occurrence is also doable. First you have to count how many fruits are in the sentence, then decide which replacement is effective in a second pass.
like this: (not very beautiful, using a lot of globals and all)
total = 0
def countfunc(m):
global total
w = m.group(1)
if w.lower() in fruit_set:
total += 1
idx = 0
def repfunc(m):
global idx
w = m.group(1)
if w.lower() in fruit_set:
if total == idx+1:
return "Brussel Sprouts"
else:
idx += 1
return w
else:
return w
re.sub(r"(\w+)",countfunc,s)
print(re.sub(r"(\w+)",repfunc,s))
first sub just counts how many fruits would match, then the second function replaces only when the counter matches. Here last occurrence is selected.

How do I divide this string after the word extracted with the regex?

import re
input_text_to_checkA = "to play football in the street you need good boots" #For example
regex_patron_00A = r"\s*\¿?(?:so you can play |so you can play |to play the |to play )\s*((?:\w+\s*)+)\s*\??"
regex_patron_00B = r" \s*((?:\w+\s*)+) \s*\¿?(?:are really necessary the |are really necessary |would be needed from |would be needed |she needs |he needs |you need )\s*((?:\w+\s*)+)\s*\??"
mA = re.search(regex_patron_00A, input_text_to_checkA, re.IGNORECASE)
if mA:
input_text_to_checkB = mA.group() # ---> "football in the street you need good boots"
mB = re.search(regex_patron_00B, input_text_to_checkB, re.IGNORECASE)
if mB:
word, association = m.groups()
word = word.strip() # ---> "football in the street"
association = association.strip() # ---> "good boots"
I need input_text_to_checkB = "" to store the string without the beginning of what was valid in the first regex and without what was stored in the word variable.

Delete based on presence

I'm trying to analyze an article to determine if a specific substring appears.
If "Bill" appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill" does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill" appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill" (the opening sentence, "This is Bill Everest here"). How can this code be altered to only select for "Bill"s in the last 200 characters?
Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)
Here is how you can use re:
import re
stringy = """..."""
target = "Bill"
l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)
for i in range(len(l)-1,0,-1):
if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
strings = ' '.join(l[:i])
print(stringy)

I'm trying to solve for the other Acronyms

I have the following strings that I need to make acronyms for:
Institute of Electrical and Electronics Engineers
As Soon As Possible
University of California San Diego
Self Contained Underwater Breathing Apparatus
This is my code
my_string = input()
my_string2 = my_string.upper()
for x in range(0, 1, len(my_string2)):
print(my_string2[0::15])
but it only worked for the first input. There are three more examples that this code doesn't cover. What I need is for this code to be modified in such a way where it will create an Acronym out of any input.The first Acronym is called "Institute of Electrical and Electronics Engineers" and once it's placed into the input it returns IEEE as the output. Basically all of the first letters that are capitalized are kept and no lower cased words remain.
I'm new to programming so I bet the way I did it is a bit funky, but this worked for me on my zybooks lab:
Name = input()
AStart = Name.split()
AFinal = ''
for string in AStart:
if string[0].isupper():
AFinal += string[0] + '.'
print(AFinal)
Here's a regex based solution which looks for words that start with a capital letter and extracts their starting letter, then joins all them together to make the acronym:
import re
strings = [
'As Soon As Possible',
'Institute of Electrical and Electronics Engineers',
'University of California San Diego',
'Self Contained Underwater Breathing Apparatus'
]
for s in strings:
acronym = ''.join(re.findall(r'\b[A-Z]', s))
print(acronym)
If you don't want to use regex, you can just split the strings and test the first character of each word to see if it is uppercase:
for s in strings:
acronym = ''.join(w[0] for w in s.split(' ') if w[0].isupper())
print(acronym)
In either case the output is:
ASAP
IEEE
UCSD
SCUBA
To run from input, use this code:
import re
s = input()
acronym = ''.join(re.findall(r'\b[A-Z]', s))
print(acronym)
Or:
s = input()
acronym = ''.join(w[0] for w in s.split(' ') if w[0].isupper())
print(acronym)
Demo on ideone.com
try this:
full_string = input("Enter Text: ")
string_list = full_string.split()
acronym = ""
for string in string_list:
acronym += f"{string[0].upper()}"
print(acronym)
output:
Enter Text: This is a long string please be kind
TIALSPBK
This is what I used.
phrase = str(input()).rstrip() #gets the phrase and makes string sanitized
for char in phrase: #goes through every char
x = char #Did this to make it easier to keep track
if x.isupper() == True: #The char loop check if the value is true or not
print(x, end='') #print the true uppercase, end print on 1 line.

Count characters in string

So I'm trying to count anhCrawler and return the number of characters with and without spaces alone with the position of "DEATH STAR" and return it in theReport. I can't get the numbers to count correctly either. Please help!
anhCrawler = """Episode IV, A NEW HOPE. It is a period of civil war. \
Rebel spaceships, striking from a hidden base, have won their first \
victory against the evil Galactic Empire. During the battle, Rebel \
spies managed to steal secret plans to the Empire's ultimate weapon, \
the DEATH STAR, an armored space station with enough power to destroy \
an entire planet. Pursued by the Empire's sinister agents, Princess Leia\
races home aboard her starship, custodian of the stolen plans that can \
save her people and restore freedom to the galaxy."""
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR occurs and starts at position {3}.
"""
def analyzeCrawler(thetext):
numchars = 0
nospacechars = 0
numspacechars = 0
anhCrawler = thetext
word = anhCrawler.split()
for char in word:
numchars = word[numchars]
if numchars == " ":
numspacechars += 1
anhCrawler = re.split(" ", anhCrawler)
for char in anhCrawler:
nospacechars += 1
numwords = len(anhCrawler)
pos = thetext.find("DEATH STAR")
char_len = len("DEATH STAR")
ds = thetext[261:271]
dspos = "[261:271]"
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler(theReport)
You're overthinking this problem.
Number of chars in string (returns 520):
len(anhCrawler)
Number of non-whitespace characters in string (using split as using split automatically removes the whitespace, and join creates a string with no whitespace) (returns 434):
len(''.join(anhCrawler.split()))
Finding the position of "DEATH STAR" (returns 261):
anhCrawler.find("DEATH STAR")
Here, you have simplilfied version of your function:
import re
def analyzeCrawler2(thetext, text_to_search = "DEATH STAR"):
numchars = len(anhCrawler)
nospacechars = len(re.sub(r"\s+", "", anhCrawler))
numwords = len(anhCrawler.split())
dspos = anhCrawler.find(text_to_search)
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler2(theReport)
This text contains 520 characters (434 if you ignore spaces).
There are approximately 87 words in the text. The phrase
DEATH STAR occurs and starts at position 261.
I think the trick part is to remove white spaces from the string and to calculate the non-space character count. This can be done simply using regular expression. Rest should be self-explanatory.
First off, you need to indent the code that's inside a function. Second... your code can be simplified to the following:
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR is the {3}th word and starts at the {4}th character.
"""
def analyzeCrawler(thetext):
numchars = len(anhCrawler)
nospacechars = len(anhCrawler.replace(' ', ''))
numwords = len(anhCrawler.split())
word = 'DEATH STAR'
wordPosition = anhCrawler.split().index(word)
charPosition = anhCrawler.find(word)
return theReport.format(
numchars, nospacechars, numwords, wordPosition, charPosition
)
I modified the last two format arguments because it wasn't clear what you meant by dspos, although maybe it's obvious and I'm not seeing it. In any case, I included the word and char position instead. You can determine which one you really meant to include.

Categories

Resources