Regular expression of a sentence - python

I'm trying to write a regular expression to represent a sentence with the following conditions: starts with a capital letter, ends with a period (and only one period can appear), and is allowed to contain a comma or semi-colon, but when it does, it must appear as (letter)(semicolon)(space) or (letter)(comma)(space).
I've got the capital letter and period down. I have the idea for the code but I think I'm not getting the syntax completely right...
In English, my expression for a sentence looks like this:
(capital letter) ((lowercase letter)(space) ((lowercase letter)(comma)(space))*
((lowercase letter)(semicolon)(space)* )* (period)
I realize this ignores the case where the first letter of the sentence is immediately followed by a comma or semicolon, but it's safe to ignore that case.
Now when I try to code this in Python, I try the following (I've added whitespace to make things easier to read):
sentence = re.compile("^[A-Z] [a-z\\s (^[a-z];\\s$)* (^[a-z],\\s$)*]* \.$")
I feel like it's a syntax issue... I'm not sure if I'm allowed to have the semicolon and comma portions inside of parentheses.
Sample inputs that match the definition:
"This is a sentence."
"Hello, world."
"Hi there; hi there."
Sample inputs that do not match the definition:
"i ate breakfast."
"This is , a sentence."
"What time is it?"

This would match what you said above.
^"[A-Z][a-z]*(\s*|[a-z]*|(?<!\s)[;,](?=\s))*[.]"$? => demo
This would match:
"This is a sentence."
"Hello, world."
"Hi there; hi there."
This won't match:
"i ate breakfast."
"This is , a sentence."
"What time is it?"
"I a ,d am."
"I a,d am."
If you don't need the " just remove it from the regex.
If you need the regex in python, try this
re.compile(r'^[A-Z][a-z]*(\s*|[a-z]*|(?<!\s)[;,](?=\s))*[.]$')
Python demo
import re
tests = ["This is a sentence."
,"Hello, world."
,"Hi there; hi there."
,"i ate breakfast."
,"This is , a sentence."
,"What time is it?"]
rex = re.compile(r'^[A-Z][a-z]*(\s*|[a-z]*|(?<![\s])[;,])*[.]$')
for test in tests:
print rex.match(test)
output
<_sre.SRE_Match object at 0x7f31225afb70>
<_sre.SRE_Match object at 0x7f31225afb70>
<_sre.SRE_Match object at 0x7f31225afb70>
None
None
None

^(?!.*[;,]\S)(?!.* [;,])[A-Z][a-z\s,;]+\.$
Its easier to use lookaheads to remove invalid sentences.See demo.
https://regex101.com/r/vV1wW6/36#python

I ended up modifying my regular expression to
"^[A-Z][a-z\s (a-z,\s)* (a-z;\s)*]*\.$"
and it ended up working just fine. Thanks for everyone's help!

Related

Python get arguments from string

I wanted to grab a argument from a string in python...
I wanted to grab the city of this string: weather in <city>
How do I get the city? Into a new variable?
Use Regular Expressions!
If you haven't heard of them, it's quite simple. Simply import the re module, and away you go!
>>> import re
Ok, maybe that wasn't so exciting. But now you can use pattern matching. Simply define your pattern:
>>> pattern = r"^(?P<thing>.*?) in (?P<city>.*?)$"
and away you go!
>>> re.match(pattern, "weather in my city")
<_sre.SRE_Match object; span=(0, 18), match='weather in my city'>
Don't worry! This is actually something useful. Let's store this in a variable so we can use it:
>>> match = re.match(pattern, "weather in my city")
>>> match.group("city")
'my city'
Hooray!
Now, what was that crazy pattern thing about? It worked, but it just seems like magic. Let me explain:
r"" just makes Python treat (most) \s as literal \s. So, r"\n" will be an actual \ followed by an actual n, as opposed to a new-line character. This is because regular expressions have special meanings for \ characters, and it's awkward to have to write \\ all the time.
^ means "start of the string".
(?P<name>...) is a named group. Normal groups are represented by (...), and can be referenced by their number (e.g. match.group(0)). Named groups can also be referenced by number, but they can also be referenced by their name. The P stands for Python, because that's where the syntax originally came from. Neat!
. means "any character".
* means "repeated 0 or more times".
? means a few things, but when it's after a * or + it means "match as little as possible". This means that it will make the thing group have as few "any character"s as possible.
in means exactly what it looks like. A followed by an i followed by a n followed by a .
.*? again means "match as few of any character as possible", but... I'm not really sure why I wrote that, considering that
$ means "end of the string".
And yeah, they never really stop seeming like magic. (Unless you use Perl.) If you want to make your own regular expression or learn some more, have a look at the documentation for the re module.
If you have constant spaces in your string and your strings are not going to change, it's relatively easy. Just use split on your string.
x = "weather in <city>"
split_x = x.split(" ")
# will return you
["weather", "in", "<city>"]
city = split_x[2]
Look at split's docs. But suppose your city is something like "New York", then you'll have to look for some alternative because in that case, the list will be -
x = "weather in New York"
# O/P
["weather", "in", "New", "York"]
And then if you do this-
city = split_x[2]
You will have wrong city name
With str.lstrip():
s = "weather in Las Vegas"
city_name = s.lstrip('weather in ')
print(city_name)
Prints:
Las Vegas

Regular expression in Python not catching all information

I have the following string:
a = '''"The cat is running to the door, he does not look hungry anymore".
Said my mom, whispering.'''
Note the line breaks. In python the string will be:
'The cat is running to the door, he does not look hungry anymore".\n \n Said my mom, whispering.'
I have this regular expression:
pattern = u'^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*?)'
and I used as follows in Python:
>>> import re
>>> a = '''"The cat is running to the door, he does not look hungry anymore".
Said my mom, whispering.'''
>>> pattern = u'^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*?)'
>>> re.search(pattern, a).groups()
>>> ('"', 'The cat is running to the door, he does not look hungry anymore', '"', '.', '')
Why the last part (Said my mom, whispering.) is not being caught by the regular expression?
I'm expecting something like this:
>>> ('"', 'The cat is running to the door, he does not look hungry anymore', '"', '.', 'Said my mom, whispering.')
Can you please clarify to me what I'm doing wrong?
Just removing the ? would be enough. And also it's better to include DOTALL modifier because dot in your regex by default won't match new line characters.
pattern = u'(?s)^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*)'
Note that .*? is reluctant or non-greedy which means match any character zero or more times non-greedily. So it stops matching once it finds an empty string.
The problem with your expression is that (.*?) group is reluctant, meaning that it shall match as little text as possible. Since you do not ask for the match to "anchor" at the end of the input, the second group is empty.
Adding $ at the end of the regex will fix this problem:
pattern = u'^("|«)(.*?)("|»)(.*?)\u000A{1,}(.*?)$'
Your input does not start with the quote and regex requires it. Then, there is a missing linebreak pattern for the second line. And third, the .*? lazy matching will not match anything since it can match empty string so it will if you do not use an anchor $ or use a greedy matching.
Also, it is not efficient to use single letters in alternations, so I'd rather use a character class for such cases: ("|«) => ["«].
With \s shorthand class, you can match not only linebreaks but also spaces thus "trimmimg" the results in capture groups.
Here is my suggestion:
import re
p = re.compile(r'^(["«])?(.*?)(["»])?\.\s*(.*?)\s*(.*)')
test_str = "The cat is running to the door, he does not look hungry anymore\".\n\nSaid my mom, whispering."
print re.search(p, test_str).groups()
See demo

How can I select a string using start and endpoints (characters)?

How can I select a string in python knowing the start and end points?
If the string is:
Evelin said, "Hi Dude! How are you?" and no one cared!!
Or something like this:
Jane said *aww! thats cute, we must try it!* John replied, "Okay!, but not now!!"
what I want to write is a function that selects from the " " not by counting the index,
but something that just selects the text from character to character,
"Hi Dude! How are you?" and "Okay!, but not now!!"
so how can I do this? is there a built in function ?
I know there is a built-in function in python that get the index of the given character
ie,
find("something") returns the index of the given string in the string.
or it need to loop through the string?
I'm just starting with python, sorry for a little question like this.
python 2 or 3 is just okay!! thank you so much!!
Update:
Thank you everyone for the answers, as a just beginner I just wanna stick with the built in split() function quotes = string.split('"')[1::2] just because its simple. thank you all. so much love :)
txt='''\
Evelin said, "Hi Dude! How are you?" and no one cared!!
Jane said *aww! thats cute, we must try it!* John replied, "Okay!, but not now!!"'''
import re
print re.findall(r'"([^"]+)"', txt)
# ['Hi Dude! How are you?', 'Okay!, but not now!!']
You can use regular expressions if you don't want to use str.index():
import re
quotes = re.findall('"([^"]*)"', string)
You can easily extend this to extract other information from your strings as well.
Alternatively:
quotes = string.split('"')[1::2]
And using str.index():
first = string.index('"')
second = string.index('"', first+1)
quote = string[first+1:second]
To extract a substring by characters, it is much easier to split on those characters; str.partition() and str.rpartition() efficiently split the string on the first or last occurrence of a given string:
extracted = inputstring.partition('"')[-1].rpartition('"')[0]
The combination of partitioning from the start and end gives you then the largest substring possible, leaving any embedded quotes in there.
Demo:
>>> inputstring = 'Evelin said, "Hi Dude! How are you?" and no one cared!!'
>>> inputstring.partition('"')
('Evelin said, ', '"', 'Hi Dude! How are you?" and no one cared!!')
>>> inputstring.rpartition('"')
('Evelin said, "Hi Dude! How are you?', '"', ' and no one cared!!')
>>> inputstring.partition('"')[-1].rpartition('"')[0]
'Hi Dude! How are you?'
str.index(str2) finds the index of str2 in str... most simple approach !
a = 'Evelin said, "Hi Dude! How are you?" and no one cared!!'
print a[1+a.index("\""):1+a.index("\"")+a[a.index("\"")+1:].index("\"")]
or as Scorpion_God mentioned, you could simply use single quotes as below
print a[1+a.index('"'):1+a.index('"')+a[a.index('"')+1:].index('"')]
this will result in :
Hi Dude! How are you?
Quotes won't be included !!!

How to strip whitespace from before but not after punctuation in python

relative python newbie here. I have a text string output from a program I can't modify. For discussion lets say:
text = "This text . Is to test . How it works ! Will it! Or won't it ? Hmm ?"
I want to remove the space before the punctuation, but not remove the second space. I've been trying to do it with regex, and I know that I can match the instances I want using
match='\s[\?.!\"]\s' as my search term.
x=re.search('\s[\?\.\!\"]\s',text)
Is there a way with a re.sub to replace the search term with the leading whitespace removed? Any ideas on how to proceed?
Put a group around the text you want to keep and refer to that group by number in the replacement pattern:
re.sub(r'\s([?.!"](?:\s|$))', r'\1', text)
Note that I used a r'' raw string to avoid having to use too many backslashes; you didn't need to add quite so many, however.
I also adjusted the match for the following space; it now matches either a space or the end of the string.
Demo:
>>> import re
>>> text = "This text . Is to test . How it works ! Will it! Or won't it ? Hmm ?"
>>> re.sub(r'\s([?.!"](?:\s|$))', r'\1', text)
"This text. Is to test. How it works! Will it! Or won't it? Hmm?"
Use re.sub instead of re.search.
>>> text = "This text . Is to test . How it works ! Will it! Or won't it ? Hmm ?"
>>> re.sub(r'\s+([?.!"])', r'\1', text)
"This text. Is to test. How it works! Will it! Or won't it? Hmm?"
You don't need to escape ?, ., !, " inside [] becaue special characters lose their meaning inside [].

"ReplaceWith" & - but only part of it

I need to perform a search/replace on text which contains a comma which is NOT followed by a space, to change to a comma+space.
So I can find this using:
,[^\s]
But I am struggling with the replacement; I can't just use:
, (space, comma)
Or
& ,
As the match originally matches two characters.
Is there a way of saying '&' - 1 ? or '&[0]' or something which means; 'The Matched String, but only part of it' in the replacement argument ?
Another way of trying to ask this:
Can I use Regex to IDENTIFY one part of my string.
But REPLACE a (slightly different,but related) part of my string.
I could just probably replace every comma with a comma+space, but this is a little more controlled and less likely to make a change I do not need....
For example:
Original:
Hello,World.
Should become:
Hello, World.
But:
Hello, World.
Should remain as :
Hello, World.
And currently, using my (bad) pattern I have:
Original:
Hello,World
After (wrong):
Hello, orld
I'm actually using Python's (2.6) 're' module for this as it happens.
Using parantheses to capture a part of the string is one way to do it. Another possibility is to use "lookahead assertion":
,(?=\S)
This pattern matches a comma only if it is followed by a non-whitespace character. It does not match anything followed by comma but uses that information to decide whether or not to match the comma.
For example:
>>> re.sub(r",(?=\S)", ", ", "Hello,World! Hello, World!")
'Hello, World! Hello, World!'
Yes, use parentheses to "capture" part of the string that matches your expression. I'm not up to speed on Python's implementation, but it should give you some kind of array called match[] whose elements correspond to the captures.
Yes, you could. But why would you, in this simple case?
def insertspaceaftercomma(s):
"""inserts a space after every comma, then remove doubled whitespace after comma (if any)"""
return s.replace(",",", ").replace(", ",", ")
seems to work:
>>> insertspaceaftercomma("Hello, World")
'Hello, World'
>>> insertspaceaftercomma("Hello,World")
'Hello, World'
>>>
You can look for a comma + non-space character and then stick a space in between them:
re.sub(r',([^\s])', r', \1', string)
Try this:
import re
s1 = 'Hello,World.'
re.sub(r',([^\s])', ', \g<1>', s1)
> Hello, World.
s2 = 'Hello, World.'
re.sub(r',([^\s])', ', \g<1>', s2)
> Hello, World.

Categories

Resources