I need to perform a search/replace on text which contains a comma which is NOT followed by a space, to change to a comma+space.
So I can find this using:
,[^\s]
But I am struggling with the replacement; I can't just use:
, (space, comma)
Or
& ,
As the match originally matches two characters.
Is there a way of saying '&' - 1 ? or '&[0]' or something which means; 'The Matched String, but only part of it' in the replacement argument ?
Another way of trying to ask this:
Can I use Regex to IDENTIFY one part of my string.
But REPLACE a (slightly different,but related) part of my string.
I could just probably replace every comma with a comma+space, but this is a little more controlled and less likely to make a change I do not need....
For example:
Original:
Hello,World.
Should become:
Hello, World.
But:
Hello, World.
Should remain as :
Hello, World.
And currently, using my (bad) pattern I have:
Original:
Hello,World
After (wrong):
Hello, orld
I'm actually using Python's (2.6) 're' module for this as it happens.
Using parantheses to capture a part of the string is one way to do it. Another possibility is to use "lookahead assertion":
,(?=\S)
This pattern matches a comma only if it is followed by a non-whitespace character. It does not match anything followed by comma but uses that information to decide whether or not to match the comma.
For example:
>>> re.sub(r",(?=\S)", ", ", "Hello,World! Hello, World!")
'Hello, World! Hello, World!'
Yes, use parentheses to "capture" part of the string that matches your expression. I'm not up to speed on Python's implementation, but it should give you some kind of array called match[] whose elements correspond to the captures.
Yes, you could. But why would you, in this simple case?
def insertspaceaftercomma(s):
"""inserts a space after every comma, then remove doubled whitespace after comma (if any)"""
return s.replace(",",", ").replace(", ",", ")
seems to work:
>>> insertspaceaftercomma("Hello, World")
'Hello, World'
>>> insertspaceaftercomma("Hello,World")
'Hello, World'
>>>
You can look for a comma + non-space character and then stick a space in between them:
re.sub(r',([^\s])', r', \1', string)
Try this:
import re
s1 = 'Hello,World.'
re.sub(r',([^\s])', ', \g<1>', s1)
> Hello, World.
s2 = 'Hello, World.'
re.sub(r',([^\s])', ', \g<1>', s2)
> Hello, World.
Related
I am trying to replace a specific part of my string. Everytime I have a backslash, followed by a capital letter, I want the backslash to be replaced with a tab. Like in this case:
Hello/My daugher/son
The output should look like
Hello My daugher/son
I have tried to use re.sub():
for x in a:
x = re.sub('\/[A-Z]', '\t[A-Z]', x)
But then my output changes into:
Hello [A-Z]y daugher/son
Which is really not what I want. Is there a better way to tackle this, maybe not in regex?
You can replace /(?=[A-Z]) with \t. Notice in Python you don't need to escape / as \/
Check this Python code,
import re
s = 'Hello/My daugher/son'
print(re.sub(r'/(?=[A-Z])',r'\t',s))
Prints,
Hello My daugher/son
Alternatively, following the way you were trying to replace, you need to capture the capital letter in a group using /([A-Z]) regex and then replace it with \t\1 to restore what got captured in group1. Check this Python codes,
import re
s = 'Hello/My daugher/son'
print(re.sub(r'/([A-Z])',r'\t\1',s))
Again prints,
Hello My daugher/son
How can I select a string in python knowing the start and end points?
If the string is:
Evelin said, "Hi Dude! How are you?" and no one cared!!
Or something like this:
Jane said *aww! thats cute, we must try it!* John replied, "Okay!, but not now!!"
what I want to write is a function that selects from the " " not by counting the index,
but something that just selects the text from character to character,
"Hi Dude! How are you?" and "Okay!, but not now!!"
so how can I do this? is there a built in function ?
I know there is a built-in function in python that get the index of the given character
ie,
find("something") returns the index of the given string in the string.
or it need to loop through the string?
I'm just starting with python, sorry for a little question like this.
python 2 or 3 is just okay!! thank you so much!!
Update:
Thank you everyone for the answers, as a just beginner I just wanna stick with the built in split() function quotes = string.split('"')[1::2] just because its simple. thank you all. so much love :)
txt='''\
Evelin said, "Hi Dude! How are you?" and no one cared!!
Jane said *aww! thats cute, we must try it!* John replied, "Okay!, but not now!!"'''
import re
print re.findall(r'"([^"]+)"', txt)
# ['Hi Dude! How are you?', 'Okay!, but not now!!']
You can use regular expressions if you don't want to use str.index():
import re
quotes = re.findall('"([^"]*)"', string)
You can easily extend this to extract other information from your strings as well.
Alternatively:
quotes = string.split('"')[1::2]
And using str.index():
first = string.index('"')
second = string.index('"', first+1)
quote = string[first+1:second]
To extract a substring by characters, it is much easier to split on those characters; str.partition() and str.rpartition() efficiently split the string on the first or last occurrence of a given string:
extracted = inputstring.partition('"')[-1].rpartition('"')[0]
The combination of partitioning from the start and end gives you then the largest substring possible, leaving any embedded quotes in there.
Demo:
>>> inputstring = 'Evelin said, "Hi Dude! How are you?" and no one cared!!'
>>> inputstring.partition('"')
('Evelin said, ', '"', 'Hi Dude! How are you?" and no one cared!!')
>>> inputstring.rpartition('"')
('Evelin said, "Hi Dude! How are you?', '"', ' and no one cared!!')
>>> inputstring.partition('"')[-1].rpartition('"')[0]
'Hi Dude! How are you?'
str.index(str2) finds the index of str2 in str... most simple approach !
a = 'Evelin said, "Hi Dude! How are you?" and no one cared!!'
print a[1+a.index("\""):1+a.index("\"")+a[a.index("\"")+1:].index("\"")]
or as Scorpion_God mentioned, you could simply use single quotes as below
print a[1+a.index('"'):1+a.index('"')+a[a.index('"')+1:].index('"')]
this will result in :
Hi Dude! How are you?
Quotes won't be included !!!
Let's say I have a string like:
data = 'MESSAGE: Hello world!END OF MESSAGE'
And I want to get the string between 'MESSAGE: ' and the next capitalized word. There are never any fully capitalized words in the message.
I tried to get this by using this regular expression in re.search:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
Here I would like it to output 'Hello world!'- but it always returns the wrong result. It is very easy in regular expressions for one to find a sub-string that occurs between two other strings, but how do you find a substring between strings that are matches for a regular expression. I have tried making it a raw string but that didn't seem to work.
I hope I am expressing myself well- I have extensive experience in Python but am new to regular expressions. If possible, I would like an explanation along with an example of how to make my specific example code work. Any helpful posts are greatly appreciated.
BTW, I am using Python 3.3.
Your code doesn't work but for the opposite reason:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
would match
'Hello world!END OF MESSA'
because (.*) is "greedy", i.e. it matches the most that will allow the rest (two uppercase chars) to match. You need to use a non-greedy quantifier with
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
that correctly matches
'Hello world!'
One little question mark:
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
Out[91]: 'Hello world!'
if you make the first capturing group lazy, it won't consume anything after the exclamation point.
You need your .* to be non-greedy (see the first ?) which means that it stops matching at the point where the next item could match, and you need the second group to be non-capturing (see the ?:).
import re
data = 'MESSAGE: Hello world!END OF MESSAGE'
regex = r'MESSAGE: (.*?)(?:[A-Z]{2,})'
re.search(regex, data).group(1)
Returns:
'Hello world!'
Alternatively, you could use this:
regex = r'MESSAGE: (.*?)[A-Z]{2,}'
To break this down (I'll include the search line with the VERBOSE flag:):
regex = r'''
MESSAGE:\s # first part, \s for the space (matches whitespace)
(.*?) # non-greedy, anything but a newline
(?:[A-Z]{2,}) # a secondary group, but non-capturing,
# good for alternatives separated by a pipe, |
'''
re.search(regex, data, re.VERBOSE).group(1)
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>
When you use variables (is that the correct word?) in python regular expressions like this: "blah (?P\w+)" ("value" would be the variable), how could you make the variable's value be the text after "blah " to the end of the line or to a certain character not paying any attention to the actual content of the variable. For example, this is pseudo-code for what I want:
>>> import re
>>> p = re.compile("say (?P<value>continue_until_text_after_assignment_is_recognized) endsay")
>>> m = p.match("say Hello hi yo endsay")
>>> m.group('value')
'Hello hi yo'
Note: The title is probably not understandable. That is because I didn't know how to say it. Sorry if I caused any confusion.
For that you'd want a regular expression of
"say (?P<value>.+) endsay"
The period matches any character, and the plus sign indicates that that should be repeated one or more times... so .+ means any sequence of one or more characters. When you put endsay at the end, the regular expression engine will make sure that whatever it matches does in fact end with that string.
You need to specify what you want to match if the text is, for example,
say hello there and endsay but some more endsay
If you want to match the whole hello there and endsay but some more substring, #David's answer is correct. Otherwise, to match just hello there and, the pattern needs to be:
say (?P<value>.+?) endsay
with a question mark after the plus sign to make it non-greedy (by default it's greedy, gobbling up all it possibly can while allowing an overall match; non-greedy means it gobbles as little as possible, again while allowing an overall match).