RegEx for ignoring parentheses in a string - python

These is a string like this:
strs = "Tierd-Branden This is (L.A.) 105 / New (Even L.A.A)"
After trying the following code, I don't get my expected output.
and this is my code:
import re, itertools
strs = "Tierd-Branden This is (U.C.) 105 / New (Even L.A.A)"
print re.findall(r"[\w']+[\w\.]", strs)
I expect This:
['Tierd', 'Branden', 'This', 'is', 'L.A.', '105', 'New', 'Even', 'L.A.A']
But, I get this:
['Tierd', 'Branden', 'This', 'is', 'L.', 'A.', '105', 'New', 'Even', 'L.', 'A.']
My question is how to keep content of parenthesis with . linked as a list element?

The [\w']+[\w\.] pattern matches 1 or more word or ' chars and then a word or . char. Hence, it cannot match chunks of word or ' chars that have more than 1 dot in them.
I suggest using
r"\w[\w'.]*"
See the regex demo and a Regulex graph:
Details
\w - a word char
[\w'.]* - 0 or more word, ' and . chars.

This RegEx might return your desired output, which simply you can list all your desired chars in the []. You might use a capturing group, if you wish, just to simply call it using $1. You can add any other chars that you may wish/have in the [], and if those chars might be metachars, you can use \ for escaping.
([A-Za-z0-9\.]+)
You can remove the capturing group, and it might still work:
[A-Za-z0-9\.]+

Related

Why are there space outcome in my re.split() result

I want to extract the strings in the brackets and single quote in the given a string, e.g. Given ['this'], extract this
, yet it keeps haunting me that the following example and result:
import re
target_string = "['this']['current']"
result = re.split(r'[\[|\]|\']+', target_string)
print(result)
I got
['', 'this', 'current', '']
# I expect ['this', 'current']
Now I really don't understand where are the first and last ' ' in the result coming from, I guarantee that the input target_string has no such leading and trailing space, I don't expect that they occurred in the result
Can anybody help me fix this, please?
Using re.split match every time the pattern is found and since your string starts and ends with the pattern is output a '' at he beguining and end to be able to use join on the output and form the original string
If you want to capture why don't you use re.findall instead of re.split? you have very simple use if you only have one word per bracket.
target_string = "['this']['current']"
re.findall("\w", target_string)
output
['this', 'current']
Note the above will not work for:
['this is the', 'current']
For such a case you can use lookahead (?=...) and lookbehind (?<=...) and capture everything in a nongreedy way .+?
target_string = "['this is the', 'current']"
re.findall("(?<=\[\').+?(?=\'\])", target_string) # this patter is equivalent "\[\'(.+)\'\]"
output:
['this is the', 'current']

Get all non-word characters excluding those in a url

I am attempting to write a regex in python that will match all non-word characters (spaces, slashes, colons, etc.) excluding those that exist in a url. I know I can get all non-word characters with \W+ and I also have a regex to get urls: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+ but I can't figure out a way to combine them. What would be the best way to get what I need here?
EDIT
To clarify, I am trying to split on this regex. So when I attempt to using re.split() with the following regex: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W) I end up with something like the following:
INPUT:
this is a test: https://www.google.com
OUTPUT:
['this', ' ', 'is', ' ', 'a', ' ', 'test', ':', '', ' ', '', None, '']
What I'm hoping to get is this:
['this', 'is', 'a', 'test', 'https://www.google.com']
This is how I'm splitting:
import re
message = 'this is a test: https://www.google.com'
re.split("https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a- zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)", message)
You should use a reverse logic, match a URL pattern or any one or more word chars:
import re
rx = r"https*://[\w.]+\.[\w/-]*|[\w.]+\.[a-zA-Z]*/[\w/-]+|\w+"
message = 'this is a test: https://www.google.com'
print( re.findall(rx, message) )
# => ['this', 'is', 'a', 'test', 'https://www.google.com']
See the Python demo.
Note I shortened your URL pattern, you had two similar alternatives, https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+ and https*:\/\/[\w\.]+\.[a-zA-Z]*, where [a-zA-Z]* is redundant as it matches any zero or more letters and the next [\w\/\-]+ pattern requires one or more letters, / or - chars. You also do not have to escape dots inside character classes and slashes, the unnecessary escapes are removed here.

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

How to select words with apostrophe using regular expression

I am trying to separate a string into a list, but I need to have the string contain words that are joined by apostrophes. For example :
String="My name is Melvin_JESUS, Guatemala, Dean'Olvier, 501soy...# 1231 !"
should give me a result as:
['my', 'name', 'is', 'melvin', 'jesus', 'guatemala', '"dean'oliver"', 'soy']
i have tried the following regular expression:
my_patern= r"(?:^|(?<=\s)|-)[A-Za-z'\.]+(?=\s|\t|$|\b)"
but doesn't give me the desired results.
You may use
(?<![^\W\d_])[^\W\d_]+(?:['.][^\W\d_]+)*(?![^\W\d_])
See the regex demo
Details
(?<![^\W\d_]) - no letter right before the match is allowed
[^\W\d_]+ - 1 or more letters
(?:['.][^\W\d_]+)* - 0 or more sequences of ' or . and then 1+ letters
(?![^\W\d_]) - no letter right after the match is allowed.
In Python, use
re.findall(r'(?<![^\W\d_])[^\W\d_]+(?:['.][^\W\d_]+)*(?![^\W\d_])', text)

How can I split at word boundaries with regexes?

I'm trying to do this:
import re
sentence = "How are you?"
print(re.split(r'\b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall instead of split.
Actually \b just means word boundary.
It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Ideone Demo
Regex101 Demo
Regex Explanation:
"[\w']+|[.,!?;]"
1st Alternative: [\w']+
[\w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split on word boundaries:
re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall on word boundaries
re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

Categories

Resources