Get all non-word characters excluding those in a url - python

I am attempting to write a regex in python that will match all non-word characters (spaces, slashes, colons, etc.) excluding those that exist in a url. I know I can get all non-word characters with \W+ and I also have a regex to get urls: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+ but I can't figure out a way to combine them. What would be the best way to get what I need here?
EDIT
To clarify, I am trying to split on this regex. So when I attempt to using re.split() with the following regex: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W) I end up with something like the following:
INPUT:
this is a test: https://www.google.com
OUTPUT:
['this', ' ', 'is', ' ', 'a', ' ', 'test', ':', '', ' ', '', None, '']
What I'm hoping to get is this:
['this', 'is', 'a', 'test', 'https://www.google.com']
This is how I'm splitting:
import re
message = 'this is a test: https://www.google.com'
re.split("https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a- zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)", message)

You should use a reverse logic, match a URL pattern or any one or more word chars:
import re
rx = r"https*://[\w.]+\.[\w/-]*|[\w.]+\.[a-zA-Z]*/[\w/-]+|\w+"
message = 'this is a test: https://www.google.com'
print( re.findall(rx, message) )
# => ['this', 'is', 'a', 'test', 'https://www.google.com']
See the Python demo.
Note I shortened your URL pattern, you had two similar alternatives, https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+ and https*:\/\/[\w\.]+\.[a-zA-Z]*, where [a-zA-Z]* is redundant as it matches any zero or more letters and the next [\w\/\-]+ pattern requires one or more letters, / or - chars. You also do not have to escape dots inside character classes and slashes, the unnecessary escapes are removed here.

Related

Why are there space outcome in my re.split() result

I want to extract the strings in the brackets and single quote in the given a string, e.g. Given ['this'], extract this
, yet it keeps haunting me that the following example and result:
import re
target_string = "['this']['current']"
result = re.split(r'[\[|\]|\']+', target_string)
print(result)
I got
['', 'this', 'current', '']
# I expect ['this', 'current']
Now I really don't understand where are the first and last ' ' in the result coming from, I guarantee that the input target_string has no such leading and trailing space, I don't expect that they occurred in the result
Can anybody help me fix this, please?
Using re.split match every time the pattern is found and since your string starts and ends with the pattern is output a '' at he beguining and end to be able to use join on the output and form the original string
If you want to capture why don't you use re.findall instead of re.split? you have very simple use if you only have one word per bracket.
target_string = "['this']['current']"
re.findall("\w", target_string)
output
['this', 'current']
Note the above will not work for:
['this is the', 'current']
For such a case you can use lookahead (?=...) and lookbehind (?<=...) and capture everything in a nongreedy way .+?
target_string = "['this is the', 'current']"
re.findall("(?<=\[\').+?(?=\'\])", target_string) # this patter is equivalent "\[\'(.+)\'\]"
output:
['this is the', 'current']

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Why is Python re not splitting multiple instances of punctuation?

I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation.
Here is my re pattern wordsWithPunc = re.split(r'([^-\w]+)',words)
If I have a word like "hello" with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example
"hello,-" will equal "hello",",-" but I want it to be "hello",",","-"
Another example. My name is mud!!! would be split into "My","name","is","mud","!!!" but I want it to be "My","name","is","mud","!","!","!"
You need to make your pattern non-greedy (remove the +) if you want to capture single non-word characters, something like:
import re
words = 'My name is mud!!!'
splitted = re.split(r'([^-\w])', words)
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']
This will produce also 'empty' matches between non-word characters (because you're slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:
splitted = [match for match in re.split(r'([^-\w])', words) if match]
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']
You can further strip spaces in the generator (i.e. ... if match.strip() ...) if you want to get rid off the space matches as well.

RegEx for ignoring parentheses in a string

These is a string like this:
strs = "Tierd-Branden This is (L.A.) 105 / New (Even L.A.A)"
After trying the following code, I don't get my expected output.
and this is my code:
import re, itertools
strs = "Tierd-Branden This is (U.C.) 105 / New (Even L.A.A)"
print re.findall(r"[\w']+[\w\.]", strs)
I expect This:
['Tierd', 'Branden', 'This', 'is', 'L.A.', '105', 'New', 'Even', 'L.A.A']
But, I get this:
['Tierd', 'Branden', 'This', 'is', 'L.', 'A.', '105', 'New', 'Even', 'L.', 'A.']
My question is how to keep content of parenthesis with . linked as a list element?
The [\w']+[\w\.] pattern matches 1 or more word or ' chars and then a word or . char. Hence, it cannot match chunks of word or ' chars that have more than 1 dot in them.
I suggest using
r"\w[\w'.]*"
See the regex demo and a Regulex graph:
Details
\w - a word char
[\w'.]* - 0 or more word, ' and . chars.
This RegEx might return your desired output, which simply you can list all your desired chars in the []. You might use a capturing group, if you wish, just to simply call it using $1. You can add any other chars that you may wish/have in the [], and if those chars might be metachars, you can use \ for escaping.
([A-Za-z0-9\.]+)
You can remove the capturing group, and it might still work:
[A-Za-z0-9\.]+

Python regex splitting additionally on _

I'm trying to split a string in python using regular expressions. This line works almost perfectly for me:
from string import punctuation
import re
row = re.findall('\w+|[{0}]+'.format(punctuation), string)
However, it doesn't split the string on instances of _ as well. For instance:
>>> string = "Hi my name is _Mark. I like apples!! Do you?!"
>>> row = re.findall('\w+|[{0}]+'.format(punctuation), string)
>>> row
['Hi', 'my', 'name', 'is', '_Mark', '.', 'I', 'like', 'apples', '!!', 'Do', 'you', '?!']
What i want is:
['Hi', 'my', 'name', 'is', '_', 'Mark', '.', 'I', 'like', 'apples', '!!', 'Do', 'you', '?!']
I've read its because _ is considered a character. Does anyone know how to accomplish this? Thanks for the help.
Since \w will match the underscore, you can more directly specify what you consider a character without too much more work:
re.findall('[a-zA-Z0-9]+|[{0}]+'.format(punctuation), string)
Because the left side of a disjunction will always match first if possible, you can simply include _ with the punctuation characters before you match letters:
row = re.findall(r'[{0}_]+|\w+'.format(string.punctuation), mystring)
But you can do the same without bothering with string.punctuation at all. "Punctuation" is anything that's neither a space nor a word character:
row = re.findall(r"(?:[^\s\w]|_)+|\w+", mystring)
PS. In your code sample, the string named string "shadows" the module string. Don't do that, it's bad practice and leads to bugs.
It is clearly stated in Python docs that \w not only include alphanumerical characters but also the underscore as well:
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
so like Eric pointed out in his solution, better specify a set of only alphanumerical characters [a-zA-Z0-9]

Categories

Resources