python string split only by `/` but not `//` [duplicate] - python

I have a string like this
"yJdz:jkj8h:jkhd::hjkjh"
I want to split it using colon as a separator, but not a double colon. Desired result:
("yJdz", "jkj8h", "jkhd::hjkjh")
I'm trying with:
re.split(":{1}", "yJdz:jkj8h:jkhd::hjkjh")
but I got a wrong result.
In the meanwhile I'm escaping "::", with string.replace("::", "$$")

You could split on (?<!:):(?!:). This uses two negative lookarounds (a lookbehind and a lookahead) which assert that a valid match only has one colon, without a colon before or after it.
To explain the pattern:
(?<!:) # assert that the previous character is not a colon
: # match a literal : character
(?!:) # assert that the next character is not a colon
Both lookarounds are needed, because if there was only the lookbehind, then the regular expression engine would match the first colon in :: (because the previous character isn't a colon), and if there was only the lookahead, the second colon would match (because the next character isn't a colon).

You can do this with lookahead and lookbehind, if you want:
>>> s = "yJdz:jkj8h:jkhd::hjkjh"
>>> l = re.split("(?<!:):(?!:)", s)
>>> print l
['yJdz', 'jkj8h', 'jkhd::hjkjh']
This regex essentially says "match a : that is not followed by a : or preceded by a :"

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

How to match anything in a regular expression up to a character and not including it?

For example if I have a string abc%12341%%c%9876 I would like to substitute from the last % in the string to the end with an empty string, the final output that I'm trying to get is abc%12341%%c.
I created a regular expression '.*#' to search for the last % meaning abc%12341%%c% , and then getting the index of the the last % and then just replacing it with an empty string.
I was wondering if it can be done in one line using re.sub(..)
Use the following regex pattern, and then replace with empty string:
%[^%]*$
Sample script:
inp = "abc%12341%%c%9876"
output = re.sub(r'%[^%]*$', '', inp)
print(output) # abc%12341%%c
The regex pattern says to match the final % sign, followed by zero or more non % characters, up to the end of the string. We then replace with empty string, to effectively remove this content from the input.
I think it is called lookahead matching - I will look it up if I am not too slow :-)
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Python regex match literal asterisk

Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.
\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)
re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.
Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".
You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)

Python RegEx Meaning

I'm new to python regular expressions and was wondering if someone could help me out by walking me through what this means (I'll state what I think each bit means here as well).
Thanks!
RegExp:
r'(^.*def\W*)(\w+)\W*\((.*)\):'
r'...' = python definition of regular expression within the ''
(...) = a regex term
(^. = match the beginning of any character
*def\W* = ???
(\w+) = match any of [a, z] 1 or more times
\W*\ = ? i think its the same as the line above this but from 0+ more times instead of 1 but since it matches the def\W line above (which i dont really know the meaning of) i'm not sure.
((.*)\): = match any additional character within brackets ()
thanks!
It seems like a failed attempt to match a Python function signature:
import re
regex = re.compile(r""" # r'' means that \n and the like is two chars
# '\\','n' and not a single newline character
( # begin capturing group #1; you can get it: regex.match(text).group(1)
^ # match begining of the string or a new line if re.MULTILINE is set
.* # match zero or more characters except newline (unless
# re.DOTALL is set)
def # match string 'def'
\W* # match zero or more non-\w chars i.e., [^a-zA-Z0-9_] if no
# re.LOCALE or re.UNICODE
) # end capturing group #1
(\w+) # second capturing group [a-zA-Z0-9_] one or more times if
# no above flags
\W* # see above
\( # match literal paren '('
(.*) # 3rd capturing group NOTE: `*` is greedy `.` matches even ')'
# therefore re.match(r'\((.*)\)', '(a)(b)').group(1) == 'a)(b'
\) # match literal paren ')'
: # match literal ':'
""", re.VERBOSE|re.DEBUG)
re.DEBUG flag causes the output:
subpattern 1
at at_beginning
max_repeat 0 65535
any None
literal 100
literal 101
literal 102
max_repeat 0 65535
in
category category_not_word
subpattern 2
max_repeat 1 65535
in
category category_word
max_repeat 0 65535
in
category category_not_word
literal 40
subpattern 3
max_repeat 0 65535
any None
literal 41
literal 58
more
r'..' = Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'.
(...) = a capture group which stores the captured value in a var to be used in replacement/mathing.
^ = the start of the string.
.* = 0 or more chars of any type.
def = the literal string def
\W* = 0 or more non word chars (Anything other than a-zA-Z or _)
\w+ = 1 or more word chars (see above)
\( = escapes the (, therefore means a literal (
\) = same as above.
: = literal :
PS: I like the effort that you made to try to understand the regex. It will serve you well, much better than people asking what does r'(^.*def\W*)(\w+)\W*\((.*)\):' mean.
r'...' = python definition of regular expression within the ''
The r'' syntax has nothing to with regular expressions (or at least, not directly). The r stands for raw and is simply an indicator to Python that no string interpolation should be performed on the string.
This is often used with regular expressions so that you don't have to escape backslash (\) characters, which would otherwise be eaten by the normal string interpolation mechanism.
(^. = match the beginning of any character
I'm not sure what "beginning of any character" means. The ^ character matches the beginning of a line.
def\W = ???
def matches the characters def. For \W, take a look at pydoc re, which describes the regular expression language.
\W*
As above.
Other than the above, your interpretation seems largely correct.
r'(^.*def\W\*)(\w+)\W*((.*)):'
==================================
r' tells python this is a raw string so you don't have to double escape all the \
^ match start of string
() in each case this means group this match where each () is a different group
.* match zero or more of any characters
def match the literal 'def'
\W* match zero or more of any non word character
() more grouping of the contained expression
\w+ match one or more of word character
\W* zero or more of any non word character
\( escape the left paren
() more grouping of the contained expression
.* zero of more of any character
\) escape the right paren
: match a single colon literal
This looks like it is trying to match a python method definition. Here is a link to play with this regular expression. Yes it is powered by Ruby, but the syntax is pretty much the same across all languages, I use this site to test regexes for Python, Java and Ruby.
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.
( ) groups a statement, it's treated as ONE thing, so you can do ()? or ()* or ()+ if the things in the brackets need to handled together.
^ matches it if it's the beginning of the string
(Dot.) In the default mode, this matches any character except a newline.
Since it is ".*" the * means match 0 or more matches of the previous thing, which in this case is any character.
def\W* - lines beginning with the string "def" and then \W matches any non-alphanumeric characters and is equivalent to [^a-zA-Z0-9_]. Since we have the * again, this time it matches 0 or more non-alphanumeric characters.
(\w+), the + is for 1 or more of the previous, which in this case is \w, equivalent to [a-zA-Z0-9_].
7.\W*, we know that one already.
"(" - means just to match "(", differentiate it from () grouping things, same for ")/"
(.*) - match 0 or more characters inside.
: - the matched string ends with a colon at the end.
The whole thing seems to be matching a definition of a function in python i.e. "def foo(x):" will be matched. Dealing with regular expressions is hard - using tools such as http://www.pythonregex.com/ helps me to try out different things. And since RE are slightly different in different languages, it's nice to have tools for those too.
r'...' → the preferred way to define the string of a regular expression in Python
(...) → regex term
^ → matches only with the beginning of the string
So, in the first pair of parenthesis (^.def\W), firstly it matches with the string.
. → matches any character
* → repeat the previous match 0 or more times
Then .* will match anything any number of times. The following 'def' is an exact match, that only matches with itself.
\W → matches anything that is NOT a letter, nor a number, nor the underscore character.
Then \W* will match zero or more of these non-letter-number-underscore characters. The next pair of parenthesis (\w+) you got it right. In the last part \W*\((.*)\): the initial \W* means the same thing as the previous \W* . Next, \( matches with ( , then there is the group (.*) which means, as previously, anything any number of times, followed by \): that matches ): .
An example of string that is matched by this regular expression is:
thing_def = function_name (123 anything in here):

Categories

Resources