python regex pattern not working properly

python regex pattern not working properly - python

Good day stackoverflow, I have a problem with my program. I want to test if the string I entered is alphanumeric or not.
def logUtb(fl, str):
now = datetime.datetime.now()
fl.write(now.strftime('%Y-%m-%d %H:%M') + " - " + str + "\n");
return;
#Test alphanumeric
def testValidationAlphaNum():
valid = re.match('[A-Za-z0-9]', '!###$#$#')
if valid == True:
logUtb(f, 'Alphanumeric')
else:
logUtb(f, 'Unknown characters')
As you can see I entered '!###$#$#' to be tested by my regex pattern. Instead of return "Unknown characters" to my report log it returns alphanumeric. Can you guys please tell me what seems to be wrong with my program? Thanks!

re.match() returns None if the string didn't match and a MatchObject if it did. So the == True test will never be satisfied. If you're really seeing the 'Alphanumeric' output, then it's not a result of the code you have posted.
In any case, you should use str.isalnum() for this:
>>> 'abc'.isalnum()
True

I agree with the others who have said that calling str.isalnum() would be a simpler option here.
However, I would like to point out a few things with regard to the regex pattern you tried. Like Alex Baldwin said, your pattern as-is will only be looking for a single alphanumeric at the beginning of the string. So, you could potentially have anything else in the rest of the string and still get a match.
What you should do instead is quantify your character class, and anchor that class to the end of the string. To test that the string contains some alphanumerics, you ought to choose the + quantifier, which looks for at least one alphanumeric. Make sure you use the $ to anchor the pattern to the end of the string, or else you could have some non-alphanumerics sneak in at the end:
re.match('[A-Za-z0-9]+$', '!###$#$#')
This will return false, of course, for the given string. The problem with using the * here is that it will return a MatchObject even against an empty string, and I assume you want there to be at least one alphanumeric character present. Notice also that using the ^ to anchor the character class to the beginning of the string is not necessary, because re.match() begins its search only at the beginning of the string. What you then want to test with your conditional is whether or not a MatchObject was returned by re.match():
valid = re.match('[A-Za-z0-9]+$', '!###$#$#')
if valid:
logUtb(f, 'Alphanumeric')
else:
logUtb(f, 'Unknown characters')
Additional information on the quantifiers and anchors can be found in the documentation:
http://docs.python.org/2/library/re.html

valid = re.match('[A-Za-z0-9]', '!###$#$#')
could be
valid = re.match(r'^\w*$', '!###$#$#')
and work. \w is alpha numeric. (I'd like to add that underscores are alphanumeric, according to python.) So if you don't want those, your regex should be: ^[A-Za-z0-9]*$
OR, it could [^_/W]
but if valid == True must be if valid to work.

Related

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.

re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.

Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo

This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.

re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'

You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

regex in python with defined number of letters but no more

I need to check whether the string contains exactly three letters-no more, no les.
I tried:
import re
rege=r'[A-Z]{3,3}'
word='AAAD'
if( re.match(rege,word)):
print 'yes'
else:
print 'no'
My second try was:
import re
rege=r'[A-Z][A-Z][A-Z]'
word='AAAD'
if( re.match(rege,word)):
print 'yes'
else:
print 'no'
both regex tests give the answer 'yes'. Of course I can check len(word) but, this part of regex will be part of more difficult regex expression and I do not want to use structure like
if(re.match(word[0:2],r'[A-Z][A-Z][A-Z]')):
if(re.match(word[3]=='-')):
if....:
if....:
....
Thank you.

You want to use anchors:
^[a-zA-Z]{3}$
^ will match the beginning of the string, $ will match the end.

^[A-Z]{3}$
will do the magic for you
According to you [A-Z]{3} should work, but this only means to check whether the string to be tested contains three letters. Not exactly three letters. The string may have more letters as well.
Thus my regex will check number of letters from starting of the string to the end.

You should use:
^[A-Z]{3}$
as they specify the beginning and the ending of the line, making sure nothing else is in there.

Why does Django not find these urls, allthough the regex matches?

In the python docs for regex there is the description of what the "." does:
(Dot.) In the default mode, this matches any character except a
newline. If the DOTALL flag has been specified, this matches any
character including a newline.
For a project i do in Django i set up this regex:
url(r'^accounts/confirm/(.+)$', confirm,name='confirmation_view')
For all i understand, this should match any url that starts with 'accounts/confirm/', then followed by any number of arbitrary characters. These arbitrary characters are then passed to the function "confirm" as parameter. So far, so good.
So this regex should match
accounts/confirm/fb75c6529af9246e4e048d8a4298882909dc03ee0/
just as well as
accounts/confirm/fb75c6529af9246e4e-048d8a4298882909dc03ee0/
and
accounts/confirm/fb75c6529af9246e4e=048d8a4298882909dc03ee0/
and
accounts/confirm/fb75c6529af9246e4e%20048d8a4298882909dc03ee0/
That, at least, was what i thought it would do. But it doesn't, it matches only the first one. Django keeps returning me a 404 on the other ones. Which i do not understand, because the (.+) part of the expression should mean "match one ore more of any character except a newline".
edit:
As the comments and answers proved, i got the regex right. So this is now a question about: why is Django not returning the correct view, but a 404. Is it doing some stuff to the url before passing it to that regex?

A quick test confirms this should work:
>>>import re
>>>test = ["accounts/confirm/fb75c6529af9246e4e048d8a4298882909dc03ee0/", "accounts/confirm/fb75c6529af9246e4e-048d8a4298882909dc03ee0/", "accounts/confirm/fb75c6529af9246e4e=048d8a4298882909dc03ee0/", "accounts/confirm/fb75c6529af9246e4e%20048d8a4298882909dc03ee0/"]
>>>all([re.match(r'^accounts/confirm/(.+)$', item) for item in test])
True
This will return false on any non matches:
>>>test.append("something else")
>>>all([re.match(r'^accounts/confirm/(.+)$', item) for item in test])
False
The problem must be elsewhere.

Repeating a python regular expression until a certain char

I want to get all of the text until a ! appears. Example
some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!
The number of lines before the ! changes so I can't hardcode a reg exp like this
"+\n^.+\n^.+"
I am using re.MULTLINE, but should I be using re.DOTALL?
Thanks

Why does this need a regular expression?
index = str.find('!')
if index > -1:
str = str[index:] # or (index+1) to get rid of the '!', too

So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:
re.match(r'[^!]*', input)
If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:
re.match(r'[^!]*(?=!)', input)
The MULTILINE flag is not needed because there are no anchors (^ and $), and DOTALL isn't needed because there are no dots.

Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" (EAFP), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.
SEPARATOR = u"!"
def process_string(s):
try:
return s[:s.index(SEPARATOR)]
except ValueError:
return s
This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.
ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.
No regular expressions needed; this is a simple problem.

Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.
I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com
Maybe something like:
([^!]*(?=\!))
(Just tried this, seems to work)

It should do the job.
re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)

I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".
So, like this:
>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>>

re.DOTALL should be sufficient:
import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]
some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex pattern not working properly - python

Related

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Regex only finds results once

regex in python with defined number of letters but no more

Why does Django not find these urls, allthough the regex matches?

Repeating a python regular expression until a certain char

Categories

Resources