Regex in Django query set - python

I'm trying to write a Django query that will filter by a particular regex pattern.
I want to filter by a code that pulls out any cases where there's any non-digit character followed by a number, followed by a non-digit character (a white space is fine).
Just say some codes are AJDP8EP, jsif28EP, EROE88, oskdpoeks8.
So I want my results to return: AJDP8EP, oskdpoeks8.
This is my query, but it's not recognizing things properly. Number is a variable.
results = Book.objects.filter(author__contains = firstname,type = "Fiction").filter(code__regex = r'^(\D+)(number)(\D+)')

You cannot use a variable inside your quoted regex expression. Try concatenating strings like:
r"^(\D+)(" + str(number) + ")(\D+)"
This converts your number variable to a string in case it is not already.
Also, as indicated in one of the comments to your question, oskdpoeks8 will not match your pattern. If you want to catch cases where number may come at the end of the code, one solution would be:
r"^(\D+)(" + str(number) + ")(\D*)"
Note the replacement of the + with an * to catch that case of zero occurrences.

With python 3+ and newer versions on django you can make use of f-strings to pass variables to regex and use them in queries. For example:
Car.objects.filter(
car_code__regex=rf'^{settings.SOME_SETTING}0*(\d+)$'
)
For the OP's case:
results = Book.objects.filter(author__contains = firstname,type = "Fiction").filter(code__regex = rf'^(\D+){number}(\D+)')

Related

Edit regex strings in Python using format method

I want to develop a regex in Python where a component of the pattern is defined in a separate variable and combined to a single string on-the-fly using Python's .format() string method. A simplified example will help to clarify. I have a series of strings where the space between words may be represented by a space, an underscore, a hyphen etc. As an example:
new referral
new-referal
new - referal
new_referral
I can define a regex string to match these possibilities as:
space_sep = '[\s\-_]+'
(The hyphen is escaped to ensure it is not interpreted as defining a character range.)
I can now build a bigger regex to match the strings above using:
myRegexStr = "new{spc}referral".format(spc = space_sep)
The advantage of this method for me is that I need to define lots of reasonably complex regexes where there may be several different commonly-occurring stings that occur multiple times and in an unpredictable order; defining commonly-used patterns beforehand makes the regexes easier to read and allows the strings to be edited very easily.
However, a problem occurs if I want to define the number of occurrences of other characters using the {m,n} or {n} structure. For example, to allow for a common typo in the spelling of 'referral', I need to allow either 1 or 2 occurrences of the letter 'r'. I can edit myRegexStr to the following:
myRegexStr = "new{spc}refer{1,2}al".format(spc = space_sep)
However, now all sorts of things break due to confusion over the use of curly braces (either a KeyError in the case of {1,2} or an IndexError: tuple index out of range in the case of {n}).
Is there a way to use the .format() string method to build longer regexes whilst still being able to define number of occurrences of characters using {n,m}?
You can double the { and } to escape them or you can use the old-style string formatting (% operator):
my_regex = "new{spc}refer{{1,2}}al".format(spc="hello")
my_regex_old_style = "new%(spc)srefer{1,2}al" % {"spc": "hello"}
print(my_regex) # newhellorefer{1,2}al
print(my_regex_old_style) # newhellorefer{1,2}al

search substring + integer from a string in python using regular expression

I have a string
str="TMOUT=1800; export TMOUT"
I want to extract only TMOUT=1800 from above string, but 1800 is not constant it can be any integer value. For example TMOUT=18 or TMOUT=201 etc. I'm very new to regular expression.
I tried using code below
re.search("TMOUT=\d",str).
It is not working. Please help
\d matches a single digit. You want to match one or more digits, so you have to add a + quantifier:
re.search("TMOUT=\d+", text)
If you then you want to extract the number you have to create a group using parenthesis ():
match = re.search(r"TMOUT=(\d+)", text)
number = int(match.group(1))
Or you may want to use the named group syntax (?P<name>):
match = re.search(r"TMOUT=(?P<num>\d+)", text)
number = int(match.group("num"))
I suggest you use regex101 to test your regexes and get an explanation of what they do. Also read python's re docs to learn about the methods of the various objects and functions available.

matching regular expressions in python which contains URLs

I have a list of URLS from which I am trying to fetch just the id numbers. I am trying to solve this out using the combination of URLParse and regular expressions. Here is how my function looks like:
def url_cleanup(url):
parsed_url = urlparse(url)
if parsed_url.query=="fref=ts":
return 'https://www.facebook.com/'+re.sub('/', '', parsed_url.path)
else:
qry = parsed_url.query
result = re.search('id=(.*)&fref=ts',qry)
return 'https://www.facebook.com/'+result.group(1)
However, I feel that the regular expression result = re.search('id=(.*)&fref=ts',qry) fails to match some of the URLs as explained in the below example.
#1
id=10001332443221607 #No match
#2
id=6383662222426&fref=ts #matched
I tried to take the suggestion as per the suggestion provided in this answer by rephrasing my regular expression as id=(.*).+?(?=&fref=ts) which again matches #2 but not #1 in the above examples.
I am not sure what I am missing here. Any suggestion/hint will be much appreciated.
Your regex's are wrong, indeed.
using the expression id=(.*)&fref=ts you will only match ids succeded by &fref=ts literally.
using id=(.*).+?(?=&fref=ts) you will do the same thing, but using the lookahead, which is a non-capturing group expression. This means that your match will be only the id=blablabla part, but only if it's succeded by &fref=ts.
Moreover, id=(.*) will match ids comprised of numbers, letters, symbols... literally anything. Using id=\d+ will match 'numbers only' ids.
So, try using
result = re.search('id=(\d+)', qry)
it will allow you to catch just the numbers, supposing your ids are always digits, and capture(using the parenthesis) only these digits for later use.
For further reference, refer to
http://www.regular-expressions.info/python.html
Your regex needs tweaking slightly. Try:
result = re.search('id=(\d+)(&fref=ts)?', qry)
id=(\d+) matches any number of digits following id=, and (&fref=ts)? allows the following group of letters to be optional. This would allow you to add them back in if necessary.
You should also note that this will throw an error if no match is found - so you might want to change slightly to:
result = re.search('id=(\d+)(&fref=ts)?', qry)
if result:
return 'https://www.facebook.com/'+result.group(1)
else:
# some error catch

Matching a complex expression in python regex

I have to create a unique textual marker in my document using Python 2.7, with the following function:
def build_textual_marker(number, id):
return "[xxxcixxx[[_'" + str(number) + "'] [_'" + id + "']]xxxcixxx]"
the output looks like this : [xxxcixxx[[_'1'] [_'24']]xxxcixxx]
And then I have to catch any occurrence of this expression in my document. I ended up to the following regular expression but it seems not working fine:
marker_regex = "\[xxxcixxx\[(\[_*?\])\s(\[_*?\])\]xxxcixxx\]"
I was wondering how should I write the correct regex in this case?
Try using
\[xxxcixxx\[\[_'.*?'\] \[_'.*?'\]\]xxxcixxx\]
Demo: http://regexr.com/3d887
Rather than the lazy star, you might as well get along with a digit class directly (the function build_textual_marker takes a number parameter, doesn't it?):
\[xxxcixxx\[(\[_'\d+'\])\s(\[_'\d+'\])\]xxxcixxx\]
See a demo on regex101.com.

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Categories

Resources