How to parse a list in Django urlparser? - python

On Stack Overflow, you can view a list of questions with multiple tags at a URL such as http://stackoverflow.com/questions/tagged/django+python.
I'd like to do something similar in a project I am working on, where one of the url parameters would be a list of tags, but I'm not sure how to write a regex urlparser that can parse it out. I'm fond of SO's way of using the + sign, but it's not a dealbreaker. I also imagine that the urlparser may have to take the whole string (foo+bar+baz) as a single variable to give to the view, which is also fine as I can just split it in the view itself- that is, I'm not expecting the URL parser to give the view an already split list, but if it can, even better!
Right now all I have is:
url(r'^documents/tag/(?P<tag>\w+)/$', ListDocuments.as_view(), name="list_documents"),
Which just pulls out one single tag since \w+ just gets me those [A-Za-z0-9_], but not +. I tried something like:
url(r'^documents/tag/(?P<tag>[\w+\+*])/$', ListDocuments.as_view(), name="list_documents"),
But this no longer matched documents/tag/foo nor documents/tag/foo+bar.
Please assist, I'm not so great with regex, thanks!

It's not possible to do this automatically. From the documentation: "Each captured argument is sent to the view as a plain Python string, regardless of what sort of match the regular expression makes." Splitting it in the view is the way to go.
The second regex in your answer is OK, but it does allow some things you might not want (e.g. 'django+++python+'). A stricter version might be something like: (?P<tag>\w+(?:\+\w+)*). Then you can just do a simple tag.split('+') in the view without worrying about any edge cases.

This works for now:
url(r'^documents/tag/(?P<tag>[A-Za-z0-9_\+]+)/$', ListDocuments.as_view(), name="list_documents"),
But I'd like to be able to get that w back in there instead of the full list of characters like that.
[Edit]
Here we go:
url(r'^documents/tag/(?P<tag>[\w\+]+)/$', ListDocuments.as_view(), name="list_documents"),
I will still select a better answer if there is a way for the Django urlparser to give the view an actual list instead of just one big long string, but if that's not possible, this solution does work.

Related

REGEX working together separated by | OR. When run independently are both returning empty lists

I have written two REGEX that I originally was using with the | either or. I need them to both run separately, what should be a simple matter of doing is not working the way I expected. I have tested both regex with online tools, and they both work 100%. When ran in the code they both return: [].
For reference stringSoup is an html string.
Here was the original:
re.findall(r"(\(#([^)\s]+)\))|//.*instagram\.com/(\w+.*?)/(?:p)/g")
I need to run each re separately like so:
re.findall(r"(\(#([^)\s]+)\))/g", stringSoup)
re.findall(r"//.*instagram\.com/(\w+.*?)/(?:p)/g", stringSoup)
The first regex is to find usernames as (#username) The second is to find usernames as instagram.com/username
The original combined regex was working fine
After separation both of these are returning empty []
I'm not really certain I understand your question and some of the inputs, but I made a sample to hopefully re-create what you're trying to do:
\(#(?P<username1>[^)]+)\) # username is after '(#' and is everything up until ')'
| # or
.*instagram\.com\/(?P<username2>[^\/]+)\/p # username is between 'instagram.com/' and the next '/'
You can view it here. You can also remove the top half or the bottom half and see that each regex will only match that specific item. Note that using something like [^\/] might be a bit crude and you can make that more specific, but the above should give you what you need in a general sense.

Exact keyword match in string

I know this question has been asked almost hundred times in stack overflow but after doing lot of search and not finding my answer, I am asking this question.
I am looking to search exact word from strings something like below.
'svm_midrangedb_nonprod:svm_midrangedb_nonprod_root'
'svm_midrangedb_prod:svm_midrangedb_prod_root'
I want to search only for 'prod' but getting both 'prod' and 'nonprod' in output.
Here is the code I am using:
re.search(r"\wprod\w", in_volumes.json()[i]['name'].split(":")[2].lower())
You have to make rules to not match nonprod but match prod.
For example, maybe you can make it so that if there's n infront of prod, you exclude it like this: [^n]prod\w.
Or maybe some data has n infront of prod and you want to keep it. Then, you want to exclude if there's non infront of prod like this: \w*(?<!non)prod\w*.
It really depends on the rest of your data and see what kind of rules you can make/apply to them to get your desired data.
It's normal because your regular expression tell that you want a string containing "prod", in order to solve that very easily you can do the same thing you did but like follow
re.search(r"\w_prod\w", in_volumes.json()[i]['name'].split(":")[2].lower())
I just add a _ character existing in your targeted string

Generate random string from regex character set

I assume there's some beautiful Pythonic way to do this, but I haven't quite figured it out yet. Basically I'm looking to create a testing module and would like a nice simple way for users to define a character set to pull from. I could potentially concatenate a list of the various charsets associated with string, but that strikes me as a very unclean solution. Is there any way to get the charset that the regex represents?
Example:
def foo(regex_set):
re.something(re.compile(regex_set))
foo("[a-z]")
>>> abcdefghijklmnopqrstuvwxyz
The compile is of course optional, but in my mind that's what this function would look like.
Paul McGuire, author of Pyparsing, has written an inverse regex parser, with which you could do this:
import invRegex
print(''.join(invRegex.invert('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
If you do not want to install Pyparsing, there is also a regex inverter that uses only modules from the standard library with which you could write:
import inverse_regex
print(''.join(inverse_regex.ipermute('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
Note: neither module can invert all regex patterns.
And there are differences between the two modules:
import invRegex
import inverse_regex
print(repr(''.join(invRegex.invert('.'))))
print(repr(''.join(inverse_regex.ipermute('.'))))
yields
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
Here is another difference, this time pyparsing enumerates a larger set of matches:
x = list(invRegex.invert('[a-z][0-9]?.'))
y = list(inverse_regex.ipermute('[a-z][0-9]?.'))
print(len(x))
# 26884
print(len(y))
# 1100
A regex is not needed here. If you want to have users select a character set, let them just pick characters. As I said in my comment, simply listing all the characters and putting checkboxes by them would be sufficent. If you want something that is more compact, or just looks cooler, you could do something like one of these:
Of course, if you actually use this, what you come up with will undoubtedly look better than these (And they will also actually have all the letters in them, not just "A").
If you need, you could include a button to invert the selection, select all, clear selection, save selection, or anything else you need to do.
if its just simple ranges you could manually parse it
def range_parse(rng):
min,max = rng.split("-")
return "".join(chr(i) for i in range(ord(min),ord(max)+1))
print range_parse("a-z")+range_parse('A-Z')
but its gross ...
Another solution I thought of to simplify the problem:
Stick your own [ and ] on the line as part of the prompt, and disallow those characters in the input. After you scan the input and verify it doesn't contain anything matching [\[\]], you can prepend [ and append ] to the string, and use it like a regex against a string of all the characters needed ("abcdefghijklmnopqrstuvwxyz", fort instance).

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Parsing FIX protocol in regex?

I need to parse a logfiles that contains FIX protocol messages.
Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.
I've used regex to parse the header information into named groups. E.g.:
<?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*
I then come to the FIX payload itself (^A is the separator between each tag) e.g:
8=FIX.4.2^A9=61^A35=A...^A11=blahblah...
I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.
I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.
Is there a good way in regex to extract the tags I require?
Cheers,
Victor
No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:
dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))
Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:
No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)
To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.
Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in #tropleee's answer here.
^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.
Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

Categories

Resources