How to create a CSS identifier from an arbitrary string with python? - python

Working on a Django Template tag, I find myself needing to take a string and convert it into a CSS identifier so it can be part of a class attribute on an html element. The problem is the string can contain spaces which makes it useless as a CSS identifier, and it could contain punctuation as well.
My thoughts were to use a regex to rip out the good parts and then put them back together, but I can't figure out how to express the repeating group pattern. Here is what I have
to_css = re.compile(r"[^a-z_-]*([a-z0-9_-]+[^a-z0-9_]*)+", re.IGNORECASE)
#register.filter(name='as_css_class')
def as_css_class(value):
matches = to_css.match(value)
if matches:
return '-'.join(matches.groups())
return ""
The problem comes with you do this:
as_css_class("Something with a space in it")
and you get
'it'
I was hoping the + would apply to the (group), but evidently it doesn't do what I want.

You can use slugify for this:
from django.template.defaultfilters import slugify
slugify("Something with a space in it")

Your regex will match the whole string and the only group catched will be "it" (therefore the result). A capturing group will only keep the last string it captured. You can't catch an arbitrary number of strings with one regex.
What you can do however, is use the global modifier g (or simply re.findall in Python I believe). Something like:
re.findall(r'[\w-]+');
and then join the result (more or less, my Python's a little rusted).

Does it need to be a CSS class?
<div data-something="Anything you like provided it's HTML escaped"> ... </div>
div[data-something="Anything you like provided it's HTML escaped"] {
background: red;
}
Arguably you shouldn't be shoe-horning arbitrary data into the class, since you risk clashing with an existing class. Data attributes allow you to specify information with name clashes.

Related

Python: Dynamic matching with regular expressions

I've been trying several things to use variables inside a regular expression. None seem to be capable of doing what I need.
I want to search for a consecutively repeated substring (e.g. foofoofoo) within a string (e.g. "barbarfoofoofoobarbarbar"). However, I need both the repeating substring (foo) and the number of repetitions (In this case, 3) to be dynamic, contained within variables. Since the regular expression for repeating is re{n}, those curly braces conflict with the variables I put inside the string, since they also need curly braces around.
The code should match foofoofoo, but NOT foo or foofoo.
I suspect I need to use string interpolation of some sort.
I tried stuff like
n = 3
str = "foo"
string = "barbarfoofoofoobarbarbar"
match = re.match(fr"{str}{n}", string)
or
match = re.match(fr"{str}{{n}}", string)
or escaping with
match = re.match(fr"re.escape({str}){n}", string)
but none of that seems to work. Any thoughts? It's really important both pieces of information are dynamic, and it matches only consecutive stuff. Perhaps I could use findall or finditer? No idea how to proceed.
Something I havent tried at all is not using regular expressions, but something like
if (str*n) in string:
match
I don't know if that would work, but if I ever need the extra functionality of regex, I'd like to be able to use it.
For the string barbarfoofoofoobarbarbar, if you wanted to capture foofoofoo, the regex would be r"(foo){3}". if you wanted to do this dynamically, you could do fr"({your_string}){{{your_number}}}".
If you want a curly brace in an f-string, you use {{ or }} and it'll be printed literally as { or }.
Also, str is not a good variable name because str is a class (the string class).

Regex to match MediaWiki template without certain named parameter

I’ll get to the point: I need a regex that matches any template out of a list that have a date parameter - so assuming that my (singleton for now) list of templates is “stub”, the things below that are in bold should be matched:
{{stub}}
{{stub|param}}
{{stub|date=a}}
{{stub|param|date=a}}
{{stub|date=a|param}}
{{stub|param|date=a|param}}
Note: “param” means any number of parameters there.
Additionally, it would be nice if it could also match if the date parameter is blank, but this is not required.
The current regex I have so far is
{{((?:stub|inaccurate)(?!(?:\|.*?\|)*?\|date=.*?(?:\|.*?)*?)(?:\|.*?)*?)}}
However it matches the fourth and sixth items in the list above.
Note: (?:stub|inaccurate) is just to make sure the template is either a stub or inaccurate template.
Note 2: the flavor of regex here is Python 2.7 module RE.
Since you are using Python, you have the luxury of an actual parser:
import mwparserfromhell
wikicode = mwparserfromhell.parse('{{stub|param|date=a|param}}')
for template in wikicode.filter_templates():
if template.get('date')...
That will remain accurate even if the template contains something you would not have expected ({{stub| date=a}}, {{stub|<!--<newline>-->date=a}}, {{stub|foo={{bar}}|date=a}} etc.). The classic answer on the dangers of using regular expressions to parse complex markup applies to wikitext as well.
I think it's enough to have a negative look-ahead, which tries to match date at any position?
{{((?:stub|inaccurate)(?!.*\|date=).*)}}
If empty date parameters have a | following the equals sign, then use
{{((?:stub|inaccurate)(?!.*\|date=[^|}]).*)}}

python: regex to extract content between two text

I want a python regex expression that can pull the contents between script[" and "] but there are other "]" which worries me
expected:
{bunch of javascript here. [\"apple\"] test}
my attempt:
javascript\[\"(.*)"]
target string:
//url//script["{bunch of javascript here. [\"apple\"] test}"]|//*[#attribute="eggs"]
link to the regex
You can't match nested brackets with the re module since it doesn't have the recursion feature to do that. However, in your example you can skip the innermost square brackets if you choose to ignore all brackets enclosed between double quotes.
try something like this:
p = re.compile(r'script\["([^\\"]*(?:\\.[^\\"]*)*)"]', re.S)
Note: I assumed here that the predicate is only related to the "text" content of the script node (and not an attribute, a number of item or an axe).
It's very hard to understand exactly what you want to achieve because of the way you have written the question. However if you are looking for the firs instance of "] AFTER a } then try this:
\["([^}]+}.*?)"\]
Link to the regex
This also would work:
\["(.*?}.*?)"\]
Link to the second regex example

Parsing FIX protocol in regex?

I need to parse a logfiles that contains FIX protocol messages.
Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.
I've used regex to parse the header information into named groups. E.g.:
<?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*
I then come to the FIX payload itself (^A is the separator between each tag) e.g:
8=FIX.4.2^A9=61^A35=A...^A11=blahblah...
I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.
I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.
Is there a good way in regex to extract the tags I require?
Cheers,
Victor
No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:
dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))
Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:
No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)
To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.
Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in #tropleee's answer here.
^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.
Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

Python regex: how to extract inner data from regex

I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!

Categories

Resources