Python regex: how to extract inner data from regex - python

I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.

As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.

As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!

Related

Regular expression to extract info from HTML file

I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>
I need to extract: ABCDE
Could anybody please help me with the regular expression that I should use?
Leaning on this, https://stackoverflow.com/a/40908001/11450166
(?<=(<A>))[A-Za-z]+(?=(<\/A>))
With that expression, supposing that your tag is <A> </A>, works fine.
This other match with your input form.
(?<=(>))[A-Za-z]+(?=(<\/A>))
You can try using this regular expression in your specific example:
/">(.*)<\/A><\/td><td>/g
Tested on string:
Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum
extracts:
">ABCDE</A></td><td>
Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.
I also tried:
/">([^<]*)<\/A><\/td><td>/g
It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.

Regex to read tags Python

I want to read elements within tags with regex, example:
<td>Stuff Here</td>
<td>stuff
</td>
I am using the following: re.findall(re.compile('<td>(.*)</td>'), str(line).strip())
How come I can read the first <td> tag, but not the second?
For the general case, you can't use regular expressions for parsing markup. The best you can do is to start using an HTML parser, there are many good options out there, IMHO Beautiful Soup is a good choice.
First of all, I assume that line contains the entire HTML document, and not just a single line as its name would imply.
One issue is that by default, . doesn't match the newline:
In [3]: re.findall('.', '\n')
Out[3]: []
You either need to remove embedded newlines (which strip() doesn't do BTW), or use re.DOTALL:
In [4]: re.findall('.', '\n', re.DOTALL)
Out[4]: ['\n']
Also, you should change the .* to .*? to make the expression non-greedy.
Another, bigger, issue is that a regex-based approach is insufficiently general to parse arbitrary HTML. See RegEx match open tags except XHTML self-contained tags for a nice discussion.

How to create a CSS identifier from an arbitrary string with python?

Working on a Django Template tag, I find myself needing to take a string and convert it into a CSS identifier so it can be part of a class attribute on an html element. The problem is the string can contain spaces which makes it useless as a CSS identifier, and it could contain punctuation as well.
My thoughts were to use a regex to rip out the good parts and then put them back together, but I can't figure out how to express the repeating group pattern. Here is what I have
to_css = re.compile(r"[^a-z_-]*([a-z0-9_-]+[^a-z0-9_]*)+", re.IGNORECASE)
#register.filter(name='as_css_class')
def as_css_class(value):
matches = to_css.match(value)
if matches:
return '-'.join(matches.groups())
return ""
The problem comes with you do this:
as_css_class("Something with a space in it")
and you get
'it'
I was hoping the + would apply to the (group), but evidently it doesn't do what I want.
You can use slugify for this:
from django.template.defaultfilters import slugify
slugify("Something with a space in it")
Your regex will match the whole string and the only group catched will be "it" (therefore the result). A capturing group will only keep the last string it captured. You can't catch an arbitrary number of strings with one regex.
What you can do however, is use the global modifier g (or simply re.findall in Python I believe). Something like:
re.findall(r'[\w-]+');
and then join the result (more or less, my Python's a little rusted).
Does it need to be a CSS class?
<div data-something="Anything you like provided it's HTML escaped"> ... </div>
div[data-something="Anything you like provided it's HTML escaped"] {
background: red;
}
Arguably you shouldn't be shoe-horning arbitrary data into the class, since you risk clashing with an existing class. Data attributes allow you to specify information with name clashes.

Removing TAGS in a document

I need to find all the tags in .txt format (SEC filing) and remove from the filing.
Well, as a beginner of Python, I used the following code to find the tags, but it returns None, None, ... and I don't know how to remove all the tags. My question is how to find all the tags <....> and remove all the tags so that the document contains everything but tags.
import re
tags = [re.search(r'<.+>', line) for line in mylist]
#mylist is the filename opened by open(filename, 'rU').readlines()
Thanks for your time.
Use something like this:
re.sub(r'<[^>]+>', '', open(filename, 'r').read())
Your current code is getting a None for each line that does not include angle-bracketed tags.
You probably want to use [^>] to make sure it matches only up to the first >.
re.sub(r'<.*?>', '', line)
Use re.sub and <.*?> expression
Well, for starters, you're going to need a different regex. The one you have will select everything between the first '<' and the last '>' So the string:
I can type in <b>BOLD</b>
would render the match:
BOLD
The way to fix this would be to use a lazy operators this site has a good explanation on why you should be using
<.+?>
to match HTML tags. And ultimately, you should be substituting, so:
re.sub(r'', '', line)
Though, I suspect what you'd actually like to match is between the tags. Here's where a good lookahead can do wonders!
(?<=>).+?(?=<)
Looks crazy, but it breaks down pretty easy. Let's start with what you know:
.+?
matches a string of arbitrary length. ? means it will match the shortest string possible. (The laziness we added before)
(<?=...)
is a lookbehind. It literally looks behind itself without capturing the expression.
(?=...)
is a lookahead. It's the same as a lookbehind. Then with a little findall:
re.findall(r'(?<=>).+?(?=<)', line);
Now, you can iterate over the array and trim an unnecessary spaces that got left behind and make for some really nice output! Or, if you'd really like to use a substitution method (I know I would):
re.sub(r'\s*(?:</+?>\s*)+', ' ', line)
the
\s*
will match any amount of whitespace attached to a tag, which you can then replace with one space, whittlling down those unnerving double and triple spaces that often result from over careful tagging. As a bonus, the
(?: ... )
is known as a non-capturing group (it won't give you smaller sub matches in your result). It's not really necessary in this situation for your purposes, but groups are always useful things to think about, and it's good practice to only capture the ones you need. Tacking a + onto the end of that (as I did), will capture as many tags as are right next to each other, eliminating them into a single space. So if the file has
This is <b> <i> overemphasized </b> </i>!
you'd get
This is overemphasized !
instead of
This is overemphasized !

Parsing FIX protocol in regex?

I need to parse a logfiles that contains FIX protocol messages.
Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.
I've used regex to parse the header information into named groups. E.g.:
<?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*
I then come to the FIX payload itself (^A is the separator between each tag) e.g:
8=FIX.4.2^A9=61^A35=A...^A11=blahblah...
I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.
I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.
Is there a good way in regex to extract the tags I require?
Cheers,
Victor
No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:
dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))
Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:
No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)
To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.
Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in #tropleee's answer here.
^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.
Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

Categories

Resources