parse xml like document with regex - python

I have a file that has many xml-like elements such as this one:
<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>
I need to parse the docid and the text. What's a suitable regular expression for that?
I've tried this but it doesn't work:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)
EDIT: I've modified the pattern like this:
<document docid=(\d+)>(.*)</document>
This matches the whole document unfortunately not the individual document elements.
EDIT2: The correct implementation from Ahmad's and Acorn's answer is:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)

Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.
You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:
<document docid=(\d+)>(.*?)</document>

You need to use the DOTALL option with your regular expression so that it will match over multiple lines (by default . will not match newline characters).
Also note the comments regarding greediness in Ahmad's answer.
import re
text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''
pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)
In general, regular expressions are not suitable for parsing XML/HTML.
See:
RegEx match open tags except XHTML self-contained tags and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
You want to use a parser like lxml.

Seems to work for .net "xml-like" structure just FYI...
<([^<>]+)>([^<>]+)<(\/[^<>]+)>

Related

Beautiful soup if class not like "string" or regex

I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:
where class not like '%test%'
Thanks in advance!
This actually can be done by using Negative Lookahead
Negative Lookahead has the following syntax (?!«pattern») and matches if pattern does not match what comes before the current location in the input string.
In your case, you could use the following regex to match all classes that don’t contain listing-col- in their name:
regex = re.compile('^((?!listing-col-).)*$')
Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$:
^ asserts position at start of a line
Capturing Group ((?!listing-col-).)*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed
Negative Lookahead (?!listing-col-).
Assert that the Regex below does not match.
listing-col- matches the characters listing-col- literally (case sensitive)
. matches any character
$ asserts position at the end of a line
Also, you may find the https://regex101.com site useful
It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.
One possible solution is utilizing regex directly.
You can refer to Regular expression to match a line that doesn't contain a word.
Or you can introduce a function to implement the logic and pass it to find_all as a parameter.
You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all
You can use css selector syntax with :not() pseudo class and * contains operator
data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]

Regex quantifiers

I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info

Regex to read tags Python

I want to read elements within tags with regex, example:
<td>Stuff Here</td>
<td>stuff
</td>
I am using the following: re.findall(re.compile('<td>(.*)</td>'), str(line).strip())
How come I can read the first <td> tag, but not the second?
For the general case, you can't use regular expressions for parsing markup. The best you can do is to start using an HTML parser, there are many good options out there, IMHO Beautiful Soup is a good choice.
First of all, I assume that line contains the entire HTML document, and not just a single line as its name would imply.
One issue is that by default, . doesn't match the newline:
In [3]: re.findall('.', '\n')
Out[3]: []
You either need to remove embedded newlines (which strip() doesn't do BTW), or use re.DOTALL:
In [4]: re.findall('.', '\n', re.DOTALL)
Out[4]: ['\n']
Also, you should change the .* to .*? to make the expression non-greedy.
Another, bigger, issue is that a regex-based approach is insufficiently general to parse arbitrary HTML. See RegEx match open tags except XHTML self-contained tags for a nice discussion.

Python Regular Expression OR not matching

For some reason,
m = re.search('(<[^pib(strong)(br)].*?>|</[^pib(strong)]>)', '</b>')
matches the string, but
m = re.search('(</[^pib(strong)]>)', '</b>')
does not. I am trying to match all tags that are not
<p>, <b>, </p>, </b>
and so on. Am I misunderstanding something about how '|' works?
You're doing it wrong. First of all, characters between [] are matched differently: [ab] will match either a or b, so in your case [^pib(strong)] will match everything that is not a p, an i, a b, a (, etc. (note the negation from ^). Your first regex matching is merely a coincidence.
Also, you shouldn't be parsing html/xml with regex. Instead, use a proper xml parsing library, like lxml or beautifulsoup.
Here's a simple example with lxml:
from lxml import html
dom = html.fromstring(your_code)
illegal = set(dom.cssselect('*')) - set(dom.cssselect('p,b'))
for tag in illegal:
do_something_with(tag)
(this is a small, probably sub-optimal example; it serves just to show you how easy it is to use such a library. Also, note that the library will wrap the code in a <p>, so you should take that into consideration)

python re.search (regex) to search words who have pattern like {{world}} only

I have on HTML file in which I have inserted the custom tags like {{name}}, {{surname}}. Now I want to search the tags who exactly match the pattern like {{world}} only not even {world}}, {{world}, {world}, { word }, {{ world }}, etc.
I wrote the small code for the
re.findall(r'\{(\w.+?)\}', html_string)
It returns the words which follow the pattern {{world}} ,{world},{world}}
that I don't want. I want to match exactly the {{world}}. Can anybody please guide me?
Um, shouldn't the regex be:
'\{\{(\w.+?)\}\}'
Ok, after the comments, I understand your requirements more:
'\{\{\w+?\}\}'
should work for you.
Basically, you want {{any nnumber of word characters including underscore}}. You don't even need the lazy match in this case actually so you may remove th ? in the expression.
Something like {{keyword1}} other stuff {{keyword2}} will not match as a whole now.
To get only the keyword without getting the {{}} use below:
'(?<=\{\{)\w+?(?=\}\})'
How about this?
re.findall('{{(\w+)}}', html_string)
Or, if you want the curly braces included in the results:
re.findall('({{\w+}})', html_string)
If you're trying to accomplish html templating, though, I recommend using a good template engine.
This will match no curly braces within your result, do you want that?
'\{\{(\w[^\{\}]+?)\}\}'
http://rubular.com/r/79YwR13MS0
If you want to match doubled curly brackets, you should specify them in your regex:
re.findall(r'\{\{(\w[^}]?)\}\}', html_string)
You say the other answers don't work, but they seem to for me:
>>> import re
>>> html_string = '{{realword}} {fake1}} {{fake2} {fake3} fake4'
>>> re.findall(r'\{\{(\w.+?)\}\}', html_string)
['realword']
If it doesn't work for you, you'll need to give more details.
Edit: How about the following? Getting rid of the dot (.) and using only \w also allows you to use greedy qualifiers and works for the example HTML from your comment:
>>> html_string = 'html>\n <head>\n </head>\n <title>\n </title>\n <body>\n <h1>\n T - Shirts\n </h1>\n <img src="March-Tshirts/skull_headphones_tshirt.jpg" />\n <img src="/March-Tshirts/star-wars-t-shirts-6.jpeg" />\n <h2>\n we - we - we\n </h2>\n {{unsubscribe}} -- {{tracking_beacon} -- {web_url}} -- {name} \n </body>\n</html>\n'
>>> re.findall(r'\{\{(\w+)\}\}', html_string)
['unsubscribe']
The \w matches alphanumeric characters and the underscore; if you need to match more characters you could add it to a set (e.g., [\w\+] to also match the plus sign).

Categories

Resources