For some reason,
m = re.search('(<[^pib(strong)(br)].*?>|</[^pib(strong)]>)', '</b>')
matches the string, but
m = re.search('(</[^pib(strong)]>)', '</b>')
does not. I am trying to match all tags that are not
<p>, <b>, </p>, </b>
and so on. Am I misunderstanding something about how '|' works?
You're doing it wrong. First of all, characters between [] are matched differently: [ab] will match either a or b, so in your case [^pib(strong)] will match everything that is not a p, an i, a b, a (, etc. (note the negation from ^). Your first regex matching is merely a coincidence.
Also, you shouldn't be parsing html/xml with regex. Instead, use a proper xml parsing library, like lxml or beautifulsoup.
Here's a simple example with lxml:
from lxml import html
dom = html.fromstring(your_code)
illegal = set(dom.cssselect('*')) - set(dom.cssselect('p,b'))
for tag in illegal:
do_something_with(tag)
(this is a small, probably sub-optimal example; it serves just to show you how easy it is to use such a library. Also, note that the library will wrap the code in a <p>, so you should take that into consideration)
Related
I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:
where class not like '%test%'
Thanks in advance!
This actually can be done by using Negative Lookahead
Negative Lookahead has the following syntax (?!«pattern») and matches if pattern does not match what comes before the current location in the input string.
In your case, you could use the following regex to match all classes that don’t contain listing-col- in their name:
regex = re.compile('^((?!listing-col-).)*$')
Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$:
^ asserts position at start of a line
Capturing Group ((?!listing-col-).)*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed
Negative Lookahead (?!listing-col-).
Assert that the Regex below does not match.
listing-col- matches the characters listing-col- literally (case sensitive)
. matches any character
$ asserts position at the end of a line
Also, you may find the https://regex101.com site useful
It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.
One possible solution is utilizing regex directly.
You can refer to Regular expression to match a line that doesn't contain a word.
Or you can introduce a function to implement the logic and pass it to find_all as a parameter.
You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all
You can use css selector syntax with :not() pseudo class and * contains operator
data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]
I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info
I'm trying to figure out how to use regular expressions in Python to extract out certain URLs in strings. For example, I might have 'blahblahblah (a href="example.com")'. In this case I want to extract all "example.com" links. How can I do that instead of just splitting the string?
Thanks!
There is a great module called BeautifulSoup (link: http://www.crummy.com/software/BeautifulSoup/) which is great for parsing HTML. You should use this instead of using regex to get info from HTML. Here's an example of BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>> html = """<p> some HTML and another link</p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
... print link['href']
http://link.com
http://second.com
Here is a link to the documentation, which is really easy to follow: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Regex are very powerful tools, but they might not be your tool in all circumstances (as other has suggested already). That said, here's a minimal example from the console that uses - as per request - regex:
>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']
Focus on r'a href="(.*?)"'. In Englis it translates in: "find a string beginning with a href=", then save as a result any character until you hit the next ". The syntax is:
the () means "save only stuff in here"
the . means "any character"
the * means "any number of times"
the ? means "non greedy" or in other terms: find the shortest string that satisfy the requirements (try without the question mark and you will see what happens).
HTH!
Do not use regexp:
Here is why you should not think at regex in the first place when dealing with HTML or XML (or URLs).
If you wish to use regex anyway,
You can find several pattern that do the job, and several way to fetch the strings you wish to find.
These patterns do the job:
r'\(a href="(.*?)"\)'
r'\(a href="(.*)"\)'
r'\(a href="(+*)"\)'
1. re.findall()
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
import re
st = 'blahblahblah (a href="example.com") another bla <a href="polymer.edu">'
re.findall(r'\(a href="(+*)"\)',s)
2. re.search()
re.search(pattern, string, flags=0)
Scan through string looking for a location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
Then, go with re.group() through groups. For instance, using regex r'\(a href="(.+?(.).+?)"\)', that is also working here, you have several enclosed groups: group 0 is a match to the whole pattern, group 1 is a match to the first enclosed sub-pattern surrounded with parenthesis, (.+?(.).+?)
You would use search when looking for first occurence of pattern only. And with your example this would be
>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'
I have a file that has many xml-like elements such as this one:
<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>
I need to parse the docid and the text. What's a suitable regular expression for that?
I've tried this but it doesn't work:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)
EDIT: I've modified the pattern like this:
<document docid=(\d+)>(.*)</document>
This matches the whole document unfortunately not the individual document elements.
EDIT2: The correct implementation from Ahmad's and Acorn's answer is:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)
Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.
You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:
<document docid=(\d+)>(.*?)</document>
You need to use the DOTALL option with your regular expression so that it will match over multiple lines (by default . will not match newline characters).
Also note the comments regarding greediness in Ahmad's answer.
import re
text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''
pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)
In general, regular expressions are not suitable for parsing XML/HTML.
See:
RegEx match open tags except XHTML self-contained tags and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
You want to use a parser like lxml.
Seems to work for .net "xml-like" structure just FYI...
<([^<>]+)>([^<>]+)<(\/[^<>]+)>
I want to extract data from such regex:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
I've found related question extract contents of regex
but in my case I shoud iterate somehow.
As paprika mentioned in his/her comment, you need to identify the desired parts of any matched text using ()'s to set off the capture groups. To get the contents from within the td tags, change:
<td>[a-zA-Z]+</td><td>[\d]+.[\d]+</td><td>[\d]+</td><td>[\d]+.[\d]+</td>
to:
<td>([a-zA-Z]+)</td><td>([\d]+.[\d]+)</td><td>([\d]+)</td><td>([\d]+.[\d]+)</td>
^^^^^^^^^ ^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^
group 1 group 2 group 3 group 4
And then access the groups by number. (Just the first line, the line with the '^'s and the one naming the groups are just there to help you see the capture groups as specified by the parentheses.)
dataPattern = re.compile(r"<td>[a-zA-Z]+</td>... etc.")
match = dataPattern.find(htmlstring)
field1 = match.group(1)
field2 = match.group(2)
and so on. But you should know that using re's to crack HTML source is one of the paths toward madness. There are many potential surprises that will lurk in your input HTML, that are perfectly working HTML, but will easily defeat your re:
"<TD>" instead of "<td>"
spaces between tags, or between data and tags
" " spacing characters
Libraries like BeautifulSoup, lxml, or even pyparsing will make for more robust web scrapers.
As the poster clarified, the <td> tags should be removed from the string.
Note that the string you've shown us is just that: a string. Only if used in the context of regular expression functions is it a regular expression (a regexp object can be compiled from it).
You could remove the <td> tags as simply as this (assuming your string is stored in s):
s.replace('<td>','').replace('</td>','')
Watch out for the gotchas however: this is really of limited use in the context of real HTML, just as others pointed out.
Further, you should be aware that whatever regular expression [string] is left, what you can parse with that is probably not what you want, i.e. it's not going to automatically match anything that it matched before without <td> tags!