I am doing a rather complex match with python using re.match, which takes the form of (some_pattern_1)?(some_pattern_2)?..(.*)
On the other side of it I have a unit test with about one hundred examples I am checking, which are all sending request asynchronously to my (local, development) server. The server is in django.
I am sometime seing the match apparently be non-greedy (i.e. too many things end up in the last catch all block) and the unit test fail, but can't really reproduce it in isolation, and I don't really have an idea what's going on.
More concretely, the relevant part of the regex is (in Python):
import re
input = "1 small shoe"
sizes = ["small", "large", "big", "huge"]
colors = ["blue", "red", "green", "yellow", "grey"]
anySize = u' |'.join(sizes)
anyColor = u' |'.join(colors)
matched_expression = re.match(
r'\s*(?P<amount>(((\d{1,2}\.)?\d{1,3})?)\s*'
r'(?P<size>(\b'+anySize+'\b)?)\s*'
r'(?P<color>(\b'+anyColor+'\b)?)\s*'
r'(?P<name>.*, input, re.UNICODE|re.IGNORECASE)
if matched_expression:
print(matched_expression.groupdict()["amount"])
print(matched_expression.groupdict()["size"])
print(matched_expression.groupdict()["color"])
print(matched_expression.groupdict()["name"])
And I am sometimes seeing this printed:
1
''
''
'small shoe'
Are there know conditions where this could happen (and am I correct to assume that the regex match is guaranteed to be fully deterministic in principle?) ?
Most of the string literals you're using to build your pattern are raw literals (introduced with the r prefix), which is great—the string interpreter therefore does not give backslash any special meaning, but instead leaves them intact for the regex parser. However, you have unfortunately not used raw literals in every case:
r'(?P<size>(\b'+anySize+'\b)?)\s*'
# ^^^^^^^^^^ this is not a raw string literal
r'(?P<color>(\b'+anyColor+'\b)?)\s*'
# ^^^^^^^^^^ and nor is this
Consequently, the backslashes in those literals have the effect described under String and Bytes literals before the interpreted string is given to the regex compiler. Accordingly, your \b boundary anchors are replaced with ASCII backspace characters!
Either use raw string literals by prefixing them with r or else be sure to escape the backslashes they contain.
There are however also a number of other issues with your code worth noting:
As currently written, your regex won't compile due to some syntax errors. In particular, the capture groups named amount and name are not terminated due to unbalanced brackets:
r'\s*(?P<amount>(((\d{1,2}\.)?\d{1,3})?)\s*'
# + +++ - - -
There are four opening brackets, but only three closing brackets. You probably intended to write:
r'\s*(?P<amount>(((\d{1,2}\.)?\d{1,3})?))\s*'
# ^
Similarly, r'(?P<name>.*, ... should probably be r'(?P<name>.*)', ... (note also the pattern string needs to be terminated before the argument separator).
\b boundary anchors bind more tightly than | alternation, so when placed at the same level as your joined arrays they are only bound to the first and last elements of the alternatives respectively. For example, the capture group named size is currently specified by the following pattern:
(\bsmall |large |big |huge\b)?
Which is equivalent, in terms of precedence, to:
((\bsmall )|(large )|(big )|(huge\b))?
Better instead to place the boundary anchors outside of the brackets:
r'(?P<size>\b('+anySize+r')?\b)\s*'
r'(?P<color>\b('+anyColor+r')?\b)\s*'
As shown above, the whitespace in your join expressions is likely to lead to unintended consequences: anySize and anyColor require that all but the final terms in their underlying arrays are, if present, followed by a space character (in addition to those that match the \s* patterns. Better to join the arrays with '|' alone, rather than ' |':
anySize = u'|'.join(sizes)
anyColor = u'|'.join(colors)
Depending on the source of the underlying arrays, and how confident you are that they do not contain any special regex patterns, you may wish to first escape the array elements.
Related
A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.
To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$
answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)
To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.
This is the results from python2.7.
>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'
The results I thought should be as follows.
>>> re.sub('.*?', '-', 'abc')
'-------'
But it's not. Why?
The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).
Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?
Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.
Examples:
# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'
# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'
(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).
For your new, edited question:
The .*? can match any number of characters, including zero. So what it does is it matches zero characters at every position in the string: before the "a", between the "a" and "b", etc. It replaces each of those zero-width matches with a hyphen, giving the result you see.
The regex does not try to match each character one by one; it tries to match at each position in the string. Your regex allows it to match zero characters. So it matches zero at each position and moves on to the next. You seem to be thinking that in a string like "abc" there is one position before the "b", one position "inside" the "b", and one position after "b", but there isn't a position "inside" an individual character. If it matches zero characters starting before "b", the next thing it tries is to match starting after "b". There's no way you can get a regex to match seven times in a three-character string, because there are only four positions to match at.
Are you sure you interpreted re.sub's documentation correctly?
*?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if
the RE <.> is matched against '<H1>title</H1>', it will match the
entire string, and not just '<H1>'. Adding '?' after the qualifier
makes it perform the match in non-greedy or minimal fashion; as few
characters as possible will be matched. Using .*? in the previous
expression will match only ''.
Adding a ? will turn the expression into a non-greedy one.
Greedy:
re.sub(".*", "-", "abc")
non-Greedy:
re.sub(".*?", "-", "abc")
Update: FWIW re.sub does exactly what it should:
>>> from re import sub
>>> sub(".*?", "-", "abc")
'-a-b-c-'
>>> sub(".*", "-", "abc")
'-'
See #BrenBarn's awesome answer on why you get -a-b-c- :)
Here's a visual representation of what's going on:
.*?
Debuggex Demo
To elaborate on Veedrac's answer, different implementation has different treatment of zero-width matches in a FindAll (or ReplaceAll) operations. Two behaviors can be observed among different implementations, and Python re simply chooses to follow the first line of implementation.
1. Always bump along by one character on zero-width match
In Java and JavaScript, zero-width match causes the index to bump along by one character, since staying at the same index will cause an infinite loop in FindAll or ReplaceAll operations.
As a result, output of FindAll operations in such implementation can contain at most 1 match starting at a particular index.
The default Python re package probably also follow the same implementation (and it seems to be also the case for Ruby).
2. Disallow zero-width match on next match at same index
In PHP, which provides a wrapper over PCRE libreary, zero-width match does not cause the index to bump along immediately. Instead, it will set a flag (PCRE_NOTEMPTY) requiring the next match (which starts at the same index) to be a non-zero-width match. If the match succeeds, it will bump along by the length of the match (non-zero); otherwise, it bumps along by one character.
By the way, PCRE library does not provide built-in FindAll or ReplaceAll operation. It is actually provided by PHP wrapper.
As a result, output of FindAll operations in such implementation can contain up to 2 matches starting at the same index.
Python regex package probably follows this line of implementation.
This line of implementation is more complex, since it requires the implementation of FindAll or ReplaceAll to keep an extra state of whether to disallow zero-width match or not. Developer also needs to keep track of this extra flags when they use the low level matching API.
I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)
The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.
It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d
I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.
Im trying to find c style comments in a c file but im having trouble if there happens to be // inside of quotations. This is the file:
/*My function
is great.*/
int j = 0//hello world
void foo(){
//tricky example
cout << "This // is // not a comment\n";
}
it will match with that cout. This is what i have so far (i can match the /**/ comments already)
fp = open(s)
p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)
The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes). To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:
pattern:
(
"(?:[^"\\]|\\[\s\S])*"
|
'(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/
replacement:
\1
online demo
Since quoted parts are searching first, each time you find // or /*...*/, you can be sure that your are not inside a string.
Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:
("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/
(?=(something+))\1 is only a way to emulate an atomic group (?>something+)
online demo
So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option) and to handle the backslash followed by a newline character that allows to format a single line on several lines:
fp = open(s)
p = re.compile(r'''(?x)
(?=["'/]) # trick to make it faster, a kind of anchor
(?:
"(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
|
'(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
|
(
/(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
|
/(?:(?:\?\?/|\\)\n)*\* # multiline comment
(?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
\*(?:(?:\?\?/|\\)\n)*/
)
)
''')
for m in p.findall(fp.read()):
if (m[2]):
print m[2]
These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.
An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.
NB: a chance there is no heredoc syntax in C!
Python's re.findall method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. All that is required is to produce a disjunction of all the lexical patterns:
(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)
Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.) as the last pattern, in order to match all otherwise unmatched characters individually.
An important feature of re.findall is that if the regex has any groups, then only the groups will be returned. Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:
(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)
With that in mind, let's take a look at how to tokenize C just enough to recognize comments. We need to deal with:
Single-line comments: // Comment
Multi-line comments: /* Comment */
Double-quoted string: "Might include escapes like \n"
Single-quoted character: '\t'
(See below for a few more irritating cases)
With that in mind, let's create regexen for each of the above.
Two slashes followed by anything other than a newline: //[^\n]*
This regex is tedious to explain: /*[^*]*[*]+(?:[^/*][^*]*[*]+)*/
Note that it uses (?:...) to avoid capturing the repeated group.
A quote, any repetition of a character other than quote and backslash, or a backslash followed by any character whatsoever. That's not a precise definition of an escape sequence, but it's good enough to detect when a " terminates the string, which is all we care about: "(?:[^"\\]|\\.*)"
The same as (3) but with single quotes: '(?:[^'\\]|\\.)*'
Finally, the goal was to find the text of C-style comments. So we just need to avoid captures in any of the other groups. Hence:
p = re.compile('|'.join((r"(//[^\n])*"
,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]
Above, I left out some obscure cases which are unlikely to arise:
In an #include <...> directive, the <...> is essentially a string. In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:
#include </*This looks like a comment but it is a filename*/>
A line which ends with \ is continued on the next line; the \ and following newline character are simply removed from the input. This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):
/\
**************** Surprise! **************\
//////////////////////////////////////////
To make the above worse, the trigraph ??/ is the same as a \, and that replacement happens before the continuation handling.
/************************************//??/
**************** Surprise! ************??/
//////////////////////////////////////////
Outside of obfuscation contests, no-one actually uses trigraphs. But they're still in the standard. The easiest way to deal with both of these issues would be to prescan the string:
return [c[2:]
for c in p.findall(text.replace('//?','\\').replace('\\\n',''))
if c]
The only way to deal with the #include <...> issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*>.