Loading regular expression patterns from external source?

Loading regular expression patterns from external source? - python

I have a series of regular expression patterns defined for automated processing of text. Due to the design of the program, it's better to have these patterns separate in a text file, namely a JSON file. The pattern in Python is of r'' type, but all I can provide is a string. I'd like to retain functionalities such as grouping. I'd like to have features such as entities ([A-z]), so I'm not talking about escaping everything.
I'm using Python 3.4. How do I properly load these patterns into the re module? And what kind of escaping problem should I watch out for?

I am not sure what you want but have a look at this.:
If you have a file called input.txt containing \d+
Then you can use it this way:
import re
f=open("input.txt","r")
x="asasd3243sdfdsf23234sdsdf"
print re.findall(r""+f.readline(),x)
Output:['3243', '23234']
When you use r mode you need not escape anything.

The r'' thing in Python is not a different type than simple ''. The r'' syntax simply creates a string that looks exactly like the one you typed, so the \n sequence stays as \n, and isn't turned into a new line (same thing happens to other special characters). This little r simply escapes everything you type.
Check it yourself with this two simple lines in the console:
print('test \n test')
print(r'test \n test')
print(type(r''))
print(type(''))
Now, while you read lines from JSON file, the escaping is done for you. I don't know how will you create the JSON file, but you should take a look at the json module, and the load method, that will allow you to read a JSON file.

You can use re.escape to escape the strings. However this is escaping everything and you might want some special chars. I'd just use the strings and be careful about placing \ in the right places.
BTW: If you have many regular expressions, matching might get slow. You might want to consider some alternatives such esmre.

Related

Should I use raw string by default?

I know raw string, like r'hello world', prevents escaping.
Is it a good practice to always prepend the r symbol even if the string doesn't have any escaping sequences?
Say my exception needs some string literal explanation, I need to connect to a website whose url is a string literal. They don't have backslash. Are there any performance differences between raw string and regular string?

The r sigil means "backslashes in this string are literal backslashes". Putting this sigil on a string which doesn't contain any backslashes is harmless but sometimes mildly confusing to a human reader. A better approach is probably to only use this sigil when you actually need it.
Situations where the string may not contain backslashes at the moment, but where you might expect to add one in the future, such as in regular expressions and Windows file paths, would probably qualify as useful exceptions.
re.findall(r'hello', string) # what if we switch to r'hello\.'?
with open(r'file.txt') as handle: # what if we switch to r'sub\file.txt'?
It would be easy to forget to add the r when you add a backslash, so supplying it in advance has some merit here.

You can do that in Python. But I don't recommend that because if you add something like '\n', it won't work well. You can use that in Regex and paths on Windows.

split() not producing expected results

I have a problem with python split which I can't figure out what I am missing that results in the split function not to work properly. I have been using similar splits before and they worked just fine.
content=open(file).read)()
Sep = content.split(r'Document [a-zA-Z0-9]{25}\n')
The file I am reading is something very easy as:
"I like coffee.
Document CLASSAR020181030eeat0000l
I like tea as well.
Document CLASSAR020181030eeat0000l
I like both coffee and tea."

str.split() splits using a fixed delimiter, not a regular expression. You need to use re.split().
import re
sep = re.split(r'Document [a-zA-Z0-9]{25}\n', content)

Error - regular expression syntax on string methods
content is a string. You cannot call the split method on this variable as it will invoke a string-based method that expects a separator. This separator must be a fixed string, and not a regular expression.
Solution - Use re module
You can instead use methods within the regular expression module, as you're using regular expression syntax:
import re
with open(file, 'r') as fp:
content = fp.read()
pattern = re.compile(r'Document \w{25}\n')
separated = pattern.split(content)
The with block is just best practice for opening files in python. It
is a context manager that automatically closes your file when you're
finished. You may run into problems in the future if you don't use
this.
The regular expression I have used is slightly different to yours. It
does exactly the same thing. However, \w is short for
[a-zA-Z0-9]. I.e., it matches any alphanumeric character.
We are using the split method again. However, this split method is part of the re module, not string, as our pattern variable is an re object.

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!

According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.

I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.

I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.

Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

Parsing in Python: what's the most efficient way to suppress/normalize strings?

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?

Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).

Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.

Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.