Working on a simple script processing text files and logs. It has to take from the command line a list of regular expression for substitutions. For example:
./myscript.py --replace=s/foo/bar/ --replace=s#/etc/hosts#/etc/foo# --replace=#test\#email.com#root\#email.com#
Is there a simple way to provide a user specified substitution pattern to the python re library? And have that pattern run against a string? Any elegant solution?
If possible, I'd like to avoid writing my own parser. Note that I'd like support for modifiers like /g or /i and so on.
Thanks!
You could use a space as a separator to exploit shell's command-line parser:
$ myscript --replace=foo bar \
> --replace=/etc/hosts /etc/foo gi \
> --replace=test#email.com root#email.com
g flag is default in Python so you need to add special support for it:
#!/usr/bin/env python
import re
from argparse import ArgumentParser
from functools import partial
all_re_flags = 'Lgimsux' # regex flags
parser = ArgumentParser(usage='%(prog)s [--replace PATTERN REPL [FLAGS]]...')
parser.add_argument('-e', '--replace', action='append', nargs='*')
args = parser.parse_args()
print(args.replace)
subs = [] # replacement functions: input string -> result
for arg in args.replace:
count = 1 # replace only the first occurrence if no `g` flag
if len(arg) == 2:
pattern, repl = arg
elif len(arg) == 3:
pattern, repl, flags = arg
if ''.join(sorted(flags)) not in all_re_flags:
parser.error('invalid flags %r for --replace option' % flags)
if 'g' in flags: # add support for `g` flag
flags = flags.replace('g', '')
count = 0 # replace all occurrences
if flags: # embed flags
pattern = "(?%s)%s" % (flags, pattern)
else:
parser.error('wrong number of arguments for --replace option')
subs.append(partial(re.compile(pattern).sub, repl, count=count))
You could use subs as follows:
input_string = 'a b a b'
for replace in subs:
print(replace(input_string))
Example:
$ ./myscript -e 'a b' 'no flag' -e 'a B' 'with flags' ig
Output:
[['a b', 'no flag'], ['a B', 'with flags', 'ig']]
no flag a b
with flags with flags
Like mentioned in the comments, you can use re.compile(), but that only works for matching and searching apparently. Assuming you have only substitutions, you might do something like this:
modifiers_map = {
'i': re.IGNORE,
...
}
for replace in replacements:
# Look for a generalized separator in front of a command
m = re.match(r'(s?)(.)([^\2]+)\2([^\2]+)\2([ig]*)', replace)
if not m:
print 'Invalid command: %s' % replace
continue
command, separator, query, substitution, modifiers = m.groups()
# Convert the modifiers to flags
flags = reduce(operator.__or__, [modifiers_map[char] for char in modifiers], 0)
# This needs a little bit of tweaking if you want to support
# group matching (like \1, \2, etc.). This also assumes that
# you're only getting 's' as a command
my_text = re.sub(query, substitution, my_text, flags=flags)
Suffice it to say, this is a rough draft but I think it'd get you 90% of the way to what you're looking for.
Thanks for the answers. Given the complexity of any of the proposed solutions and the lack of a pre-backed parser in the standard libraries, I just went the extra mile and implemented my own parser.
It is not significantly more complex than the other proposals, see below. I just need to write tests now.
Thanks!
class Replacer(object):
def __init__(self, patterns=[]):
self.patterns = []
for pattern in patterns:
self.AddPattern(pattern)
def ParseFlags(self, flags):
mapping = {
'g': 0, 'i': re.I, 'l': re.L, 'm': re.M, 's': re.S, 'u': re.U, 'x': re.X,
'd': re.DEBUG
}
result = 0
for flag in flags:
try:
result |= mapping[flag]
except KeyError:
raise ValueError(
"Invalid flag: %s, known flags: %s" % (flag, mapping.keys()))
return result
def Apply(self, text):
for regex, repl in self.patterns:
text = regex.sub(repl, text)
return text
def AddPattern(self, pattern):
separator = pattern[0]
match = []
for position, char in enumerate(pattern[1:], start=1):
if char == separator:
if pattern[position - 1] != '\\':
break
match[-1] = separator
continue
match += char
else:
raise ValueError("Invalid pattern: could not find divisor.")
replacement = []
for position, char in enumerate(pattern[position + 1:], start=position + 1):
if char == separator:
if pattern[position - 1] != '\\':
break
replacement[-1] = separator
continue
replacement += char
else:
raise ValueError(
"Invalid pattern: could not find divisor '%s'." % separator)
flags = self.ParseFlags(pattern[position + 1:])
match = ''.join(match)
replacement = ''.join(replacement)
self.patterns.append((re.compile(match, flags=flags), replacement))
Related
I am making a lexical analyzer for determined words that are in a .txt file, for this I declare determined words reserved and I try to print only the selected words on the screen, but the result I get is that it takes all the words in the txt file and prints them. I've been following the tutorial and the official Ply documentation in http://www.dabeaz.com/ply/ply.html#ply_nn6 but I still don't achieve my goal. Could someone help me with this? Thank you very much.
import ply.lex as lex
import re
import os
import sys
reservadas = {
'if' : 'if',
'then' : 'then',
'else' : 'else',
'while' : 'while',
}
tokens = ['ID','NUMBER','PLUS','MINUS','TIMES','DIVIDE',
'ODD','ASSIGN','NE','LT','LTE','GT','GTE',
'LPARENT', 'RPARENT','COMMA','SEMMICOLOM',
'DOT','UPDATE'
] + list(reservadas.values())
#tokens = tokens+reservadas
# reservadas = {
# 'begin':'BEGIN',
# 'end':'END',
# 'if':'IF',
# 'then':'THEN',
# 'while':'WHILE',
# 'do':'DO',
# 'call':'CALL',
# 'const':'CONST',
# 'int':'VAR',
# 'procedure':'PROCEDURE',
# 'out':'OUT',
# 'in':'IN',
# 'else':'ELSE'
# }
#tokens = tokens+list(reservadas.values())
t_ignore = '\t '
t_ignore_PLUS = r'\+'
t_ignore_MINUS = r'\-'
t_ignore_TIMES = r'\*'
t_ignore_DIVIDE = r'/'
t_ignore_ODD = r'ODD'
t_ignore_ASSIGN = r'='
t_ignore_NE = r'<>'
t_ignore_LT = r'<'
t_ignore_LTE = r'<='
t_ignore_GT = r'>'
t_ignore_GTE = r'>='
t_ignore_LPARENT = r'\('
t_ignore_RPARENT = r'\)'
t_ignore_COMMA = r','
t_ignore_SEMMICOLOM = r';'
t_ignore_DOT = r'\.'
t_ignore_UPDATE = r':='
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = reservadas.get(t.value,'ID') # Check for reserved words
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
#dsfjksdlgjklsdgjsdgslxcvjlk-,.
def t_COMMENT(t):
r'\//.*'
r'\/*.*'
r'\*/.*'
pass
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
pass
def t_error(t):
print ("----- '%s'" % t.value[0])
t.lexer.skip(1)
while True:
tok = analizador.token()
if not tok : break
print (tok)
the output I get with the above code is:
LexToken(ID,'FSR',1,3)
LexToken(ID,'testing',1,7)
LexToken(ID,'sketch',1,15)
'---- '
'---- '
LexToken(ID,'Connect',3,28)
LexToken(ID,'one',3,36)
LexToken(ID,'end',3,40)
LexToken(ID,'of',3,44)
LexToken(ID,'FSR',3,47)
LexToken(ID,'to',3,51)
LexToken(ID,'V',3,55)
LexToken(ID,'the',3,58)
LexToken(ID,'other',3,62)
LexToken(ID,'end',3,68)
LexToken(ID,'to',3,72)
LexToken(ID,'Analog',3,75)
'---- '
.
.
.
.
LexToken(ID,'Serial',21,694)
LexToken(ID,'print',21,701)
----- '"'
LexToken(ID,'Analog',21,708)
LexToken(ID,'reading',21,715)
----- '"'
'---- '
LexToken(ID,'Serial',22,732)
LexToken(ID,'println',22,739)
LexToken(ID,'fsrReading',22,747)
'---- '
'---- '
LexToken(ID,'LEDbrightness',26,898)
LexToken(ID,'map',26,914)
LexToken(ID,'fsrReading',26,918)
'---- '
LexToken(ID,'analogWrite',28,996)
LexToken(ID,'LEDpin',28,1008)
LexToken(ID,'LEDbrightness',28,1016)
'---- '
LexToken(ID,'IF',29,1034)
'---- '
LexToken(if,'if',30,1038)
'---- '
LexToken(ID,'delay',31,1044)
'---- '
----- '}'
Press any key to continue . . .
my expectation for the exit would be this:
LexToken(ID,'IF',29,1034)
'---- '
LexToken(if,'if',30,1038)
I am analyzing a code of arduino, and all those words are comments, but I only need you to look for the conditionals if or IF, or other reserved words like for, but the main idea is that with a list of reserved words you identify them and show me only those selected
If you want to discard tokens that are not in your 'reserved' list, adjust the t_ID function like so:
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
reserved_type = reservadas.get(t.value, False)
if reserved_type:
t.type = reserved_type
return t # Return token with reserved type
return None # Discard non-reserved tokens
Additionally, your comment token function is probably misapplied here.
def t_COMMENT(t):
r'\//.*'
r'\/*.*'
r'\*/.*'
pass
You can't use multiple rules or span a rule over multiple strings like this. Because the docstring (which ply uses to get the regex) will only contain the very first string.
Secondly, I think the regex needs adjusting for comments, assuming you're tokenizing C or a C-like language. Particularly, it needs to account for the possibility that comments span multiple lines.
To fix, apply the following for dealing with comments:
def t_block_comment(tok):
r'/\*((.|\n))*?\*/'
tok.lexer.lineno += tok.value.count('\n')
return None # Discard block comments "/* comment */"
t_ignore_comment = r'//.*' # ignore inline comments "// comment"
You may also need to apply the regex multiline flag:
analizador = lex.lex(reflags=re.MULTILINE)
Lastly, your t_ignore_DIVIDE = r'/' may be preventing your comment rules from applying, too. Consider ordering this after the comment rules.
s = '^^^# """#$ raw data &*823ohcneuj^^^ Important Information ^^^raw data^^^ Imp Info'
In it, I want to remove texts between the delimiters ^^^ and ^^^.
The output should be "Important Information Imp Info"
You can do this with regular expressions:
import re
s = '^^^# """#$ raw data &*823ohcneuj^^^ Important Information ^^^raw data^^^ Imp Info'
important = re.compile(r'\^\^\^.*?\^\^\^').sub('', s)
The key elements in this regular expression are:
escape the ^ charater since it has special meaning
use the ungreedy match of .*?
def removeText(text):
carrotCount = 0
newText = ""
for char in text:
if(char == '^'):
# Reset if we have exceeded 2 sets of carrots
if(carrotCount == 6):
carrotCount = 1
else:
carrotCount += 1
# Check if we have reached the first '^^^'
elif(carrotCount == 3):
# Ignore everything between the carrots
if(char != '^'):
continue;
# Add the second set of carrots when we find them
else:
carrotCount += 1
# Check if we have reached the end of the second ^^^
# If we have, we have the message
elif(carrotCount == 6):
newText += char
return newText
This will print "Important Information Imp Info."
I am wondering, how could I make an algorithm that parses a string for the hashtag symbol ' # ' and returns the full string, but where ever a word starts with a '#' symbol, it becomes a link. I am using python with Google app engine: webapp2 and Jinja2 and I am building a blog.
Thanks
A more efficient and complete way to find the "hashwords":
import functools
def hash_position(string):
return string.find('#')
def delimiter_position(string, delimiters):
positions = filter(lambda x: x >= 0, map(lambda delimiter: string.find(delimiter), delimiters))
try:
return functools.reduce(min, positions)
except TypeError:
return -1
def get_hashed_words(string, delimiters):
maximum_length = len(string)
current_hash_position = hash_position(string)
string = string[current_hash_position:]
results = []
counter = 0
while current_hash_position != -1:
current_delimiter_position = delimiter_position(string, delimiters)
if current_delimiter_position == -1:
results.append(string)
else:
results.append(string[0:current_delimiter_position])
# Update offsets and the haystack
string = string[current_delimiter_position:]
current_hash_position = hash_position(string)
string = string[current_hash_position:]
return results
if __name__ == "__main__":
string = "Please #clarify: What do you #mean with returning somthing as a #link. #herp"
delimiters = [' ', '.', ',', ':']
print(get_hashed_words(string, delimiters))
Imperative code with updates of the haystack looks a little bit ugly but hey, that's what we get for (ab-)using mutable variables.
And I still have no idea what do you mean with "returning something as a link".
Hope that helps.
not sure where do you get the data for the link, but maybe something like:
[('%s' % word) for word in input.split() if word[0]=='#']
Are you talking about twitter? Maybe this?
def get_hashtag_link(hashtag):
if hashtag.startswith("#"):
return '%s' % (hashtag[1:], hashtag)
>>> get_hashtag_link("#stackoverflow")
'#stackoverflow'
It will return None if hashtag is not a hashtag.
If I have a keyword, how can I get it to, once it encounters a keyword, to just grab the rest of the line and return it as a string? Once it encounters an end of line, return everything on that line.
Here is the line I'm looking at:
description here is the rest of my text to collect
Thus, when the lexer encounters description, I would like "here is the rest of my text to collect" returned as a string
I have the following defined, but it seems to be throwing an error:
states = (
('bcdescription', 'exclusive'),
)
def t_bcdescription(t):
r'description '
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.valiue.count('\n')
t.lexer.begin('INITIAL')
return t
This is part of the error being returned:
File "/Users/me/Coding/wm/wm_parser/ply/lex.py", line 393, in token
raise LexError("Illegal character '%s' at index %d" % (lexdata[lexpos],lexpos), lexdata[lexpos:])
ply.lex.LexError: Illegal character ' ' at index 40
Finally, if I wanted this functionality for more than one token, how could I accomplish that?
Thanks for your time
There is no big problem with your code,in fact,i just copy your code and run it,it works well
import ply.lex as lex
states = (
('bcdescription', 'exclusive'),
)
tokens = ("BCDESCRIPTION",)
def t_bcdescription(t):
r'\bdescription\b'
t.lexer.code_start = t.lexer.lexpos
t.lexer.level = 1
t.lexer.begin('bcdescription')
def t_bcdescription_close(t):
r'\n'
t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
t.type="BCDESCRIPTION"
t.lexer.lineno += t.value.count('\n')
t.lexer.begin('INITIAL')
return t
def t_bcdescription_content(t):
r'[^\n]+'
lexer = lex.lex()
data = 'description here is the rest of my text to collect\n'
lexer.input(data)
while True:
tok = lexer.token()
if not tok: break
print tok
and result is :
LexToken(BCDESCRIPTION,' here is the rest of my text to collect\n',1,50)
So maybe your can check other parts of your code
and if I wanted this functionality for more than one token, then you can simply capture words and when there comes a word appears in those tokens, start to capture the rest of content by the code above.
It is not obvious why you need to use a lexer/parser for this without further information.
>>> x = 'description here is the rest of my text to collect'
>>> a, b = x.split(' ', 1)
>>> a
'description'
>>> b
'here is the rest of my text to collect'
########################################
# some comment
# other comment
########################################
block1 {
value=data
some_value=some other kind of data
othervalue=032423432
}
block2 {
value=data
some_value=some other kind of data
othervalue=032423432
}
The best way would be to use an existing format such as JSON.
Here's an example parser for your format:
from lepl import (AnyBut, Digit, Drop, Eos, Integer, Letter,
NON_GREEDY, Regexp, Space, Separator, Word)
# EBNF
# name = ( letter | "_" ) , { letter | "_" | digit } ;
name = Word(Letter() | '_',
Letter() | '_' | Digit())
# words = word , space+ , word , { space+ , word } ;
# two or more space-separated words (non-greedy to allow comment at the end)
words = Word()[2::NON_GREEDY, ~Space()[1:]] > list
# value = integer | word | words ;
value = (Integer() >> int) | Word() | words
# comment = "#" , { all characters - "\n" } , ( "\n" | EOF ) ;
comment = '#' & AnyBut('\n')[:] & ('\n' | Eos())
with Separator(~Regexp(r'\s*')):
# statement = name , "=" , value ;
statement = name & Drop('=') & value > tuple
# suite = "{" , { comment | statement } , "}" ;
suite = Drop('{') & (~comment | statement)[:] & Drop('}') > dict
# block = name , suite ;
block = name & suite > tuple
# config = { comment | block } ;
config = (~comment | block)[:] & Eos() > dict
from pprint import pprint
pprint(config.parse(open('input.cfg').read()))
Output:
[{'block1': {'othervalue': 32423432,
'some_value': ['some', 'other', 'kind', 'of', 'data'],
'value': 'data'},
'block2': {'othervalue': 32423432,
'some_value': ['some', 'other', 'kind', 'of', 'data'],
'value': 'data'}}]
Well, the data looks pretty regular. So you could do something like this (untested):
class Block(object):
def __init__(self, name):
self.name = name
infile = open(...) # insert filename here
current = None
blocks = []
for line in infile:
if line.lstrip().startswith('#'):
continue
elif line.rstrip().endswith('{'):
current = Block(line.split()[0])
elif '=' in line:
attr, value = line.strip().split('=')
try:
value = int(value)
except ValueError:
pass
setattr(current, attr, value)
elif line.rstrip().endswith('}'):
blocks.append(current)
The result will be a list of Block instances, where block.name will be the name ('block1', 'block2', etc.) and other attributes correspond to the keys in your data. So, blocks[0].value will be 'data', etc. Note that this only handles strings and integers as values.
(there is an obvious bug here if your keys can ever include 'name'. You might like to change self.name to self._name or something if this can happen)
HTH!
If you do not really mean parsing, but rather text processing and the input data is really that regular, then go with John's solution. If you really need some parsing (like there are some a little more complex rules to the data that you are getting), then depending on the amount of data that you need to parse, I'd go either with pyparsing or simpleparse. I've tried both of them, but actually pyparsing was too slow for me.
You might look into something like pyparsing.
Grako (for grammar compiler) allows to separate the input format specification (grammar) from its interpretation (semantics). Here's grammar for your input format in Grako's variety of EBNF:
(* a file contains zero or more blocks *)
file = {block} $;
(* a named block has at least one assignment statement *)
block = name '{' {assignment}+ '}';
assignment = name '=' value NEWLINE;
name = /[a-z][a-z0-9_]*/;
value = integer | string;
NEWLINE = /\n/;
integer = /[0-9]+/;
(* string value is everything until the next newline *)
string = /[^\n]+/;
To install grako, run pip install grako. To generate the PEG parser from the grammar:
$ grako -o config_parser.py Config.ebnf
To convert stdin into json using the generated config_parser module:
#!/usr/bin/env python
import json
import string
import sys
from config_parser import ConfigParser
class Semantics(object):
def file(self, ast):
# file = {block} $
# all blocks should have unique names within the file
return dict(ast)
def block(self, ast):
# block = name '{' {assignment}+ '}'
# all assignment statements should use unique names
return ast[0], dict(ast[2])
def assignment(self, ast):
# assignment = name '=' value NEWLINE
# value = integer | string
return ast[0], ast[2] # name, value
def integer(self, ast):
return int(ast)
def string(self, ast):
return ast.strip() # remove leading/trailing whitespace
parser = ConfigParser(whitespace='\t\n\v\f\r ', eol_comments_re="#.*?$")
ast = parser.parse(sys.stdin.read(), rule_name='file', semantics=Semantics())
json.dump(ast, sys.stdout, indent=2, sort_keys=True)
Output
{
"block1": {
"othervalue": 32423432,
"some_value": "some other kind of data",
"value": "data"
},
"block2": {
"othervalue": 32423432,
"some_value": "some other kind of data",
"value": "data"
}
}