Python chunking with regular expressions

Python chunking with regular expressions - python

In Perl, it was easy to iterate over a string to chunk it into tokens:
$key = ".foo[4][5].bar.baz";
#chunks = $key =~ m/\G\[\d+\]|\.[^][.]+/gc;
print "#chunks\n";
#>> output: .foo [4] [5] .bar .baz
# Optional error handling:
die "Malformed key at '" . substr($key, pos($key)) . "'"
if pos($key) != length($key);
If more control is needed, that can be turned into a loop instead:
while ($key =~ m/(\G\[\d+\]|\.[^][.]+)/g) {
push #chunks, $1; # Optionally process each one
}
I'd like to find a clean, idiomatic way to do this in Python. So far I only have this:
import re
key = ".foo[4][5].bar.baz"
rx = re.compile(r'\[\d+\]|\.[^][.]+')
chunks = []
while True:
m = re.match(rx, key)
if not m:
raise ValueError(f"Malformed key at '{key}'")
chunk = m.group(0)
chunks.append(chunk[1:] if chunk.startswith('.') else int(chunk[1:-1]))
key = key[m.end(0):]
if key == '':
break
print(chunks)
Aside from it being a lot more verbose, I don't love that because I need to destroy the string as I process it, since there doesn't seem to be an equivalent of Perl's \G anchor (pick up where the last match left off). An alternative would be to keep track of my own match position in the string in each loop, but that seems even more fiddly.
Is there some idiom I haven't found? I also tried some solution using re.finditer() but it doesn't seem to have a way to have each match start at the exact end of the previous match (e.g. re.matchiter() or somesuch).
Suggestions & discussion welcome.

Summary
There is no direct equivalent of re.matchiter() as you described it.
Two alternatives come to mind:
Create a mismatch token.
Write your own generator with the desired behavior.
Mismatch Token
The usual technique in Python is to define a MISMATCH catchall token and to raise an exception if that token is ever encountered.
Here's a working example (one that I wrote and put in the Python docs so that everyone could find it):
from typing import NamedTuple
import re
class Token(NamedTuple):
type: str
value: str
line: int
column: int
def tokenize(code):
keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
token_specification = [
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+\-*/]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]+'), # Skip over spaces and tabs
('MISMATCH', r'.'), # Any other character
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group()
column = mo.start() - line_start
if kind == 'NUMBER':
value = float(value) if '.' in value else int(value)
elif kind == 'ID' and value in keywords:
kind = value
elif kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
continue
elif kind == 'SKIP':
continue
elif kind == 'MISMATCH':
raise RuntimeError(f'{value!r} unexpected on line {line_num}')
yield Token(kind, value, line_num, column)
statements = '''
IF quantity THEN
total := total + price * quantity;
tax := price * 0.05;
ENDIF;
'''
for token in tokenize(statements):
print(token)
Custom Generator
Another alternative is to write a custom generator with the desired behavior.
The match() method for compiled regular expressions allows an optional starting position for the match operation. With that tool, it isn't hard to write a custom generator that applies match() to consecutive starting positions:
def itermatch(pattern, string):
p = re.compile(pattern)
pos = 0
while True:
mo = p.match(string, pos)
if mo is None:
break # Or raise exception
yield mo
pos = mo.end()

Related

How to parse strings into numericals for a summation function?

I want to write a summation function, but can't figure out how I would parse the bottom expression and right expression.
def summation(count: int, bottom_var: str, espression: str):
out = 0
# parse bottom_var into a variable and it's value
value = ···
var_name = ···
expression.replace(var_name, value)
···
I want you to be able to use the inputs the same way as in normal sigma notation, as in bottom_var is 'n=13', not '13'.
You enter an assignment for bottom_var, and an expression using the variable defined in bottom_var in expression.
Example:
summation(4, 'x=1', 'x+1')
(would return 14, as 2+3+4+5=14)

First, parse the bottom_var to get the symbol and starting value:
var, value = bottom_var.split('=')
var = var.strip()
value = eval(value)
Using split we get the two parts of the equal sign easily. Using strip we even allow any number of spaces, like x = 1.
Then, we want to create a loop from the starting value to count:
for i in range(value, count+1):
...
Lastly, we want to use the loop to sum the expression when each time the symbol is replaced with the current iteration's value. All in all:
def summation(count: int, bottom_var: str, expression: str):
var, value = bottom_var.split('=')
var = var.strip()
value = eval(value)
res = 0
for i in range(value, count+1):
res += eval(expression.replace(var, str(i)))
return res
For example:
>>> summation(4, 'x=1', 'x+1')
14
Proposing the code in this answer, I feel the need to ask you to read about Why is using 'eval' a bad practice? and please make sure that it is OK for your application. Notice that depending on the context of the use of your code, using eval can be quite dangerous and lead to bad outcomes.

This is how i did it:
def summation(count,bottom_var,expression):
begin = False
x = ""
v = ""
for char in bottom_var:
if begin:
x += char
if char == "=":
begin = True
if begin == False:
v += char
x = int(x)
expression = expression.replace(v,str("x"))
print(expression)
for n in range(count):
x = eval(expression)
summation(4,"d=152",'d+145*2')

There are some built-in function in python which execute code from text or str which are exec() and eval() function.
for example :
>>> exec('n=13')
>>> print(n)
13
>>> eval('n+1')
14
you can use this in your code.

How do I detect any of 4 characters in a string and then return their index?

I would understand how to do this assuming that I was only looking for one specific character, but in this instance I am looking for any of the 4 operators, '+', '-', '*', '/'. The function returns -1 if there is no operator in the passed string, txt, otherwise it returns the position of the leftmost operator. So I'm thinking find() would be optimal here.
What I have so far:
def findNextOpr(txt):
# txt must be a nonempty string.
if len(txt) <= 0 or not isinstance(txt, str):
print("type error: findNextOpr")
return "type error: findNextOpr"
if '+' in txt:
return txt.find('+')
elif '-' in txt:
return txt.find('-')
else
return -1
I think if I did what I did for the '+' and '-' operators for the other operators, it wouldn't work for multiple instances of that operator in one expression. Can a loop be incorporated here?

Your current approach is not very efficient, as you will iterate over txt, multiple times, 2 (in and find()) for each operator.
You could use index() instead of find() and just ignore the ValueError exception , e.g.:
def findNextOpr(txt):
for o in '+-*/':
try:
return txt.index(o)
except ValueError:
pass
return -1
You can do this in a single (perhaps more readable) pass by enumerate()ing the txt and return if you find the character, e.g.:
def findNextOpr(txt):
for i, c in enumerate(txt):
if c in '+-*/':
return i
return -1
Note: if you wanted all of the operators you could change the return to yield, and then just iterate over the generator, e.g.:
def findNextOpr(txt):
for i, c in enumerate(txt):
if c in '+-*/':
yield i
In []:
for op in findNextOpr('1+2-3+4'):
print(op)
Out[]:
1
3
5

You can improve your code a bit because you keep looking at the string a lot of times. '+' in txt actually searches through the string just like txt.find('+') does. So you can combine those easily to avoid having to search through it twice:
pos = txt.find('+')
if pos >= 0:
return pos
But this still leaves you with the problem that this will return for the first operator you are looking for if that operator is contained anywhere within the string. So you don’t actually get the first position any of these operators is within the string.
So what you want to do is look for all operators separately, and then return the lowest non-negative number since that’s the first occurence of any of the operators within the string:
plusPos = txt.find('+')
minusPos = txt.find('-')
multPos = txt.find('*')
divPos = txt.find('/')
return min(pos for pos in (plusPos, minusPos, multPos, divPos) if pos >= 0)

First, you shouldn't be printing or returning error messages; you should be raising exceptions. TypeError and ValueError would be appropriate here. (A string that isn't long enough is the latter, not the former.)
Second, you can simply find the the positions of all the operators in the string using a list comprehension, exclude results of -1, and return the lowest of the positions using min().
def findNextOpr(text, start=0):
ops = "+-/*"
assert isinstance(text, str), "text must be a string"
# "text must not be empty" isn't strictly true:
# you'll get a perfectly sensible result for an empty string
assert text, "text must not be empty"
op_idxs = [pos for pos in (text.find(op, start) for op in ops) if pos > -1]
return min(op_idxs) if op_idxs else -1
I've added a start argument that can be used to find the next operator: simply pass in the index of the last-found operator, plus 1.

Search a string for a given key

I've been doing some more CodeEval challenges and came across one on the hard tab.
You are given two strings. Determine if the second string is a substring of the first (Do NOT use any substr type library function). The second string may contain an asterisk() which should be treated as a regular expression i.e. matches zero or more characters. The asterisk can be escaped by a \ char in which case it should be interpreted as a regular '' character. To summarize: the strings can contain alphabets, numbers, * and \ characters.
So you are given two strings in a file that look something like this: Hello,ell your job is to figure out if ell is in hello, what I do:
I haven't quite gotten it perfect, but I did get it to the point where it passes and works with a 65% complete. How it runs through the string, and the key, and checks if the characters match. If the characters match, it appends the character into a list. After this it divides the length of the string by 2 and checks if the length of the list is either greater than, or equal to half of the string. I figured half of the string length would be enough to verify if it indeed matches or not. Example of how it works:
h == e -> no
e == e -> yes -> list
l == e -> no
l == e -> no
...
My question is what can I do better to the point where I can verify the wildcards that are said above?
import sys
def search_string(string, key):
""" Search a string for a specified key.
If the key exists out put "true" if it doesn't output "false"
>>> search_string("test", "est")
true
>>> search_string("testing", "rawr")
false"""
results = []
for c in string:
for ch in key:
if c == ch:
results.append(c)
if len(string) / 2 < len(results) or len(string) / 2 == len(results):
return "true"
else:
return "false"
if __name__ == '__main__':
with open(sys.argv[1]) as data:
for line in data.readlines():
data_list = line.rstrip().split(",")
search_key = data_list[1]
word = data_list[0]
print(search_string(word, search_key))

I've come up with a solution to this problem. You've said "Do NOT use any substr type library function", I'm not sure If some of the functions I used are allowed or not, so tell me if I've broken any rules :D
Hope this helps you :)
def search_string(string, key):
key = key.replace("\\*", "<NormalStar>") # every \* becomes <NormalStar>
key = key.split("*") # splitting up the key makes it easier to work with
#print(key)
point = 0 # for checking order, e.g. test = t*est, test != est*t
found = "true" # default
for k in key:
k = k.replace("<NormalStar>", "*") # every <NormalStar> becomes *
if k in string[point:]: # the next part of the key is after the part before
point = string.index(k) + len(k) # move point after this
else: # k nbt found, return false
found = "false"
break
return found
print(search_string("test", "est")) # true
print(search_string("t....est", "t*est")) # true
print(search_string("n....est", "t*est")) # false
print(search_string("est....t", "t*est")) # false
print(search_string("anything", "*")) # true
print(search_string("test", "t\*est")) # false
print(search_string("t*est", "t\*est")) # true

Suffix search - Python

Here's a the problem, provided a list of strings and a document find the shortest substring that contains all the strings in the list.
Thus for:
document = "many google employees can program because google is a technology company that can program"
searchTerms = ['google', 'program', 'can']
the output should be:
"can program because google" # 27 chars
and not:
"google employees can program" # 29 chars
"google is a technology company that can program" # 48 chars
Here's my approach,
Split the document into suffix tree,
check for all strings in each suffix
return the one of the shortest length,
Here's my code
def snippetSearch(document, searchTerms):
doc = document.split()
suffix_array = create_suffix_array(doc)
current = None
current_len = sys.maxsize
for suffix in suffix_array:
if check_for_terms_in_array(suffix, searchTerms):
if len(suffix) < current_len:
current_len = len(suffix)
current = suffix
return ' '.join(map(str, current))
def create_suffix_array(document):
suffix_array = []
for i in range(len(document)):
sub = document[i:]
suffix_array.append(sub)
return suffix_array
def check_for_terms_in_array(arr, terms):
for term in terms:
if term not in arr:
return False
return True
This is an online submission and it's not passing one test case. I have no idea what the test case is though. My question is, is there anything logically incorrect with the code. Also is there a more efficient way of doing this.

You can break this into two parts. First, finding the shortest substring that matches some property. We'll pretend we already have a function that tests for the property:
def find_shortest_ss(document, some_property):
# First level of looping gradually increases substring length
for x in range(len(document)):
# Second level of looping tests current length at valid positions
for y in range(max(len(document), len(document)-x)):
if some_property(document[y:x+y]):
return document[y:x+y]
# How to handle the case of no match is undefined
raise ValueError('No matching value found')
Now the property we want to test for itself:
def contains_all_terms(terms):
return (lambda s: all(term in s for term in terms))
This lambda expression takes some terms and will return a function which, when evaluated on a string, returns true if and only if all the terms are in the string. This is basically a more terse version of a nested function definition which you could write like this:
def contains_all_terms(terms):
def string_contains_them(s):
return all(term in s for term in terms)
return string_contains_them
So we're actually just returning the handle of the function we create dynamically inside of our contains_all_terms function
To piece this together we do like so:
>>> find_shortest_ss(document, contains_all_terms(searchTerms))
'program can google'
Some efficiency advantages which this code has:
The any builtin function has short-circuit evaluation, meaning that it will return False as soon as it finds a non-contained substring
It starts by checking all the shortest substrings, then proceeds to increase substring length one extra character length at a time. If it ever finds a satisfying substring it will exit and return that value. So you can guarantee the returned value will never be longer than necessary. It won't even be doing any operations on substrings longer than necessary.
8 lines of code, not bad I think

Well, brute force is O(n³), so why not:
from itertools import product
def find_shortest(doc, terms):
doc = document.split()
substrings = (
doc[i:j]
for i, j in product(range(0, len(doc)), range(0, len(doc)))
if all(search_term in doc[i:j] for search_term in search_terms)
)
shortest = doc
for candidate in substrings:
if len(candidate) < len(shortest):
shortest = candidate
return shortest.
document = 'many google employees can program can google employees because google is a technology company that writes program'
search_terms = ['google', 'program', 'can']
print find_shortest(document, search_terms)
>>>> ['program', 'can', 'google']
You can probably do this a lot faster, though. For example, any relevant substring can only end with one of the keywords

Instead of brute forcing all possible sub-strings, I brute forced all possible matching word positions... It should be a bit faster..
import numpy as np
from itertools import product
document = 'many google employees can program can google employees because google is a technology company that writes program'
searchTerms = ['google', 'program']
word_lists = []
for word in searchTerms:
word_positions = []
start = 0 #starting index of str.find()
while 1:
start = document.find(word, start, -1)
if start == -1: #no more instances
break
word_positions.append([start, start+len(word)]) #beginning and ending index of search term
start += 1 #increment starting search postion
word_lists.append(word_positions) #add all search term positions to list of all search terms
minLen = len(document)
lower = 0
upper = len(document)
for p in product(*word_lists): #unpack word_lists into word_positions
indexes = np.array(p).flatten() #take all indices into flat list
lowerI = np.min(indexes)
upperI = np.max(indexes)
indexRange = upperI - lowerI #determine length of substring
if indexRange < minLen:
minLen = indexRange
lower = lowerI
upper = upperI
print document[lower:upper]

Python - packing/unpacking by letters

I'm just starting to learn python and I have this exercise that's puzzling me:
Create a function that can pack or unpack a string of letters.
So aaabb would be packed a3b2 and vice versa.
For the packing part of the function, I wrote the following
def packer(s):
if s.isalpha(): # Defines if unpacked
stack = []
for i in s:
if s.count(i) > 1:
if (i + str(s.count(i))) not in stack:
stack.append(i + str(s.count(i)))
else:
stack.append(i)
print "".join(stack)
else:
print "Something's not quite right.."
return False
packer("aaaaaaaaaaaabbbccccd")
This seems to work all proper. But the assignment says that
if the input has (for example) the letter a after b or c, then
it should later be unpacked into it's original form.
So "aaabbkka" should become a3b2k2a, not a4b2k2.
I hence figured, that I cannot use the "count()" command, since
that counts all occurrences of the item in the whole string, correct?
What would be my options here then?
On to the unpacking -
I've thought of the basics what my code needs to do -
between the " if s.isalpha():" and else, I should add an elif that
checks whether or not the string has digits in it. (I figured this would be
enough to determine whether it's the packed version or unpacked).
Create a for loop and inside of it an if sentence, which then checks for every element:
2.1. If it has a number behind it > Return (or add to an empty stack) the number times the digit
2.2. If it has no number following it > Return just the element.
Big question number 2 - how do I check whether it's a number or just another
alphabetical element following an element in the list? I guess this must be done with
slicing, but those only take integers. Could this be achieved with the index command?
Also - if this is of any relevance - so far I've basically covered lists, strings, if and for
and I've been told this exercise is doable with just those (...so if you wouldn't mind keeping this really basic)
All help appreciated for the newbie enthusiast!
SOLVED:
def packer(s):
if s.isalpha(): # Defines if unpacked
groups= []
last_char = None
for c in s:
if c == last_char:
groups[-1].append(c)
else:
groups.append([c])
last_char = c
return ''.join('%s%s' % (g[0], len(g)>1 and len(g) or '') for g in groups)
else: # Seems to be packed
stack = ""
for i in range(len(s)):
if s[i].isalpha():
if i+1 < len(s) and s[i+1].isdigit():
digit = s[i+1]
char = s[i]
i += 2
while i < len(s) and s[i].isdigit():
digit +=s[i]
i+=1
stack += char * int(digit)
else:
stack+= s[i]
else:
""
return "".join(stack)
print (packer("aaaaaaaaaaaabbbccccd"))
print (packer("a4b19am4nmba22"))
So this is my final code. Almost managed to pull it all off with just for loops and if statements.
In the end though I had to bring in the while loop to solve reading the multiple-digit numbers issue. I think I still managed to keep it simple enough. Thanks a ton millimoose and everyone else for chipping in!

A straightforward solution:
If a char is different, make a new group. Otherwise append it to the last group. Finally count all groups and join them.
def packer(s):
groups = []
last_char = None
for c in s:
if c == last_char:
groups[-1].append(c)
else:
groups.append([c])
last_char = c
return ''.join('%s%s'%(g[0], len(g)) for g in groups)
Another approach is using re.
Regex r'(.)\1+' can match consecutive characters longer than 1. And with re.sub you can easily encode it:
regex = re.compile(r'(.)\1+')
def replacer(match):
return match.group(1) + str(len(match.group(0)))
regex.sub(replacer, 'aaabbkka')
#=> 'a3b2k2a'

I think You can use `itertools.grouby' function
for example
import itertools
data = 'aaassaaasssddee'
groupped_data = ((c, len(list(g))) for c, g in itertools.groupby(data))
result = ''.join(c + (str(n) if n > 1 else '') for c, n in groupped_data)
of course one can make this code more readable using generator instead of generator statement

This is an implementation of the algorithm I outlined in the comments:
from itertools import takewhile, count, islice, izip
def consume(items):
from collections import deque
deque(items, maxlen=0)
def ilen(items):
result = count()
consume(izip(items, result))
return next(result)
def pack_or_unpack(data):
start = 0
result = []
while start < len(data):
if data[start].isdigit():
# `data` is packed, bail
return unpack(data)
run = run_len(data, start)
# append the character that might repeat
result.append(data[start])
if run > 1:
# append the length of the run of characters
result.append(str(run))
start += run
return ''.join(result)
def run_len(data, start):
"""Return the end index of the run of identical characters starting at
`start`"""
return start + ilen(takewhile(lambda c: c == data[start],
islice(data, start, None)))
def unpack(data):
result = []
for i in range(len(data)):
if data[i].isdigit():
# skip digits, we'll look for them below
continue
# packed character
c = data[i]
# number of repetitions
n = 1
if (i+1) < len(data) and data[i+1].isdigit():
# if the next character is a digit, grab all the digits in the
# substring starting at i+1
n = int(''.join(takewhile(str.isdigit, data[i+1:])))
# append the repeated character
result.append(c*n) # multiplying a string with a number repeats it
return ''.join(result)
print pack_or_unpack('aaabbc')
print pack_or_unpack('a3b2c')
print pack_or_unpack('a10')
print pack_or_unpack('b5c5')
print pack_or_unpack('abc')
A regex-flavoured version of unpack() would be:
import re
UNPACK_RE = re.compile(r'(?P<char> [a-zA-Z]) (?P<count> \d+)?', re.VERBOSE)
def unpack_re(data):
matches = UNPACK_RE.finditer(data)
pairs = ((m.group('char'), m.group('count')) for m in matches)
return ''.join(char * (int(count) if count else 1)
for char, count in pairs)
This code demonstrates the most straightforward (or "basic") approach of implementing that algorithm. It's not particularly elegant or idiomatic or necessarily efficient. (It would be if written in C, but Python has the caveats such as: indexing a string copies the character into a new string, and algorithms that seem to copy data excessively might be faster than trying to avoid this if the copying is done in C and the workaround was implemented with a Python loop.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python chunking with regular expressions - python

Related

How to parse strings into numericals for a summation function?

How do I detect any of 4 characters in a string and then return their index?

Search a string for a given key

Suffix search - Python

Python - packing/unpacking by letters

Categories

Resources