Expanding a tree-like data structure - python

I am attempting to use Python to alter some text strings using the re module (i.e., re.sub). However, I think that my question is applicable to other languages that have regex implementations.
I have a number of strings that represent tree-like data structures. They look something like this:
(A,B)-C-D
A-B-(C,D)
A-(B,C,D-(E,F,G,H,I))
Each letter represents a branch or edge. Letters in parentheses represent branches coming into or out of another branch.
Everywhere that there is a 'plain' tuple of values (a tuple with only comma separated single letter), I would like to take the prefix (X-) or suffix (-X) of that tuple and apply it to each of the values in the tuple.
Under this transformation, the above strings would become
(A-C,B-C)-D
A-(B-C,B-D)
A-(B,C,(D-E,D-F,D-G,D-H,D-I))
Applying the methodology repeatedly would ultimately yield
(A-C-D,B-C-D)
(A-B-C,A-B-D)
(A-B,A-C,A-D-E,A-D-F,A-D-G,A-D-H,A-D-I)
The strings in these tuples then represent the paths through the tree starting at a root and ending at a leaf.
Any help accomplishing this task using regular expressions (or other approaches) would be greatly appreciated.

You could not do this with regular expressions, because you have to deal with nested structures. Instead you could use pyparsing's nestedExpr

The problem you are describing is one of enumerating paths within a graph.
You describe three graphs
A B
\ /
C
|
D
A
|
B
/ \
C D
A
/ | \
B C D
// | \\
E F G H I
and for each you want to enumerate paths. This involves distributing a value across an arbitrarily nested structure. If this could be done with regexes, and I am not certain that it can, it would have to be done, I believe, in several passes.
My sense of your problem though, is that it is best solved by parsing your string into a graph structure and then enumerating the paths. If you do not want to physically build the graph, you can probably generate strings within user-supplied actions to a parser generator.
A regex-based solution would have to know how to handle both
(A,B)-C
and
(A,B,C,D,E,F,G,H)-I
You can match these strings with
\([A-Z](,[A-Z])*\)-[A-Z]
but how would you "distribute" over all submatches without some logic? Since you need this logic anyway, you might as well perform it on a real graph structure. You can also do this on a string itself, but it would be better to do this under the auspices of a parser generator which can handle context-free or context-sensitive structures.

After posting my comment referring to pyparsing's invRegex example, I looked a little closer at your input, and it looked like you could interpret this as an infix notation, with ',' and '-' as binary operators. Pyparsing has a helper method awkwardly named operatorPrecedence that parses expressions according to a precedence of operators, with grouping in parentheses. (This has a little more smarts to it than just using the nestedExpr helper method, which matches expressions nested within grouping symbols.) So here is a getting-started version of a parser using operatorPrecedence:
data = """\
(A,B)-C-D
A-B-(C,D)
A-(B,C,D-(E,F,G,H,I))""".splitlines()
from pyparsing import alphas, oneOf, operatorPrecedence, opAssoc
node = oneOf(list(alphas))
graphExpr = operatorPrecedence(node,
[
('-', 2, opAssoc.LEFT),
(',', 2, opAssoc.LEFT),
])
for d in data:
print graphExpr.parseString(d).asList()
Pyparsing actually returns a complex structure of type ParseResults which supports access to the parsed tokens as elements in a list, items in a dict, or attributes in an object. By calling asList, we just get the elements in simple list form.
The output of the above shows that we look to be on the right track:
[[['A', ',', 'B'], '-', 'C', '-', 'D']]
[['A', '-', 'B', '-', ['C', ',', 'D']]]
[['A', '-', ['B', ',', 'C', ',', ['D', '-', ['E', ',', 'F', ',', 'G', ',', 'H', ',', 'I']]]]]
Pyparsing also allows you to attach callbacks or parse actions to individual expressions, to be called at parse time. For instance, this parse action does parse-time conversion to integer:
def toInt(tokens):
return int(tokens[0])
integer = Word(nums).setParseAction(toInt)
When the value is returned in the ParseResults, it has already been converted to an integer.
Classes can also be specified as parse actions, and the ParseResults object is passed to the class's __init__ method and the resulting object returned. We can specify parse actions within operatorPrecedence by adding the parse action as a 4th element in each operator's descriptor tuple.
Here is base class for binary operators:
class BinOp(object):
def __init__(self, tokens):
self.tokens = tokens
def __str__(self):
return self.__class__.__name__ + str(self.tokens[0][::2])
__repr__ = __str__
From this base class, we can derive 2 subclasses, one for each operator - and ,:
class Path(BinOp):
pass
class Branch(BinOp):
pass
And add them to the operator definition tuples in operatorPrecedence:
node = oneOf(list(alphas))
graphExpr = operatorPrecedence(node,
[
('-', 2, opAssoc.LEFT, Path),
(',', 2, opAssoc.LEFT, Branch),
])
for d in data:
print graphExpr.parseString(d).asList()
This gives us a nested structure of objects for each input string:
[Path[Branch['A', 'B'], 'C', 'D']]
[Path['A', 'B', Branch['C', 'D']]]
[Path['A', Branch['B', 'C', Path['D', Branch['E', 'F', 'G', 'H', 'I']]]]]
The generation of paths from this structure is left as an exercise for the OP. (The pyparsing regex inverter does this using a tangle of generators - hopefully some simple recursion will be sufficient.)

Related

joining string from set reverses order

I have the string 'ABBA'. If I turn it into a set, I get this:
In: set('ABBA')
Out: {'A', 'B'}
However, if I join them as a string, it reverses the order:
In: ''.join(set('ABBA'))
Out: 'BA'
The same happens if I try to turn the string into a list:
In: list(set('ABBA'))
Out: ['B', 'A']
Why is this happening and how do I address it?
EDIT
The reason applying sorted doesn't work is that if I make a set out of 'CDA', it will return 'ACD', thus losing the order of the original string. My question pertains to preserving the original order of the set itself.
Sets are unordered collection i.e. you will get a different order every time you run the command and sets also have unique elements therefore there will be no repetition in the elements of the set.
if you try running this command for a few times set('ABBA') sometimes you will get the output as {'A', 'B'} and sometimes as {'B', 'A'} and that what happens when you are using the join command the output is sometimes taken as BA and sometimes it will show AB.
There is an ordered set recipe for this which is referred to from the Python 2 Documentation. This runs on Py2.6 or later and 3.0 or later without any modifications. The interface is almost exactly the same as a normal set, except that initialisation should be done with a list.
OrderedSet([1, 2, 3])
This is a MutableSet, so the signature for .union doesn't match that of set, but since it includes or something similar can easily be added
b = "AAEEBBCCDD"
a = set(b)#unordered
print(a)#{'B', 'D', 'C', 'A', 'E'}/{'A', 'E', 'B', 'D', 'C'}/,,,
#do not have reverses the order,only random
print(''.join(a))
print(list(a))
print(sorted(a, key=b.index))#Save original sequence(b)

Cleanest way to declare a tuple of one string

Declaring a tuple of one string using the function produces one element per item in the string
(Pdb) tuple('VERSION',)
('V', 'E', 'R', 'S', 'I', 'O', 'N')
Declaring a tuple using commas feels like a side effect and I feel like is easy to miss.
(Pdb) ('VERSION',)
('VERSION',)
Is there a cleaner way to make a declaration like this?
For context I'm using a tuple of tuples and I'm iterating on all of the individual values. Rather than special case the single values I'm just making them a tuple of one item.
Edit: I see I was unclear about this.
I don't personally like the declaration of ('VERSION',) so I tried
(Pdb) tuple('VERSION',)
('V', 'E', 'R', 'S', 'I', 'O', 'N')`
And found that the function declaration to have this behavior.
I was interested to find that you could enclose the tuple declaration with a tuple declaration and that works.
(Pdb) tuple(['VERSION'])
('VERSION',)
(Pdb) tuple(('VERSION',))
('VERSION',)
Well, it's really a good question, how should we distinguish the difference between two parentheses and a tuple with one element, look at the following example :
>>> a = (1)+(2)
>>> a
3
>>> b = (1,)+(2,)
>>> b
(1, 2)
it's the beauty of the syntax of python, that , may look like an extra thing, but it makes difference () operator and tuples. so you should use it when you create a tuple with length one.
If you want to call the tuple by name, you can write tuple(['foo']) or tuple(('foo',)).
The tuple constructor takes an iterable, and, unless it's already a tuple, makes its elements the tuple's elements. There's no way around that.
The idiomatic way to define a 1-tuple is (foo,). I personally find the ,) a sign which is hard to miss, but tastes vary.
For ultimate clarity, just make your own function:
def tu(value):
return (value,)
Now you can tu('foo') and get a 1-tuple.

String translate using dict

I want to replace letters in a character vector by other ones, using a dictionary created with dict, as follows
import string
trans1 = str.maketrans("abc","cda")
trans = dict(zip("abc","cda"))
out1 = "abcabc".translate(trans1)
out = "abcabc".translate(trans)
print(out1)
print(out)
The desired output is "cdacda"
What I get is
cdacda
abcabc
Now out1 is this desired output, but out is not. I can not figure out why this is the case. How can I use the dictionary created via dict in the translate function? So what do I have to change if I want to use translate with trans?
str.translate supports dicts perfectly well (in fact, it supports anything that supports indexing, i.e. __getitem__) – it's just that the key has to be the ordinal representation of the character, not the character itself.
Compare:
>>> "abc".translate({"a": "d"})
'abc'
>>> "abc".translate({ord("a"): "d"})
'dbc'
I do not think the method translate will accept a dictionary object. Aditionlly, you should look at what you are creating:
>>> dict(zip("abc","cda"))
{'c': 'a', 'a': 'c', 'b': 'd'}
I do not think that is what you wanted. zip pairs off correspondingly indexed elements from the first and second argument.
You could write a work around:
def translate_from_dict(original_text,dictionary_of_translations):
out = original_text
for target in dictionary_of_translations:
trans = str.maketrans(target,dictionary_of_translations[target])
out = out.translate(trans)
return out
trans = {"abc":"cda"}
out = translate_from_dict("abcabc",trans)
print(out)
Usage of the dict function to create the dictionary. Read the function definition.
>>> dict([("abc","cda")])
{"abc":"cda"}
The string.translate doesn't support dictionaries as arguments:
translate(s, table, deletions='')
translate(s,table [,deletions]) -> string
Return a copy of the string s, where all characters occurring
in the optional argument deletions are removed, and the
remaining characters have been mapped through the given
translation table, which must be a string of length 256. The
deletions argument is not allowed for Unicode strings.
So, you have to write your own function.
Also, revise your code as it wont run in any python version that I know. It has at least 2 exceptions.

Fast way of generating combinations with constraints?

I have generator function that creates a Cartesian product of lists. The real application uses more complex objects, but they can be represented by strings:
import itertools
s1 = ['a', 'b']
s2 = ['c', 'd', 'e', 'f']
s3 = ['c', 'd', 'e', 'f']
s4 = ['g']
p = itertools.product(*[s1,s2,s3,s4])
names = [''.join(s) for s in p]
In this example, the result is 32 combinations of characters:
names
['accg', 'acdg', 'aceg', 'acfg', 'adcg', 'addg', 'adeg', 'adfg', 'aecg',
'aedg', 'aeeg', 'aefg', 'afcg', 'afdg', 'afeg', 'affg', 'bccg', 'bcdg',
'bceg', 'bcfg', 'bdcg', 'bddg', 'bdeg', 'bdfg', 'becg', 'bedg', 'beeg',
'befg', 'bfcg', 'bfdg', 'bfeg', 'bffg']
Now, let's say I have some constraints such that certain character combinations are illegal. For example, let's say that only strings that contain the regex '[ab].c' are allowed. ('a' or 'b' followed by any letter followed by 'c')
After applying these constraints, we are left with a reduced set of just 8 strings:
import re
r = re.compile('[ab].c')
filter(r.match, names)
['accg', 'adcg', 'aecg', 'afcg', 'bccg', 'bdcg', 'becg', 'bfcg']
In the real application the chains are longer, there could be thousands of combinations and applying the hundreds of constraints is fairly computationally intensive so I'm concerned about scalability.
Right now I'm going through every single combination and checking its validity. Does an algorithm/data structure exist that could speed up this process?
EDIT:
Maybe this will clarify a little: In the real application I am assembling random 2D pictures of buildings from simple basic blocks (like pillars, roof segments, windows, etc.). The constraints limit what kind of blocks (and their rotations) can be grouped together, so the resulting random image looks realistic, and not a random jumble.
A given constraint can contain many combinations of patterns. But of all those combinations, many are not valid because a different constraint would prohibit some portion of it. So in my example, one constraint would contain the full Cartesian product of characters above. And a second constraint is the '[ab].c'; this second constraint reduces the number of valid combinations of the first constraint that I need to consider.
Because these constraints are difficult to create; I looking to visualize what all the combinations of blocks in each constraint look like, but only the valid combinations that pass all constraints. Hence my question. Thanks!
Try just providing the iterator that generates the names directly to filter, like so:
import itertools
import re
s1 = ['a', 'b']
s2 = ['c', 'd', 'e', 'f']
s3 = ['c', 'd', 'e', 'f']
s4 = ['g']
p = itertools.product(*[s1,s2,s3,s4])
r = re.compile('[ab].c')
l = filter(r.search, (''.join(s) for s in p))
print(list(l))
That way, it shouldn't assemble the full set of combinations in memory, it will only keep the ones that fit the criteria. There is probably another way as well.
EDIT:
One of the primary differences from the original, is that instead of:
[''.join(s) for s in p]
Which is a list comprehension, we use:
(''.join(s) for s in p)
Which is a generator.
The important difference here is that a list comprehension creates a list using the designated criteria and generator, while only providing the generator allows the filter to generate values as needed. The important mechanism is lazy evaluation, which really just boils down to only evaluating expressions as their values become necessary.
By switching from a list to a generator, Rob's answer saves space but not time (at least, not asymptotically). You've asked a very broad question about how to enumerate all solutions to what is essentially a constraint satisfaction problem. The big wins are going to come from enforcing local consistency, but it's difficult to give you advice without specific knowledge of your constraints.

difflib with more than two file names

I have several file names that I am trying to compare. Here are some examples:
files = ['FilePrefix10.jpg', 'FilePrefix11.jpg', 'FilePrefix21.jpg', 'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg']
What I need to do is extract "FilePrefix" from each file name, which changes depending on the directory. I have several folders containing many jpg's. Within each folder, each jpg has a FilePrefix in common with every other jpg in that directory. I need the variable portion of the jpg file name. I am unable to predict what FilePrefix is going to be ahead of time.
I had the idea to just compare two file names using difflib (in Python) and extract FilePrefix (and subsequently the variable portion) that way. I've run into the following issue:
>>>> comp1 = SequenceMatcher(None, files[0], files[1])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=11), Match(a=12, b=12, size=4), Match(a=16, b=16, size=0)]
>>>> comp1 = SequenceMatcher(None, files[1], files[2])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=10), Match(a=11, b=11, size=5), Match(a=16, b=16, size=0)]
As you can see, the first size does not match up. It's confusing the ten's and digit's place, making it hard for me to match a difference between more than two files. Is there a correct way to find a minimum size among all files within the directory? Or alternatively, is there a better way to extract FilePrefix?
Thank you.
It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.
For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:
def prefix(x, y):
comp = SequenceMatcher(None, x, y)
matches = comp.get_matching_blocks()
prefix_match = matches[0]
prefix_size = prefix_match[2]
return prefix_size
pairs = zip(files, files[1:])
matches = (prefix(x, y) for x, y in pairs)
prefixlen = min(matches)
prefix = files[0][:prefixlen]
The prefix function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call with map. And I used the [2] instead of .size because there's an annoying bug in 2.7 difflib where the second call to get_matching_blocks may return a tuple instead of a namedtuple. This won't affect the code as-is, but if you add some debugging prints it will break.
Now, pairs is a list of all adjacent pairs of names, created by zipping together names and names[1:]. (If this isn't clear, print(zip(names, names[1:]). If you're using Python 3.x, you'll need to print(list(zip(names, names[1:])) instead, because zip returns a lazy iterator instead of a printable list.)
Now we just want to call prefix on each of the pairs, and take the smallest value we get back. That's what min is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)
You could obviously compact this into two or three lines while still leaving it readable:
prefixlen = min(SequenceMatcher(None, x, y).get_matching_blocks()[0][2]
for x, y in zip(files, files[1:]))
prefix = files[0][:prefixlen]
However, it's worth considering that SequenceMatcher is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.
So, why not just do it manually?
def prefixes(name):
while name:
yield name
name = name[:-1]
def maxprefix(names):
first, names = names[0], names[1:]
for prefix in prefixes(first):
if all(name.startswith(prefix) for name in names):
return prefix
prefixes(first) just gives you 'FilePrefix10.jpg', 'FilePrefix10.jp','FilePrefix10.j, etc. down to'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.
And you can do this even faster by thinking character by character instead of prefix by prefix:
def maxprefix(names):
for i, letters in enumerate(zip(*names)):
if len(set(letters)) > 1:
return names[0][:i]
Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).
The zip reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is, [('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …].
The enumerate just gives us the index along with the value. So, instead of getting ('F', 'F', 'F', 'F') you get 0, ('F, 'F', F', 'F'). We need that index for the last step.
Now, to check that ('F', 'F', 'F', 'F') are all the same, I just put them in a set. If they're all the same, the set will have just one element—{'F'}, then {'i'}, etc. If they're not, it'll have multiple elements—{'1', '2'}—and that's how we know we've gone past the prefix.
The only way to be certain is to check ALL the filenames. So just iterate through them all, checking against the kept maximum matching string as you go.
You might try something like this:
files = ['FilePrefix10.jpg',
'FilePrefix11.jpg',
'FilePrefix21.jpg',
'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg',
'FileProtector354.jpg
]
prefix=files[0]
max = 0
for f in files:
for c in range(0, len(prefix)):
if prefix[:c] != f[:c]:
prefix = f[:c-1]
max = c - 1
print prefix, max
Please pardon the 'un-Pythonicness' of the solution, but I wanted the algorithm to be obvious to any level programmer.

Categories

Resources