difflib with more than two file names - python

I have several file names that I am trying to compare. Here are some examples:
files = ['FilePrefix10.jpg', 'FilePrefix11.jpg', 'FilePrefix21.jpg', 'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg']
What I need to do is extract "FilePrefix" from each file name, which changes depending on the directory. I have several folders containing many jpg's. Within each folder, each jpg has a FilePrefix in common with every other jpg in that directory. I need the variable portion of the jpg file name. I am unable to predict what FilePrefix is going to be ahead of time.
I had the idea to just compare two file names using difflib (in Python) and extract FilePrefix (and subsequently the variable portion) that way. I've run into the following issue:
>>>> comp1 = SequenceMatcher(None, files[0], files[1])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=11), Match(a=12, b=12, size=4), Match(a=16, b=16, size=0)]
>>>> comp1 = SequenceMatcher(None, files[1], files[2])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=10), Match(a=11, b=11, size=5), Match(a=16, b=16, size=0)]
As you can see, the first size does not match up. It's confusing the ten's and digit's place, making it hard for me to match a difference between more than two files. Is there a correct way to find a minimum size among all files within the directory? Or alternatively, is there a better way to extract FilePrefix?
Thank you.

It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.
For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:
def prefix(x, y):
comp = SequenceMatcher(None, x, y)
matches = comp.get_matching_blocks()
prefix_match = matches[0]
prefix_size = prefix_match[2]
return prefix_size
pairs = zip(files, files[1:])
matches = (prefix(x, y) for x, y in pairs)
prefixlen = min(matches)
prefix = files[0][:prefixlen]
The prefix function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call with map. And I used the [2] instead of .size because there's an annoying bug in 2.7 difflib where the second call to get_matching_blocks may return a tuple instead of a namedtuple. This won't affect the code as-is, but if you add some debugging prints it will break.
Now, pairs is a list of all adjacent pairs of names, created by zipping together names and names[1:]. (If this isn't clear, print(zip(names, names[1:]). If you're using Python 3.x, you'll need to print(list(zip(names, names[1:])) instead, because zip returns a lazy iterator instead of a printable list.)
Now we just want to call prefix on each of the pairs, and take the smallest value we get back. That's what min is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)
You could obviously compact this into two or three lines while still leaving it readable:
prefixlen = min(SequenceMatcher(None, x, y).get_matching_blocks()[0][2]
for x, y in zip(files, files[1:]))
prefix = files[0][:prefixlen]
However, it's worth considering that SequenceMatcher is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.
So, why not just do it manually?
def prefixes(name):
while name:
yield name
name = name[:-1]
def maxprefix(names):
first, names = names[0], names[1:]
for prefix in prefixes(first):
if all(name.startswith(prefix) for name in names):
return prefix
prefixes(first) just gives you 'FilePrefix10.jpg', 'FilePrefix10.jp','FilePrefix10.j, etc. down to'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.
And you can do this even faster by thinking character by character instead of prefix by prefix:
def maxprefix(names):
for i, letters in enumerate(zip(*names)):
if len(set(letters)) > 1:
return names[0][:i]
Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).
The zip reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is, [('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …].
The enumerate just gives us the index along with the value. So, instead of getting ('F', 'F', 'F', 'F') you get 0, ('F, 'F', F', 'F'). We need that index for the last step.
Now, to check that ('F', 'F', 'F', 'F') are all the same, I just put them in a set. If they're all the same, the set will have just one element—{'F'}, then {'i'}, etc. If they're not, it'll have multiple elements—{'1', '2'}—and that's how we know we've gone past the prefix.

The only way to be certain is to check ALL the filenames. So just iterate through them all, checking against the kept maximum matching string as you go.
You might try something like this:
files = ['FilePrefix10.jpg',
'FilePrefix11.jpg',
'FilePrefix21.jpg',
'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg',
'FileProtector354.jpg
]
prefix=files[0]
max = 0
for f in files:
for c in range(0, len(prefix)):
if prefix[:c] != f[:c]:
prefix = f[:c-1]
max = c - 1
print prefix, max
Please pardon the 'un-Pythonicness' of the solution, but I wanted the algorithm to be obvious to any level programmer.

Related

Generating expressions from permutations of variables and operators

So, I've decided that it's time to learn regular expressions. Thus, I set out to solve various problems, and after a bit of smooth sailing, I seem to have hit a wall and need help getting unstuck.
The task:
Given a list of characters and logical operators, find all possible combinations of these characters and operators that are not gibberish.
For example, given:
my_list = ['p', 'q', '&', '|']
the output would be:
answers = ['p', 'q', 'p&q', 'p|q'...]
However, strings like 'pq&' and 'p&|' are gibberish and therefore not allowed.
Naturally, as more elements are added to my_list, the more complicated the process becomes.
My current approach:
(I'd like to learn how to solve it with regex, but I am also curious if there exists a better way, too... but again, my focus is regex)
step 1:
find all permutations of the elements such that each permutation is 3 <= x <= len(my_list) long.
step 2:
Loop over the list, and if a regex match is found, pull that element out and put it in the answers list.
(I'm not married to this 2-step approach, it is just what seemed most logical to me)
My current code, minus the regex:
import re
from itertool import permutations
my_list = ['p', 'q', '~r', 'r', '|', '&']
foo = []
answers = []
count = 3
while count < 7:
for i in permutations(a, count):
i = ''.join(k for k in i)
foo.append(i)
count +=1
for i in foo:
if re.match(r'insert_regex', i):
answers.append(i)
else:
None
print answers
Now, I have tried a vast slew of different regex's to get this to work (too many to list them all here) but some of the main ones are:
A straightforward approach by finding all the cases that have two letters side by side, or two operators side by side, then instead of appending 'answers', I just removed them from 'foo'. This is the regex I tried:
r'(\w\w)[&\|]{2,}'
and did not even come close.
I then decided to try and find the strings that I wanted, as opposed to the ones I did not want.
First I tested:
r'^[~\w]'
to make sure I could get the strings whose first character were a letter or a negation. This worked. I was happy.
I then tried:
r'^[~\w][&\|]'
to try and get the next logical operator; however, it only picked up strings whose first character was a letter, and ignored all of the strings whose first character was a negation.
I then tried a conditional so that if the first character was a negation, the next character would be a letter, otherwise it would be an operator:
r'^(?(~)\w|[&\|])'
but this thew me "error: bad character in group name".
I then tried to resolve this error by:
r'^(?:(~)\w|[&\|])'
But that returned only strings that started with '~' or an operator.
I then tried a slew of other things related to conditionals and groupings (2 days worth, actually), but I can't seem to find a solution. Part of the problem is that I don't know enough about regex to know where to go to find the solution, so I have kind of been wandering around the internet aimlessly.
I have run through a lot of tutorials and explanation pages, but they are all rather opaque and don't piece things together in a way is conducive to understanding... they just sort of throw out code for you to copy and paste or mimic.
Any insights you have would be much appreciated, and as much as I would love an answer to the problem, if possible, an ELI5 explanation of what the solution does would be excellent for my own progress.
In a bitter twist of irony, it turns out that I had the solution written down (I documented all the regex's I tried), but it originally failed because I forgot to remove strings from the original list, not the copy.
If anyone is looking for a solution to the problem, the following code worked on all of my test cases (can't promise beyond that, however).
import re
from itertools import permutations
import copy
a = ['p', 'q', 'r', '~r', '|', '&']
foo = []
count = 3
while count < len(a)+1:
for j in permutations(a, count):
j = ''.join(k for k in j)
foo.append(j)
count +=1
foo_copy = copy.copy(foo)
for i in foo:
if re.search(r'(^[&\|])|(\w\w)|(\w~)|([&\|][&\|])|([&\|]$)', i):
foo_copy.remove(i)
else:
None
print foo_copy
You have a list of variables (characters), binary operators, and/or variables prefixed with a unitary operator (like ~). The last case can be dealt with just like a variable.
As binary operators need a variable at either side, we can conclude that a valid expression is an alternation of variables and operators, starting and ending with a variable.
So, you could first divide the input list into two lists based on whether an item is a variable or an operator. Then you could increase the size of the output you will generate, and for each size, get the permutations of both lists and zip these in order to build a valid expression each time. This way you don't need a regular expression to verify the validity.
Here is the suggested function:
from itertools import permutations, zip_longest, chain
def expressions(my_list):
answers = []
variables = [x for x in my_list if x[-1].isalpha()]
operators = [x for x in my_list if not x[-1].isalpha()]
max_var_count = min(len(operators) + 1, len(variables))
for var_count in range(1, max_var_count+1):
for vars in permutations(variables, var_count):
for ops in permutations(operators, var_count-1):
answers.append(''.join(list(chain.from_iterable(zip_longest(vars, ops)))[:-1]))
return answers
print(expressions(['p', 'q', '~r', 'r', '|', '&']))

Python list sort creating different orders to expected output

I have a script to move items around and perform some basic functions on them. It relies on list.sort() to make sure the files are going to the right places.
For example I have 11 files:
A1_S1_ETC.ext
A2_S2_ETC.ext
...
...
A10_S10_ETC.ext
A11_S11_ETC.ext
The script asks for a path and output, from this I create two sorted lists using os and glob:
pathA = raw_input()
listA = list(glob.glob(os.path.join(path,'*.ext')))
listA.sort()
outp = raw_input()
outp.sort()
filen = [x.split(pathA)[1].split('_')[0] for x in listA]
filen.sort()
outp1 = [pathA + s + '/' for s in filen]
outp1.sort()
But when printed:
print listA
['A10_S10_ETC.ext', 'A11_S11_ETC.ext','A1_S1_ETC.ext',, A2_S2_ETC.ext']
print outp1
['/user/path/A1/', '/user/path/A10/', '/user/path/A11/', '/user/path/A2/']
I guess it's the '_SXX' part in the file name that's impacting the sort function? I don't care how it's sorted, as long as A1 files go into A1 directory - not just for this nomenclature but for any possible string.
Is there a way to do this - perhaps by asking the list.sort function to sort until the first underscore?
Sorting strings in python is a lexicographical sort. The strings are compared lexicographically. So 'A10' and 'A11' come before 'A1_'.
you can get your expect behaviour using:
lst.sort(key=lambda x: int(x.split('_')[0][1:])
What happens is that sorting is lexicographic with ordering ASCII characters according to ASCII code. Here we have ASCII code for '0' is 48 while the ASCII code for '_' is 95 - which means that '0' < '_'.
What you can do do get consistency is to supply a consistent comparison function. For example:
def mycmp(s1, s2):
s1 = s1.split(pathA)[1].split('_')[0]
s2 = s2.split(pathA)[1].split('_')[0]
return cmp(s1, s2)
outp1.sort(cmp=mycmp)
Here the thing is that you use the same transformation before comparing the strings.
This relies on that since you strip away information you may strip away too much to make the elements distinct, but in your case it would mean that two elements of outp1 would become the same anyway so it wouldn't matter here.
Otherwise you would have to apply the sort before you transform the names. Which would mean not to sort filen or outp1 (because then their order would rely on the order of listA.
What you want is called natural sort. See this thread about it: Does Python have a built in function for string natural sort?

Mutating a List in Python

I have a list of the form
['A', 'B', 'C', 'D']
which I want to mutate into:
[('Option1','A'), ('Option2','B'), ('Option3','C'), ('Option4','D')]
I can iterate over the original list and mutate successfully, but the closest that I can come to what I want is this:
["('Option1','A')", "('Option2','B')", "('Option3','C')", "('Option4','D')"]
I need the single quotes but don't want the double quotes around each list.
[EDIT] - here is the code that I used to generate the list; although I've tried many variations. Clearly, I've turned 'element' into a string--obviously, I'm not thinking about it the right way here.
array = ['A', 'B', 'C', 'D']
listOption = 0
finalArray = []
for a in array:
listOption += 1
element = "('Option" + str(listOption) + "','" + a + "')"
finalArray.append(element)
Any help would be most appreciated.
[EDIT] - a question was asked (rightly) why I need it this way. The final array will be fed to an application (Indigo home control server) to populate a drop-down list in a config dialog.
[('Option{}'.format(i+1),item) for i,item in enumerate(['A','B','C','D'])]
# EDIT FOR PYTHON 2.5
[('Option%s' % (i+1), item) for i,item in enumerate(['A','B','C','D'])]
This is how I'd do it, but honestly I'd probably try not to do this and instead want to know why I NEEDED to do this. Any time you're making a variable with a number in it (or in this case a tuple with one element of data and one element naming the data BY NUMBER) think instead how you could organize your consuming code to not need that instead.
For instance: when I started coding professionally the company I work for had an issue with files not being purged on time at a few of our locations. Not all the files, mind you, just a few. In order to provide our software developer with the information to resolve the problem, we needed a list of files from which sites the purge process was failing on.
Because I was still wet behind the ears, instead of doing something SANE like making a dictionary with keys of the files and values of the sizes, I used locals() to create new variables WITH MEANING. Don't do this -- your variables should mean nothing to anyone but future coders. Basically I had a whole bunch of variables named "J_ITEM" and "J_INV" and etc with a value 25009 and etc, one for each file, then I grouped them all together with [item for item in locals() if item.startswith("J_")]. THAT'S INSANITY! Don't do this, build a saner data structure instead.
That said, I'm interested in how you put it all together. Do you mind sharing your code by editing your answer? Maybe we can work together on a better solution than this hackjob.
x = ['A','B','C','D']
option = 1
answer = []
for element in x:
t = ('Option'+str(option),element) #Creating the tuple
answer.append(t)
option+=1
print answer
A tuple is different from a string, in that a tuple is an immutable list. You define it by writing:
t = (something, something_else)
You probably defined t to be a string "(something, something_else)" which is indicated by the quotations surrounding the expression.
In addition to adsmith great answer, I would add the map way:
>>> map(lambda (index, item): ('Option{}'.format(index+1),item), enumerate(['a','b','c', 'd']))
[('Option1', 'a'), ('Option2', 'b'), ('Option3', 'c'), ('Option4', 'd')]

Combining two sets and splitting them in pairs

I need to take two sets of data and produce one set of pairs(tuples) from both sets. This result set will only have one possible pair, i.e. for two sets: 1,2 and 3, 4 the result should be: ((1, 3), (2, 4)). Full exercise text can be found here:http://pastebin.com/mUaKV4G7
I need to do this using pop. Here's what I have so far:
def mating_pairs(males, females):
pairs = set()
tmp_males, tmp_females = males.copy(), females.copy()
for male in tmp_males:
for female in tmp_females:
pairs.add(males.pop())
pairs.add(females.pop())
zip(pairs[::2], pairs[1::2])
return pairs
This function works fine up to the point when it reaches:
zip(pairs[::2], pairs[1::2])
without it given two sets it'll combine them together but when I try to use zip to split them in pairs I get this error:
'set' object is not subscriptable
Which leads me to believe that it's somewhere returning None instead of correct result.
This function need to work with both integers and strings( I don't think it needs to pairing values in a specific order), also both sets will have equal number of values.
Can someone advise what I'm doing wrong?
The error tells you what is going on: pairs is a set and the expression pairs[::2] means "every 2nd element of the set". The problem is that sets have no defined order, so "every 2nd element of the set" makes no sense. As the order of element is undefined, Python raises a exception instead of making up a random order.
What you probably wanted to do is either to pair up males and females in the order they appear:
def mating_pairs(males, females):
return zip(males, females)
or all possible pairs of males and females (the product of both lists):
from itertools import product
def mating_pairs(males, females):
return product(males, females)
Your homework seem to be to implement either zip or product :-)
If you must use pop, try popping both males and females at once, into a tuple that you add to pairs (and, by the way, I'm not sure why you make copies of your sets and destroy the originals, but I suppose you have your reasons). Also, iterating both males and females will fail to give you the answer you're looking for - rather, check the emptiness of each set as you pop from it. What you're looking for is more like this:
def mating_pairs(males, females):
pairs = set()
tmp_males, tmp_females = males.copy(), females.copy()
while tmp_males and tmp_females:
pairs.add((tmp_males.pop(),tmp_females.pop()))
return pairs
though this would be a touch simpler if you can avoid using set.pop:
def mating_pairs(males, females):
return set(zip(males,females))
Also, please note that this can't be a complete answer unless you are using some sort of ordered set datatype. As it is using a set, you're not guaranteed to preserve any order of the males and females that were passed in.
If using pop is forced restriction I'd go with something like that:
def mating_pairs(males, females):
res = set()
males_copy, females_copy = males.copy(), females.copy()
while males_copy and females_copy:
res.add((males_copy.pop(), females_copy.pop()))
return res
print mating_pairs(set([1, 2, 3]), set(['a', 'b', 'c', 'd']))
# => set([(1, 'a'), (2, 'b'), (3, 'c')])
But set(zip(males, females)) is much more easier anyway.
I’m a bit unsure what you are trying to do. But analyzing your posted code, I guess you want to combine every male with every female once, and put tuples of each combination into a final list.
If that is the case, then what you are trying to do is creating a Cartesian product. The Python library has a nice function for this: itertools.product. You can use it like this:
def mating_pairs(males, females):
return set(itertools.product(males, females))
If you want to do it the manual way, then you can use two nested for loops to get all combinations. However the way you did it, by utilizing pop won’t work. What you did in your code is that you iterated over all males and females (after copying the parameters) and then you pop the items from the original sets. That way, very quickly both sets males and females will be empty, as you keep popping from them for every possible combination, without considering that you only get all combinations if you keep reusing individual items.
You could fix your code like this, without using pop:
def mating_pairs(males, females):
pairs = set()
# We only iterate over the items, so we don’t modify the original sets.
for male in males:
for female in females:
# And instead of adding individual items by once, and zipping them
# later, we just directly add tuples to the set.
pairs.add((male, female))
return pairs

How can I sort a complicated dictionary key

I have these really complicated data files that I have processed and as each file is processed I have used an orderedDictionary to capture the keys and values. Each orderedDictionary is appended to a list so my final result is a list of dictionaries. Because of the diversity in the data captured in these files, they have many keys in common but there are enough uncommon keys to make exporting the data to Excel more complicated than I was hoping for because I really need to push out the data in a consistent structure.
Each key has the structure like
Q_#_SUB_A_COLUMN_#_NUMB_#
so for example I have
Q_123_SUB_D_COLUMN_C_NUMB_17
We can translate the key as follows
Question 123
SubItem D
Column C
Instance 17
Because there is a SubItem D, column C and instance 17 there must be a SubItemA, Column B and Instance 16
However, one of the source files might be populated with data values (and keys that range up to the example above and some other source file might terminate with
Q_123_SUB_D_COLUMN_C_NUMB_13
so when I iterate through the list of dictionaries to pull all of the unique key instances so I can use them in csv.dictwriter as the column headings my plan was to sort the resulting list of unique column headings but I can't seem to make the sort work
specifically I need it to sort so that the results look like
Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_1
dot
dot
dot
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_17
The big issue is that I do not know before I open any particular set of these files how many questions are answered, how many sub-questions are answered, how many columns are associated with each question or sub-question or how many instances exist of any particular combination of questions, sub-questions or columns, and I don't want to. Using Python I was able to reduce over 1,200 lines of SAS code to 95 but this last little bit before I start writing it out to a CSV file I can't seem to figure out.
Any observations would be appreciated.
My plan is to find all of the unique keys by iterating through the list of dictionaries and then sort these keys correctly so I can then create a csv file using the keys as column headings. I know that I can find the unique keys push that out and manually sort it and then read the sorted file back but that seems clumsy.
Just supply a sufficiently clever function as the key when sorting.
>>> (lambda x: tuple(y(z) for (y, z)
in zip((int, str, str, int),
x.split('_')[1::2])))('Q_122_SUB_A_COLUMN_C_NUMB_1')
(122, 'A', 'C', 1)
You could use a regular expression to extract the different parts of the key and use those to sort with.
e.g.,
import re
names = '''Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_17'''.split()
def key(name, match=re.compile(r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)').match):
# not sure what the actual order is, adjust the priorities accordingly
return tuple(f(value) for f, value in zip((str, int, int, str), match(name).group(3, 4, 1, 2)))
for name in names:
print name
names.sort(key=key)
print
for name in names:
print name
To explain the key-extracting process, we know the that the keys have a certain pattern. A regular expression works great here.
r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)'
# ^ ^ ^ ^
# digits letters letters digits
# group 1 group 2 group 3 group 4
In regular expressions, parts of the string wrapped in parens are groups. \d represents any decimal digit. + means that there should be one or more of the previous character. So \d+ means one or more decimal digits. \w corresponds to a letter.
Provided a string matches this pattern, we could get easy access to each grouping in that string using the group method. You could access multiple groups just by including more group numbers too
e.g.,
m = match('Q_122_SUB_B_COLUMN_C_NUMB_1')
# m.group(1) == '122'
# m.group(2) == 'B'
# m.group(3, 4) == ('C', '1')
This is similar to Ignacio's approach, only a lot more strict on the pattern. Once you can wrap your head around this, creating the appropriate key for sorting should be simple.
Assuming the keys are contained in a list, say keyList
list_to_sort=[]
for key in keyList:
sortKeys=key.split('_')
keyTuple=(sortKeys[1],sortKeys[-1],sortKeys[3],sortKeys[5],key)
list_to_sort.append(keyTuple)
after this the items in the list are tuples that look like
(123,17,D,C,Q_123_SUB_D_COLUMN_C_NUMB_17)
from operator import itemgetter
list_to_sort.sort(key=itemgetter(0,1,2,3)
I am not sure exactly what itemgetter does but this works and seems simpler, but less elegant than the other two solutions.
Notice that I arranged the keys in the tuple to sort in an order that was different than the way the keys appear live. That was not necessary I could have done
for key in keyList:
sortKeys=key.split('_')
keyTuple=(sortKeys[1],sortKeys[3],sortKeys[5],sortKeys[7],key)
list_to_sort.append(keyTuple)
and then done the sort like so
list_to_sort.sort(key=itemgetter(0,3,1,2)
It was easier for me to track the first one through

Categories

Resources