How can I sort a complicated dictionary key - python

I have these really complicated data files that I have processed and as each file is processed I have used an orderedDictionary to capture the keys and values. Each orderedDictionary is appended to a list so my final result is a list of dictionaries. Because of the diversity in the data captured in these files, they have many keys in common but there are enough uncommon keys to make exporting the data to Excel more complicated than I was hoping for because I really need to push out the data in a consistent structure.
Each key has the structure like
Q_#_SUB_A_COLUMN_#_NUMB_#
so for example I have
Q_123_SUB_D_COLUMN_C_NUMB_17
We can translate the key as follows
Question 123
SubItem D
Column C
Instance 17
Because there is a SubItem D, column C and instance 17 there must be a SubItemA, Column B and Instance 16
However, one of the source files might be populated with data values (and keys that range up to the example above and some other source file might terminate with
Q_123_SUB_D_COLUMN_C_NUMB_13
so when I iterate through the list of dictionaries to pull all of the unique key instances so I can use them in csv.dictwriter as the column headings my plan was to sort the resulting list of unique column headings but I can't seem to make the sort work
specifically I need it to sort so that the results look like
Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_1
dot
dot
dot
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_17
The big issue is that I do not know before I open any particular set of these files how many questions are answered, how many sub-questions are answered, how many columns are associated with each question or sub-question or how many instances exist of any particular combination of questions, sub-questions or columns, and I don't want to. Using Python I was able to reduce over 1,200 lines of SAS code to 95 but this last little bit before I start writing it out to a CSV file I can't seem to figure out.
Any observations would be appreciated.
My plan is to find all of the unique keys by iterating through the list of dictionaries and then sort these keys correctly so I can then create a csv file using the keys as column headings. I know that I can find the unique keys push that out and manually sort it and then read the sorted file back but that seems clumsy.

Just supply a sufficiently clever function as the key when sorting.
>>> (lambda x: tuple(y(z) for (y, z)
in zip((int, str, str, int),
x.split('_')[1::2])))('Q_122_SUB_A_COLUMN_C_NUMB_1')
(122, 'A', 'C', 1)

You could use a regular expression to extract the different parts of the key and use those to sort with.
e.g.,
import re
names = '''Q_122_SUB_A_COLUMN_C_NUMB_1
Q_122_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_1
Q_123_SUB_A_COLUMN_C_NUMB_17
Q_123_SUB_D_COLUMN_C_NUMB_1
Q_123_SUB_B_COLUMN_C_NUMB_17
Q_123_SUB_C_COLUMN_C_NUMB_1
Q_123_SUB_C_COLUMN_C_NUMB_17
Q_123_SUB_A_COLUMN_C_NUMB_1
Q_123_SUB_D_COLUMN_C_NUMB_17'''.split()
def key(name, match=re.compile(r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)').match):
# not sure what the actual order is, adjust the priorities accordingly
return tuple(f(value) for f, value in zip((str, int, int, str), match(name).group(3, 4, 1, 2)))
for name in names:
print name
names.sort(key=key)
print
for name in names:
print name
To explain the key-extracting process, we know the that the keys have a certain pattern. A regular expression works great here.
r'Q_(\d+)_SUB_(\w+)_COLUMN_(\w+)_NUMB_(\d+)'
# ^ ^ ^ ^
# digits letters letters digits
# group 1 group 2 group 3 group 4
In regular expressions, parts of the string wrapped in parens are groups. \d represents any decimal digit. + means that there should be one or more of the previous character. So \d+ means one or more decimal digits. \w corresponds to a letter.
Provided a string matches this pattern, we could get easy access to each grouping in that string using the group method. You could access multiple groups just by including more group numbers too
e.g.,
m = match('Q_122_SUB_B_COLUMN_C_NUMB_1')
# m.group(1) == '122'
# m.group(2) == 'B'
# m.group(3, 4) == ('C', '1')
This is similar to Ignacio's approach, only a lot more strict on the pattern. Once you can wrap your head around this, creating the appropriate key for sorting should be simple.

Assuming the keys are contained in a list, say keyList
list_to_sort=[]
for key in keyList:
sortKeys=key.split('_')
keyTuple=(sortKeys[1],sortKeys[-1],sortKeys[3],sortKeys[5],key)
list_to_sort.append(keyTuple)
after this the items in the list are tuples that look like
(123,17,D,C,Q_123_SUB_D_COLUMN_C_NUMB_17)
from operator import itemgetter
list_to_sort.sort(key=itemgetter(0,1,2,3)
I am not sure exactly what itemgetter does but this works and seems simpler, but less elegant than the other two solutions.
Notice that I arranged the keys in the tuple to sort in an order that was different than the way the keys appear live. That was not necessary I could have done
for key in keyList:
sortKeys=key.split('_')
keyTuple=(sortKeys[1],sortKeys[3],sortKeys[5],sortKeys[7],key)
list_to_sort.append(keyTuple)
and then done the sort like so
list_to_sort.sort(key=itemgetter(0,3,1,2)
It was easier for me to track the first one through

Related

How to Filter Rows in a DataFrame Based on a Specific Number of Characters and Numbers

New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.

Python Matching Multiple Keys/ Unique Pairs to a Value

What would be the fastest, most efficient way to grab and map multiple values to one value. For a use case example, say you are multiplying two numbers and you want to remember if you have multiplied those numbers before. Instead of making a giant matrix of X by Y and filling it out, it would be nice to query a Dict to see if dict[2,3] = 6 or dict[3,2] = 6. This would be especially useful for more than 2 values.
I have seen an answer similar to what I'm asking here, but would this be O(n) time or O(1)?
print value for matching multiple key
for key in responses:
if user_message in key:
print(responses[key])
Thanks!
Seems like the easiest way to do this is to sort the values before putting them in the dict. Then sort the x,y... values before looking them up. And note that you need to use tuples to map into a dictionary (lists are mutable).
the_dict = {(2,3,4): 24, (4,5,6): 120}
nums = tuple(sorted([6,4,5]))
if nums in the_dict:
print(the_dict[nums])

How to deal with parsing an arbitrary number of lists into a dictionary

I am parsing an XMI/XML data structure into a pandas dataframe by first decomposing it into a dictionary. When I encounter a named tuple in a list in my XMI, there appear to be a maximum of two named tuples in my list (although the majority only have one).
To handle this case, I am doing the following:
if val is not None and val:
if len(val) == 1:
d['modifiedBegin'] = val[0].begin
d['modifiedEnd'] = val[0].end
d['modifiedBegin1'] = None
d['modifiedEnd1'] = None
else:
d['modifiedBegin1'] = val[1].begin
d['modifiedEnd1'] = val[1].end
My issues with this are: a) I cannot be guaranteed that there are only two lists in my list that I am decomposing, and b) this feels cheap, ugly and just plain wrong!
I really would like to come up with a more general solution, especially given item a) above.
My data look like:
val = [Span(xmiID=105682, begin=13352, end=13358, type='org.metamap.uima.ts.Span'), Span(xmiID=105685, begin=13368, end=13374, type='org.metamap.uima.ts.Span')]
I would really much rather parse this out into two separate rows in my dataframe, instead of having more columns. The major issue is that both of these tuples share common data from a larger object that looks like:
Negation(xmiID=142613, id=None, negType='nega', negTrigger='without', modifier=[Span(xmiID=105682, begin=13352, end=13358, type='org.metamap.uima.ts.Span'), Span(xmiID=105685, begin=13368, end=13374, type='org.metamap.uima.ts.Span')])
So, both rows share the attributes negType and negTrigger... what is a more general way of decomposing this to insert into my dataframe. I though of iterating through the elements when the length of the list ws greater than one and then inserting into the datframe on each iteration, but that seems messy.
My desired outcome would thus be to have a dataframe that looks like (minus the indices and other common junk):
Iterate over Negation namedtuples
for each thing in negation.modifier
add a row using the negation attributes and the things attributes
Or instead of parsing XML to namedtuples to dictionaries skip the middle part and create a single dictionary - {'begin':[row0,row1,...],'end':[row0,row1,...],'negtrigger':[row0,row1,...],'negtype':[row0,row1,...]} - from the XML

difflib with more than two file names

I have several file names that I am trying to compare. Here are some examples:
files = ['FilePrefix10.jpg', 'FilePrefix11.jpg', 'FilePrefix21.jpg', 'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg']
What I need to do is extract "FilePrefix" from each file name, which changes depending on the directory. I have several folders containing many jpg's. Within each folder, each jpg has a FilePrefix in common with every other jpg in that directory. I need the variable portion of the jpg file name. I am unable to predict what FilePrefix is going to be ahead of time.
I had the idea to just compare two file names using difflib (in Python) and extract FilePrefix (and subsequently the variable portion) that way. I've run into the following issue:
>>>> comp1 = SequenceMatcher(None, files[0], files[1])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=11), Match(a=12, b=12, size=4), Match(a=16, b=16, size=0)]
>>>> comp1 = SequenceMatcher(None, files[1], files[2])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=10), Match(a=11, b=11, size=5), Match(a=16, b=16, size=0)]
As you can see, the first size does not match up. It's confusing the ten's and digit's place, making it hard for me to match a difference between more than two files. Is there a correct way to find a minimum size among all files within the directory? Or alternatively, is there a better way to extract FilePrefix?
Thank you.
It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.
For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:
def prefix(x, y):
comp = SequenceMatcher(None, x, y)
matches = comp.get_matching_blocks()
prefix_match = matches[0]
prefix_size = prefix_match[2]
return prefix_size
pairs = zip(files, files[1:])
matches = (prefix(x, y) for x, y in pairs)
prefixlen = min(matches)
prefix = files[0][:prefixlen]
The prefix function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call with map. And I used the [2] instead of .size because there's an annoying bug in 2.7 difflib where the second call to get_matching_blocks may return a tuple instead of a namedtuple. This won't affect the code as-is, but if you add some debugging prints it will break.
Now, pairs is a list of all adjacent pairs of names, created by zipping together names and names[1:]. (If this isn't clear, print(zip(names, names[1:]). If you're using Python 3.x, you'll need to print(list(zip(names, names[1:])) instead, because zip returns a lazy iterator instead of a printable list.)
Now we just want to call prefix on each of the pairs, and take the smallest value we get back. That's what min is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)
You could obviously compact this into two or three lines while still leaving it readable:
prefixlen = min(SequenceMatcher(None, x, y).get_matching_blocks()[0][2]
for x, y in zip(files, files[1:]))
prefix = files[0][:prefixlen]
However, it's worth considering that SequenceMatcher is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.
So, why not just do it manually?
def prefixes(name):
while name:
yield name
name = name[:-1]
def maxprefix(names):
first, names = names[0], names[1:]
for prefix in prefixes(first):
if all(name.startswith(prefix) for name in names):
return prefix
prefixes(first) just gives you 'FilePrefix10.jpg', 'FilePrefix10.jp','FilePrefix10.j, etc. down to'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.
And you can do this even faster by thinking character by character instead of prefix by prefix:
def maxprefix(names):
for i, letters in enumerate(zip(*names)):
if len(set(letters)) > 1:
return names[0][:i]
Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).
The zip reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is, [('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …].
The enumerate just gives us the index along with the value. So, instead of getting ('F', 'F', 'F', 'F') you get 0, ('F, 'F', F', 'F'). We need that index for the last step.
Now, to check that ('F', 'F', 'F', 'F') are all the same, I just put them in a set. If they're all the same, the set will have just one element—{'F'}, then {'i'}, etc. If they're not, it'll have multiple elements—{'1', '2'}—and that's how we know we've gone past the prefix.
The only way to be certain is to check ALL the filenames. So just iterate through them all, checking against the kept maximum matching string as you go.
You might try something like this:
files = ['FilePrefix10.jpg',
'FilePrefix11.jpg',
'FilePrefix21.jpg',
'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg',
'FileProtector354.jpg
]
prefix=files[0]
max = 0
for f in files:
for c in range(0, len(prefix)):
if prefix[:c] != f[:c]:
prefix = f[:c-1]
max = c - 1
print prefix, max
Please pardon the 'un-Pythonicness' of the solution, but I wanted the algorithm to be obvious to any level programmer.

Comparing massive lists of dictionaries in python

I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so
biglist1=[{'transaction':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, ...]
With 'somevalue' standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.
biglist2=[{'transaction':'somevalue', 'id':'', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'', 'date':'somevalue' ...}, ...]
So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.
I've been doing
for item in biglist2:
for transaction in biglist1:
if item['transaction'] == transaction['transaction']:
list_transactionnamematches.append(transaction)
for item in biglist2:
for transaction in list_transactionnamematches:
if item['date'] == transaction['date']:
list_transactionnamematches.append(transaction)
... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.
I'm guessing this isn't really how this kind of comparison should be done. Any ideas?
Index on the fields you want to use for lookup. O(n+m)
matches = []
biglist1_indexed = {}
for item in biglist1:
biglist1_indexed[(item["transaction"], item["date"])] = item
for item in biglist2:
if (item["transaction"], item["date"]) in biglist1_indexed:
matches.append(item)
This is probably thousands of times faster than what you're doing now.
What you want to do is to use correct data structures:
Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.
Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode
import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
biglist3.append(biglist1[i1])
i1++
i2++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
i1++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
i2++
else:
print "this wont happen if i did the tuple comparison correctly"
This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.
In O(m*n)...
for item in biglist2:
for transaction in biglist1:
if (item['transaction'] == transaction['transaction'] &&
item['date'] == transaction['date'] &&
item['foo'] == transaction['foo'] ) :
list_transactionnamematches.append(transaction)
The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.
From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.
After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.
Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.
Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.
http://sourceforge.net/projects/psyco/
While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.
Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.
http://psyco.sourceforge.net/psycoguide/node8.html
I'm also a newbie. My code is structured in much the same way as his.
for A in biglist:
for B in biglist:
if ( A.get('somekey') <> B.get('somekey') and #don't match to itself
len( set(A.get('list')) - set(B.get('list')) ) > 10:
[do stuff...]
This takes hours to run through a list of 10000 dictionaries. Each dictionary contains lots of stuff but I could potentially pull out just the ids ('somekey') and lists ('list') and rewrite as a single dictionary of 10000 key:value pairs.
Question: how much faster would that be? And I assume this is faster than using a list of lists, right?

Categories

Resources