How can I generate a unique key from 2 strings? - python

Let's say I have "username1" and "username2"
How can I combine these two usernames together to generate a unique value?
The value should be the same no matter how they are combined if username2 is entered first or vice versa the two names should always combine to become the same unique value. The string length will not work as other usernames can have the same length.
Is there a simple way to do this or a technique for this?

Sets are unordered, and Python has a hashable immutable set type, assuming your requirement for a unique key is that it can be used as a dict key:
def key(a, b):
return frozenset([a, b])
d = {}
d[key("foo", "bar")] = "baz"
print(d[key("bar", "foo")])
You can also create a sorted tuple:
def key(a, b):
return tuple(sorted([a, b]))
d = {}
d[key("foo", "bar")] = "baz"
print(d[key("bar", "foo")])

There are many ways to generate a unique value. One common (to sysops) method is the one used to get a UNIX UUID (Universally Unique IDentifier) from the routines supplied with most major languages. In case of parsing problems, use a separator that does not appear in your input strings.
If you replace my constant strings "username1" and "username2" with your variables, I believe that your problem is solved.
import uuid
unique_sep = '|' # I posit vertical bar as not appearing in any user name
import uuid
unique_ID = uuid.uuid5(uuid.NAMESPACE_X500, "username1" + unique_sep + "username2")
print(unique_ID)
Output:
b2106742-94f5-596d-a461-c977b5982d85
To preserve uniqueness from entry order, simply sort them in any convenient fashion, such as:
names_in = [username1, username2]
unique_ID = uuid.uuid5(uuid.NAMESPACE_X500, min(names_in) + unique_sep + max(names_in))

Related

Python: Sorting ip ranges which are dictionary keys

I have a dictionary which has IP address ranges as Keys (used to de-duplicate in a previous step) and certain objects as values. Here's an example
Part of the dictionary sresult:
10.102.152.64-10.102.152.95 object1:object3
10.102.158.0-10.102.158.255 object2:object5:object4
10.102.158.0-10.102.158.31 object3:object4
10.102.159.0-10.102.255.255 object6
There are tens of thousands of lines, I want to sort (correctly) by IP address in keys
I tried splitting the key based on the range separator - to get a single IP address that can be sorted as follows:
ips={}
for key in sresult:
if '-' in key:
l = key.split('-')[0]
ips[l] = key
else:
ips[1] = key
And then using code found on another post, sorting by IP address and then looking up the values in the original dictionary:
sips = sorted(ipaddress.ip_address(line.strip()) for line in ips)
for x in sips:
print("SRC: "+ips[str(x)], "OBJECT: "+" :".join(list(set(sresult[ips[str(x)]]))), sep=",")
The problem I have encountered is that when I split the original range and add the sorted first IPs as new keys in another dictionary, I de-duplicate again losing lines of data - lines 2 & 3 in the example
line 1 10.102.152.64 -10.102.152.95
line 2 10.102.158.0 -10.102.158.255
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
becomes
line 1 10.102.152.64 -10.102.152.95
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
So upon rebuilding the original dictionary using the IP address sorted keys, I have lost data
Can anyone help please?
EDIT This post now consists of three parts:
1) A bit of information about dictionaries that you will need in order to understand the rest.
2) An analysis of your code, and how you could fix it without using any other Python features.
3) What I would consider the best solution to the problem, in detail.
1) Dictionaries
Python dictionaries are not ordered. If I have a dictionary like this:
dictionary = {"one": 1, "two": 2}
And I loop through dictionary.items(), I could get "one": 1 first, or I could get "two": 2 first. I don't know.
Every Python dictionary implicitly has two lists associated with it: a list of it's keys and a list of its values. You can get them list this:
print(list(dictionary.keys()))
print(list(dictionary.values()))
These lists do have an ordering. So they can be sorted. Of course, doing so won't change the original dictionary, however.
Your Code
What you realised is that in your case you only want to sort according to the first IP address in your dictionaries keys. Therefore, the strategy that you adopted is roughly as follows:
1) Build a new dictionary, where the keys are only this first part.
2) Get that list of keys from the dictionary.
3) Sort that list of keys.
4) Query the original dictionary for the values.
This approach will, as you noticed, fail at step 1. Because as soon as you made the new dictionary with truncated keys, you will have lost the ability to differentiate between some keys that were only different at the end. Every dictionary key must be unique.
A better strategy would be:
1) Build a function which can represent you "full" ip addresses with as an ip_address object.
2) Sort the list of dictionary keys (original dictionary, don't make a new one).
3) Query the dictionary in order.
Let's look at how we could change your code to implement step 1.
def represent(full_ip):
if '-' in full_ip:
# Stylistic note, never use o or l as variable names.
# They look just like 0 and 1.
first_part = full_ip.split('-')[0]
return ipaddress.ip_address(first_part.strip())
Now that we have a way to represent the full IP addresses, we can sort them according to this shortened version, without having to actually change the keys at all. All we have to do is tell Python's sorted method how we want the key to be represented, using the key parameter (NB, this key parameter has nothing to do with key in a dictionary. They just both happened to be called key.):
# Another stylistic note, always use .keys() when looping over dictionary keys. Explicit is better than implicit.
sips = sorted(sresults.keys(), key=represent)
And if this ipaddress library works, there should be no problems up to here. The remainder of your code you can use as is.
Part 3 The best solution
Whenever you are dealing with sorting something, it's always easiest to think about a much simpler problem: given two items, how would I compare them? Python gives us a way to do this. What we have to do is implement two data model methods called
__le__
and
__eq__
Let's try doing that:
class IPAddress:
def __init__(self, ip_address):
self.ip_address = ip_address # This will be the full IP address
def __le__(self, other):
""" Is this object less than or equal to the other one?"""
# First, let's find the first parts of the ip addresses
this_first_ip = self.ip_address.split("-")[0]
other_first_ip = other.ip_address.split("-")[0]
# Now let's put them into the external library
this_object = ipaddress.ip_address(this_first_ip)
other_object = ipaddress.ip_adress(other_first_ip)
return this_object <= other_object
def __eq__(self, other):
"""Are the two objects equal?"""
return self.ip_address == other.ip_adress
Cool, we have a class. Now, the data model methods will automatically be invoked any time I use "<" or "<=" or "==". Let's check that it is working:
test_ip_1 = IPAddress("10.102.152.64-10.102.152.95")
test_ip_2 = IPAddress("10.102.158.0-10.102.158.255")
print(test_ip_1 <= test_ip_2)
Now, the beauty of these data model methods is that Pythons "sort" and "sorted" will use them as well:
dictionary_keys = sresult.keys()
dictionary_key_objects = [IPAddress(key) for key in dictionary_keys]
sorted_dictionary_key_objects = sorted(dictionary_key_objects)
# According to you latest comment, the line below is what you are missing
sorted_dictionary_keys = [object.ip_address for object in sorted_dictionary_key_objects]
And now you can do:
for key in sorted_dictionary_keys:
print(key)
print(sresults[key])
The Python data model is almost the defining feature of Python. I'd recommend reading about it.

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Differentiate lists according to translations

I have a multilingual field which i used hvad package.
I have a script like below, which i use for the tastypie dehydration.
array = []
for t in bundle.obj.facilities.filter(foo_type = i.foo_type):
for field in get_translatable_fields(t.foo_type.__class__):
for translation in t.foo_type.translations.all():
value = getattr(translation, field)
array.append(value)
print array
But i get all the language translations in the same list. Do you have any idea to have different lists belong to different languages.
I just want to have different arrays that differentiate during for translation in .... iteration
You can store them in a dictionary, indexed by the translation, using a collections.defaultdict:
import collections
dict_all = collections.defaultdict(list)
for t in bundle.obj.facilities.filter(foo_type = i.foo_type):
for field in get_translatable_fields(t.foo_type.__class__):
for translation in t.foo_type.translations.all():
value = getattr(translation, field)
dict_all[translation.language_code].append(value)
If you'd like to turn it back into a regular dictionary afterwards (instead of a defaultdict):
dict_all = dict(dict_all.items())

Sort list of strings by substring using Python

I have a list of strings, each of which is an email formatted in almost exactly the same way. There is a lot of information in each email, but the most important info is the name of a facility, and an incident date.
I'd like to be able to take that list of emails, and create a new list where the emails are grouped together based on the "location_substring" and then sorted again for the "incident_date_substring" so that all of the emails from one location will be grouped together in the list in chronological order.
The facility substring can be found usually in the subject line of each email. The incident date can be found in a line in the email that starts with: "Date of Incident:".
Any ideas as to how I'd go about doing this?
Write a function that returns the two pieces of information you care about from each email:
def email_sort_key(email):
"""Find two pieces of info in the email, and return them as a tuple."""
# ...search, search...
return "location", "incident_date"
Then, use that function as the key for sorting:
emails.sort(key=email_sort_key)
The sort key function is applied to all the values, and the values are re-ordered based on the values returned from the key function. In this case, the key function returns a tuple. Tuples are ordered lexicographically: find the first unequal element, then the tuples compare as the unequal elements compare.
Your solution might look something like this:
def getLocation (mail): pass
#magic happens here
def getDate (mail): pass
#here be dragons
emails = [...] #original list
#Group mails by location
d = {}
for mail in emails:
loc = getLocation (mail)
if loc not in d: d [loc] = []
d [loc].append (mail)
#Sort mails inside each group by date
for k, v in d.items ():
d [k] = sorted (v, key = getDate)
This is something you could do:
from collections import defaultdict
from datetime import datetime
import re
mails = ['list', 'of', 'emails']
mails2 = defaultdict(list)
for mail in mails:
loc = re.search(r'Subject:.*?for\s(.+?)\n', mail).group(1)
mails2[loc].append(mail)
for m in mails2.values():
m.sort(key=lambda x:datetime.strptime(re.search(r'Date of Incident:\s(.+?)\n',
x).group(1), '%m/%d/%Y'))
Please note that this has absolutely no error handling for cases where the regexes don't match.

Pythonic way to parse list of dictionaries for a specific attribute?

I want to cross reference a dictionary and django queryset to determine which elements have unique dictionary['name'] and djangoModel.name values, respectively. The way I'm doing this now is to:
Create a list of the dictionary['name'] values
Create a list of djangoModel.name values
Generate the list of unique values by checking for inclusion in those lists
This looks as follows:
alldbTests = dbp.test_set.exclude(end_date__isnull=False) #django queryset
vctestNames = [vctest['name'] for vctest in vcdict['tests']] #from dictionary
dbtestNames = [dbtest.name for dbtest in alldbTests] #from django model
# Compare tests in protocol in fortytwo's db with protocol from vc
obsoleteTests = [dbtest for dbtest in alldbTests if dbtest.name not in vctestNames]
newTests = [vctest for vctest in vcdict if vctest['name'] not in dbtestNames]
It feels unpythonic to have to generate the intermediate list of names (lines 2 and 3 above), just to be able to check for inclusion immediately after. Am I missing anything? I suppose I could put two list comprehensions in one line like this:
obsoleteTests = [dbtest for dbtest in alldbTests if dbtest.name not in [vctest['name'] for vctest in vcdict['tests']]]
But that seems harder to follow.
Edit:
Think of the initial state like this:
# vcdict is a list of django models where the following are all true
alldBTests[0].name == 'test1'
alldBTests[1].name == 'test2'
alldBTests[2].name == 'test4'
dict1 = {'name':'test1', 'status':'pass'}
dict2 = {'name':'test2', 'status':'pass'}
dict3 = {'name':'test5', 'status':'fail'}
vcdict = [dict1, dict2, dict3]
I can't convert to sets and take the difference unless I strip things down to just the name string, but then I lose access to the rest of the model/dictionary, right? Sets only would work here if I had the same type of object in both cases.
vctestNames = dict((vctest['name'], vctest) for vctest in vcdict['tests'])
dbtestNames = dict((dbtest.name, dbtest) for dbtest in alldbTests)
obsoleteTests = [vctestNames[key]
for key in set(vctestNames.keys()) - set(dbtestNames.keys())]
newTests = [dbtestNames[key]
for key in set(dbtestNames.keys()) - set(vctestNames.keys())]
You're working with basic set operations here. You could convert your objects to sets and just find the intersection (think Venn Diagrams):
obsoleteTests = list(set([a.name for a in alldbTests]) - set(vctestNames))
Sets are really useful when comparing two lists of objects (pseudopython):
set(a) - set(b) = [c for c in a and not in b]
set(a) + set(b) = [c for c in a or in b]
set(a).intersection(set(b)) = [c for c in a and in b]
The intersection- and difference-operations of sets should help you solve your problem more elegant.
But as you're originally dealing with dicts these examples and discussion may provide some inspirations: http://code.activestate.com/recipes/59875-finding-the-intersection-of-two-dicts

Categories

Resources