I have a dictionary which has IP address ranges as Keys (used to de-duplicate in a previous step) and certain objects as values. Here's an example
Part of the dictionary sresult:
10.102.152.64-10.102.152.95 object1:object3
10.102.158.0-10.102.158.255 object2:object5:object4
10.102.158.0-10.102.158.31 object3:object4
10.102.159.0-10.102.255.255 object6
There are tens of thousands of lines, I want to sort (correctly) by IP address in keys
I tried splitting the key based on the range separator - to get a single IP address that can be sorted as follows:
ips={}
for key in sresult:
if '-' in key:
l = key.split('-')[0]
ips[l] = key
else:
ips[1] = key
And then using code found on another post, sorting by IP address and then looking up the values in the original dictionary:
sips = sorted(ipaddress.ip_address(line.strip()) for line in ips)
for x in sips:
print("SRC: "+ips[str(x)], "OBJECT: "+" :".join(list(set(sresult[ips[str(x)]]))), sep=",")
The problem I have encountered is that when I split the original range and add the sorted first IPs as new keys in another dictionary, I de-duplicate again losing lines of data - lines 2 & 3 in the example
line 1 10.102.152.64 -10.102.152.95
line 2 10.102.158.0 -10.102.158.255
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
becomes
line 1 10.102.152.64 -10.102.152.95
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
So upon rebuilding the original dictionary using the IP address sorted keys, I have lost data
Can anyone help please?
EDIT This post now consists of three parts:
1) A bit of information about dictionaries that you will need in order to understand the rest.
2) An analysis of your code, and how you could fix it without using any other Python features.
3) What I would consider the best solution to the problem, in detail.
1) Dictionaries
Python dictionaries are not ordered. If I have a dictionary like this:
dictionary = {"one": 1, "two": 2}
And I loop through dictionary.items(), I could get "one": 1 first, or I could get "two": 2 first. I don't know.
Every Python dictionary implicitly has two lists associated with it: a list of it's keys and a list of its values. You can get them list this:
print(list(dictionary.keys()))
print(list(dictionary.values()))
These lists do have an ordering. So they can be sorted. Of course, doing so won't change the original dictionary, however.
Your Code
What you realised is that in your case you only want to sort according to the first IP address in your dictionaries keys. Therefore, the strategy that you adopted is roughly as follows:
1) Build a new dictionary, where the keys are only this first part.
2) Get that list of keys from the dictionary.
3) Sort that list of keys.
4) Query the original dictionary for the values.
This approach will, as you noticed, fail at step 1. Because as soon as you made the new dictionary with truncated keys, you will have lost the ability to differentiate between some keys that were only different at the end. Every dictionary key must be unique.
A better strategy would be:
1) Build a function which can represent you "full" ip addresses with as an ip_address object.
2) Sort the list of dictionary keys (original dictionary, don't make a new one).
3) Query the dictionary in order.
Let's look at how we could change your code to implement step 1.
def represent(full_ip):
if '-' in full_ip:
# Stylistic note, never use o or l as variable names.
# They look just like 0 and 1.
first_part = full_ip.split('-')[0]
return ipaddress.ip_address(first_part.strip())
Now that we have a way to represent the full IP addresses, we can sort them according to this shortened version, without having to actually change the keys at all. All we have to do is tell Python's sorted method how we want the key to be represented, using the key parameter (NB, this key parameter has nothing to do with key in a dictionary. They just both happened to be called key.):
# Another stylistic note, always use .keys() when looping over dictionary keys. Explicit is better than implicit.
sips = sorted(sresults.keys(), key=represent)
And if this ipaddress library works, there should be no problems up to here. The remainder of your code you can use as is.
Part 3 The best solution
Whenever you are dealing with sorting something, it's always easiest to think about a much simpler problem: given two items, how would I compare them? Python gives us a way to do this. What we have to do is implement two data model methods called
__le__
and
__eq__
Let's try doing that:
class IPAddress:
def __init__(self, ip_address):
self.ip_address = ip_address # This will be the full IP address
def __le__(self, other):
""" Is this object less than or equal to the other one?"""
# First, let's find the first parts of the ip addresses
this_first_ip = self.ip_address.split("-")[0]
other_first_ip = other.ip_address.split("-")[0]
# Now let's put them into the external library
this_object = ipaddress.ip_address(this_first_ip)
other_object = ipaddress.ip_adress(other_first_ip)
return this_object <= other_object
def __eq__(self, other):
"""Are the two objects equal?"""
return self.ip_address == other.ip_adress
Cool, we have a class. Now, the data model methods will automatically be invoked any time I use "<" or "<=" or "==". Let's check that it is working:
test_ip_1 = IPAddress("10.102.152.64-10.102.152.95")
test_ip_2 = IPAddress("10.102.158.0-10.102.158.255")
print(test_ip_1 <= test_ip_2)
Now, the beauty of these data model methods is that Pythons "sort" and "sorted" will use them as well:
dictionary_keys = sresult.keys()
dictionary_key_objects = [IPAddress(key) for key in dictionary_keys]
sorted_dictionary_key_objects = sorted(dictionary_key_objects)
# According to you latest comment, the line below is what you are missing
sorted_dictionary_keys = [object.ip_address for object in sorted_dictionary_key_objects]
And now you can do:
for key in sorted_dictionary_keys:
print(key)
print(sresults[key])
The Python data model is almost the defining feature of Python. I'd recommend reading about it.
Related
I've followed a tutorial to write a Flask REST API and have a special request about a Python code.
The offered code is following:
# data list is where my objects are stored
def put_one(name):
list_by_id = [list for list in data_list if list['name'] == name]
list_by_id[0]['name'] = [new_name]
print({'list_by_id' : list_by_id[0]})
It works, which is nice, and even though I understand what line 2 is doing, I would like to rewrite it in a way that it's clear how the function iterates over the different lists. I already have an approach but it returns Key Error: 0
def put(name):
list_by_id = []
list = []
for list in data_list:
if(list['name'] == name):
list_by_id = list
list_by_id[0]['name'] = request.json['name']
return jsonify({'list_by_id' : list_by_id[0]})
My goal with this is also to be able to put other elements, that don't necessarily have the type 'name'. If I get to rewrite the function in an other way I'll be more likely to adapt it to my needs.
I've looked for tools to convert one way of coding into the other and answers in forums before coming here and couldn't find it.
It may not be beatiful code, but it gets the job done:
def put(value):
for i in range(len(data_list)):
key_list = list(data_list[i].keys())
if data_list[i][key_list[0]] == value:
print(f"old value: {key_list[0], data_list[i][key_list[0]]}")
data_list[i][key_list[0]] = request.json[test_key]
print(f"new value: {key_list[0], data_list[i][key_list[0]]}")
break
Now it doesn't matter what the key value is, with this iteration the method will only change the value when it finds in the data_list. Before the code breaked at every iteration cause the keys were different and they played a role.
I have a problem and I want to determine whether my approach is sound. Here is the idea:
I would be creating a primary dict called zip_codes, of which respective zipcodes (from a list) were the names of each of the nested dicts. Each would have keys for "members", "offices", "members per office"
It would look like this:
zips {
90219: {
"members": 120,
"offices": 18,
"membersperoffice": 28
},
90220: {
"members": 423,
"offices": 37,
"membersperoffice": 16
}
}
and so on and so forth.
I think I need to build the nested dicts, and then process several lists against conditionals, passing resulting values into the corresponding dicts on the fly (i.e. based on how many times a zip code exists in the list).
Is using nested dictionaries the most pythonic way of doing this? Is it cumbersome? Is there a better way?
Can someone drop me a hint about how to push key values into nested dicts from a loop? I've not been able to find a good resource describing what I'm trying to do (if this is, indeed, the best path).
Thanks.
:edit: a more specific example:
determine how many instances of a zipcode are in list called membersperzip
find corresponding nested dict with same name as zipcode, inside dict called zips
pass value to corresponding key, called "members" (or whatever key)
:edit 2:
MadPhysicist requested I give code examples (I don't even know where to start with this one and I can't find examples. All I've been able to do thus far is:
area_dict = {}
area_dict = dict.fromkeys(all_areas, 0) #make all of the zipscodes keys, add a zero in the first non-key index
dictkeys = list (area_dict.keys())
That gets me a dict with a bunch of zip codes as keys. I've discovered no way to iterate through a list and create nested dicts (yet). Hence the actual question.
Please don't dogpile me and do the usual stack overflow thing. This is not me asking anyone to do my homework. This is merely me asking someone to drop me a HINT.
:edit 3:
Ok. This is convoluted (my fault). Allow me to clarify further:
So, I have an example of what the nested dicts should look like. They'll start out empty, but I need to iterate through one of the zip code lists to create all the nested dicts... inside of zips.
This is a sample of the list that I want to use to create the nested dicts inside of the zips dict:
zips = [90272, 90049, 90401, 90402, 90403, 90404, 90291, 90292, 90290, 90094, 90066, 90025, 90064, 90073]
And this is what I want it to look like
zips {
90272: {
"members": ,
"offices": ,
"membersperoffice":
},
90049: {
"members": ,
"offices": ,
"membersperoffice":
}
}
....
etc, etc. ( creating a corresponding nested dict for each zipcode in the list)
After I achieve this, I have to iterate through several more zip code lists... and those would spit out the number of times a zip code appears in a given list, and then find the dict corresponding to the zip code in question, and append that value to the relevant key.
One I figure out the first part, I can figure this second part out on my own.
Thanks again. Sorry for any confusion.
You can do something like this:
all_areas = [90219, 90220]
zips = {zipcode: code_members(zipcode) for zipcode in all_areas}
def code_members(zipcode):
if zipcode == 90219:
return dict(members=120, offices=18, membersperoffice=28)
return dict(members=423, offices=37, membersperoffice=16)
I think I need to build the nested dicts, and then process several
lists against conditionals, passing resulting values into the
corresponding dicts on the fly (i.e. based on how many times a zip
code exists in the list).
Using the above approach, if a zipcode appears multiple times in the all_areas list, the resulting zip dictionary will only contain one instance of the zipcode.
Is using nested dictionaries the most pythonic way of doing this? Is
it cumbersome? Is there a better way?
May I suggest making a simple object that represents the value of each zipcode. Something simple like:
Using dataclass:
#dataclass.dataclass
class ZipProperties(object):
members: int
offices: int
membersperoffice: int
Using named tuple:
ZipProperties = collections.namedtuple('ZipProperties', ['members', 'offices', 'membersperoffice'])
You can then change the code_members function to this:
def code_members(zipcode):
if zipcode == 90219:
return ZipProperties(120, 18, 28)
return ZipProperties(423, 37, 16)
Addressing your concrete example:
determine how many instances of a zipcode are in list called membersperzip
find corresponding nested dict with same name as zipcode, inside dict called zips
pass value to corresponding key, called "members" (or whatever key)
membersperzip: typings.List[Tuple[int, int]] = [(90219, 54)]
for zip, members in membersperzip:
for zipcode, props in zips.items():
if zipcode == zip:
props.members = members
I would suggest you to append it when you have the actual value instead of initializing dictionary with empty values for each key. You have list of keys and I do not see why you want to put all of them to the dictionary without having value in the first place.
zips = [90272, 90049, 90401, 90402, 90403, 90404, 90291, 90292, 90290, 90094, 90066, 90025, 90064, 90073]
zips_dict = {}
for a_zip in zips:
if a_zip not in zips_dict:
# Initialize proper value here for members etc.
zips_dict[a_zip] = proper_value
If you insist to initialize dict with empty value for each keys, you could use this, which will also iterate through the list anyway but in python comprehension.
zips = [90272, 90049, 90401, 90402, 90403, 90404, 90291, 90292, 90290, 90094, 90066, 90025, 90064, 90073]
zips_dict = {
x:{
"members":None,
"offices":None,
"membersperoffice":None,
} for x in zips
}
Hope this helps
So I made this method to set parameters from a text file:
def set_params(self, params, previous_response=None):
if len(params) > 0:
param_value_list = params.split('&')
self.params = {
param_value.split()[0]: json.loads(previous_response.decode())[param_value.split()[1]] if
param_value.split()[0] == 'o' and previous_response else param_value.split()[1]
for param_value in param_value_list
}
When i call this method for example like this:
apiRequest.set_params("lim 5 & status active")
//now self.params={"lim" : 5, "status" : "active"}
it works well. Now I want to be able to add the same parameter multiple times, and when that happens, set the param like a list:
apiRequest.set_params("lim 5 & status active & status = other")
//I want this: self.params={"lim" : 5, "status" : ["active", "other"]}
How can I modify this method beautifully? All I can think of is kinda ugly... I am new with python
Just write it as simple and straightforward as you can. That is usually the best approach. In my code, below, I made one change to your requirements: all values are a list, some may have just one element in the list.
In this method I apply the following choices and techniques:
decode and parse the previous response only once, not every time it is referenced
start with an empty dictionary
split each string only once: this is faster because it avoids redundant operations and memory allocations, and (even more importantly) it is easier to read because the code is not repetitive
adjust the value according to the special-case
use setdefault() to obtain the current list of values, if present, or set a new empty list object if it is not present
append the new value to the list of values
def set_params(self, params, previous_response=None):
if len(params) <= 0:
return
previous_data = json.loads(previous_response.decode())
self.params = {}
for param_value in params.split('&'):
key, value = param_value.split()
if key == 'o' and previous_response:
value = previous_data[value]
values = self.params.setdefault(key, [])
values.append(value)
# end set_params()
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.
— Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.
Reference: http://quotes.cat-v.org/programming/
I have a list contains 700,000 items and a dictionary contains 300,000 keys. Some of the 300k keys are contained within the 700k items stored in the list.
Now, I have built a simple comparison and handling loop:
# list contains about 700k lines - ids,firstname,lastname,email,lastupdate
list = open(r'myfile.csv','rb').readlines()
dictionary = {}
# dictionary contains 300k ID keys
dictionary[someID] = {'first':'john',
'last':'smith',
'email':'john.smith#gmail.com',
'lastupdate':datetime_object}
for line in list:
id, firstname, lastname, email, lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
if id in dictionary.keys():
# update dictionary[id]'s keys:values
if lastupdate > dictionary[id]['lastupdate']:
# update values in dictionary[id]
else:
# create new id inside dictionary and fill with keys:values
I wish to speed things up a little and use multiprocessing for this kind of job. For this, I thought I could split the list to four smaller lists, Pool.map each list and check them separately with each of the four processes I'll make to create four new dictionaries. Problem is that in order create one whole dictionary with last updated values, I will have to repeat the process with the 4 new created dictionaries and so on.
Have anyone ever experienced with such problem and have a solution or an idea for that problem?
Thanks
if id in dictionary.keys():
NO! Please No! This is an O(n) operation!!! The right way to do it is simply
if id in dictionary
which takes O(1) time!!!
Before thinking about using multiprocessing etc you should avoid this really inefficient operations. If the dictionary has 300k keys that line was probably the bottleneck.
I have assumed python2; if this is not the case then you should use the python-3.x. In python3 using key in dictionary.keys() is O(1) because .keys() now returns a view of the dict instead of the list of keys, however is still a bit faster to omit .keys().
I think you should start with not splitting the same line for each token over and over again:
id, firstname, lastname, email, lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
I currently have a structure that is a dict: each value is a list that contains numeric values. Each of these numeric lists contain what (to borrow a SQL idiom) you could call a primary key containing the first three values which are: a year, a player identifier, and a team identifier. This is the key for the dict.
So you can get a unique row by passing the a value in for the year, player ID, and team ID like so:
statline = stats[(2001, 'SEA', 'suzukic01')]
Which yields something like
[305, 20, 444, 330, 45]
I'd like to alter this data structure to be quickly summed by either of these three keys: so you could easily slice the totals for a given index in the numeric lists by passing in ONE of year, player ID, and team ID, and then the index. I want to be able to do something like
hr_total = stats[year=2001, idx=3]
Where that idx of 3 corresponds to the third column in the numeric list(s) that would be retrieved.
Any ideas?
Read up on Data Warehousing. Any book.
Read up on Star Schema Design. Any book. Seriously.
You have several dimensions: Year, Player, Team.
You have one fact: score
You want to have a structure like this.
You then want to create a set of dimension indexes like this.
years = collections.defaultdict( list )
players = collections.defaultdict( list )
teams = collections.defaultdict( list )
Your fact table can be this a collections.namedtuple. You can use something like this.
class ScoreFact( object ):
def __init__( self, year, player, team, score ):
self.year= year
self.player= player
self.team= team
self.score= score
years[self.year].append( self )
players[self.player].append( self )
teams[self.team].append( self )
Now you can find all items in a given dimension value. It's a simple list attached to a dimension value.
years['2001'] are all scores for the given year.
players['SEA'] are all scores for the given player.
etc. You can simply use sum() to add them up. A multi-dimensional query is something like this.
[ x for x in players['SEA'] if x.year == '2001' ]
Put your data into SQLite, and use its relational engine to do the work. You can create an in-memory database and not even have to touch the disk.
The syntax stats[year=2001, idx=3] is invalid Python and there is no way you can make it work with those square brackets and "keyword arguments"; you'll need to have a function or method call in order to accept keyword arguments.
So, say we make it a function, to be called like wells(stats, year=2001, idx=3). I imagine the idx argument is mandatory (which is very peculiar given the call, but you give no indication of what could possibly mean to omit idx) and exactly one of year, playerid, and teamid must be there.
With your current data structure, wells can already be implemented:
def wells(stats, year=None, playerid=None, teamid=None, idx=None):
if idx is None: raise ValueError('idx must be specified')
specifiers = [(i, x) for x in enumerate((year, playerid, teamid)) if x is not None]
if len(specifiers) != 2:
raise ValueError('Exactly one of year, playerid, teamid, must be given')
ikey, keyv = specifiers[0]
return sum(v[idx] for k, v in stats.iteritems() if k[ikey]==keyv)
of course, this is O(N) in the size of stats -- it must examine every entry in it. Please measure correctness and performance with this simple implementation as a baseline. An alternative solutions (much speedier in use, but requiring much time for preparation) is to put three dicts of lists (one each for year, playerid, teamid) to the side of stats, each entry indicating (or copying, but I think indicating by full key may suffice) all entries of stats that match that that ikey / keyv pair. But it's not clear at this time whether this implementation may not be premature, so please try first with the simple-minded idea!-)
def getSum(d, year, idx):
sum = 0
for key in d.keys():
if key[0] == year:
sum += d[key][idx]
return sum
This should get you started. I have made the assumption in this code, that ONLY year will be asked for, but it should be easy enough for you to manipulate this to check for other parameters as well
Cheers