Python data structure recommendation?

Python data structure recommendation? - python

I currently have a structure that is a dict: each value is a list that contains numeric values. Each of these numeric lists contain what (to borrow a SQL idiom) you could call a primary key containing the first three values which are: a year, a player identifier, and a team identifier. This is the key for the dict.
So you can get a unique row by passing the a value in for the year, player ID, and team ID like so:
statline = stats[(2001, 'SEA', 'suzukic01')]
Which yields something like
[305, 20, 444, 330, 45]
I'd like to alter this data structure to be quickly summed by either of these three keys: so you could easily slice the totals for a given index in the numeric lists by passing in ONE of year, player ID, and team ID, and then the index. I want to be able to do something like
hr_total = stats[year=2001, idx=3]
Where that idx of 3 corresponds to the third column in the numeric list(s) that would be retrieved.
Any ideas?

Read up on Data Warehousing. Any book.
Read up on Star Schema Design. Any book. Seriously.
You have several dimensions: Year, Player, Team.
You have one fact: score
You want to have a structure like this.
You then want to create a set of dimension indexes like this.
years = collections.defaultdict( list )
players = collections.defaultdict( list )
teams = collections.defaultdict( list )
Your fact table can be this a collections.namedtuple. You can use something like this.
class ScoreFact( object ):
def __init__( self, year, player, team, score ):
self.year= year
self.player= player
self.team= team
self.score= score
years[self.year].append( self )
players[self.player].append( self )
teams[self.team].append( self )
Now you can find all items in a given dimension value. It's a simple list attached to a dimension value.
years['2001'] are all scores for the given year.
players['SEA'] are all scores for the given player.
etc. You can simply use sum() to add them up. A multi-dimensional query is something like this.
[ x for x in players['SEA'] if x.year == '2001' ]

Put your data into SQLite, and use its relational engine to do the work. You can create an in-memory database and not even have to touch the disk.

The syntax stats[year=2001, idx=3] is invalid Python and there is no way you can make it work with those square brackets and "keyword arguments"; you'll need to have a function or method call in order to accept keyword arguments.
So, say we make it a function, to be called like wells(stats, year=2001, idx=3). I imagine the idx argument is mandatory (which is very peculiar given the call, but you give no indication of what could possibly mean to omit idx) and exactly one of year, playerid, and teamid must be there.
With your current data structure, wells can already be implemented:
def wells(stats, year=None, playerid=None, teamid=None, idx=None):
if idx is None: raise ValueError('idx must be specified')
specifiers = [(i, x) for x in enumerate((year, playerid, teamid)) if x is not None]
if len(specifiers) != 2:
raise ValueError('Exactly one of year, playerid, teamid, must be given')
ikey, keyv = specifiers[0]
return sum(v[idx] for k, v in stats.iteritems() if k[ikey]==keyv)
of course, this is O(N) in the size of stats -- it must examine every entry in it. Please measure correctness and performance with this simple implementation as a baseline. An alternative solutions (much speedier in use, but requiring much time for preparation) is to put three dicts of lists (one each for year, playerid, teamid) to the side of stats, each entry indicating (or copying, but I think indicating by full key may suffice) all entries of stats that match that that ikey / keyv pair. But it's not clear at this time whether this implementation may not be premature, so please try first with the simple-minded idea!-)

def getSum(d, year, idx):
sum = 0
for key in d.keys():
if key[0] == year:
sum += d[key][idx]
return sum
This should get you started. I have made the assumption in this code, that ONLY year will be asked for, but it should be easy enough for you to manipulate this to check for other parameters as well
Cheers

Related

Python: Sorting ip ranges which are dictionary keys

I have a dictionary which has IP address ranges as Keys (used to de-duplicate in a previous step) and certain objects as values. Here's an example
Part of the dictionary sresult:
10.102.152.64-10.102.152.95 object1:object3
10.102.158.0-10.102.158.255 object2:object5:object4
10.102.158.0-10.102.158.31 object3:object4
10.102.159.0-10.102.255.255 object6
There are tens of thousands of lines, I want to sort (correctly) by IP address in keys
I tried splitting the key based on the range separator - to get a single IP address that can be sorted as follows:
ips={}
for key in sresult:
if '-' in key:
l = key.split('-')[0]
ips[l] = key
else:
ips[1] = key
And then using code found on another post, sorting by IP address and then looking up the values in the original dictionary:
sips = sorted(ipaddress.ip_address(line.strip()) for line in ips)
for x in sips:
print("SRC: "+ips[str(x)], "OBJECT: "+" :".join(list(set(sresult[ips[str(x)]]))), sep=",")
The problem I have encountered is that when I split the original range and add the sorted first IPs as new keys in another dictionary, I de-duplicate again losing lines of data - lines 2 & 3 in the example
line 1 10.102.152.64 -10.102.152.95
line 2 10.102.158.0 -10.102.158.255
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
becomes
line 1 10.102.152.64 -10.102.152.95
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
So upon rebuilding the original dictionary using the IP address sorted keys, I have lost data
Can anyone help please?

EDIT This post now consists of three parts:
1) A bit of information about dictionaries that you will need in order to understand the rest.
2) An analysis of your code, and how you could fix it without using any other Python features.
3) What I would consider the best solution to the problem, in detail.
1) Dictionaries
Python dictionaries are not ordered. If I have a dictionary like this:
dictionary = {"one": 1, "two": 2}
And I loop through dictionary.items(), I could get "one": 1 first, or I could get "two": 2 first. I don't know.
Every Python dictionary implicitly has two lists associated with it: a list of it's keys and a list of its values. You can get them list this:
print(list(dictionary.keys()))
print(list(dictionary.values()))
These lists do have an ordering. So they can be sorted. Of course, doing so won't change the original dictionary, however.
Your Code
What you realised is that in your case you only want to sort according to the first IP address in your dictionaries keys. Therefore, the strategy that you adopted is roughly as follows:
1) Build a new dictionary, where the keys are only this first part.
2) Get that list of keys from the dictionary.
3) Sort that list of keys.
4) Query the original dictionary for the values.
This approach will, as you noticed, fail at step 1. Because as soon as you made the new dictionary with truncated keys, you will have lost the ability to differentiate between some keys that were only different at the end. Every dictionary key must be unique.
A better strategy would be:
1) Build a function which can represent you "full" ip addresses with as an ip_address object.
2) Sort the list of dictionary keys (original dictionary, don't make a new one).
3) Query the dictionary in order.
Let's look at how we could change your code to implement step 1.
def represent(full_ip):
if '-' in full_ip:
# Stylistic note, never use o or l as variable names.
# They look just like 0 and 1.
first_part = full_ip.split('-')[0]
return ipaddress.ip_address(first_part.strip())
Now that we have a way to represent the full IP addresses, we can sort them according to this shortened version, without having to actually change the keys at all. All we have to do is tell Python's sorted method how we want the key to be represented, using the key parameter (NB, this key parameter has nothing to do with key in a dictionary. They just both happened to be called key.):
# Another stylistic note, always use .keys() when looping over dictionary keys. Explicit is better than implicit.
sips = sorted(sresults.keys(), key=represent)
And if this ipaddress library works, there should be no problems up to here. The remainder of your code you can use as is.
Part 3 The best solution
Whenever you are dealing with sorting something, it's always easiest to think about a much simpler problem: given two items, how would I compare them? Python gives us a way to do this. What we have to do is implement two data model methods called
__le__
and
__eq__
Let's try doing that:
class IPAddress:
def __init__(self, ip_address):
self.ip_address = ip_address # This will be the full IP address
def __le__(self, other):
""" Is this object less than or equal to the other one?"""
# First, let's find the first parts of the ip addresses
this_first_ip = self.ip_address.split("-")[0]
other_first_ip = other.ip_address.split("-")[0]
# Now let's put them into the external library
this_object = ipaddress.ip_address(this_first_ip)
other_object = ipaddress.ip_adress(other_first_ip)
return this_object <= other_object
def __eq__(self, other):
"""Are the two objects equal?"""
return self.ip_address == other.ip_adress
Cool, we have a class. Now, the data model methods will automatically be invoked any time I use "<" or "<=" or "==". Let's check that it is working:
test_ip_1 = IPAddress("10.102.152.64-10.102.152.95")
test_ip_2 = IPAddress("10.102.158.0-10.102.158.255")
print(test_ip_1 <= test_ip_2)
Now, the beauty of these data model methods is that Pythons "sort" and "sorted" will use them as well:
dictionary_keys = sresult.keys()
dictionary_key_objects = [IPAddress(key) for key in dictionary_keys]
sorted_dictionary_key_objects = sorted(dictionary_key_objects)
# According to you latest comment, the line below is what you are missing
sorted_dictionary_keys = [object.ip_address for object in sorted_dictionary_key_objects]
And now you can do:
for key in sorted_dictionary_keys:
print(key)
print(sresults[key])
The Python data model is almost the defining feature of Python. I'd recommend reading about it.

python: sort list with multiple values

I have a list of CrewRecords objects.
crew_record = list[<CrewRecords instance at 0x617bb48>,
<CrewRecords instance at 0x617b9e0>,
<CrewRecords instance at 0x5755680>]
where:
class CrewRecords():
def __init__(self):
self.crew_id = None
self.crew_date_of_hire = None
self.crew_points = None
def crew_attributes(self,crew_bag):
''' populate the values of crew with some values'''
self.crew_id = crew_bag.crew.id()
self.crew_date_of_hire = crew_bag.crew.date_of_hire()
self.crew_points = crew_bag.crew_points()
Now, i want to write a function in python which takes 3 arguments and sort the list by the preferences provided. i.e.
if the user inputs the value to be sort by
options:
points, date_of_hire, id
points, id, date_of_hire
date_of_hire, points, id
etc.. sort based on user input.
then, function should be able to sort with sort.i.e.
if the 1st option is chosen, then sort all crew by points, if 2 crew has same points then sort by date_of_hire, if date_of_hire is also same then sort by id.
Also, later if the sort options increases like if the user wants to sort by some extra option (by name for example) we should also be able to easily extend the sort criteria.

Use the key keyword to sorted, i.e.
return sorted(crew_record, key=attrgetter('crew_points', 'crew_date_of_hire', 'crew_id'))
would solve your first point.

Python - How to pull an attribute out of list of custom objects and convert it to a float?

I am a novice Python user so I hope I haven't missed something basic but I feel I've done my due diligence in trying to research this problem on my own so here goes.
In brief, I am writing a program that will analyze sports statistics and ultimately generate a rating for the strength of each team based on the chosen statistics.
I can successfully read in simple csv files and I'm reasonably happy with the custom object class I have created to store the statistics as attributes for each team but I am running in to an issue when I go to calculate the rating. Essentially, I need to sort all the teams by each statistic I am interested in, rank the teams by this statistic and assign a point value for the rank of each one. This will produce a cumulative rating score based on the rank for each statistic. However, I'm having some issues in getting the statistic value as a float which I think I need to do in order to sort properly.
Here's the code I've tried:
I've created a team object as seen below. This version is stripped down of most of the attributes for ease of reading but the additional attributes are all very similar.
class team(object):
def__init__(self,teamName="",passOffYc=0, passOffAya=0):
self.teamName = teamName
self.passOffYc = passOffYc
self.passOffAya = passOffAya
self.score = 0
These objects are constructed from a csv file that has a header with the statistical categories and each row represents a team and its stats. I am reading in the file using csv.DictReader like this:
league= []
with open(passoffense) as csvfile:
statreader = csv.DictReader(csvfile, delimiter=',')
for row in statreader:
newTeam = team(row["Tm"],row["Y/C"],row["AY/A"])
print(newTeam, "added")
league.append(newTeam)
At this point I think I have a list called league that contains a team object for each row in the csv file and the teamname, passOffYc, and passOffAya attributes have taken the values for the row elements Tm, Y/C, and AY/A. These are the team name, Yards per Catch, and Average Yards per Attempt so the second two are all decimal numbers.
To try to create the score for each team object, I'd like to sort the league list first by passOffYc, determine the rank of each team, and then repeat for passOffAya and so on for all the attributes in the full version of the program.
I've attempted two different implementations of this trying to understand attrgetter or lambda functions.
My attrgetter thoughts are something like this:
sortedTeams = sorted(league, key = attrgetter("passOffYc"))
My understanding is that this would sort the list called league according to the attribute passOffYc but the issue I'm encountering is that the sort is not producing the expected output.
If passOffYc was 19, 23, 14, 7, and 9, I am expecting the sort to result in 7, 9, 14, 19, 23. However it will end up sorting as 14, 19, 23, 7, 9. My research has led me to believe this is because the values are strings and not integers (or more accurately floats as some values have decimals) Not quite sure how to fix this I tried:
sortedTeams = sorted(league, key = float(attrgetter("passOffYc"))
But this gives me the error:
TypeError: float() argument must be a string or a number, not 'operator.attrgetter'
So apparently it isn't a string and instead is an operator.attrgetter object. I haven't been able to figure out how to get the key for the sorted function to the float type so I also tried using lambda functions I read about while searching:
sortedTeams = sorted(league, key = lambda team: float(team.passOffYc))
This seems very close to what I'd like to happen as it does sort properly but then I run in to a scaling problem. Since I have more than 20 attributes to sort by in the full version of my program I'd like to avoid needing to type the above statement 20 times to accommodate each attribute.
I thought of trying to define a function something to the effect of:
def score(stat):
sortedTeams = sorted(league, key = lambda team: float(team.stat))
I thought this would allow me to pass in to the lambda function which stat I want to sort by but I then get the error:
AttributeError: 'team' object has no attribute 'stat'
I think this is because it may not be possible to pass an argument to a lambda function or that I'm not understanding something because I also tried:
sortedTeams = sorted(league, key = lambda team, stat=stat: float(team.stat))
Which resulted in the same error. Whew! If you're still reading this thank you for wading through my essay. Any ideas how I can solve this?
Once I get the sorting to work properly and can scale it I intend to enumerate the sorted lists to obtain the ranks and I'm beginning to think about strategies to address ties. Thank you again for any help!

You just need to create the original team entries with float contents:
newTeam = team(row["Tm"],float(row["Y/C"]),float(row["AY/A"]))
If instead you want to pursue the lambda approach you can use:
sortedTeams = sorted(league, key = lambda team: float(attrgetter("passOffYc")(team)))
This could similarly be used in your function score function. What you were missing is that attrgetter returns a function. You can then call that function by passing it an argument (in this case team). Then that result can be passed to float. In that function you could use:
lambda team: float(attrgetter(stat)(team))

As I understand, you whant to pass a string name of desired field into function. If that is right, then instead of:
def score(stat):
sortedTeams = sorted(league, key = lambda team: float(team.stat))
Try this:
def score(stat):
sortedTeams = sorted(league, key = lambda team: float(getattr(team, stat)))
Some explanation. team.stat - accessing an attribute with name "stat" for object team. getattr(team, stat) - accessing an attribute with name given in the stat variable for object team.

Python: Compare values from 2 dictionaries with the same key

Hello i am new to python and i have a question about dictionaries:
Let's say we have a dictionary :
Cars {"Audi": {"Wheels": 4, "Body": 1, "Other": 20},"Ford": {"Wheels": 2, "Body": 3, "Other":10},"BMW": {"Wheels": 5, "Body": 0.5, "Other": 30}}
And and other dictionary:
Materials {"Wheels": 30, "Body": 5, "Other": 110}
I want to return the number of cars i can produce with the materials i have so:
def production(car,Materials):
return
production("Audi",Materials)
My output in this example should return the number 5,because there are only 5 body parts to use.
I was thinking to make it somehow like this:
Divide the values from Materials with values from cars. Then write the numbers to an other list ,and then return the min number in whole.
More exaples:
production("BMW",Materials)
3.0 # because the value of key only is 110 and for 3 cars we need 90 other:
production("Ford",Materials)
1.0 # because the value of key body is 3 and for 1 car we need 3 body:
I thank you in advance for everything.

If what you want is to see how many of any given car can be created without actually affecting the contents of Materials, you could write your method like so:
def number_of_units_creatable(car_key):
required_parts = Cars[car_key]
return min(Materials["Wheels"] // required_parts["Wheels"],
Materials["Body"] // required_parts["Body"],
Materials["Other"] // required_parts["Other"])
In production, you'd want to add conditional guards to check whether your Cars and Materials have all the required keys. You'll get an exception if you try to get the value for a key that doesn't exist.
This will allow you to figure out the maximum number of any given car you can create with the resources available in Materials.
I'd strongly recommend you not use nested dicts like this, though - this design would be greatly helped by creating, say, a Materials class, and storing this as your value rather than another dictionary. abarnert has a little more on this in his post.
Another note, prompted by abarnert - it's an extremely bad idea to rely on all a shared, static set of keys between two separate dictionaries. What happens if you want to build, say, an armored car, and now you need a gun? Either you have to add Gun: 0 within the required attributes of every car, or you'll run into an exception. Every single car will require an entry for every single part required by each and every car in existence, and a good deal of those will signify nothing other than the fact that the car doesn't need it. As it stands, your design is both very constraining and brittle - chance are good it'll break as soon as you try and add something new.

If the set of possible materials is a static collection—that is, it can only have "Wheels", "Body", and "Other"—then you really ought to be using a class rather than a dict, as furkle's answer suggests, but you can fake it with your existing data structure, as his answer shows.
However, if the set of possible materials is open-ended, then you don't want to refer to them one by one explicitly; you want to loop over them. Something like:
for material, count in car.items():
In this case:
return min(Materials[material] // count for material, count in car.items())

You can iterate over materials and decrement the values until one become 0:
def production(car, materials):
count = 0
while 0 not in materials.values():
for part in cars[car]:
materials[part] -= 1
count += 1
return count
If you don't want to change the material dict:
def production(car, materials):
count = 0
vals = materials.values()
while not 0 in vals:
for ind, part in enumerate(Cars[car]):
vals[ind] -= 1
count += 1
return count

ArcMap Field Calculator Program to create Unique ID's

I'm using the Field Calculator in ArcMap and
I need to create a unique ID for every storm drain in my county.
An ID Should look something like this: 16-I-003
The first number is the municipal number which is in the column/field titled "Munic"
The letter is using the letter in the column/field titled "Point"
The last number is simply just 1 to however many drains there are in a municipality.
So far I have:
rec=0
def autoIncrement()
pStart=1
pInterval=1
if(rec==0):
rec=pStart
else:
rec=rec+pInterval
return "16-I-" '{0:03}'.format(rec)
So you can see that I have manually been typing in the municipal number, the letter, and the hyphens. But I would like to use the fields: Munic and Point so I don't have to manually type them in each time it changes.
I'm a beginner when it comes to python and ArcMap, so please dumb things down a little.

I'm not familiar with the ArcMap, so can't directly help you, but you might just change your function to a generator as such:
def StormDrainIDGenerator():
rec = 0
while (rec < 99):
rec += 1
yield "16-I-" '{0:03}'.format(rec)
If you are ok with that, then parameterize the generator to accept the Munic and Point values and use them in your formatting string. You probably should also parameterize the ending value as well.
Use of a generator will allow you to drop it into any later expression that accepts an iterable, so you could create a list of such simply by saying list(StormDrainIDGenerator()).
Is your question on how to get Munic and Point values into the string ID? using .format()?

I think you can use following code to do that.
def autoIncrement(a,b):
global rec
pStart=1
pInterval=1
if(rec==0):
rec=pStart
else:
rec=rec+pInterval
r = "{1}-{2}-{0:03}".format(a,b,rec)
return r
and call
autoIncrement( !Munic! , !Point! )
The r = "{1}-{2}-{0:03}".format(a,b,rec) just replaces the {}s with values of variables a,b which are actually the values of Munic and Point passed to the function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python data structure recommendation? - python

Put your data into SQLite, and use its relational engine to do the work. You can create an in-memory database and not even have to touch the disk.

Related

Python: Sorting ip ranges which are dictionary keys

python: sort list with multiple values

Python - How to pull an attribute out of list of custom objects and convert it to a float?

Python: Compare values from 2 dictionaries with the same key

ArcMap Field Calculator Program to create Unique ID's

Categories

Resources