I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
I'm going to take two different outputs and verify three pieces for each line.
Problem is, later I'll have a output C which I want to try and verify again output A and B. I'm leaning towards a dictionary?
For example: (bold is what I'll be matching
output A:
10.1.1.1:80 10.2.1.1:81 10.1.1.1:80 10.3.1.1:81 etc etc etc name
...
...
...
output B:
name etc etc etc etc 10.1.1.1/16 10.2.1.1/16
...
...
...
I'm trying to figure out if I should make my output a list or dictionary? Which one would be easy to verify when you are trying to match 3 different outputs with millions of lines?
They key difference between a list and a dictionary is the way you access your data. For a dictionary, you will be using keys, where for lists you would be using a sequence of indexes to glean your data. Which one should you use here? This is sometimes a performance issue and sometimes it's about code clarity. I don't think the latter issue relates to your case, I think performance would be your greatest asset here, that's because you're dealing with dozens of lines to be processed and matched.There has been many internal optimizations for Python objects, so it's better to focus on what you're trying to achieve instead of focusing on objects. I could very easily recommend dictionaries, but would that be a good recommendation for later developments in your code? I don't know, unless we work on that specific code any recommendation would be vague.
If you care about performance, this might help:
Python: List vs Dict for look up table
Related
Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!
Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).
This question already has answers here:
Why is the order in dictionaries and sets arbitrary?
(5 answers)
Closed 7 years ago.
Just curious more than anything else, but why isn't a dictionary such as the one below not ordered the same as it was created? But when I print out test it returns the same order from then on...
test = {'one':'1', 'two':'2', 'three':'3', 'four':'4'}
It's not that I need them ordered, but it's just been on my mind for awhile as to what is occurring here.
The only thing I've found on this is a quote from this article:
Python uses complex algorithms to determine where the key-value pairs are stored in a dictionary.
But what are these "complex algorithms" and why?
Python needs to be able to access D[thing] quickly.
If it stores the values in the order that it receives them, then when you ask it for D[thing], it doesn't know in advance where it put that value. It has to go and find where the key thing appears and then find that value. Since it has no control over the order these are received, this would take about N/2 steps on average where N is the number of keys it's received.
But if instead it has a function (called a hash) that can turn thing in to a location in memory, it can quickly take thing and calculate that value, and check in that spot of memory. Of course, it's got to do a bit more overhead - checking that D[thing] has actually been defined, and checking for those rare cases where you may have defined D[thing1] and D[thing2] where the hash function of thing1 and thing2 happen to be the same (in which case a "collision" occurs and python has to figure out a new place to put one of them).
So for your example, you might expect that when you search for test['four'] it just goes to the last entry in a list it's stored and says "aha, that's '4'." But it can't just do that. How does it know that four corresponds to the last entry of the list. It could have come in any order, so it would have to create some other data structure which allows it to quickly tell that four was the last entry. This would take a lot of overhead.
It would be possible to make it output things in the order they were entered, but that would still require additional overhead tracking the order things were entered.
For curiosity, if you want a ordered dictionary, use OrderedDict
I have two documents that are mostly the same, but with some small differences I want to ignore. Specifically, I know that one has hex values written as "0xFFFFFFFF" while the other has them just as "FFFFFFFF"
Basically, these two documents are lists of variables, their values, their location in memeory, size, etc.
But another problem is that they are not in the same order either.
I tried a few things, one being to just pack them all up in two lists of lists, and compare if the lists of lists have counterparts in each other, but with the number of variables being almost 100,000 the time it takes to do this is ridiculous (on the order of nearly an hour) so that isn't going to work. I'm not very seasonsed in python, or even the pythonic way of doing things, so I'm sorry if there is a quick and easy way to do this.
I've read a few other similar questions, but they all assume 100% identicallity, and other things that arent true in my case.
Basically, I have two .txts that have series of lines that look like:
***************************************
Variable: Var_name1
Size: 4
Address: 0x00FF00F0 .. 0x00FF00F3
Description: An awesome variable
..
***************************************
I don't care if the Descriptions are different, I just want to make sure that every variable has the same length and is in the same place, address-wise, and if they are any difference, I want to see them. I also want to be sure that every variable in one is present in the other.
And again, the address in the first one are written with the hex radix and in the second one, without the hex radix. And they are in a different order
--- Output ---
I don't really care about the output's format as long as it is human readable. Ideally, it'd be a .txt document that said something like:
"Var_name1 does not exist in list two"
"Var_name2 has a different size. (Size1, Size2)"
"Var_name4 is located in a different place. (Loc1, Loc2)"
COMPLETE RE-EDIT
[My initial suggestion was to use sets, but further discussion via the comments made me realize that that was nonsense, and that a dictionary was the real solution.]
You want a dictionary; keyed on variable name; and where the value is a list or a tuple or a nested dictionary or even an object, containing size and address. You can add each variable name to the dictionary and update the values as needed.
For comparing the addresses, a regex would do it, but you can probably get by with less overhead with just a str.contains(_).
I had an interview question along these lines:
Given two lists of unordered customers, return a list of the intersection of the two lists. That is, return a list of the customers that appear in both lists.
Some things I established:
Assume each customer has a unique name
If the name is the same in both lists, it's the same customer
The names are of the form first name last name
There's no trickery of II's, Jr's, weird characters, etc.
I think the point was to find an efficient algorithm/use of data structures to do this as efficiently as possible.
My progress went like this:
Read one list in to memory, then read the other list one item at a time to see if there is a match
Alphabetize both lists then start at the top of one list and see if each item appears in the other list
Put both lists into ordered lists, then use the shorter list to check item by item (that way, it one list has 2 items, you only check those 2 items)
Put one list into a hash, and check for the existence of keys from the other list
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Any other tricks to do this efficiently?
Side note, this question was in python, and I just read about sets, which seem to do this as efficiently as possible. Any idea what the data structure/algorithm of sets is?
It really doesnt matter how its implemented ... but I believe it is implemented in C so it is faster and better set([1,2,3,4,5,6]).intersection([1,2,5,9]) is likely what they wanted
In python readability counts for alot! and set operations in python are used extensively and well vetted...
that said another pythonic way of doing it would be
list_new = [itm for itm in listA if itm in listB]
or
list_new = filter(lambda itm:itm in listB,listA)
basically I believe they were testing if you were familliar with python, not if you could implement the algorithm. since they asked a question that is so well suited to python
Put one list into a bloom filter and use that to filter the second list.
Put the filtered second list into a bloom filter and use that to filter the first list.
Sort the two lists and find the intersection by one of the methods above.
The benefit of this approach (besides letting you use a semi-obscure data structure correctly in an interview) is that it doesn't require any O(n) storage until after you have (with high probability) reduced the problem size.
The interviewer kept asking, "What next?", so I assume I'm missing something else.
Maybe they would just keep asking that until you run out of answers.
http://code.google.com/p/python-bloom-filter/ is a python implementation of bloom filters.
I have this somewhat big data structure that stores pairs of data. The individual data is tiny and readily hashable, and there are something like a few hundred thousand data points in there.
At first, this was a simple dict that was accessed only by keys. Later on however, I discovered that I also needed to access it by value, that is, get the key for a certain value. Since this was done somewhat less often (~1/10) than access by key, I naïvely implemented it by simply iterating over all the dicts items(). Which proved a bit sluggish at a few hundred thousand calls per second. It is about 500 times slower.
So my next idea was to just use save the reverse dict, too. This seems to be a rather inelegant solution however, so I turn to you guys for help.
Do you know any data structure in Python that stores data pairs that can be accessed by either data point of the pair?
You could try bidict.