Optimal data structure for streaming data

Optimal data structure for streaming data - python

I have a stream of data of the form [id, name, act, value, type].
id is an integer, name a string, act can be 'add', 'update' or 'delete', value is an integer, type is either L or R. We can only add once an id, perform multiple updates and then delete the id. I obviously look for a data structure that will allow me to insert those data efficiently.
I also need to be able to get the highest L value by name and the lowest R value by name at each moment the fastest way possible.
I believe I will need to use heap to get in a constant time min and max values by name. My problem is that I don't manage to find a way to also have the possibility to delete and update existing data at the same time.

The phrasing is a bit unclear here. Let me try and rephrase: you are looking for a good data structure such that, given a stream of operations in the form given above, you can add, delete, or update items (found using their id number). And you'd also like to maintain a few summary statistics about the whole data structure such as highest L and lowest R value.
Does this sound correct?
A dictionary of dictionaries sounds like it's probably the right answer if your id numbers are not over a specific range, or a list of dictionaries if they are.

Sorting makes this a different sort of problem. So you are instead looking for a way to add and subtract data entries into a data structure sorted alphabetically on their string names? One common way to do this is with a binary search tree. A BST will give you an insertion time complexity of O(log(n)) with n elements in the tree. At each element you can store the other data. Then you can separately maintain the highest L and lowest R values and update these each time a value is added that exceeds on of these values. If you remove a value equal to one of these limits, you'll have to traverse the whole data structure to get the new limit value.

Related

Remove A Specific Instance of a Partially Duplicated Entry In a List In Python 3

I am relatively new to Python. However, my needs generally only involve simple string manipulation of rigidly formatted data files. I have a specific situation that I have scoured the web trying to solve and have come up blank.
This is the situation. I have a simple list of two-part entires, formatted like this:
name = ['PAUL;25', 'MARY;60', 'PAUL;40', 'NEIL;50', 'MARY;55', 'HELEN;25', ...]
And, I need to keep only one instance of any repeated name (ignoring the number to the right of the ' ; '), keeping only the entry with the highest number, along with that highest value still attached. So the answer would look like this:
ans = ['MARY;60', 'PAUL;40', 'HELEN;25', 'NEIL;50, ...]
The order of the elements in the list is irrelevant, but the format of the ans list entries must remain the same.
I can probably figure out a way to brute force it. I have looked at 2D lists, sets, tuples, etc. But, I can't seem to find the answer. The name list has about a million entries, so I need something that is efficient. I am sure it will be painfully easy for some of you.
Thanks for any input you can provide.
Cheers.
alkemyst

Probably the best data structure for this would be a dictionary, with the entries split up (and converted to integer) and later re-joined.
Something like this:
max_score = {}
for n in name:
person, score_str = n.split(';')
score = int(score_str)
if person not in max_score or max_score[person] < score:
max_score[person] = score
ans = [
'%s;%s' % (person, score)
for person, score in max_score.items()
]
This is a fairly common structure for many functions and programs: first convert the input to an internal representation (in this case, split and convert to integer), then do the logic or calculation (in this case, uniqueness and maximum), then convert to the required output representation (in this case, string separated with ;).
In terms of efficiency, this code looks at each input item once, then at each output item once; there's unlikely to be any approach that can do better than that (certainly not formally, and likely not in practice). All of the per-item operations are constant-time and fast. It accumulates the intermediate answer in memory (in max_score), but again that is unavoidable; if memory is an issue, the input and output could be changed to iterators/generators, but the whole intermediate answer has to be accumulated in max_score before any items can be output.

How to get nth record from result set of aerospike scan() output in Python, without looping through all results?

A beginner question with python probably.
I am able to iterate over the results of aerospike db query like this -
client = aerospike.client(config).connect()
scan = client.scan('namespace', 'setName')
scan.select('PK','expiresIn','clientId','scopes','roles') # scan from aerospike
scan.foreach(process_result)
def process_result((key, metadata, record)):
expiresIn = record.get("expiresIn")
Now, all I want to do is get the nth record from this set, without having to iterate through all.
I tried looking at Get the nth item of a generator in Python but could not make much sense.

Results from scan operation come from all the nodes in the cluster, pipelined, in no particular order. In that sense, there is no difference between the first record or the Nth record in terms of ordering. There is no order.
I wrote some Medium posts on how to sort results from a scan or query:
Sorted Results from a Secondary Index Query in Aerospike — Part I
Sorted Results from a Secondary Index Query — Part II

As usual, the workaround would be to set the scan policy to return just the digests, store them as a list (or several records with smaller lists) and paginate over those wth batch reads. You can set reasonable TTLs so that this result set has a reasonable length of time.
I can provide sample code if needed.

Sorting based on a variable number of sort keys input after execution

I am very new to python and my apologies is this has already been answered. I can see a lot of previous answers to 'sort' questions but my problem seems a little different from these questions and answers.
I have a list of keys, with each key contained in a tuple, that I am trying to sort. Each key is derived from a subset of the columns in a CSV file, but this subset is determined by the user at runtime and can't be hard coded as it will vary from execution to execution. I also have a datetime value that will always form part of the key as the last item in the tuple (so there will be at least one item to sort on - even if the user provides no additional items).
The tuples to be sorted look like:
(col0, col1, .... colN, datetime)
Where col0 to colN are based on the values found in columns in a CSV file, and the 'N' can change from run to run.
In each execution, the tuples in the list will always have the same number of items in each tuple. However, they need to be able to vary from run to run based on user input.
The sort looks like:
sorted(concurrencydict.keys(), key=itemgetter(0, 1, 2))
... when I do hard-code the sort based on the first three columns. The issue is that I don't know in advance of execution that 3 items will need to be sorted - it may be 1, 2, 3 or more.
I hope this description makes sense.
I haven't been able to think of how I can get itemgetter to accept a variable number of values.
Does anyone know whether there is an elegant way of performing a sort based on a variable number of items in python where the number of sort items is determined at run time (and not based on fixed column numbers or attribute names)?

I guess I'll turn my comment into an answer.
You can pass a variable number of arguments (which are packed into an iterable object) by using *args in the function call. In your specific case, you can put your user-supplied selection of column numbers to sort by into a sort_columns list or tuple, then call:
sorted_keys = sorted(concurrencydict.keys(), key=itemgetter(*sort_columns))

Reordering pairs of values within an array so they are in sequence

I'm trying to build a solution to properly order an array of value pairs so that they end up in the correct sequence. Consider this example in Python:
theArray = [['Dempster St','Main St'],['Dempster St','Church St'],['Emerson St','Church St']]
I need to order the array so that in the end it looks like this:
theArray = [['Emerson St','Church St'],['Church St','Dempster St'],['Dempster St','Main St']]
Some considerations:
There is no guarantee that the order within each pair point in the same direction. Ex: in the example above, the second array element has the order of their pairs pointing in the opposite direction of the rest (Dempster to Church instead of Church to Dempster)
The code should be built so that it could be used in both Python and C, so ideally it should be done without any language-specific tricks
At the end, it doesn't matter in which order the final array will be built, as long as the elements follow the correct order. For example, the solution below would also work:
theArray = [['Main St','Dempster St'],['Dempster St','Church St'],['Church St','Emerson St']]
Ideas?

I managed to make it work. I iterated each element of every pair with each other by using multiple nested loops - so that I could check for their uniqueness (and in order to do that, I increment an associated variable whenever an item was found more than once, like a refcount); at the end, the two elements with the lowest count are beginning and end of the route. From there it was quite easy to find the remaining connections.

Large sortable data structures? Dictionary or something else?

I have a large python dictionary (65535 key:value pairs), where key is range(0, 65536) and the values are integers.
The solution I found to sorting this data structure is posted here:
Sort a Python dictionary by value
That solution works, but is not necessarily very fast.
To further complicate the problem, there is a potential for me to have many (thousands) of these dictionaries that I must combine prior to sorting. I am currently combining these dictionaries by iterating over the pairs in one dictionary, doing a key lookup in the other dictionary, and adding/updating the entry as appropriate.
This makes my question two fold:
1)Is a dictionary the right data structure for this problem? Would a custom tree or something else make more sense?
2)If dictionary is the smart, reasonable choice, what is the ideal way to combine multiples of the dictionary and then sort it?
One solution may be for me to redesign my program's flow in order to decrease the number of dictionaries being maintained to one, though this is more of a last resort.
Thanks

A dictionary populated with 65535 entries with keys from the range (0:65536) sounds suspiciously like an array. If you need sorted arrays, why are you using dictionaries?
Normally, in Python, you would use a list for this type of data. In your case, since the values are integers, you might also want to consider using the array module. You should also have a look at the heapq module since if your data can be represented in this way, there is a builtin merge function that could be used.
In any case, if you need to merge data structures and produce a sorted data structure as a result, it is best to to use a merge algorithm and one possibility for that is a mergesort algorithm.

There's not enough information here to say which data structure you should use, because we don't know what else you're doing with it.
If you need to be able to quickly insert records into the data structure later one at a time, then you do need a tree-like data structure, which unfortunately doesn't have a standard implementation (or even a standard interface, for some operations) in Python.
If you only need to be able to do what you said--to sort existing data--then you can use lists. The sorting is quick, especially if parts of the data are already sorted, and you can use binary searching for fast lookups. However, inserting elements will be O(n) rather than the O(log n) you'll get with a tree.
Here's a simple example, converting the dicts to a list or tuples, sorting the combined results and using the bisect module to search for items.
Note that you can have duplicate keys, showing up in more than one dict. This is handled easily: they'll be sorted together naturally, and the bisection will give you a [start, end) range containing all of those keys.
If you want to add blocks of data later, append it to the end and re-sort the list; Python's sorting is good at that and it'll probably be much better than O(n log n).
This code assumes your keys are integers, as you said.
dataA = { 1: 'data1', 3: 'data3', 5: 'data5', 2: 'data2' }
dataB = { 2: 'more data2', 4: 'data4', 6: 'data6' }
combined_list = dataA.items() + dataB.items()
combined_list.sort()
print combined_list
import bisect
def get_range(data, value):
lower_bound = bisect.bisect_left(data, (value, ))
upper_bound = bisect.bisect_left(data, (value+1, ))
return lower_bound, upper_bound
lower_bound, upper_bound = get_range(combined_list, 2)
print lower_bound, upper_bound
print combined_list[lower_bound:upper_bound]

With that quantity of data, I would bite the bullet and use the built in sqlite module. Yes you give up some python flexibility and have to use SQL, but right now its sorting 65k values; next it will be find values that meet certain criteria. So instead of reinventing relational databases, just go the SQL route now.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.