Converting a list of strings to a single pattern

Converting a list of strings to a single pattern - python

I have a list of strings that follow a specific pattern. Here's an example
['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
I'm trying to end up with a blob pattern that will represent this list like the following
'ratelimiter:foobar:201401011*
I know the first two fields ahead of time. The third field is a time stamp and I want to find the column at which they start to have different values from other values in the column.
In the example given the timestamp ranges from 2014-01-01-11:57 to 2014-01-01-12:00 and the column that's different is the third to the last column where 1 changes to 2. If I can find that, then I can slice the string to [:-3] += '*' (for this example)
Every time I try and tackle this problem I end up with loops everywhere. I just feel like there's a better way of doing this.
Or maybe someone knows a better way of doing this with redis. I'm doing this because I'm trying to get keys from redis and I don't want to make a request for every key but rather make a batch request using the pattern parameter. Maybe there's a better way of doing this but haven't found anything yet.
Thanks

Staying in the pattern thing (converting to timestamp is probably best, though), I would do that to find the longest prefix:
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
print items[0][:[len(set(x)) == 1 for x in zip(*items)].index(False)] + '*'
# ratelimiter:foobar:201401011*
Which reads as: cut the first element of items where all nth elements of items are no longer equals.
[len(set(x)) == 1 for x in zip(*items)] will return a list of boolean being True for i if all elements at i are equal across items

This is what I will do:
convert the timestamp to numbers
find the max and min (if your list is not ordered)
take the difference between max and min and convert it back to pattern.
For example, in your case, the difference between max and min is 43. And the min is already 57, you can quickly deduct that if the min ends with ***157, the max should be ***200. And you know the pattern

You almost never want to use the '*' parameter in Redis in production because it is very slow-- much slower than making a request for each key individually in the vast majority of cases. Unless you're requesting so many keys that your bottleneck becomes the sheer amount of data you're transferring over the network (in which case you should really convert things to Lua and run the logic server-side), a pipeline is really want you want.
The reason you want a pipeline is you're probably getting hit by the costs of transferring data back and forth between your Redis server in separate hops right now. A pipeline, in contrast, queues up a bunch of commands to run against Redis, and then executes them all at once, when you're ready. Assuming you're using redis-py (if you're not, you really should be), and r is your connection to your Redis server, you can do this like so:
r = redis.Redis(...)
pipe = r.pipeline()
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
for item in items:
pipe.get(item)
#all the values for each item you're getting from Redis will be here.
item_values = pipe.execute()
Note: this will only make one call to Redis and will be much faster than either getting each value individually or running a pattern selection.
All of the other answers so far are good Python answers, but you're dealing with a Redis problem. You need a Redis answer.

Related

How to get nth record from result set of aerospike scan() output in Python, without looping through all results?

A beginner question with python probably.
I am able to iterate over the results of aerospike db query like this -
client = aerospike.client(config).connect()
scan = client.scan('namespace', 'setName')
scan.select('PK','expiresIn','clientId','scopes','roles') # scan from aerospike
scan.foreach(process_result)
def process_result((key, metadata, record)):
expiresIn = record.get("expiresIn")
Now, all I want to do is get the nth record from this set, without having to iterate through all.
I tried looking at Get the nth item of a generator in Python but could not make much sense.

Results from scan operation come from all the nodes in the cluster, pipelined, in no particular order. In that sense, there is no difference between the first record or the Nth record in terms of ordering. There is no order.
I wrote some Medium posts on how to sort results from a scan or query:
Sorted Results from a Secondary Index Query in Aerospike — Part I
Sorted Results from a Secondary Index Query — Part II

As usual, the workaround would be to set the scan policy to return just the digests, store them as a list (or several records with smaller lists) and paginate over those wth batch reads. You can set reasonable TTLs so that this result set has a reasonable length of time.
I can provide sample code if needed.

How to impliment a binary search on a list created from a file

This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.

Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.

I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".

GAE NDB model sequential ID [duplicate]

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.

The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)

If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.

Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.

Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.

Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.

I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.

I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

Speed up lookup item in list (via Python)

I have a very large list, and I have to run a lot of lookups for this list.
To be more specific I work on a large (> 11 Gb) textfile for processing, but there are items which are appear more than once, and I have only process them first when they are appearing.
If the pattern shows up, I process it, and put it to a list. If the item appears again, I check for it in the list, and if it is, then I just pass to process, like this:
[...]
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.append(threadid)
elif threadid in closedthreads:
pass
else:
[...]
the code itself is far from optimal. My main problem is that the 'closedthreads' list contains a few million items, and the whole operation just start to be slower and slower.
I think it could be help to sort the list (or use a 'sorted list' object) after every append() but I am not sure about this.
What is the most elegant sollution?

You can simply use a set or a hash table which marks if given id already appeared. It should solve your problem with O(1) time complexity for adding and finding an item.

Using a set instead of a list will give you O(1) lookup time, although there may be other ways to optimize this that will work better for your particular data.
closedthreads = set()
# ...
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.add(threadid)
elif threadid in closedthreads:
pass
else:

Do you need to preserve ordering?
If not - use a set.
If you do - use an OrderedDict. OrderedDict lets you store values associated with it as well (example, process results)
But... do you need to preserve the original values at all? You might look at the 'dbm' module if you absolutely do (or buy a lot of memory!) or, instead of storing the actual text, store SHA-1 digests, or something like that. If all you want to do is make sure you don't run the same element twice, that might work.

How to implement "autoincrement" on Google AppEngine

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.

The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)

If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.

Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.

Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.

Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.

I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.

I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.