What is the best way to create a python script which by calling which, it will ALWAYS generate a new UNIQUE ID (Autoincremental)
You run the script and it will tell you 1, then close script and open again and will tell you 2.
Purpose of it is to create a script which will be used across and this ID will be used to track the latest changes and so on.
P.S. I'm not talking to make a function which will do it.
import uuid
uniqueid = uuid.uuid1()
Since you didnt provide any code, I will also not provide any code.
Solution 1: Unique ID
1) TIME: create function to give you timestamp
2) ID: create function that generate long string with random numbers and letters
This is of course 'risky' because there is a chance you will generate already existing ID, but from statistical point of view, it is so called 'impossible even if it is possible'
save in file or somewhere
Solution 2: offset - incremental
1) have file with a 0 in it.
2) open a file, read line, convert to integer, increment to +1, write in a file.
Note:
Your title is wrong. One moment you talk about UNIQUE ID, Next moment you are talking about offset. Unique ID and counting running python script are quite contradicting ideas
I assume you have a script, it will generate some result every time it is executed. Then you need need a value that (1) distinguish one result from another and (2) shows which result came last. Is that right? If so, we have many options here. In the simplest case (a script always running in the same machine) I would suggest two options
Save a count to a file
In this case, you would have a file and would read the number from it:
try:
with open('count.txt') as count_file:
content = count_file.read()
count = int(content)
except Exception:
count = 0
After doing whatever your script does, you would write to the file the value you've read, but incremented:
with open('count.txt', 'w') as count_file:
count_file.write(str(count + 1))
Save the timestamp
A simpler option, however, is not to increment a value but get a timestamp. You could use time.time(), that returns the number of seconds since Unix epoch:
>>> import time
>>> time.time()
1547479233.9383247
You will always know which result came later than the others. Personally, however, I would rather format the current time, it is easier to read and reason about:
>>> from datetime import datetime
>>> datetime.now().strftime('%Y%m%d%H%M%S')
'20190114132407'
Those are basic ideas, you may need to pay attention to corner cases and possible failures (especially with the file-based solution). That said, I guess those are quite viable first steps.
A technical note
What you want here is to a program to remember a piece of information between two or more executions, and we have a technical term for that: the information should be persistent. Since you asked for an autoincrementing feature, you wanted a persistent count. I suspect, however, you do not need that if you use the timestamp option. It is up to you to decide what to do here.
I had the same situation. I ended up in creating a CSV file so that I can map variable names.
def itemId_generator(itemIdLocation):
# Importing value of ItemID from the csv file
df = pd.read_csv(itemIdLocation)
#return value which is current ItemID in the csv file
ItemId = df.loc[0, 'ItemId']
# If loop to set limit to maximum Item ID
if ItemId>= 10000:
df.loc[0, 'ItemId'] = 1
elif ItemId<10000:
# updating the column value/data
df.loc[0, 'ItemId'] = df.loc[0,'ItemId'] + 1
else:
print("Invalid value returned")
sys.exit()
# writing new ItemID into the file
df.to_csv(itemIdLocation, index=False)
# The function .item() converts numpy integer to a normal int
return str(ItemId.item())
If there is any chance of the file being accessed concurrently, it is best to lock the file. Keep trying if the file is locked.
http://tilde.town/~cristo/file-locking-in-python.html
Old answer:
You could store it as an environment variable on the system. If not set, initialise to 1. Else increment it by 1.
Related
I am beginner in programming and I use python. I have a code that calculate the rotation period of star, but I have to change the star ID each time, which will take me effort and time to complete it.
Can I change the star ID automatically?
from lightkurve import search_lightcurvefile
lcf = search_lightcurvefile('201691589').download() ## star Id = 201691589
lc = lcf.PDCSAP_FLUX.remove_nans()
pg = lc.to_periodogram()
Prot= pg.frequency_at_max_power**-1
print Prot
I saved all 'stars_ID' that I want to use in a txt file(starID.txt) with 10000 lines, and I want to calculate the rotation period (Prot) in an automatic way so that the code takes the star ID from the txt file one by one and do the calculations, then save the star_ID and Prot in a csv file (two columns: 'star_ID', 'Prot'). Can you please help me do it.
This should get you close but I don't have a bunch of star IDs handy, nor is this in my field.
The main points:
Use the csv module for reading and writing files.
When you have code that you need to call many times (well, oftentimes even just once for a logical grouping), you want to consider packaging it into a function
There are other pointers for you to research. If I didn't try and make things a little more succinct than basic loops then the code would be quite long, and I tried to not make it too terse. Hopefully it's enough for you to follow up on.
from lightkurve import search_lightcurvefile
import csv
# You need to read the file and get the star IDs. I'm taking a guess here that
# the file has a single column of IDs
with open('starID.txt') as infile:
reader = csv.reader(infile)
# Below is where my guess matters. A "list comprehension" assuming a single
# column so I just take the first value of each row.
star_ids = [item[0] for item in reader]
def data_reader(star_id):
"""
Function to read periodogram for a star_id
Returns [star_id, periodogram]
"""
lcf = search_lightcurvefile('201691589').download()
lc = lcf.PDCSAP_FLUX.remove_nans()
pg = lc.to_periodogram()
Prot= pg.frequency_at_max_power**-1
return [star_id, Prot]
# Now start calling the function on your list of star IDs and storing the result
results = []
for id_number in star_ids:
individual_result = data_reader(id_number) # Call function
results.append(individual_result) # Add function response to the result collection
# Now write the data out
with open('star_results.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerows(results)
You said 'beginner' so this is an answer for someone writing programs almost for the first time.
In Python the straight-forward way to change the star id each time is to write a function and call it multiple times with different star ids. Taking the code you have and changing into a function without changing the behavior at all might look like this:
from lightkurve import search_lightcurvefile
def prot_for_star(star_id):
lcf = search_lightcurvefile(star_id).download()
pg = lcf.PDCSAP_FLUX.remove_nans().to_periodogram()
return pg.frequency_at_max_power**-1
# Now use map to call the function for each star id in the list
star_ids = ['201691589', '201692382']
prots = map(prot_for_star, star_ids)
print(list(prots))
This could be inefficient code though. I don't know what this lightkurve package does exactly, so there could be additional ways to save time. If you need to do more than one thing with each lcf object, you might need your functions to be structured differently. Or if creating a periodogram is CPU intensive, and you end up generating the same ones multiple times, there could be ways to save time doing that.
But this is the basic idea of using abstraction to avoid repeating the same lines of code over and over again.
Combining the original star id with its period of rotation can be achieved like this. This is a bit of functional programming magic.
# interleave the star ids with their periods of rotation
for pair in zip(star_ids, prots):
# separate the id and prot with a comma for csv format
print(','.join(pair))
The output of the Python script can then be stored in a csv file.
csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!
I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !
I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).
I am trying to find an alternate/faster method to running the Frequency command on a single variable and writing the number of times the value appears in the dataset to a new variable. My current setup uses Syntax and writes the output to a new SAV file (oms send), which take several hours to run.
I am looking for some sample code that might show how this can be done with spss.Cursor, where it first reads the variable I want to get the Frequency on, saved it to a list by the number of times each value occurs, then writes the value to a new variable within the current dataset.
I understand how the read and write cursors work, but am having an issue how to count the number of times the variable occurs/stores it in a list, which is then written to the new variable. I have read through the Spss/python plugin manual, and haven't been able to recognize the solution. Thanks!
Have you considered the AGGREGATE command with MODE = ADDVARIABLES? For example:
AGGREGATE OUTFILE = * MODE = ADDVARIABLES
/BREAK = var1
/var1_n = n.
I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).