First, I would like to apologize in advance. I just started learning this a few months ago, so I need stuff broken down completely. I have a project using python and datajoint (makes sql code shorter to write) where I need to create an airport that has at least 7 airports, different planes and what not. Then I need to populate the tables with passenger reservations. Here is what I have so far.
#schema
class Seat(dj.Lookup):
definition = """
aircraft_seat : varchar(25)
"""
contents = [["F_Airbus_1A"],["F_Airbus_1B"],["F_Airbus_2A"],["F_Airbus_2B"],["F_Airbus_3A"],
["F_Airbus_3B"],["F_Airbus_4A"],["F_Airbus_4B"],["F_Airbus_5A"],["F_Airbus_5B"],
["B_Airbus_6A"],["B_Airbus_6B"],["B_Airbus_6C"],["B_Airbus_6D"],["B_Airbus_7A"],
["B_Airbus_7B"],["B_Airbus_7C"],["B_Airbus_7D"],["B_Airbus_8A"],["B_Airbus_8B"],
["B_Airbus_8C"],["B_Airbus_8D"],["B_Airbus_9A"],["B_Airbus_9B"],
This keeps going leaving me with a total of 144 seats on each plane.
#schema
class Flight(dj.Manual):
definition = """
flight_no : int
---
economy_price : decimal(6,2)
departure : datetime
arrival : datetime
---
origin_code : int
dest_code : int
"""
#schema
class Passenger(dj.Manual):
definition = """
passenger_id : int
---
full_name : varchar(40)
ssn : varchar(20)
"""
#schema
class Reservation(dj.Manual):
definition = """
-> Flight
-> Seat
---
-> Passenger
"""
Then I populate flights and passengers:
Flight.insert((dict(flight_no = i,
economy_price = round(random.randint(100, 1000), 2),
departure = faker.date_time_this_month(),
arrival = faker.date_time_this_month(),
origin_code = random.randint(1,7),
dest_code = random.randint(1,7)))
for i in range(315))
Passenger.insert(((dict(passenger_id=i, full_name=faker.name(),
ssn = faker.ssn()))
for i in range(10000)), skip_duplicates = True)
Lastly I create the transaction:
def reserve(passenger_id, origin_code, dest_code, departure):
with dj.conn().transaction:
available_seats = ((Seat * Flight - Reservation) & Passenger &
{'passenger_id':passenger_id}).fetch(as_dict=True)
try:
choice = random.choice(available_seats)
except IndexError:
raise IndexError(f'Sorry, no seats available for {departure}')
name = (Passenger & {'passenger_id': passenger_id}).fetch1('full_name')
print('Success. Reserving seat {aircraft_seat} at ticket_price {economy_price} for
{name}'.format(name=name, **choice))
Reservation.insert1(dict(choice, passenger_id=passenger_id), ignore_extra_fields=True)
reserve(random.randint(1,1000), random.randint(1,7),
random.randint(1,7),random.choice('departure'))
Output[]: Success. Reserving seat E_Yak242_24A at ticket_price 410.00 for Cynthia Erickson
Reservation()
Output[]: flight_no aircraft_seat passenger_id
66 B_Yak242_7A 441
So I am required to have 10.5 flights a day with the planes at least 75% full which leaves me needing over 30000 reservations. Is there a way to do this like 50 at a time? I have been searching for an answer and have not been able to find a solution. Thank you.
One of the maintainers for DataJoint here. First off, I'd like to say thanks for trying out DataJoint; curious as to how you found out about the project.
Forewarning, this will be a long post but I feel it is a good opportunity to clear up a few things. Regarding the problem in question, not sure if I fully understand the nature of your problem but let me follow on several points. I recommend reading this answer in its entirety before determining how best to proceed for your case.
TL;DR: Compute tables are your friend.
Multi-threading
Since it has come up in the comments it is worth addressing that as of 2021-01-06, DataJoint is not completely thread-safe (at least from the perspective of sharing connections). It is unfortunate but it is mainly due to a standing issue with PyMySQL which is a principal dependency of DataJoint. That said, if you initiate a new connection on each thread or process you should not run into any issues. However, this is an expensive workaround and can't be combined with transactions since they require that operations be conducted within a single connection. Speaking of which...
Compute Tables and Job Reservation
Compute tables is one noticeable omission from your above attempt at a solution. Compute tables provide a mechanism to associate its entities to those in an upstream parent table with addional processing prior to insert (defined in a make method in your Compute table class) where it may be inoked by calling the populate method which calls the make method for each new entry. Calls to your make method are transaction-constrained and should achieve what you are looking for. See here in the docs for more details in its use.
Also, for additional performance gains, there is another feature called Job Reservation which provides a means to pool together multiple workers to process large data sets (using populate) in an organized, distributed manner. I don't feel it is required here but worth mentioning and ultimately up to how you view the results below. You may find out more on this feature here in our docs.
Schema Design
Based on my understanding of your initial design, I have some suggestions how we can improve the flow of the data to increase clarity, performance, and also to provide specific examples on how we can use the power of Compute tables. Running as illustrated below on my local setup, I was able to process your requirement of 30k reservations in 29m54s with 2 different plane model types, 7 airports, 10k possible passengers, 550 available flights. Minimum 75% seating capacity was not verified only because I didn't see you attempt this yet, though if you see how I am assigning seats you will notice that it is almost there. :)
Disclaimer: I should note that the below design is still a large oversimplification of the actual real-world challenge to orchestrate proper travel reservations. Considerable assumptions were taken mainly for the benefit of education as opposed to submitting a full, drop-in solution. As such, I have explicitly chosen to avoid using longblob for the below solution so that it is easier to follow along. In reality, a proper solution would likely include more advanced topics for further performance gains e.g. longblob, _update, etc.
That said, let's begin by considering the following:
import datajoint as dj # conda install -c conda-forge datajoint or pip install datajoint
import random
from faker import Faker # pip install Faker
faker = Faker()
Faker.seed(0) # Pin down randomizer between runs
schema = dj.Schema('commercial_airtravel') # instantiate a workable database
#schema
class Plane(dj.Lookup):
definition = """
# Defines manufacturable plane model types
plane_type : varchar(25) # Name of plane model
---
plane_rows : int # Number of rows in plane model i.e. range(1, plane_rows + 1)
plane_columns : int # Number of columns in plane model; to extract letter we will need these indices
"""
contents = [('B_Airbus', 37, 4), ('F_Airbus', 40, 5)] # Since new entries to this table should happen infrequently, this is a good candidate for a Lookup table
#schema
class Airport(dj.Lookup):
definition = """
# Defines airport locations that can serve as origin or destination
airport_code : int # Airport's unique identifier
---
airport_city : varchar(25) # Airport's city
"""
contents = [(i, faker.city()) for i in range(1, 8)] # Also a good candidate for Lookup table
#schema
class Passenger(dj.Manual):
definition = """
# Defines users who have registered accounts with airline i.e. passenger
passenger_id : serial # Passenger's unique identifier; serial simply means an auto-incremented, unsigned bigint
---
full_name : varchar(40) # Passenger's full name
ssn : varchar(20) # Passenger's Social Security Number
"""
Passenger.insert((dict(full_name=faker.name(),
ssn = faker.ssn()) for _ in range(10000))) # Insert a random set of passengers
#schema
class Flight(dj.Manual):
definition = """
# Defines specific planes assigned to a route
flight_id : serial # Flight's unique identifier
---
-> Plane # Flight's plane model specs; this will simply create a relation to Plane table but not have the constraint of uniqueness
flight_economy_price : decimal(6,2) # Flight's fare price
flight_departure : datetime # Flight's departure time
flight_arrival : datetime # Flight's arrival time
-> Airport.proj(flight_origin_code='airport_code') # Flight's origin; by using proj in this way we may rename the relation in this table
-> Airport.proj(flight_dest_code='airport_code') # Flight's destination
"""
plane_types = Plane().fetch('plane_type') # Fetch available plane model types
Flight.insert((dict(plane_type = random.choice(plane_types),
flight_economy_price = round(random.randint(100, 1000), 2),
flight_departure = faker.date_time_this_month(),
flight_arrival = faker.date_time_this_month(),
flight_origin_code = random.randint(1, 7),
flight_dest_code = random.randint(1, 7))
for _ in range(550))) # Insert a random set of flights; for simplicity we are not verifying that flight_departure < flight_arrival
#schema
class BookingRequest(dj.Manual):
definition = """
# Defines one-way booking requests initiated by passengers
booking_id : serial # Booking Request's unique identifier
---
-> Passenger # Passenger who made request
-> Airport.proj(flight_origin_code='airport_code') # Booking Request's desired origin
-> Airport.proj(flight_dest_code='airport_code') # Booking Request's desired destination
"""
BookingRequest.insert((dict(passenger_id = random.randint(1, 10000),
flight_origin_code = random.randint(1, 7),
flight_dest_code = random.randint(1, 7))
for i in range(30000))) # Insert a random set of booking requests
#schema
class Reservation(dj.Computed):
definition = """
# Defines booked reservations
-> BookingRequest # Association to booking request
---
flight_id : int # Flight's unique identifier
reservation_seat : varchar(25) # Reservation's assigned seat
"""
def make(self, key):
# Determine booking request's details
full_name, flight_origin_code, flight_dest_code = (BookingRequest * Passenger & key).fetch1('full_name',
'flight_origin_code',
'flight_dest_code')
# Determine possible flights to satisfy booking
possible_flights = (Flight * Plane *
Airport.proj(flight_dest_city='airport_city',
flight_dest_code='airport_code') &
dict(flight_origin_code=flight_origin_code,
flight_dest_code=flight_dest_code)).fetch('flight_id',
'plane_rows',
'plane_columns',
'flight_economy_price',
'flight_dest_city',
as_dict=True)
# Iterate until we find a vacant flight and extract details
for flight_meta in possible_flights:
# Determine seat capacity
all_seats = set((f'{r}{l}' for rows, letters in zip(*[[[n if i==0 else chr(n + 64)
for n in range(1, el + 1)]]
for i, el in enumerate((flight_meta['plane_rows'],
flight_meta['plane_columns']))])
for r in rows
for l in letters))
# Determine unavailable seats
taken_seats = set((Reservation & dict(flight_id=flight_meta['flight_id'])).fetch('reservation_seat'))
try:
# Randomly choose one of the available seats
reserved_seat = random.choice(list(all_seats - taken_seats))
# You may uncomment the below line if you wish to print the success message per processed record
# print(f'Success. Reserving seat {reserved_seat} at ticket_price {flight_meta["flight_economy_price"]} for {full_name}.')
# Insert new reservation
self.insert1(dict(key, flight_id=flight_meta['flight_id'], reservation_seat=reserved_seat))
return
except IndexError:
pass
raise IndexError(f'Sorry, no seats available departing to {flight_meta["flight_dest_city"]}')
Reservation.populate(display_progress=True) # This is how we process new booking requests to assign a reservation; you may invoke this as often as necessary
Syntax and Convention Nits
Lastly, just some minor feedback in your provided code. Regarding table definitions, you should only use --- once in the definition to identify a clear distinction between primary key attributes and secondary attributes (See your Flight table). Unexpectedly, this did not throw an error in your case but should have done so. I will file an issue since this appears to be a bug.
Though transaction is exposed on dj.conn(), it is quite rare to need to invoke it directly. DataJoint provides the benefit of handling this internally to reduce the management overhead of this from the user. However, the option is still available should it be needed for corner-cases. For your case, I would avoid invoking it directly and reccomend using Computed (or also Imported) tables instead.
Related
Lets say I have an assembly like this:
MainProduct:
-Product1 (Instance of Part1)
-Product2 (Instance of Part2)
-Product3 (Instance of Part2)
-Product4 (Instance of Part3)
...
Now, I want to copy/paste a feature from Product3 into another one.
But I run into problems when selecting the feature programmatically, because there are 2 instances of the part of that feature.
I can't control which feature will be selected by CATIA.ActiveDocument.Selection.Add(myExtractReference)
Catia always selects the feature from Product2 instead of the feature from Product3. So the position of the pasted feature will be wrong!
Does anybody know this problem and has a solution to it?
Edit:
The feature reference which I want to copy already exists as a variable because it was newly created (an extract of selected geometry)
I could get help else where. Still want to share my solution. It's written in Python but in VBA its almost the same.
The clue is to access CATIA.Selection.Item(1).LeafProduct in order to know where the initial selection was made.
import win32com.client
import pycatia
CATIA = win32com.client.dynamic.DumbDispatch('CATIA.Application')
c_doc = CATIA.ActiveDocument
c_sel = c_doc.Selection
c_prod = c_doc.Product
# New part where the feature should be pasted
new_prod = c_prod.Products.AddNewComponent("Part", "")
new_part_doc = new_prod.ReferenceProduct.Parent
# from user selection
sel_obj = c_sel.Item(1).Value
sel_prod_by_user = c_sel.Item(1).LeafProduct # reference to the actual product where the selection was made
doc_from_sel = sel_prod_by_user.ReferenceProduct.Parent # part doc from selection
hb = doc_from_sel.Part.HybridBodies.Add() # new hybrid body for the extract. will be deleted later on
extract = doc_from_sel.Part.HybridShapeFactory.AddNewExtract(sel_obj)
hb.AppendHybridShape(extract)
doc_from_sel.Part.Update()
# Add the extract to the selection and copy it
c_sel.Clear()
c_sel.Add(extract)
sel_prod_by_catia = c_sel.Item(1).LeafProduct # reference to the product where Catia makes the selection
c_sel_copy() # will call Selection.Copy from VBA. Buggy in Python.
# Paste the extract into the new part in a new hybrid body
c_sel.Clear()
new_hb = new_part_doc.Part.HybridBodies.Item(1)
c_sel.Add(new_hb)
c_sel.PasteSpecial("CATPrtResultWithOutLink")
new_part_doc.Part.Update()
new_extract = new_hb.HybridShapes.Item(new_hb.HybridShapes.Count)
# Redo changes in the part, where the selection was made
c_sel.Clear()
c_sel.Add(hb)
c_sel.Delete()
# Create axis systems from Position object of sel_prd_by_user and sel_prd_by_catia
prod_list = [sel_prod_by_user, sel_prod_by_catia]
axs_list = []
for prod in prod_list:
pc_pos = pycatia.in_interfaces.position.Position(prod.Position) # conversion to pycata's Position object, necessary
# in order to use Position.GetComponents
ax_comp = pc_pos.get_components()
axs = new_part_doc.Part.AxisSystems.Add()
axs.PutOrigin(ax_comp[9:12])
axs.PutXAxis(ax_comp[0:3])
axs.PutYAxis(ax_comp[3:6])
axs.PutZAxis(ax_comp[6:9])
axs_list.append(axs)
new_part_doc.Part.Update()
# Translate the extract from axis system derived from sel_prd_by_catia to sel_prd_by_user
extract_ref = new_part_doc.Part.CreateReferenceFromObject(new_extract)
tgt_ax_ref = new_part_doc.Part.CreateReferenceFromObject(axs_list[0])
ref_ax_ref = new_part_doc.Part.CreateReferenceFromObject(axs_list[1])
new_extract_translated = new_part_doc.Part.HybridShapeFactory.AddNewAxisToAxis(extract_ref, ref_ax_ref, tgt_ax_ref)
new_hb.AppendHybridShape(new_extract_translated)
new_part_doc.Part.Update()
I would suggest a differed approach. Instead of adding references you get from somewhere (by name probably) add the actual instance of part to selection while iterating trough all the products. Or use instance Names to get the correct part.
Here is a simple VBA example of iterating one lvl tree and select copy paste scenario.
If you want to copy features, you have to dive deeper on the Instance objects.
Public Sub CatMain()
Dim ActiveDoc As ProductDocument
Dim ActiveSel As Selection
If TypeOf CATIA.ActiveDocument Is ProductDocument Then 'of all the checks that people are using I think this one is most elegant and reliable
Set ActiveDoc = CATIA.ActiveDocument
Set ActiveSel = ActiveDoc.Selection
Else
Exit Sub
End If
Dim Instance As Product
For Each Instance In ActiveDoc.Product.Products 'object oriented for ideal for us in this scenario
If Instance.Products.Count = 0 Then 'beware that products without parts have also 0 items and are therefore mistaken for parts
Call ActiveSel.Add(Instance)
End If
Next
Call ActiveSel.Copy
Call ActiveSel.Clear
Dim NewDoc As ProductDocument
Set NewDoc = CATIA.Documents.Add("CATProduct")
Set ActiveSel = NewDoc.Selection
Call ActiveSel.Add(NewDoc.Product)
Call ActiveSel.Paste
Call ActiveSel.Clear
End Sub
As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.
Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.
for x in newarr:
i += 1
results.append(sample_classify_text(x))
Sample Classify text is a function straight from Documentation
#this function will return category for the text
from google.cloud import language_v1
def sample_classify_text(text_content):
"""
Classifying Content in a String
Args:
text_content The text content to analyze. Must include at least 20 words.
"""
client = language_v1.LanguageServiceClient()
# text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
# Available types: PLAIN_TEXT, HTML
type_ = language_v1.Document.Type.PLAIN_TEXT
# Optional. If not specified, the language is automatically detected.
# For list of supported languages:
# https://cloud.google.com/natural-language/docs/languages
language = "en"
document = {"content": text_content, "type_": type_, "language": language}
response = client.classify_text(request = {'document': document})
#return response.categories
# Loop through classified categories returned from the API
for category in response.categories:
# Get the name of the category representing the document.
# See the predefined taxonomy of categories:
# https://cloud.google.com/natural-language/docs/categories
x = format(category.name)
return x
# Get the confidence. Number representing how certain the classifier
# is that this category represents the provided text.
I'm trying to read data from a csv and then process it on different way. (For starter just the average)
Data
(OneDrive) https://1drv.ms/u/s!ArLDiUd-U5dtg0teQoKGguBA1qt9?e=6wlpko
The data looks like this:
ID; Property1; Property2; Property3...
1; ....
1; ...
1; ...
2; ...
2; ...
3; ...
...
Every line is a GPS point. All points with same ID together (for example 1) produce one Route. The routes are not of the same length and some IDs are skipped. So it isn't a seamless increase of numbers.
I may need to add, that the points are ALWAYS the same set of meters apart from each other. And I don't need the XY information currently.
Wanted Result
In the end I want something like this:
[ID, AVG_Property1, AVG_Property2...] [1, 1.00595, 2.9595, ...] [2,1.50606, 1.5959, ...]
What I got so far
import os
import numpy
import pandas as pd
data = pd.read_csv(os.path.join('C:\\data' ,'data.csv'), sep=';')
# [id, len, prop1, prop2, ...]
routes = numpy.zeros((data.size, 10)) # 10 properties
sums = numpy.zeros(8)
nr_of_entries = 0;
current_id = 1;
for index, row in data.iterrows():
if(int(row['id']) != current_id): #after the last point of the route
routes[current_id-1][0] = current_id;
routes[current_id-1][1] = nr_of_entries; #how many points are in this route?
routes[current_id-1][2] = sums[0] / nr_of_entries;
routes[current_id-1][3] = sums[1] / nr_of_entries;
routes[current_id-1][4] = sums[2] / nr_of_entries;
routes[current_id-1][5] = sums[3] / nr_of_entries;
routes[current_id-1][6] = sums[4] / nr_of_entries;
routes[current_id-1][7] = sums[5] / nr_of_entries;
routes[current_id-1][8] = sums[6] / nr_of_entries;
routes[current_id-1][9] = sums[7] / nr_of_entries;
current_id = int(row['id']);
sums = numpy.zeros(8)
nr_of_entries = 0;
sums[0] += row[3];
sums[1] += row[4];
sums[2] += row[5];
sums[3] += row[6];
sums[4] += row[7];
sums[5] += row[8];
sums[6] += row[9];
sums[7] += row[10];
nr_of_entries = nr_of_entries + 1;
routes
My problem
1.) The way I did it, I have to copy paste the same code for every other processing approach, since as stated I need to do multiple different way. Average is just an example.
2.) The reading of the data is clumsy and fails when IDs are missing
3.) I'm a C# Developer, so my approach would be to create a Class 'Route' which has all the points and then provide methods for 'calculate average for prop 1'. Or something. This way I could also tweak the data if needed. (extreme values for example). But I have no idea how this would be done in Phyton and if this is a reasonable approach in this language.
4.) Is there a more elegant way to iterate through the original csv and getting like Route ID 1, then Route ID 2 and so on? Maybe something like LINQ Queries in C#?
Thanks for any help.
He is a solution and some ideas you can use. The example features multiple options for the same issue so you have to choose which fits the purpose best. Also it is Python 3.7, you didn't specify a version so i hope this works.
class Route(object):
"""description of class"""
def __init__(self, id, rawdata): # on startup
self.id = id
self.rawdata = rawdata
self.avg_Prop1 = self.calculate_average('Prop1')
self.sum_Prop4 = None
def calculate_average(self, Prop_Name): #selfreference for first argument in class method
return self.rawdata[Prop_Name].mean()
def give_Prop_data(self, Prop_Name): #return the Propdata as list
return self.rawdata[Prop_Name].tolist()
def any_function(self, my_function, Prop_Name): #not sure what dataframes support so turning it into a list first
return my_function(self.rawdata[Prop_Name].tolist())
#end of class definiton
data = pd.read_csv('testdata.csv', sep=';')
# [id, len, prop1, prop2, ...]
route_list = [] #List of all the objects created from the route class
for i in data.id.unique():
print('Current id:', i,' with ',len(data[data['id']==i]),'entries')
route_list.append(Route(i,data[data['id']==i]))
#created the Prop1 average in initialization of route so just accessing attribute
print(route_list[1].avg_Prop1)
for current_route in route_list:
print('Route ',current_route.id , ' Properties :')
for i in current_route.rawdata.columns[1:]: #for all except the first (id)
print(i, ' has average ', current_route.calculate_average(i)) #i is the string of the column not just an id
#or pass any function that you want
route_list[1].sum_Prop4 = (route_list[1].any_function(sum,'Prop4'))
print(route_list[1].sum_Prop4)
#which is equivalent to
print(sum(route_list[1].rawdata['Prop4']))
To adress your individual problems out of order:
For 2. and 4.) Looping only over the existing Ids (data.id.unique()) solves the problem. I have no idea what LINQ Queries are, but i assume they are similar. In general, Python has a great way of looping over objects (like for current_route in route_list), which is worth looking into if you want to use it a little more.
For 1. and 3.) Again looping solves the issue. I created a class in the example, mostly to show the syntax for classes. The benefits and drawbacks for using classes should be the same in Python as in C#.
As it is right now the class probably isn't great, but this depends on how you want to use it. If the class should just be a practical way of storing and accessing data it shouldn't have the methods, because you don't need an individual average method for each route. Then you can just access it's data and use it in a function like in sum(route_list[1].rawdata['Prop4']). If however, depending on the data (amount of rows for example) different calculations are necessary, it might come in handy to use the method calculate_average and differentiate in there.
An other example would be the use of the attributes. If you need the average for Prop1 every time, creating it at the initialization sees a good idea, otherwise i wouldn't bother always calculating it.
I hope this helps!
I am using the library OBD-Python and when I tried to get a VIN number from my vehicle even following the Custom Commands documentation, I received this message:
[obd.obd] 'b'0902': VIN NUMBER' is not supported
Date: 2018-07-09 14:48:30.428588 -- VIN NUMBER: None.
def vin(messages):
""" decoder for RPM messages """
d = messages[0].data # only operate on a single message
d = d[2:] # chop off mode and PID bytes
v = bytes_to_int(d) / 4.0 # helper function for converting byte arrays to ints
return v * Unit.VIN # construct a Pint Quantity
c = OBDCommand("VIN", # name
"VIN NUMBER", # description
b"0902", # command
17, # number of return bytes to expect
vin, # decoding function
ECU.ENGINE, # (optional) ECU filter
True) # (optional) allow a "01" to be added for speed
o = obd.OBD()
o.supported_commands.add(c)
o.query(c)
print('Data: ' + str(datetime.datetime.now()) + ' -- VIN NUMBER: '+str(connection.query(c)))
What I am doing wrong?
You're not doing anything wrong. Almost all commands as defined by SAE J1979 are optional – vendors can chose to implement them or not. In the case of your vehicle, it looks like the vendor decided against it.
Some vehicle manufactures respond with all 0xFF in the bytes. They do this, maybe, to thwart the 3rd party OBD2 scan tools providers that only offer a limited number of vehicles that the tool can be used on, noting that to increase that number requires the purchase of more licenses. By filling the VIN with all 0xFF means that this trick no longer works. In doing this their service centers can use 3rd party OBD2 scan tools without having to keep buying additional VIN licenses as their fleet of vehicle they service increases. Just my thoughts.
Python docs states that uuid1 uses current time to form the uuid value. But I could not find a reference that ensures UUID1 is sequential.
>>> import uuid
>>> u1 = uuid.uuid1()
>>> u2 = uuid.uuid1()
>>> u1 < u2
True
>>>
But not always:
>>> def test(n):
... old = uuid.uuid1()
... print old
... for x in range(n):
... new = uuid.uuid1()
... if old >= new:
... print "OOops"
... break
... old = new
... print new
>>> test(1000000)
fd4ae687-3619-11e1-8801-c82a1450e52f
OOops
00000035-361a-11e1-bc9f-c82a1450e52f
UUIDs Not Sequential
No, standard UUIDs are not meant to be sequential.
Apparently some attempts were made with GUIDs (Microsoft's twist on UUIDs) to make them sequential to help with performance in certain database scenarios. But being sequential is not the intent of UUIDs.
http://en.wikipedia.org/wiki/Globally_unique_identifier
MAC Is Last, Not First
No, in standard UUIDs, the MAC address is not the first component. The MAC address is the last component in a Version 1 UUID.
http://en.wikipedia.org/wiki/Universally_unique_identifier
Do Not Assume Which Type Of UUID
The various versions of UUIDs are meant to be compatible with each other. So it may be unreasonable to expect that you always have Version 1 UUIDs. Other programmers may use other versions.
Specification
Read the UUID spec, RFC 4122, by the IETF. Only a dozen pages long.
From the python UUID docs:
Generate a UUID from a host ID, sequence number, and the current time. If node is not given, getnode() is used to obtain the hardware address. If clock_seq is given, it is used as the sequence number; otherwise a random 14-bit sequence number is chosen.
From this, I infer that the MAC address is first, then a (possibly random) sequence number, then the current time. So I would not expect these to be guaranteed to be monotonically increasing, even for UUIDs generated by the same machine/process.
I stumbled upon a probable answer in Cassandra/Python from http://doanduyhai.wordpress.com/2012/07/05/apache-cassandra-tricks-and-traps/
Lexicographic TimeUUID ordering
Cassandra provides, among all the primitive types, support for UUID values of type 1 (time and server based) and type 4 (random).
The primary use of UUID (Unique Universal IDentifier) is to obtain a really unique identifier in a potentially distributed environment.
Cassandra does support version 1 UUID. It gives you an unique identifier by combining the computer’s MAC address and the number of 100-nanosecond intervals since the beginning of the Gregorian calendar.
As you can see the precision is only 100 nanoseconds, but fortunately it is mixed with a clock sequence to add randomness. Furthermore the MAC address is also used to compute the UUID so it’s very unlikely that you face collision on one cluster of machine, unless you need to process a really really huge volume of data (don’t forget, not everyone is Twitter or Facebook).
One of the most relevant use case for UUID, and espcecially TimeUUID, is to use it as column key. Since Cassandra column keys are sorted, we can take advantage of this feature to have a natural ordering for our column families.
The problem with the default com.eaio.uuid.UUID provided by the Hector client is that it’s not easy to work with. As an ID you may need to bring this value from the server up to the view layer, and that’s the gotcha.
Basically, com.eaio.uuid.UUID overrides the toString() to gives a String representation of the UUID. However this String formatting cannot be sorted lexicographically…
Below are some TimeUUID generated consecutively:
8e4cab00-c481-11e1-983b-20cf309ff6dc at some t1
2b6e3160-c482-11e1-addf-20cf309ff6dc at some t2 with t2 > t1
“2b6e3160-c482-11e1-addf-20cf309ff6dc”.compareTo(“8e4cab00-c481-11e1-983b-20cf309ff6dc”) gives -6 meaning that “2b6e3160-c482-11e1-addf-20cf309ff6dc” is less/before “8e4cab00-c481-11e1-983b-20cf309ff6dc” which is incorrect.
The current textual display of TimeUUID is split as follow:
time_low – time_mid – time_high_and_version – variant_and_sequence – node
If we re-order it starting with time_high_and_version, we can then sort it lexicographically:
time_high_and_version – time_mid – time_low – variant_and_sequence – node
The utility class is given below:
public static String reorderTimeUUId(String originalTimeUUID)
{
StringTokenizer tokens = new StringTokenizer(originalTimeUUID, "-");
if (tokens.countTokens() == 5)
{
String time_low = tokens.nextToken();
String time_mid = tokens.nextToken();
String time_high_and_version = tokens.nextToken();
String variant_and_sequence = tokens.nextToken();
String node = tokens.nextToken();
return time_high_and_version + '-' + time_mid + '-' + time_low + '-' + variant_and_sequence + '-' + node;
}
return originalTimeUUID;
}
The TimeUUIDs become:
11e1-c481-8e4cab00-983b-20cf309ff6dc
11e1-c482-2b6e3160-addf-20cf309ff6dc
Now we get:
"11e1-c481-8e4cab00-983b-20cf309ff6dc".compareTo("11e1-c482-2b6e3160-addf-20cf309ff6dc") = -1
Argumentless use of uuid.uuid1() gives non-sequential results (see answer by #basil-bourque), but it can be easily made sequential if you set clock_seq or node arguments (because in this case uuid1 uses python implementation that guarantees to have unique and sequential timestamp part of the UUID in current process):
import time
from uuid import uuid1, getnode
from random import getrandbits
_my_clock_seq = getrandbits(14)
_my_node = getnode()
def sequential_uuid(node=None):
return uuid1(node=node, clock_seq=_my_clock_seq)
def alt_sequential_uuid(clock_seq=None):
return uuid1(node=_my_node, clock_seq=clock_seq)
if __name__ == '__main__':
from itertools import count
old_n = uuid1() # "Native"
old_s = sequential_uuid() # Sequential
native_conflict_index = None
t_0 = time.time()
for x in count():
new_n = uuid1()
new_s = sequential_uuid()
if old_n > new_n and not native_conflict_index:
native_conflict_index = x
if old_s >= new_s:
print("OOops: non-sequential results for `sequential_uuid()`")
break
if (x >= 10*0x3fff and time.time() - t_0 > 30) or (native_conflict_index and x > 2*native_conflict_index):
print('No issues for `sequential_uuid()`')
break
old_n = new_n
old_s = new_s
print(f'Conflicts for `uuid.uuid1()`: {bool(native_conflict_index)}')
print(f"Tries: {x}")
Multiple processes issues
BUT if you are running some parallel processes on the same machine, then:
node which defaults to uuid.get_node() will be the same for all the processes;
clock_seq has small chance to be the same for some processes (chance of 1/16384)
That might lead to conflicts! That is general concern for using
uuid.uuid1 in parallel processes on the same machine unless you have access to SafeUUID from Python3.7.
If you make sure to also set node to unique value for each parallel process that runs this code, then conflicts should not happen.
Even if you are using SafeUUID, and set unique node, it's still possible to have non-sequential ids if they are generated in different processes.
If some lock-related overhead is acceptable, then you can store clock_seq in some external atomic storage (for example in "locked" file) and increment it with each call: this allows to have same value for node on all parallel processes and also will make id-s sequential. For cases when all parallel processes are subprocesses created using multiprocessing: clock_seq can be "shared" using multiprocessing.Value