First, I would like to apologize in advance. I just started learning this a few months ago, so I need stuff broken down completely. I have a project using python and datajoint (makes sql code shorter to write) where I need to create an airport that has at least 7 airports, different planes and what not. Then I need to populate the tables with passenger reservations. Here is what I have so far.
#schema
class Seat(dj.Lookup):
definition = """
aircraft_seat : varchar(25)
"""
contents = [["F_Airbus_1A"],["F_Airbus_1B"],["F_Airbus_2A"],["F_Airbus_2B"],["F_Airbus_3A"],
["F_Airbus_3B"],["F_Airbus_4A"],["F_Airbus_4B"],["F_Airbus_5A"],["F_Airbus_5B"],
["B_Airbus_6A"],["B_Airbus_6B"],["B_Airbus_6C"],["B_Airbus_6D"],["B_Airbus_7A"],
["B_Airbus_7B"],["B_Airbus_7C"],["B_Airbus_7D"],["B_Airbus_8A"],["B_Airbus_8B"],
["B_Airbus_8C"],["B_Airbus_8D"],["B_Airbus_9A"],["B_Airbus_9B"],
This keeps going leaving me with a total of 144 seats on each plane.
#schema
class Flight(dj.Manual):
definition = """
flight_no : int
---
economy_price : decimal(6,2)
departure : datetime
arrival : datetime
---
origin_code : int
dest_code : int
"""
#schema
class Passenger(dj.Manual):
definition = """
passenger_id : int
---
full_name : varchar(40)
ssn : varchar(20)
"""
#schema
class Reservation(dj.Manual):
definition = """
-> Flight
-> Seat
---
-> Passenger
"""
Then I populate flights and passengers:
Flight.insert((dict(flight_no = i,
economy_price = round(random.randint(100, 1000), 2),
departure = faker.date_time_this_month(),
arrival = faker.date_time_this_month(),
origin_code = random.randint(1,7),
dest_code = random.randint(1,7)))
for i in range(315))
Passenger.insert(((dict(passenger_id=i, full_name=faker.name(),
ssn = faker.ssn()))
for i in range(10000)), skip_duplicates = True)
Lastly I create the transaction:
def reserve(passenger_id, origin_code, dest_code, departure):
with dj.conn().transaction:
available_seats = ((Seat * Flight - Reservation) & Passenger &
{'passenger_id':passenger_id}).fetch(as_dict=True)
try:
choice = random.choice(available_seats)
except IndexError:
raise IndexError(f'Sorry, no seats available for {departure}')
name = (Passenger & {'passenger_id': passenger_id}).fetch1('full_name')
print('Success. Reserving seat {aircraft_seat} at ticket_price {economy_price} for
{name}'.format(name=name, **choice))
Reservation.insert1(dict(choice, passenger_id=passenger_id), ignore_extra_fields=True)
reserve(random.randint(1,1000), random.randint(1,7),
random.randint(1,7),random.choice('departure'))
Output[]: Success. Reserving seat E_Yak242_24A at ticket_price 410.00 for Cynthia Erickson
Reservation()
Output[]: flight_no aircraft_seat passenger_id
66 B_Yak242_7A 441
So I am required to have 10.5 flights a day with the planes at least 75% full which leaves me needing over 30000 reservations. Is there a way to do this like 50 at a time? I have been searching for an answer and have not been able to find a solution. Thank you.
One of the maintainers for DataJoint here. First off, I'd like to say thanks for trying out DataJoint; curious as to how you found out about the project.
Forewarning, this will be a long post but I feel it is a good opportunity to clear up a few things. Regarding the problem in question, not sure if I fully understand the nature of your problem but let me follow on several points. I recommend reading this answer in its entirety before determining how best to proceed for your case.
TL;DR: Compute tables are your friend.
Multi-threading
Since it has come up in the comments it is worth addressing that as of 2021-01-06, DataJoint is not completely thread-safe (at least from the perspective of sharing connections). It is unfortunate but it is mainly due to a standing issue with PyMySQL which is a principal dependency of DataJoint. That said, if you initiate a new connection on each thread or process you should not run into any issues. However, this is an expensive workaround and can't be combined with transactions since they require that operations be conducted within a single connection. Speaking of which...
Compute Tables and Job Reservation
Compute tables is one noticeable omission from your above attempt at a solution. Compute tables provide a mechanism to associate its entities to those in an upstream parent table with addional processing prior to insert (defined in a make method in your Compute table class) where it may be inoked by calling the populate method which calls the make method for each new entry. Calls to your make method are transaction-constrained and should achieve what you are looking for. See here in the docs for more details in its use.
Also, for additional performance gains, there is another feature called Job Reservation which provides a means to pool together multiple workers to process large data sets (using populate) in an organized, distributed manner. I don't feel it is required here but worth mentioning and ultimately up to how you view the results below. You may find out more on this feature here in our docs.
Schema Design
Based on my understanding of your initial design, I have some suggestions how we can improve the flow of the data to increase clarity, performance, and also to provide specific examples on how we can use the power of Compute tables. Running as illustrated below on my local setup, I was able to process your requirement of 30k reservations in 29m54s with 2 different plane model types, 7 airports, 10k possible passengers, 550 available flights. Minimum 75% seating capacity was not verified only because I didn't see you attempt this yet, though if you see how I am assigning seats you will notice that it is almost there. :)
Disclaimer: I should note that the below design is still a large oversimplification of the actual real-world challenge to orchestrate proper travel reservations. Considerable assumptions were taken mainly for the benefit of education as opposed to submitting a full, drop-in solution. As such, I have explicitly chosen to avoid using longblob for the below solution so that it is easier to follow along. In reality, a proper solution would likely include more advanced topics for further performance gains e.g. longblob, _update, etc.
That said, let's begin by considering the following:
import datajoint as dj # conda install -c conda-forge datajoint or pip install datajoint
import random
from faker import Faker # pip install Faker
faker = Faker()
Faker.seed(0) # Pin down randomizer between runs
schema = dj.Schema('commercial_airtravel') # instantiate a workable database
#schema
class Plane(dj.Lookup):
definition = """
# Defines manufacturable plane model types
plane_type : varchar(25) # Name of plane model
---
plane_rows : int # Number of rows in plane model i.e. range(1, plane_rows + 1)
plane_columns : int # Number of columns in plane model; to extract letter we will need these indices
"""
contents = [('B_Airbus', 37, 4), ('F_Airbus', 40, 5)] # Since new entries to this table should happen infrequently, this is a good candidate for a Lookup table
#schema
class Airport(dj.Lookup):
definition = """
# Defines airport locations that can serve as origin or destination
airport_code : int # Airport's unique identifier
---
airport_city : varchar(25) # Airport's city
"""
contents = [(i, faker.city()) for i in range(1, 8)] # Also a good candidate for Lookup table
#schema
class Passenger(dj.Manual):
definition = """
# Defines users who have registered accounts with airline i.e. passenger
passenger_id : serial # Passenger's unique identifier; serial simply means an auto-incremented, unsigned bigint
---
full_name : varchar(40) # Passenger's full name
ssn : varchar(20) # Passenger's Social Security Number
"""
Passenger.insert((dict(full_name=faker.name(),
ssn = faker.ssn()) for _ in range(10000))) # Insert a random set of passengers
#schema
class Flight(dj.Manual):
definition = """
# Defines specific planes assigned to a route
flight_id : serial # Flight's unique identifier
---
-> Plane # Flight's plane model specs; this will simply create a relation to Plane table but not have the constraint of uniqueness
flight_economy_price : decimal(6,2) # Flight's fare price
flight_departure : datetime # Flight's departure time
flight_arrival : datetime # Flight's arrival time
-> Airport.proj(flight_origin_code='airport_code') # Flight's origin; by using proj in this way we may rename the relation in this table
-> Airport.proj(flight_dest_code='airport_code') # Flight's destination
"""
plane_types = Plane().fetch('plane_type') # Fetch available plane model types
Flight.insert((dict(plane_type = random.choice(plane_types),
flight_economy_price = round(random.randint(100, 1000), 2),
flight_departure = faker.date_time_this_month(),
flight_arrival = faker.date_time_this_month(),
flight_origin_code = random.randint(1, 7),
flight_dest_code = random.randint(1, 7))
for _ in range(550))) # Insert a random set of flights; for simplicity we are not verifying that flight_departure < flight_arrival
#schema
class BookingRequest(dj.Manual):
definition = """
# Defines one-way booking requests initiated by passengers
booking_id : serial # Booking Request's unique identifier
---
-> Passenger # Passenger who made request
-> Airport.proj(flight_origin_code='airport_code') # Booking Request's desired origin
-> Airport.proj(flight_dest_code='airport_code') # Booking Request's desired destination
"""
BookingRequest.insert((dict(passenger_id = random.randint(1, 10000),
flight_origin_code = random.randint(1, 7),
flight_dest_code = random.randint(1, 7))
for i in range(30000))) # Insert a random set of booking requests
#schema
class Reservation(dj.Computed):
definition = """
# Defines booked reservations
-> BookingRequest # Association to booking request
---
flight_id : int # Flight's unique identifier
reservation_seat : varchar(25) # Reservation's assigned seat
"""
def make(self, key):
# Determine booking request's details
full_name, flight_origin_code, flight_dest_code = (BookingRequest * Passenger & key).fetch1('full_name',
'flight_origin_code',
'flight_dest_code')
# Determine possible flights to satisfy booking
possible_flights = (Flight * Plane *
Airport.proj(flight_dest_city='airport_city',
flight_dest_code='airport_code') &
dict(flight_origin_code=flight_origin_code,
flight_dest_code=flight_dest_code)).fetch('flight_id',
'plane_rows',
'plane_columns',
'flight_economy_price',
'flight_dest_city',
as_dict=True)
# Iterate until we find a vacant flight and extract details
for flight_meta in possible_flights:
# Determine seat capacity
all_seats = set((f'{r}{l}' for rows, letters in zip(*[[[n if i==0 else chr(n + 64)
for n in range(1, el + 1)]]
for i, el in enumerate((flight_meta['plane_rows'],
flight_meta['plane_columns']))])
for r in rows
for l in letters))
# Determine unavailable seats
taken_seats = set((Reservation & dict(flight_id=flight_meta['flight_id'])).fetch('reservation_seat'))
try:
# Randomly choose one of the available seats
reserved_seat = random.choice(list(all_seats - taken_seats))
# You may uncomment the below line if you wish to print the success message per processed record
# print(f'Success. Reserving seat {reserved_seat} at ticket_price {flight_meta["flight_economy_price"]} for {full_name}.')
# Insert new reservation
self.insert1(dict(key, flight_id=flight_meta['flight_id'], reservation_seat=reserved_seat))
return
except IndexError:
pass
raise IndexError(f'Sorry, no seats available departing to {flight_meta["flight_dest_city"]}')
Reservation.populate(display_progress=True) # This is how we process new booking requests to assign a reservation; you may invoke this as often as necessary
Syntax and Convention Nits
Lastly, just some minor feedback in your provided code. Regarding table definitions, you should only use --- once in the definition to identify a clear distinction between primary key attributes and secondary attributes (See your Flight table). Unexpectedly, this did not throw an error in your case but should have done so. I will file an issue since this appears to be a bug.
Though transaction is exposed on dj.conn(), it is quite rare to need to invoke it directly. DataJoint provides the benefit of handling this internally to reduce the management overhead of this from the user. However, the option is still available should it be needed for corner-cases. For your case, I would avoid invoking it directly and reccomend using Computed (or also Imported) tables instead.
I was wondering if there's a good way to find the next available gap to create a network block given a list of existing ones?
For example, I have these networks in my list:
[
'10.0.0.0/24',
'10.0.0.0/20',
'10.10.0.0/20',
]
and then someone comes along and ask: "Do you have have enough space for 1 /22 for me?"
I'd like to be able to suggest something along the line:
"Here's a space: x.x.x.x/22" (x.x.x.x is something that comes before 10.0.0.0)
or
"Here's a space: x.x.x.x/22" (x.x.x.x is something in between 10.0.0.255 and 10.10.0.0)
or
"Here's a space: x.x.x.x/22" (x.x.x.x is something that comes after 10.10.15.255)
I'd really appreciate any suggestions.
The ipaddress library is good for this sort of use case. You can use the IPv4Network class to define subnet ranges, and the IPv4Address objects it can return can be converted into integers for comparison.
What I do below:
Establish your given list as a list of IPv4Networks
Determine the size of the block we're looking for
Iterate through the list, computing the amount of space between consecutive blocks, and checking if our wanted block fits.
You could also return an IPv4Network with the subnet built into it, instead of an IPv4Address, but I'll leave that as an exercise to the reader.
from ipaddress import IPv4Network, IPv4Address
networks = [
IPv4Network('10.0.0.0/24')
IPv4Network('10.0.0.0/20')
IPv4Network('10.0.10.0/20')
]
wanted = 22
wanted_size = 2 ** (32 - wanted) # number of addresses in a /22
space_found = None
for i in range(1, len(networks):
previous_network_end = int(networks[i-1].network_address + int(networks[i-1].hostmask))
next_network_start = int(networks[i].network_address)
free_space_size = next_network_start - previous_network_end
if free_space_size >= wanted_size:
return IPv4Address(networks[i-1] + 1) # first available address
I've recently started using PyModbus and have found it very easy to do basic polling with their ModbusTCPClient and read_holding_registers function.
I'm interested now in the best ways to structure a more complex logger - non-consecutive registers, different function codes, different Endian encoding, etc.
For example - to avoid a separate 'read_holding_registers' call for each tag of a device, I have built a function that groups all consecutive tag registers to reduce the number of calls.
I'm planning to implement a similar thing for BinaryPayloadDecoders - group by registers with the same byteorder and wordorder to reduce the number of decoder instances.
def polldevicesfast(client, device, taglist):
#loop through tags, order by address, group consecutive addresses in single reads, merge resulting lists, decode
orderedtaglist = sorted(taglist, key = lambda i: i['address'])
callgroups = sorttogroups(orderedtaglist)
allreturns = []
results = []
for acall in callgroups:
areturn = client.read_holding_registers(acall['start'], (1 + (acall['end'] - acall['start'])), unit=device['device_id'])
allreturns = allreturns + areturn.registers
decoder = BinaryPayloadDecoder.fromRegisters(allreturns, byteorder=Endian.Big, wordorder=Endian.Big)
for tag in orderedtaglist:
results.append({'tagname': tag['name'], 'value': str(tag['autoScaling']['slope'] * mydecoder(tag['dataType'], decoder)), 'unit': tag['unit']})
client.close()
return results
None of this is extremely complicated - it just seems like there has to already be an accepted standard or template for this somewhere that I can't seem to find in any of their documentation online.
So we can generate a unique id with str(uuid.uuid4()), which is 36 characters long.
Is there another method to generate a unique ID which is shorter in terms of characters?
EDIT:
If ID is usable as primary key then even better
Granularity should be better than 1ms
This code could be distributed, so we can't assume time independence.
If this is for use as a primary key field in db, consider just using auto-incrementing integer instead.
str(uuid.uuid4()) is 36 chars but it has four useless dashes (-) in it, and it's limited to 0-9 a-f.
Better uuid4 in 32 chars:
>>> uuid.uuid4().hex
'b327fc1b6a2343e48af311343fc3f5a8'
Or just b64 encode and slice some urandom bytes (up to you to guarantee uniqueness):
>>> base64.b64encode(os.urandom(32))[:8]
b'iR4hZqs9'
TLDR
Most of the times it's better to work with numbers internally and encode them to short IDs externally. So here's a function for Python3, PowerShell & VBA that will convert an int32 to an alphanumeric ID. Use it like this:
int32_to_id(225204568)
'F2AXP8'
For distributed code use ULIDs: https://github.com/mdipierro/ulid
They are much longer but unique across different machines.
How short are the IDs?
It will encode about half a billion IDs in 6 characters so it's as compact as possible while still using only non-ambiguous digits and letters.
How can I get even shorter IDs?
If you want even more compact IDs/codes/Serial Numbers, you can easily expand the character set by just changing the chars="..." definition. For example if you allow all lower and upper case letters you can have 56 billion IDs within the same 6 characters. Adding a few symbols (like ~!##$%^&*()_+-=) gives you 208 billion IDs.
So why didn't you go for the shortest possible IDs?
The character set I'm using in my code has an advantage: It generates IDs that are easy to copy-paste (no symbols so double clicking selects the whole ID), easy to read without mistakes (no look-alike characters like 2 and Z) and rather easy to communicate verbally (only upper case letters). Sticking to numeric digits only is your best option for verbal communication but they are not compact.
I'm convinced: show me the code
Python 3
def int32_to_id(n):
if n==0: return "0"
chars="0123456789ACEFHJKLMNPRTUVWXY"
length=len(chars)
result=""
remain=n
while remain>0:
pos = remain % length
remain = remain // length
result = chars[pos] + result
return result
PowerShell
function int32_to_id($n){
$chars="0123456789ACEFHJKLMNPRTUVWXY"
$length=$chars.length
$result=""; $remain=[int]$n
do {
$pos = $remain % $length
$remain = [int][Math]::Floor($remain / $length)
$result = $chars[$pos] + $result
} while ($remain -gt 0)
$result
}
VBA
Function int32_to_id(n)
Dim chars$, length, result$, remain, pos
If n = 0 Then int32_to_id = "0": Exit Function
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result$ = ""
remain = n
Do While (remain > 0)
pos = remain Mod length
remain = Int(remain / length)
result$ = Mid(chars$, pos + 1, 1) + result$
Loop
int32_to_id = result
End Function
Function id_to_int32(id$)
Dim chars$, length, result, remain, pos, value, power
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result = 0
power = 1
For pos = Len(id$) To 1 Step -1
result = result + (InStr(chars$, Mid(id$, pos, 1)) - 1) * power
power = power * length
Next
id_to_int32 = result
End Function
Public Sub test_id_to_int32()
Dim i
For i = 0 To 28 ^ 3
If id_to_int32(int32_to_id(i)) <> i Then Debug.Print "Error, i=", i, "int32_to_id(i)", int32_to_id(i), "id_to_int32('" & int32_to_id(i) & "')", id_to_int32(int32_to_id(i))
Next
Debug.Print "Done testing"
End Sub
Yes. Just use the current UTC millis. This number never repeats.
const uniqueID = new Date().getTime();
EDIT
If you have the rather seldom requirement to produce more than one ID within the same millisecond, this method is of no use as this number‘s granularity is 1ms.
I need to look at our Redis cache and see what the size of our largest stored value is. I am experienced with Python or can use the redis-cli directly. Is there a way to iterate all the keys in the database so I can then inspect the size of each value?
It looks like SCAN is the way to iterate through the keys, but I'm still working out how to put this to use to get the sizes of the values and store a maximum as I go.
Since you mentioned redis-cli as an option, it has a build it function that does pretty much what you ask for (and much more).
redis-cli --bigkeys
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type. You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
Here is a sample of the summery output:
Sampled 343799 keys in the keyspace!
Total key length in bytes is 9556361 (avg len 27.80)
Biggest string found '530f8dc7c7b3b:39:string:87929' has 500 bytes
Biggest list found '530f8d5a17b26:9:list:11211' has 500 items
Biggest set found '530f8da856e1e:75:set:65939' has 500 members
Biggest hash found '530f8d619a0af:86:hash:16911' has 500 fields
Biggest zset found '530f8d4a9ce31:45:zset:315' has 500 members
68559 strings with 17136672 bytes (19.94% of keys, avg size 249.96)
68986 lists with 17326343 items (20.07% of keys, avg size 251.16)
68803 sets with 17236635 members (20.01% of keys, avg size 250.52)
68622 hashs with 17272144 fields (19.96% of keys, avg size 251.70)
68829 zsets with 17241902 members (20.02% of keys, avg size 250.50)
You can view a complete output example here
Here is a solution which uses redis's built-in scripting capabilities:
local longest
local cursor = "0"
repeat
local ret = redis.call("scan", cursor)
cursor = ret[1]
for _, key in ipairs(ret[2]) do
local length = redis.pcall("strlen", key)
if type(length) == "number" then
if longest == nil or length > longest then
longest = length
end
end
end
until cursor == "0"
return longest
This should run faster than the Python code you provide, Ben Roberts, particularly because the Lua script uses STRLEN over GET + Python's len.
I think I've figured out a basic idea, using the redis-py library
import redis
r= redis.StrictRedis(...)
max_len = 0
for k in r.scan_iter():
try:
val_len = r.strlen(k)
except:
continue
if val_len > max_len:
max_len = val_len
print max_len