Query for bit based values - python

I have a problem how to get values from pytables. Values are bit based, but stored as integer number.
One column in my table is Int32Column() with name 'Value'. In this column I will store integer values where every bit has different meaning. So, if I want information for some bit, I will take value from table and make some bit manipulation actions. I don't know how to make query for getting specified values from table.
For example, I want to know all values in Value column where is first bit == 1 and third bit ==1.
How to make that query?
I'm trying with mask:
[ x['Value'] for x in table.where('((Value & mask) == mask)')]
but, I'm getting exception:
NotImplementedError: unsupported operand types for \*and\*: int, int
Processing query must be very fast because large number of rows in future. One restriction is that values must be as int values in table, because I'm getting values from server in int format. I hope that someone has better solution.

For future reference.
I had a similar problem and solved it in the following way. As the usual bitwise operators(<<, >>) are not available and the meaning of the &, | operators is logical, instead of bitwise, one has to improvise.
To check if a value VAL has the n-th bit set or not, we can shift the interesting bit to the 0th position, which denotes the parity of the number (2**0). The parity can be checked using the modulus operator.
So, one can do something like to check whether, for example, bit 25 is set and 16 is unset.
table.where("((VAL/(2**25))%2==1) & ((VAL/(2**16))%2==0)")
Not elegant, but working for now.

Related

Are the values 161137531201111100, 1.611375312011111e+17 equal?

I am trying to manipulate a dataframe. The value of in a list which I use to append a column to the dataframe is 161137531201111100. However, I created a dictionary whose keys are the unique values of this column, and I use this dictionary in further operations. This could used to run perfectly before.
However, after trying this code on another data I had the following error:
KeyError: 1.611375312011111e+17
which means that this value is not the of the dictionary; I tried to trace the code, everything seemed to be okay. However, when I opened the csv file of the dataframe I built I found out that the value that is causing the problem is: 161137531201111000 which is not in the list(and ofc not a key in the dictionary) I used to create this column of dataframe. This seems weird. However, I don't know what is the reason? Is there any reason that a number is saved in another way?
And how can I save it as it is in all phases? Also, why did it change in the csv?
No unfortunately, they are not equal
print(1.611375312011111e+17 == 161137531201111000)` # False.
The problem lies in the way floating numbers are handled by computers, in general, and most programming languages, including Python.
Always use integers (and not "too large") when doing computations if you want exact results.
See Is floating point math broken? for generic explanation that you definitely must know as a programmer, even if it's not specific to Python.
(and be aware that Python tries to do a rather good job at keeping precision on integers, that unfortunately won't work on floating-point numbers).
And just for the sake of "fun" with floating point numbers, 1.611375312011111e+17 is actually equal to the integer 161137531201111104!
print(format (1.611375312011111e+17, ".60g")) # shows 161137531201111104
print(1.611375312011111e+17 == 161137531201111104) # True
a = dict()
a[1.611375312011111e+17] = "hello"
#print(a[161137531201111100]) # Key error, as in question
print(a[161137531201111104]) # This one shows "hello" properly!

Transforming type Int64Index into an integer index in Python

I'm quite new with python, however, I have to accomplish some assignment and I am struggling now on a problem. I try to get the index of the element in a table A when some other parameter from this table A corresponds to a value in a list B. The table A also already contains a column "index" where all elements are numerated from 0 till the end. Moreover, the values in tableA.parameter1 and listB can coincide only once, multiple matches are not possible. So to derive the necessary index I use a line
t=tableA.index[tableA.parameter1==listB[numberObservation]]
However, what I get as a result is something like:
t Int64Index([2], dtype='int64')
If I use the variable t in this format Int64Index, it doesn't suit for the further code I have to work with. Actually, I need only 2 as an integer number, without all this redundant rest.
Can somebody please help me to circumvent my problem? I am in total despair and would be grateful for any help.
Try .tolist()
t=tableA.index[tableA.parameter1==listB[numberObservation]].tolist()
This should return
t = [2]
a list "without all the redundant rest" :)
What package is giving you Int64Index? This looks vaguely numpy-ish, but numpy arrays define __index__ so a single element array of integer values will seamlessly operate as indices for sequence lookup.
Regardless, assuming t is supposed to be exactly one value, and it's a sequence type itself, you can just do:
t, = tableA.index[tableA.parameter1==listB[numberObservation]]
That trailing comma changes the line from straight assignment to iterable unpacking; it expects the right hand side to produce an iterable with exactly one value, and that one value is unpacked into t. If the iterable has 0 or 2+ values, you'll get a ValueError.

Matlab checking for a value in a database

I have the following Boolean statement in Python:
db_connection.query(
'select storage_time from traces where id=' + trace_id
).dictresult()[0]['storage_time'] == None
It basically checks if there is a value in storage_time and I would like to do the same thing in Matlab, but I can't find anything equivalent to None.
Could you please help me out?
Thanks
Equivalent to None values from Python are NULLs. And since you connect to your database
via Matlab Database Toolbox, you need to specify how NULL values retrieved from database
are to be presented in Matlab. This may be done by setting of 'NullNumberRead'
via calling of setdbprefs function from Matlab Database Toolbox. For instance, you can do this so
setdbprefs('NullNumberRead','NaN')
or so
setdbprefs('NullNumberRead','0')
Unfortunately, there is no guarantee that the NULL representation value specified by this way won't be confused
with real non-NULL values obtained as a result of your query (it is your own responsibility in this case to
guarantee the query always do not contain NaNs or zeros, respectively, among non-NULL values in its results).
But if you have to connect to PostgreSQL, as far as I know, there exists at least one Matlab and PostgreSQL connector that
respects NULLs in a fully consistent manner. This is a high-performance PostgreSQL client library PgMex.
In PostgreSQL both a value itself and the value elements can be NULL (for array types). This makes a representation of NULLs in Matlab less trivial than expected.
To illustrate the way PgMex uses to represent NULLs in Matlab, let us consider the following example. Suppose you have to retrieve results for a query returning one field myfield of float8[] type with two tuples. And suppose the value of myfield for the first tuple equals to NULL as itself, while for the second tuple the corresponding value equals {0,NULL,NaN}. Results are obtained as follows
(we assume that the argument of the very first command connect below is corrected properly and that the table mytable containing
myfield of float8[] type among its fields already exists within the respective database):
% Create the database connection
dbConn=com.allied.pgmex.pgmexec('connect',[...
'host=<yourhost> dbname=<yourdb> port=<yourport> '...
'user=<your_postgres_username> password=<your_postgres_password>']);
pgResult=com.allied.pgmex.pgmexec('exec',dbConn,...
'select myfield from mytable'); % perform the query
SMyField=com.allied.pgmex.pgmexec('getf',pgResult,...
'%float8[]',0); % retrieve the results
Here SMyField is a structure with three fields: valueVec, isNullVec and isValueNullVec.
isValueNullVec equals a column logical array [true;false], that is the entire value for the first tuple equals NULL,
while the value for the second tuple does not equal NULL as itself. isNullVec equals the following column cell array:
{[];[false,true,false]}. By this way it is possible to indicate that only the second element of the array being the value
of myfield for the second tuple is NULL. At last, valueVec equals to the column cell array {[];[0 0 NaN]}. Only
the first and the third elements of the second cell are relevant, in contrast to the second element of this array.
This is because isNullVec clearly indicates that this second element is NULL, so this zero value does not matter
(some default value is chosen for each particular data type).
What concerns your example, the respective code in Matlab may be the following (we assume that dbConn obtained as above already exists
as well as that the query below is correct with storage_time field of timestamp type and the variable trace_id being already defined):
pgResult=com.allied.pgmex.pgmexec('exec',dbConn,...
['select storage_time from traces where id=' + trace_id]); % perform the query
SStorageTime=com.allied.pgmex.pgmexec('getf',pgResult,...
'%timestamp',0); % retrieve the results
% check that the value for the first tuple is not NULL
isStorageTime=~SStorageTime.isValueNullVec(1);
Hence it is sufficient to check only isValueNullVec.
EDIT: There are free academic licenses for PgMex.
MATLAB's Database Toolbox has preferences how to handle NULL values. Depending on those settings you can get different values. See SETDBPREFS for details. You can change preferences in GUI as well.
By default you will get NaN if you read the data as numeric, and 'NULL' strings if you read as strings. In the first case for numbers check for NaN with ISNAN function.
null_idx = isnan(fetcheddata);
For strings use STRCMP:
null_idx= strcmp(upper(fetcheddata), 'NULL');
In addition, if you fetch the data as cell array, you may need to deal with them with CELLFUN or convert to matrix with CELL2MAT.
Matlab idiom is usually to use the isempty() function.
isempty(somefunction(someargument))
returns true if somefunction(someargument) returns any empty result, and false otherwise.
I have not worked with the Matlab DB toolbox much, so I'm not sure what the full translation of your Python statement is.
If you use this query you can check for True or False in instead of None:
trace_id_exists = db_connection.query("""\
select exists (
select 1
from traces
where id = %s
) "exists"
""" % trace_id
).dictresult()[0]['exists']
if trace_id_exists:
...
You could also return something else like 1 or 0.

How to tell MySQL to trigger a Hash before Update/Insert?

I am struggeling to create a trigger within MySQL, so that everytime I am inserting a value into a column named title a HASH shall be created and stored in the column title_hash. Since I don't know how this works I found this code while googling:
CREATE TRIGGER insertModelHash
BEFORE
INSERT
ON
products
FOR EACH ROW SET
NEW.model_hash = CONV(RIGHT(MD5(NEW.products_model), 16), 16, 10)
The MySQL-reference tells me, that this means:
Create a trigger called insertModelHash...
... before inserting row int table products...
use the functions MD5, RIGHT, CONV on the column products_model in every new row I intend to insert.
The 3. point needs more explanation:
I guess, that NEW is some sort of identifier of new rows. So NEW.products_model points to the column products_model in the current (new) row.
Then MD5 is issued. Since I want to use SHA-2 it is obvious for me to change MD5(NEW.products_model), 16) ===> SHA2(NEW.products_model), 224).
And now I struggle: Why does this guy use CONV(RIGHT(...)...)? Is this really necessary?
Additional information: Right now, I am doing
hashlib.sha224(title).hexdigest()
in Python and store this value.
I appreciate any suggestions/explanations!
To answer your three questions:
The NEW keyword references the 'pseudo-table' for the record that would be inserted. On an updated trigger, you can access both 'NEW' and 'OLD', and on a delete, just 'OLD'.
Yes, MD5 is used to create the hash. However, in your question, you have part of the parameter to the 'RIGHT' function included. It's only MD5(NEW.products_model) (there's no , 16). And yes, you can substitute SHA2 for MD5, if it's available (it's only available if MySQL is configured for SSL support).
The RIGHT(string, number) simply takes the right 'number' characters from 'string'.
The CONV() function is used to convert a number between bases. The combination of these last two functions takes the right 16 characters of the hash, and converts them from base 16 (hex) to base 10 (decimal).
And the answer is no, you don't need them if all you want to do is store the hash itself.
NEW.model_hash = SHA2(NEW.products_model)

Data Structure for storing a sorting field to efficiently allow modifications

I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.
I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:
(operation on A, with subset B, C, E)
A -> B
B -> C
C -> E
D -> D
E -> A
Notice how D does not move.
In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.
The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:
oldpos = B.pos
B.pos = A.pos
A.pos = oldpos
Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.
I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.
One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:
A=1 -> B=2
B=2 -> A=2.5
C=3 -> C=3
This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.
Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.
Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?
Prefered solutions:
A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.
Another option would be to implement this using the ltree module for postgresql.
Less graceful (and write-heavy) solution:
Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D
Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)
Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).
A temp table and a transaction should maintain atomicity and the unique constraint on sort order. Restating the problem, you want to go from:
A 10 to B 10
B 25 C 25
C 26 E 26
E 34 A 34
Where there can be any number of items in between each row. So, first you read in the records and create a list [['A',10],['B',25],['C',26],['E',34]]. Through some pythonic magic you shift the identifiers around and insert them into a temp table:
create temporary table reorder (
id varchar(20), -- whatever
sort_order number,
primary key (id));
Now for the update:
update table XYZ
set sort_order = (select sort_order from reorder where xyz.id = reorder.id)
where id in (select id from reorder)
I'm only assuming pgsql can handle that query. If it can, it will be atomic.
Optionally, create table REORDER as a permanent table and the transaction will ensure that attempts to reorder the same record twice will be serialized.
EDIT: There are some transaction issues. You might need to implement both of my ideas. If two processes both want to update item B (for example) there can be issues. So, assume all order values are even:
Begin Transaction
Increment all the orders being used by 1. This puts row level write locks on all the rows you are going to update.
Select the data you just updated, if any sort_order fields are even some other process has added a record that matches your criteria. You can either abort the transaction and restart or you can just drop the record and finish the operation using only the records that were updated in step 2. The "right" thing to do depends on what you need this code to accomplish.
Fill your temporary reorder table as above using the proper even sort_orders.
Update the main table as above.
Drop the temporary table.
Commit the transaction
Step 2 ensures that if two lists overlap, only the first one will have access to the row
in question until the transaction completes:
update XYZ set sort_order = sort_order + 1
where -- whatever your select criteria are
select * from XYZ
where -- same select criteria
order by sort_order
Alternatively, you can add a control field to the table to get the same affect and then you don't need to play with the sort_order field. The benefit of using the sort_order field is indexing by a BIT field or a LOCK_BY_USERID field when the field is usually null tends to have poor performance since the index 99% of the time is meaningless. SQL engines don't like indexes that spend most of their time empty.
It seems to me that your real problem is the need to lock a table for the duration of a transaction. I don't immediately see a good way to solve this problem in a single operation, hence the need for locking.
So the question is whether you can do this in a "Django way" as opposed to using straight SQL. Searching "django lock table" turned up some interesting links, including this snippet, there are many others that implement similar behavior.
A straight SQL linked-list style solution can be found in this stack overflow post, it appeared logical and succinct to me, but again it's two operations.
I'm very curious to hear how this turns out and what your final solution is, be sure to keep us updated!
Why not do a simple character field of some length like a max of 16 (or 255) initially.
Start initially with labeling things aaa through zzz (that should be 17576 entries). (You could also add in 0-9, and the uppercase letters and symbols for an optimization.)
As items are added, they can go to the end up to the maximum you allow for the additional 'end times' (zzza, zzzaa, zzzaaa, zzzaab, zzzaac, zzzaad, etc.)
This should be reasonable simple to program, and it's very similar to the Dewey Decimal system.
Yes, you will need to rebalance it occasionally, but that should be a simple operaion. The simplest approach is two passes, pass 1 would be to set the new ordering tag to '0' (or any character earlier than the first character) followed by the new tag of the appropriate length, and step 2 would be to remove the '0 from the front.
Obviuosly, you could do the same thing with floats, and rebalancing it regularly, this is just a variation on that. The one advantage is that most databases will allow you to set a ridiculously large maximum size for the character field, large enough to make it very, very, very unlikely that you would run out of digits to do the ordering, and also make it unlikely that you would ever have to modify the schema, while not wasting a lot of space.
You can solve the renumbering issue by doing the order column as an integer that is always an even number. When you are moving the data, you change the order field to the new sort value + 1 and then do a quick update to convert all the odd order fields to even:
update table set sort_order = bitand(sort_order, '0xFFFFFFFE')
where sort_order <> bitand(sort_order, '0xFFFFFFFE')
Thus you can keep the uniqueness of sort_order as a constraint
EDIT: Okay, looking at the question again, I've started a new answer.

Categories

Resources