Selecting from DB where some data points can be NULL (psycopg2)

Selecting from DB where some data points can be NULL (psycopg2) - python

I have the following query, where some of the values I am trying to select with can be empty and therefore default to None.
So I've come up with something like this:
db.cursor.execute(
'''
SELECT
s.prod_tires_size_id as size_id
,s.rim_mm
,s.rim_inch
,s.width_mm
,s.width_inch
,s.aspect_ratio
,s.diameter_inch
FROM product_tires.sizes s
WHERE
s.rim_mm %(rim_mm)s
AND s.rim_inch %(rim_inch)s
AND s.width_mm %(width_mm)s
AND s.width_inch %(width_inch)s
AND s.aspect_ratio %(aspect_ratio)s
AND s.diameter_inch %(diameter_inch)s
''', {
'rim_mm': data['RIM_MM'] or None,
'rim_inch': data['RIM_INCH'] or None,
'width_mm': data['WIDTH_MM'] or None,
'width_inch': data['WIDTH_INCH'] or None,
'aspect_ratio': data['ASPECT_RATIO'] or None,
'diameter_inch': data['OVL_DIAMETER'] or None,
}
)
However, = NULL does not work.
If I use IS, then it will not match the values I am proving.
How can I solve this problem?

Generate the query using python:
k = {'rim_mm': 'RIM_MM',
'rim_inch': 'RIM_INCH',
'width_mm': 'WIDTH_MM',
'width_inch': 'WIDTH_INCH',
'aspect_ratio': 'ASPECT_RATIO',
'diameter_inch': 'OVL_DIAMETER',
}
where = []
for column, key in k.items():
if data[key]:
where.append("%s=%%(%s)s" % (column,key)")
else:
where.append("%s IS NULL" % column)
sql = "your select where " + " AND ".join(where)
cursor.execute( sql, data )

Related

Problems inserting a new entry in Astra Cassandra

I am doing a migration from Cassandra on an AWS machine to Astra Cassandra and there are some problems :
I cannot insert new data in Astra Cassandra with a column which is around 2 million characters and 1.77 MB (and I have bigger data to insert - around 20 millions characters). Any one knows how to address the problem?
I am inserting it via a Python app (cassandra-driver==3.17.0) and this is the error stack I get :
start.sh[5625]: [2022-07-12 15:14:39,336]
INFO in db_ops: error = Error from server: code=1500
[Replica(s) failed to execute write]
message="Operation failed - received 0 responses and 2 failures: UNKNOWN from 0.0.0.125:7000, UNKNOWN from 0.0.0.181:7000"
info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 2}
If I used half of those characters it works.
new Astra Cassandra CQL Console table description :
token#cqlsh> describe mykeyspace.series;
CREATE TABLE mykeyspace.series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
) WITH additional_write_policy = '99PERCENTILE'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.UnifiedCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99PERCENTILE';
Old Cassandra table description:
ansible#cqlsh> describe mykeyspace.series;
CREATE TABLE mykeyspace.series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Data sample :
{"type": "OP", "name": "book", "as_of": "2022-03-17", "data": [{"year": 2022, "month": 3, "day": 17, "hour": 0, "quarter": 1, "week": 11, "wk_year": 2022, "is_peak": 0, "value": 1.28056854009628e-08}, .... ], "hash": "84421b8d934b06488e1ac464bd46e83ccd2beea5eb2f9f2c52428b706a9b2a10"}
where this json contains 27.000 entries inside the data array like :
{"year": 2022, "month": 3, "day": 17, "hour": 0, "quarter": 1, "week": 11, "wk_year": 2022, "is_peak": 0, "value": 1.28056854009628e-08}
Python part of the code :
def insert_to_table(self, table_name, **kwargs):
try:
...
elif table_name == "series":
self.session.execute(
self.session.prepare("INSERT INTO series (type, name, as_of, data, hash) VALUES (?, ?, ?, ?, ?)"),
(
kwargs["type"],
kwargs["name"],
kwargs["as_of"],
kwargs["data"],
kwargs["hash"],
),
)
return True
except Exception as error:
current_app.logger.error('src/db/db_ops.py insert_to_table() table_name = %s error = %s', table_name, error)
return False
Big thanks !

you are hitting the configured limit for the maximum mutation size. On Cassandra, this defaults to 16 MB, while on Astra DB at the moment it is 4 MB (it's possible that it will be raised, but performing inserts with veyr large cell sizes is still strongly discouraged).
A more agile approach to storing this data would be to revise your data model and split the big row with the huge string into several rows, each containing a single item of the 27000 or so entries. With proper use of partitioning, you would still be able to retrieve the whole contents with a single query (paginated between the database and the driver for your convenience, which would help avoiding annoying timeouts which may arise when reading so large individual rows).
Incidentally, I suggest you create the prepared statement only once outside of the insert_to_table function (caching it or something). In the insert function you simply self.session.execute(already_prepared_statement, (value1, value2, ...)) which would noticeably improve your performance.
A last point: I believe the drivers are able to connect to Astra DB only starting from version 3.24.0, so I'm not sure how you are using version 3.17. I don't think version 3.17 know of the cloud argument to the Cluster constructor. In any case, I suggest you upgrade the drivers to the latest version (currently 3.25.0).

There's something not quite right with the details you posted in your question.
In the schema you posted, the data column is of type text:
data text,
But the sample data you posted looks like you are inserting key/value pairs which oddly seems to be formatted like a CQL collection type.
If it were really a string, it would be formatted as:
... "data": "key1: value1, key2: value2, ...", ...
Review your data and code then try again. Cheers!

Finally, the short solution to make a data migration possible for an inhouse Cassandra DB to Astra Cassandra (DataStax) was to use compression (zlib).
So the "data" field of each entry was compress in Python with zlib and then stored in Cassandra to reduce the size of the entries.
def insert_to_table(self, table_name, **kwargs):
try:
if table_name == 'series_as_of':
...
elif table_name == 'series':
list_of_bytes = bytes(json.dumps(kwargs["data"]),'latin1')
compressed_data = zlib.compress(list_of_bytes)
a = str(compressed_data, 'latin1')
self.session.execute(
self.session.prepare(INSERT_SERIES_QUERY),
(
kwargs["type"],
kwargs["name"],
kwargs["as_of"],
a,
kwargs["hash"],
),
)
....
Then, when reading the entries, a decompression step is needed :
def get_series(self, query_type, **kwargs):
try:
if query_type == "data":
execution = self.session.execute_async(
self.session.prepare(GET_SERIES_DATA_QUERY),
(kwargs["type"], kwargs["name"], kwargs["as_of"]),
timeout=30,
)
decompressed = None
a = None
for row in execution.result():
my_bytes = row[0].encode('latin1')
decompressed = zlib.decompress(my_bytes)
a = str(decompressed, encoding="latin1")
return a
...
The data looks like this now in Astra Cassandra :
token#cqlsh> select * from series where type = 'OP' and name = 'ZTP_PFM_H_LUX_PHY' and as_of = '2022-09-30';
type | name | as_of | data | hash
------+-------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------
OP | ZTP_PFM_H_LUX_PHY | 2022-09-30 00:00:00.000000+0000 | x\x9c-\x8dÉ\nÂ#\x10D\x7fEúl`z\x16Mü\x80\x90\x83â\x1c\x14\F\x86îYÈÅ\x05\x92\x8b\x88ÿnÂx*ªÞ\x83\x82\x8f\x83ñýJ\x0e6\x0b\x07{ë`9å\x83îÿår°Þ¶;ßùíñämw.\x02\rþ\x99\x8b!\x85\x94\x95h*%\n\x8a4ÒL®·¹õ4ôÅÓÙ¨\x10\të \x99H\x04¡\x8cf6¹!\x95\x02'\x93"Jb\x1dë\x84ÈT¯U\x90\x19\x11W8\x1dp£\x8d\x83/ü\x00Ó<0\x99 | 4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945
(1 rows)
Long term solution is to rearrange the data like #Stefano Lottini said.

Case Insensitive search with POstgres Similar To

I have such query in my Django code
cond = self.text_query(field.id, f'''
EXISTS(
SELECT *
FROM {self.fieldvalue_db_view}
WHERE entry_id = {self.db_table}.id AND
{self.fieldvalue_db_view}.field_id = {field.id} AND
{self.fieldvalue_db_view}.{text_search_column} SIMILAR TO %(pattern)s
)
''', dict(pattern='%' + val + '%'))
When
val = '%%john smith%%' or val = '%%john%smith%%'
it did not return a result like 'John smith', but if
val = '%%john%smith%|%smith%john%%'
it returns results with 'John smith'.
How this case sensitiveness can be solved?
Thanks,

you can use ilike:
EXISTS(
SELECT
1
FROM
{ self.fieldvalue_db_view }
WHERE
entry_id = { self.db_table }.id
AND { self.fieldvalue_db_view }.field_id = { field.id }
AND { self.fieldvalue_db_view }.{ text_search_column } ILIKE %(pattern) s
)
also you don't need to return all columns when you checking existence of a row , instead pass a constant which is cheaper

Insert into table values from a dictionary in a Postgresql database

I have a dictionary in python that I created from a JSON file. Now, I need to pass its values to insert into a postgresql database.
dictionary
if(i['trailers']):
a = [
{'url': i['images'][0]['url'], 'type': i['images'][0]['type']},
{'url': i['images'][1]['url'], 'type': i['images'][1]['type']},
{'url': i['trailers'][0]['url'], 'type': 'Trailer'},
{'url': i['trailers'][1]['url'], 'type': 'Trailer'},
]
else:
a = [
{'url': i['images'][0]['url'], 'type': i['images'][0]['type']},
{'url': i['images'][1]['url'], 'type': i['images'][1]['type']},
]
length = len(a)
Here, I created the dictionary. If there is anything inside the trailer it goes A, else it goes B. In the B case, trailers doesn't exists. Then I get the length of the dictionary.
Now, I will try to insert these elements into the table media, that depends on movies. Their relation is movie(1):media(n).
INSERT INTO media
for x in range(length):
query = ("""INSERT INTO media VALUES (%s, %s, %(url)s, %(type)s);""")
data = (media_id, media_movie_id)
cur.execute(query, data)
conn.commit()
media_id += 1
Here is what I'm trying to do. Since movie can have many media, I'll create a for to move through all the elements and inserting them in the table. With their id being incremented.
The problem is, I don't know how to do this quiet right in Python, since I always create a query and a data and then cur.execute it and the example that I got, was a entire dictionary being used, without any other kind of value.

So, if anyone have this kind of problem, the solution is simple, actually.
I remade my dict in something like this:
i['trailers'] = i.get('trailers') or []
dictionary = [{'url': x['url'], 'type': x['type']} for x in i['images'] + i['trailers']]
This solution was made by #minboost here
Then, for the insertion, is something like that:
for i, dic in enumerate(dictionary):
query = ("""
INSERT INTO media (id, movie_id, url, type)
VALUES (%s, %s, %s, %s);
"""
)
data = (media_id, media_movie_id, dictionary[i]['url'], dictionary[i]['type'])
cur.execute (query, data)
conn.commit()
All working perfectly. :)

How to ignore "IndexError: list index out of range" on SQL Execute Insert statement

I am working with Python 2.7 to extract data from a JSON API and push it into a SQL-Server table.
I am having trouble with inserting data into the database where some of the entries returned from the JSON response are missing a section of the dictionary. IE, "CustomFields": 90% of the entries have information, however 10% don't therefore I get an index error
eg
"CustomFields":[
],
vs
"CustomFields":[
{
"Type":"foo",
"Name":"foo",
"Value":"foo"
},
{
"Type":"foo",
"Name":"foo",
"Value":"foo"
},
{
"Type":"foo",
"Name":"foo",
"Value":"foo"
},
What would I change so that if I get a missing index, replace those with 'NULL' entries into the database.
response = '*API URL*'
json_response = json.loads(urllib2.urlopen(response).read())
conn = pypyodbc.connect(r'Driver={SQL Server};Server=*Address*;Database=*DataBase*;Trusted_Connection=yes;')
conn.autocommit = False
c = conn.cursor()
c.executemany("INSERT INTO phil_targetproccess (ResourceType, Id, Name, StartDate, EndDate, TimeSpent, CreateDate, ModifyDate, LastStateChangeDate, ProjectName, EntityStateName, RequestTypeName, AssignedTeamMember#1, Area, SubArea, BusinessTeam) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" ,
[(x['ResourceType'],
x['Id'],
x['Name'],
(parse_date(x['StartDate'])),
(parse_date(x['EndDate'])),
x['TimeSpent'],
(parse_date(x['CreateDate'])),
(parse_date(x['ModifyDate'])),
(parse_date(x['LastStateChangeDate'])),
x['Project']['Name'],
x['EntityState']['Name'],
x['RequestType']['Name'],
y['GeneralUser']['FirstName']+' '+y['GeneralUser']['LastName'],
x['CustomFields'][0]['Value'],
x['CustomFields'][1]['Value'],
x['CustomFields'][2]['Value'])
for x in json_response['Items']
for y in x['Assignments']['Items']])
Many thanks.

I think your issue is here
x['CustomFields'][0]['Value'],
x['CustomFields'][1]['Value'],
x['CustomFields'][2]['Value']
When CustomFields has no elements
Try
x['CustomFields'][0]['Value'] if len(x['CustomFields']) > 0 else '',
x['CustomFields'][1]['Value'] if len(x['CustomFields']) > 1 else '',
x['CustomFields'][2]['Value'] if len(x['CustomFields']) > 2 else '',

You can use get method to check whether that value in CustomFields
is available if so check the length of the list and then get the value of the dictionary in that list using the same get method.
For example:
customfield_value = (x['CustomFields'][0]).get("Value",None) if len(x['CustomFields'])>0 else None
This will return None if the value is not present in the index 0. you can follow the same for getting values from other 2 indices. If you didn't understand please comment it 'll explain further.

Final Script. Thanks for the help!
c.executemany("INSERT INTO phil_targetproccess (ResourceType, Id, Name, StartDate, EndDate, TimeSpent, CreateDate, "
"ModifyDate, LastStateChangeDate, ProjectName, EntityStateName, RequestTypeName, AssignedTeamMember1, "
"AssignedTeamMember2, AssignedTeamMember3, AssignedTeamMember4, Area, SubArea, BusinessTeam) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
[(x['ResourceType'],
x['Id'],
x['Name'],
(parse_date(x['StartDate'])),
(parse_date(x['EndDate'])),
x['TimeSpent'],
(parse_date(x['CreateDate'])),
(parse_date(x['ModifyDate'])),
(parse_date(x['LastStateChangeDate'])),
x['Project']['Name'],
x['EntityState']['Name'],
x['RequestType']['Name'],
x['Assignments']['Items'][0]['GeneralUser']['FirstName'] + ' ' + x['Assignments']['Items'][0]['GeneralUser']['LastName'] if len(x['Assignments']['Items']) > 0 else None,
x['Assignments']['Items'][1]['GeneralUser']['FirstName'] + ' ' + x['Assignments']['Items'][1]['GeneralUser']['LastName'] if len(x['Assignments']['Items']) > 1 else None,
x['Assignments']['Items'][2]['GeneralUser']['FirstName'] + ' ' + x['Assignments']['Items'][2]['GeneralUser']['LastName'] if len(x['Assignments']['Items']) > 2 else None,
x['Assignments']['Items'][3]['GeneralUser']['FirstName'] + ' ' + x['Assignments']['Items'][3]['GeneralUser']['LastName'] if len(x['Assignments']['Items']) > 3 else None,
x['CustomFields'][0]['Value'] if len(x['CustomFields']) > 0 else '',
x['CustomFields'][1]['Value'] if len(x['CustomFields']) > 1 else '',
x['CustomFields'][2]['Value'] if len(x['CustomFields']) > 2 else '')
for x in json_response['Items']])

sqlalchemy: previous row and next row by id

I have a table Images with id and name. I want to query its previous image and next image in the database using sqlalchemy. How to do it in only one query?
sel = select([images.c.id, images.c.name]).where(images.c.id == id)
res = engine.connect().execute(sel)
#How to obtain its previous and next row?
...
Suppose it is possible that some rows have been deleted, i.e., the ids may not be continuous. For example,
Table: Images
------------
id | name
------------
1 | 'a.jpg'
2 | 'b.jpg'
4 | 'd.jpg'
------------

prev_image = your_session.query(Images).order_by(Images.id.desc()).filter(Images.id < id).first()
next_image = your_session.query(Images).order_by(Images.id.asc()).filter(Images.id > id).first()

# previous
prv = select([images.c.id, images.c.name]).where(images.c.id < id).order_by(images.c.id.desc()).limit(1)
res = engine.connect().execute(prv)
for res in res:
print(res.id, res.name)
# next
nxt = select([images.c.id, images.c.name]).where(images.c.id > id).order_by(images.c.id).limit(1)
res = engine.connect().execute(nxt)
for res in res:
print(res.id, res.name)

This can be accomplished in a "single" query by taking the UNION of two queries, one to select the previous and target records and one to select the next record (unless the backend is SQLite, which does not permit an ORDER BY before the final statement in a UNION):
import sqlalchemy as sa
...
with engine.connect() as conn:
target = 3
query1 = sa.select(tbl).where(tbl.c.id <= target).order_by(tbl.c.id.desc()).limit(2)
query2 = sa.select(tbl).where(tbl.c.id > target).order_by(tbl.c.id.asc()).limit(1)
res = conn.execute(query1.union(query2))
for row in res:
print(row)
producing
(2, 'b.jpg')
(3, 'c.jpg')
(4, 'd.jpg')
Note that we could make the second query the same as the first, apart from reversing the inequality
query2 = sa.select(tbl).where(tbl.c.id >= target).order_by(tbl.c.id.asc()).limit(2)
and we would get the same result as the union would remove the duplicate target row.
If the requirement were to find the surrounding rows for a selection of rows we could use the lag and lead window functions, if they are supported.
# Works in PostgreSQL, MariaDB and SQLite, at least.
with engine.connect() as conn:
query = sa.select(
tbl.c.id,
tbl.c.name,
sa.func.lag(tbl.c.name).over(order_by=tbl.c.id).label('prev'),
sa.func.lead(tbl.c.name).over(order_by=tbl.c.id).label('next'),
)
res = conn.execute(query)
for row in res:
print(row._mapping)
Output:
{'id': 1, 'name': 'a.jpg', 'prev': None, 'next': 'b.jpg'}
{'id': 2, 'name': 'b.jpg', 'prev': 'a.jpg', 'next': 'c.jpg'}
{'id': 3, 'name': 'c.jpg', 'prev': 'b.jpg', 'next': 'd.jpg'}
{'id': 4, 'name': 'd.jpg', 'prev': 'c.jpg', 'next': 'e.jpg'}
{'id': 5, 'name': 'e.jpg', 'prev': 'd.jpg', 'next': 'f.jpg'}
{'id': 6, 'name': 'f.jpg', 'prev': 'e.jpg', 'next': None}

To iterate through your records. I think that this is what you're looking for.
for row in res:
print row.id
print row.name

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting from DB where some data points can be NULL (psycopg2) - python

Related

Problems inserting a new entry in Astra Cassandra

Case Insensitive search with POstgres Similar To

Insert into table values from a dictionary in a Postgresql database

How to ignore "IndexError: list index out of range" on SQL Execute Insert statement

sqlalchemy: previous row and next row by id

Categories

Resources