Is Using SQL Merge with only one table possible in python? [duplicate] - python

I have a table containing a student-grade relationship:
Student Grade StartDate EndDate
1 1 09/01/2009 NULL
2 2 09/01/2010 NULL
2 1 09/01/2009 06/15/2010
I am trying to write a stored procedure that takes Student, Grade, and StartDate, and I would like it to
check to make sure these values are not duplicates
insert the record if it's not a duplicate
if there is an existing student record, and it has a EndDate = NULL, then update that record with the StartDate of the new record.
For instance, if I call the procedure and pass in 1, 2, 09/01/2010, I'd like to end up with:
Student Grade StartDate EndDate
1 2 09/01/2010 NULL
1 1 09/01/2009 09/01/2010
2 2 09/01/2010 NULL
2 1 09/01/2009 06/15/2010
This sounds like I could use MERGE, except that I am passing literal values, and I need to perform more than one action. I also have a wicked headache this morning and can't seem to think clearly, so I am fixating on this MERGE solution. If there is a more more obvious way to do this, don't be afraid to point it out.

You can use a MERGE even if you are passing literal values. Here's an example for your issue:
CREATE PROCEDURE InsertStudentGrade(#Student INT, #Grade INT, #StartDate DATE)
AS
BEGIN;
MERGE StudentGrade AS tbl
USING (SELECT #Student AS Student, #Grade AS Grade, #StartDate AS StartDate) AS row
ON tbl.Student = Row.Student AND tbl.Grade = row.Grade
WHEN NOT MATCHED THEN
INSERT(Student, Grade, StartDate)
VALUES(row.Student, row.Grade, row.StartDate)
WHEN MATCHED AND tbl.EndDate IS NULL AND tbl.StartDate != row.StartDate THEN
UPDATE SET
tbl.StartDate = row.StartDate;
END;

I prefer the following, it is cleaner and easier to read and modify.
MERGE Definition.tdSection AS Target
USING
(SELECT *
FROM ( VALUES
( 1, 1, 'Administrator', 1, GETDATE(), NULL, Current_User, GETDATE())
,( 2, 1, 'Admissions', 1, GETDATE(), NULL, Current_User, GETDATE())
,( 3, 1, 'BOM', 1, GETDATE(), NULL, Current_User, GETDATE())
,( 4, 1, 'CRC', 1, GETDATE(), NULL, Current_User, GETDATE())
,( 5, 1, 'ICM', 1, GETDATE(), NULL, Current_User, GETDATE())
,( 6, 1, 'System', 1, GETDATE(), NULL,Current_User, GETDATE())
,( 7, 1, 'Therapy', 1, GETDATE(), NULL, Current_User, GETDATE())
)
AS s (SectionId
,BusinessProcessId
,Description, Sequence
,EffectiveStartDate
,EffectiveEndDate
,ModifiedBy
,ModifiedDateTime)
) AS Source
ON Target.SectionId = Source.SectionId
WHEN NOT MATCHED THEN
INSERT (SectionId
,BusinessProcessId
,Description
,Sequence
,EffectiveStartDate
,EffectiveEndDate
,ModifiedBy
,ModifiedDateTime
)
VALUES (Source.SectionId
,Source.BusinessProcessId
,Source.Description
,Source.Sequence
,Source.EffectiveStartDate
,Source.EffectiveEndDate
,Source.ModifiedBy
,Source.ModifiedDateTime
);

Simply:
--Arrange
CREATE TABLE dbo.Product
(
Id INT IDENTITY PRIMARY KEY,
Name VARCHAR(40),
)
GO
--Act
MERGE INTO dbo.Product AS Target
USING
(
--Here is the trick :)
VALUES
(1, N'Product A'),
(2, N'Product B'),
(3, N'Product C'),
(4, N'Product D')
)
AS
Source
(
Id,
Name
)
ON Target.Id= Source.Id
WHEN NOT MATCHED BY TARGET THEN
INSERT
(
Name
)
VALUES
(
Name
);

Related

Clickhouse-sqlalchemy types.Nested insert

I have a problem inserting data into a nested column.
I use map_imperatively.
Columns of a different type are filled. Only the nested column remains empty.
My code:
import attr
from sqlalchemy import (
create_engine, Column, MetaData, insert
)
from sqlalchemy.orm import registry
from clickhouse_sqlalchemy import (
Table, make_session, types, engines,
)
uri = 'clickhouse+native://localhost/default'
engine = create_engine(uri)
session = make_session(engine)
metadata = MetaData(bind=engine)
mapper = registry()
#attr.dataclass
class NestedAttr:
key1: int
key2: int
key3: int
#attr.dataclass
class NestedInObject:
id: int
name: str
nested_attr: NestedAttr
nested_test = Table(
'nested_test', metadata,
Column(name='id', type_=types.Int8, primary_key=True),
Column(name='name', type_=types.String),
Column(
name='nested_attr',
type_=types.Nested(
Column(name='key1', type_=types.Int8),
Column(name='key2', type_=types.Int8),
Column(name='key3', type_=types.Int8),
)
),
engines.Memory()
)
mapper.map_imperatively(
NestedInObject,
nested_test
)
nested_test.create()
values = [
{
'id': 1,
'name': 'name',
'nested_attr.key1': [1, 2],
'nested_attr.key2': [1, 2],
'nested_attr.key3': [1, 2],
}
]
session.execute(insert(NestedInObject), values)
I don't get an error, but the nested columns are empty.
I tried different data. Checked the data type in the database. I don't understand why the columns are left empty.

SQLAlchemy insert values into reflected table results in NULL entries all across

The following code results in None () across the row in every attempt. The query.values() code below is just a shortened line so as to keep things less complicated. Additionally I have problems inserting a dict as JSON in the address fields but that's another question.
CREATE TABLE public.customers (
id SERIAL,
email character varying(255) NULL,
name character varying(255) NULL,
phone character varying(16) NULL,
address jsonb NULL,
shipping jsonb NULL,
currency character varying(3) NULL,
metadata jsonb[] NULL,
created bigint NULL,
uuid uuid DEFAULT uuid_generate_v4() NOT NULL,
PRIMARY KEY (uuid)
);
from sqlalchemy import *
from sqlalchemy.orm import Session
# Create engine, metadata, & session
engine = create_engine('postgresql://postgres:password#database/db', future=True)
metadata = MetaData(bind=engine)
session = Session(engine)
# Create Table
customers = Table('customers', metadata, autoload_with=engine)
query = customers.insert()
query.values(email="test#test.com", \
name="testy testarosa", \
phone="+12125551212", \
address='{"city": "Cities", "street": "123 Main St", \
"state": "CA", "zip": "10001"}')
session.execute(query)
session.commit()
session.close()
# Now to see results
stmt = text("SELECT * FROM customers")
response = session.execute(stmt)
for result in response:
print(result)
# Results in None in the fields I explicitly attempted
(1, None, None, None, None, None, None, None, 1, None, None, None, None, UUID('9112a420-aa36-4498-bb56-d4129682681c'))
Calling query.values() returns a new insert instance, rather than modifying the existing instance in-place. This return value must be assigned to a variable otherwise it will have no effect.
You could build the insert iteratively
query = customers.insert()
query = query.values(...)
session.execute(query)
or chain the calls as Karolus K. suggests in their answer.
query = customers.insert().values(...)
Regarding the address column, you are inserting a dict already serialised as JSON. This value gets serialised again during insertion, so the value in the database ends up looking like this:
test# select address from customers;
address
══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
"{\"city\": \"Cities\", \"street\": \"123 Main St\", \"state\": \"CA\", \"zip\": \"10001\"}"
(1 row)
and is not amenable to being queried as a JSON object (because it's a JSONified string)
test# select address->'state' AS state from customers;
state
═══════
¤
(1 row)
You might find it better to pass the raw dict instead, resulting in this value being stored in the database:
test# select address from customers;
address
════════════════════════════════════════════════════════════════════════════
{"zip": "10001", "city": "Cities", "state": "CA", "street": "123 Main St"}
(1 row)
which is amenable to being queried as a JSON object:
test# select address->'state' AS state from customers;
state
═══════
"CA"
(1 row)
I am not sure what do you mean with
The query.values() code below is just a shortened line so as to keep
things less complicated.
So maybe I am not understanding the issue properly.
At any case the problem here is that you execute the insert() and the values() separately, while it is meant to be "chained".
Doing something like:
query = customers.insert().values(email="test#test.com", name="testy testarosa", phone="+12125551212", address='{"city": "Cities", "street": "123 Main St", "state": "CA", "zip": "10001"}')
will work.
Documentation: https://docs.sqlalchemy.org/en/14/core/selectable.html#sqlalchemy.sql.expression.TableClause.insert
PS: I did not faced any issues with the JSON field as well. Perhaps something with PG version?

Save dataframe in Postgresql Database with SERIAL Autogenerated ID

Having a dataframe in the following way:
word classification counter
0 house noun 2
1 the article 2
2 white adjective 1
3 yellow adjective 1
I would like to store in Postgresql table with the following definition:
CREATE TABLE public.word_classification (
id SERIAL,
word character varying(100),
classification character varying(10),
counter integer,
start_date date,
end_date date
);
ALTER TABLE public.word_classification OWNER TO postgres;
The current basic configuration I have is as follows:
from sqlalchemy import create_engine
import pandas as pd
# Postgres username, password, and database name
POSTGRES_ADDRESS = 'localhost' ## INSERT YOUR DB ADDRESS IF IT'S NOT ON PANOPLY
POSTGRES_PORT = '5432'
POSTGRES_USERNAME = 'postgres' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES USERNAME
POSTGRES_PASSWORD = 'BVict31C' ## CHANGE THIS TO YOUR PANOPLY/POSTGRES PASSWORD
POSTGRES_DBNAME = 'local-sandbox-dev' ## CHANGE THIS TO YOUR DATABASE NAME
# A long string that contains the necessary Postgres login information
postgres_str = ('postgresql://{username}:{password}#{ipaddress}:{port}/{dbname}'.format(username=POSTGRES_USERNAME,password=POSTGRES_PASSWORD,ipaddress=POSTGRES_ADDRESS,port=POSTGRES_PORT,dbname=POSTGRES_DBNAME))
# Create the connection
cnx = create_engine(postgres_str)
data=[['the','article',0],['house','noun',1],['yellow','adjective',2],
['the','article',4],['house','noun',5],['white','adjective',6]]
df = pd.DataFrame(data, columns=['word','classification','position'])
df_db = pd.DataFrame(columns=['word','classification','counter','start_date','end_date'])
count_series=df.groupby(['word','classification']).size()
new_df = count_series.to_frame(name = 'counter').reset_index()
df_db = new_df.to_sql('word_classification',cnx,if_exists='append',chunksize=1000)
I would like to insert into the table as I am able to do with SQL syntax:
insert into word_classification(word, classification, counter)values('hello','world',1);
Currently, I am getting an error when inserting into the table because I am passing the index:
(psycopg2.errors.UndefinedColumn) column "index" of relation "word_classification" does not exist
LINE 1: INSERT INTO word_classification (index, word, classification...
^
[SQL: INSERT INTO word_classification (index, word, classification, counter) VALUES (%(index)s, %(word)s, %(classification)s, %(counter)s)]
[parameters: ({'index': 0, 'word': 'house', 'classification': 'noun', 'counter': 2}, {'index': 1, 'word': 'the', 'classification': 'article', 'counter': 2}, {'index': 2, 'word': 'white', 'classification': 'adjective', 'counter': 1}, {'index': 3, 'word': 'yellow', 'classification': 'adjective', 'counter': 1})]
I have been searching for ways to get rid of passing the index with no luck.
Thanks for your help
Turn off index when storing in database as follows:
df_db = new_df.to_sql('word_classification',cnx,if_exists='append',chunksize=1000, index=False)

Parameterized query with pyodbc and mysql8 returns 0 for columns with int data types

Python: 2.7.12
pyodbc: 4.0.24
OS: Ubuntu 16.4
DB: MySQL 8
driver: MySQL 8
Expected behaviour: resultset should have numbers in columns with datatype int
Actual Behaviour: All of the columns with int data type have 0's (If parameterised query is used)
Here's the queries -
1.
cursor.execute("SELECT * FROM TABLE where id =7")
Result set:
[(7, 1, None, 1, u'An', u'Zed', None, u'Ms', datetime.datetime(2016, 12, 20, 0, 0), u'F', u'Not To Be Disclosed', None, None, u'SPRING', None, u'4000', datetime.datetime(2009, 5, 20, 18, 55), datetime.datetime(2019, 1, 4, 14, 25, 58, 763000), 0, None, None, None, bytearray(b'\x00\x00\x00\x00\x01(n\xba'))]
2.
cursor.execute("SELECT * FROM patients where patient_id=?", [7])`
or
cursor.execute("SELECT * FROM patients where patient_id=?", ['7'])
or
cursor.execute("SELECT * FROM patients where patient_id IN ", [7])
Result set:
[(0, 0, None, 0, u'An', u'Zed', None, u'Ms', datetime.datetime(2016, 12, 20, 0, 0), u'F', u'Not To Be Disclosed', None, None, u'SPRING', None, u'4000', datetime.datetime(2009, 5, 20, 18, 55), datetime.datetime(2019, 1, 4, 14, 25, 58, 763000), 0, None, None, None, bytearray(b'\x00\x00\x00\x00\x01(n\xba'))]
Rest of the result set is okay except for the columns with int data type that all have 0's if paramterized query is used.
It seems like it should have worked without issues. Can I get some help here.
Edit : Here's the schema of the table:
CREATE TABLE `patient
`lastname` varchar(30) DEFAULT NULL,
`known_as` varchar(30) DEFAULT NULL,
`title` varchar(50) DEFAULT NULL,
`dob` datetime DEFAULT NULL,
`sex` char(1) DEFAULT NULL,
`address1` varchar(30) DEFAULT NULL,
`address2` varchar(30) DEFAULT NULL,
`address3` varchar(30) DEFAULT NULL,
`city` varchar(30) DEFAULT NULL,
`state` varchar(16) DEFAULT NULL,
`postcode` char(4) DEFAULT NULL,
`datecreated` datetime NOT NULL,
`dateupdated` datetime(6) DEFAULT NULL,
`isrep` tinyint(1) DEFAULT NULL,
`photo` longblob,
`foreign_images_imported` tinyint(1) DEFAULT NULL,
`ismerged` tinyint(1) DEFAULT NULL,
`rowversion` varbinary(8) DEFAULT NULL,
PRIMARY KEY (`patient_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
You have encountered this bug in MySQL Connector/ODBC.
EDIT: The bug has now been fixed.
The following (Python 3) test code verifies that MySQL Connector/ODBC returns zero (incorrect), while mysqlclient returns the correct value:
import MySQLdb # from mysqlclient
import pyodbc
host = 'localhost'
user = 'root'
passwd = 'whatever'
db = 'mydb'
port = 3307
charset = 'utf8mb4'
use_odbc = False # or True
print(f'{"" if use_odbc else "not "}using ODBC ...')
if use_odbc:
connection_string = (
f'DRIVER=MySQL ODBC 8.0 ANSI Driver;'
f'SERVER={host};UID={user};PWD={passwd};DATABASE={db};PORT={port};'
f'charset={charset};'
)
cnxn = pyodbc.connect(connection_string)
print(f'{cnxn.getinfo(pyodbc.SQL_DRIVER_NAME)}, version {cnxn.getinfo(pyodbc.SQL_DRIVER_VER)}')
else:
cnxn = MySQLdb.connect(
host=host, user=user, passwd=passwd, db=db, port=port, charset=charset
)
int_value = 123
crsr = cnxn.cursor()
crsr.execute("CREATE TEMPORARY TABLE foo (id varchar(10) PRIMARY KEY, intcol int, othercol longblob)")
crsr.execute(f"INSERT INTO foo (id, intcol) VALUES ('Alfa', {int_value})")
sql = f"SELECT intcol, othercol FROM foo WHERE id = {'?' if use_odbc else '%s'}"
crsr.execute(sql, ('Alfa',))
result = crsr.fetchone()[0]
print(f'{"pass" if result == int_value else "FAIL"} -- expected: {repr(int_value)} ; actual: {repr(result)}')
Console output with use_odbc = True:
using ODBC ...
myodbc8a.dll, version 08.00.0018
FAIL -- expected: 123 ; actual: 0
Console output with use_odbc = False:
not using ODBC ...
pass -- expected: 123 ; actual: 123
FWIW i just posted a question where i was seeing this in version 3.1.14 of the ODBC connector but NOT in version 3.1.10.

How to improve performance of pymongo queries

I inherited an old Mongo database. Let's focus on the following two collections (removed most of their content for better readability):
Collection user
db.user.find_one({"email": "user#host.com"})
{'lastUpdate': datetime.datetime(2016, 9, 2, 11, 40, 13, 160000),
'creationTime': datetime.datetime(2016, 6, 23, 7, 19, 10, 6000),
'_id': ObjectId('576b8d6ee4b0a37270b742c7'),
'email': 'user#host.com' }
Collections entry (one user to many entries):
db.entry.find_one({"userId": _id})
{'date_entered': datetime.datetime(2015, 2, 7, 0, 0),
'creationTime': datetime.datetime(2015, 2, 8, 14, 41, 50, 701000),
'lastUpdate': datetime.datetime(2015, 2, 9, 3, 28, 2, 115000),
'_id': ObjectId('54d775aee4b035e584287a42'),
'userId': '576b8d6ee4b0a37270b742c7',
'data': 'test'}
As you can see, there is no DBRef between the two.
What I would like to do is to count the total number of entries, and the number of entries updated after a given date.
To do this I used Python's pymongo library. The code below gets me what I need, but it is painfully slow.
from pymongo import MongoClient
client = MongoClient('mongodb://foobar/')
db = client.userdata
# First I need to fetch all user ids. Otherwise db cursor will time out after some time.
user_ids = [] # build a list of tuples (email, id)
for user in db.user.find():
user_ids.append( (user['email'], str(user['_id'])) )
date = datetime(2016, 1, 1)
for user_id in user_ids:
email, _id = user_id
t0 = time.time()
query = {"userId": _id}
no_of_all_entries = db.entry.find(query).count()
query = {"userId": _id, "lastUpdate": {"$gte": date}}
no_of_entries_this_year = db.entry.find(query).count()
t1 = time.time()
print("delay ", round(t1 - t0, 2))
print(email, no_of_all_entries, no_of_entries_this_year)
It takes around 0.83 second to run both db.entry.find queries on my laptop, and 0.54 on an AWS server (not the MongoDB server).
Having ~20000 users it takes painful 3 hours to get all the data.
Is that the kind of latency you'd expect to see in Mongo ? What can I do to improve this ? Bear in mind that MongoDB is fairly new to me.
Instead of running two aggregates for all users separately you can just get both aggregates for all users with db.collection.aggregate().
And instead of a (email, userId) tuples we make it a dictionary as it is easier to use to get the corresponding email.
user_emails = {str(user['_id']): user['email'] for user in db.user.find()}
date = datetime(2016, 1, 1)
entry_counts = db.entry.aggregate([
{"$group": {
"_id": "$userId",
"count": {"$sum": 1},
"count_this_year": {
"$sum": {
"$cond": [{"$gte": ["$lastUpdate", date]}, 1, 0]
}
}
}}
])
for entry in entry_counts:
print(user_emails.get(entry['_id']),
entry['count'],
entry['count_this_year'])
I'm pretty sure getting the user's email address into the result could be done but I'm not a mongo expert either.

Categories

Resources