I am close to finishing an ORM for RethinkDB in Python and I got stuck at writing tests. Particularly at those involving save(), get() and delete() operations. What's the recommended way to test whether my ORM does what it is supposed to do when saving or deleting or getting a document?
Right now, for each test in my suite I create a database, populate it with all tables needed by the test models (this takes a lot of time, almost 5 seconds/test!), run the operation on my model (e.g.: save()) and then manually run a query against the database (using RethinkDB's Python driver) to see whether everything has been updated in the database.
Now, I feel this isn't just right; maybe there is another way to write these tests or maybe I can design the tests without even running that many queries against the database. Any idea on how can I improve this or a suggestion on how this has to be really done?
You can create all your databases/tables just once for all your test.
You can also use the raw data directory:
- Start RethinkDB
- Create all your databases/tables
- Commit it.
Before each test, copy the data directory, start RethinkDB on the copy, then when your test is done, delete the copied data directory.
Related
I have a Python script to import data from raw csv/xlsx files. For these I use Pandas to load the files, do some light transformation, and save to an sqlite3 database. This is fast (as fast as any method). After this, I run some queries against these to make some intermediate datasets. These I run through a function (see below).
More information: I am using Anaconda/Python3 (3.9) on Windows 10 Enterprise.
UPDATE:
Just as information for anybody reading this, I ended up going back to
just using standalone python (still using JupyterLab though)... I no
longer have this issue. So not sure if it is a problem with something
Anaconda does or just the versions of various libraries being used for
that particular Anaconda distribution (latest available). My script
runs more or less in the time that I would expect using Python 3.11
and the versions pulled in by pip for Pandas and sqlite (1.5.3 and
3.38.4).
Python function for running sqlite3 queries:
def runSqliteScript(destConnString, queryString):
'''Runs an sqlite script given a connection string and a query string
'''
try:
print('Trying to execute sql script: ')
print(queryString)
cursorTmp = destConnString.cursor()
cursorTmp.executescript(queryString)
except Exception as e:
print('Error caught: {}'.format(e))
Because somebody asked, here is the function that creates the "destConnString", though it's called something else in the actual function call, but is the same type.
def createSqliteDb(db_file):
''' Creates an sqlite database at direct/file name specified
'''
conSqlite = None
try:
conSqlite = sqlite3.connect(db_file)
return conSqlite
except Error as e:
print('Error {} when trying to create {}'.format(e, db_file))
Example of one of the queries (I commented out journal mode/synchronous pragmas after it didn't seem to help at all):
-- PRAGMA journal_mode = WAL;
-- PRAGMA synchronous = NORMAL;
BEGIN;
drop table if exists tbl_1110_cop_omd_fmd;
COMMIT;
BEGIN;
create table tbl_1110_cop_omd_fmd as
select
siteId,
orderNumber,
familyGroup01, familyGroup02,
count(*) as countOfLines
from tbl_0000_ob_trx_for_frazelle
where 1 = 1
-- and dateCreated between datetime('now', '-365 days') and datetime('now', 'localtime') -- temporarily commented due to no date in file
group by siteId,
orderNumber,
familyGroup01, familyGroup02
order by dateCreated asc
;
COMMIT
;
Here is a list of things that I have tried. Unfortunately, no matter what combination of things I have tried, it has ended up having one bottleneck or another. It seems there is some kind of write bottleneck from python to sqlite3, yet the pandas to_sql method doesn't seem to be affected by it. Complete list of all combinations of things that I have tried.
I tried wrapping all my queries in begin/commit statements. I put these in-line with the query, though I'd be interested in knowing if this is the correct way to do this. This seemed to have no effect.
I tried setting the journal mode to WAL and synchronous to normal, again to no effect.
I tried running the queries in an in-memory database.
Firstly, I tried creating everything from scratch in the in-memory database. The tables didn't create any faster. Saving this in-memory database seems to be a bottleneck (backup method).
Next, I tried creating views instead of tables (again, creating everything from scratch in the in-memory database). This created really quickly. Weirdly, querying these views was very fast. Saving this in-memory database seems to be a bottleneck (backup method).
I tried just writing views to the database file (not in-memory). Unfortunately, the views take as long as the make tables when running from Python/sqlite.
I don't really want to do anything strictly in-memory for the database creation, as this python script is used for different sets of data, some which could have too many rows for an in-memory setup. The only thing I have left to try is to take the in-memory from scratch setup, make views instead of tables, read ALL the in-memory db tables with pandas (from_sql), then write ALL the tables to a file db with pandas (to_sql)... Hoping there is something easy to try to resolve this problem.
connOBData = sqlite3.connect('file:cachedb?mode=memory?cache=shared')
These take approximately 1,000 times or more longer than if I run these queries directly in DB Browser (an sqlite frontend). These queries aren't that complex and run fine (in ~2-4 seconds) in DB Browser. All told, if I run all the queries in a row in DB Browser they'd run in 1-2 minutes. If I let them run through the Python script, it literally takes close to 20 hours. I'd expect the queries to finish in approximately the same time that they run in DB Browser.
I have a couple of SQL statements stored as files which get executed by a Python script. The database is hosted in Snowflake and I use Snowflake SQLAlchemy to connect to it.
How can I test those statements? I don't want to execute them, I just want to check if they could be executable.
One very basic thing to check if it is valid standard SQL. A better answer would be something that considers snowflake-specific stuff like
copy into s3://example from table ...
The best answer would be something that also checks permissions, e.g. for SELECT statements if the table is visible / readable.
An in-memory sqlite database is one option. But if you are executing raw SQL queries against snowflake in your code, your tests may fail if the same syntax isn't valid against sqlite. Recording your HTTP requests against a test snowflake database, and then replaying them for your unit tests suits this purpose better. There are two very good libraries that do this, check them out:
vcrpy
betamax
We do run integration tests on our Snowflake databases. We maintain clones of our production databases, for example, one of our production databases is called data_lake and we maintain a clone that gets cloned on a nightly basis called data_lake_test which is used to run our integration tests against.
Like Tim Biegeleisen mentioned, a "true" unittest would mock the response but our integration tests do run real Snowflake queries on our test cloned databases. There is the possibility that a test drastically alters the test database, but we run integration tests only during our CI/CD process so it is rare if there is ever a conflict between two tests.
I very much like this idea, however I can suggest a work around, as I often have to check my syntax and need help there. What I would recommend, if you plan on using the Snowflake interface would be to make sure to use the LIMIT 10 or LIMIT 1 on the SELECT statements that you would be needing to validate.
Another tip I would recommend is talking to a Snowflake representative about a trial if you are just getting started. They will also have alot of tips for more specific queries you are seeking to validate.
And finally, based on some comments, make sure it uses SQL: ANSI and the live in the https://docs.snowflake.net/manuals/index.html for reference.
As far as the validity of the sql statement is a concern you can run explain of the statement and it should give you error if syntax is incorrect or if you do not have permission to access the object/database. That being there still some exceptions which you cannot run explain for like 'use' command which I do not think is needed for validation.
Hope this helps.
I am working on testing database accessor methods using a DB API2 methods in Python with Pytest. Automated testing is new to me and I can't seem to figure out what should be done in the case of testing a database with fixtures. I would like to check whether getting fields in a table are successful. To be able to get the same result, I intend to add a row entry every time I run some tests and delete the row after each test that depends on it. The terms I have heard are 'setUp' and 'tearDown' although I have also read that using yield is newer syntax.
My conceptual question whose answer I would like to figure out before writing code is:
What happens when the 'tearDown' portion of the fixture fails? How do I return the database to the same state without the added row entry? Is there a way of recovering from this? I still need the rest of the data in the database?
I read this article [with unittest] that explains what runs when setting up and tearing down methods fail but it falls short on providing an answer to my question.
A common practice is to run each test in its own database transaction, so that no matter what happens any changes are rolled back and database gets returned to a clean state.
I have and ORM app that uses SQLAlchemy, Alembic for migration and Pytest for testing. In my testing, I have a database as a fixture. It used to be, before I used migrations, that I dropped all the tables and recreated them for each testing session.
Now that I am using migrations, I want to use Alembic in creating my fixtures too because I believe that mimics a production environment more closely.(Is that a good rationale?)
One way to do it is to downgrade() all the way down and upgrade() up each time. I don't really like this. I might be wrong.
Another would be to drop_all() and create_all() for unit tests, and just write another test that stamps the database with head and tests an upgrade and downgrade.
Is there another good/standard way to integrate migrations with fixtures so I do not have to use drop_tables?
Or is there a way to, after drop_tables stamp the db as "tail" or empty? without explicitly using the migration hash for revision 0, cause that creates dependencies, something like alembic downgrade -1 that will make it go back to year 0. Thank you.
I recommend starting a temporary database instance each time, e.g. with testing.mysqld or testing.postgresql. The advantage of this approach is that you're guaranteed to start fresh each time; the success of your tests will not depend on external factors. The downside is the extra handful of seconds that it takes to start the instance.
If you insist on using an existing database instance, you can, like you said, use create_all() + alembic stamp head. However, instead of doing drop_all(), simply drop the entire database (or schema, in the case of PostgreSQL) and recreate it.
If you insist on using drop_all(), you can drop the alembic_version table to tell alembic that the current version is "tail".
I am building a database interface using Python's peewee module. I am trying to figure out how to insert data into an existing database where I do not know the schema.
My idea is to use playhouse.reflection.Introspector to find out the database schema, then use that information to create class objects which can then be inserted into the existing database.
So far I've gotten to:
introspector = Introspector.from_database(database)
models = introspector.generate_models()
I'm don't know where to go from there.
1) Can I create database objects in this manner? What is the next step?
2) Is there an easier way to do this?
peewee includes an introspection tool called pwiz that can (basically) introspect a database and produce model definitions. It is run as a command line script and dumps the model definitions to stdout, so invokation is like any other unix tool. Here is an example from the docs:
python -m pwiz -e postgresql my_postgres_db > mymodels.py
From there edit mymodels.py to get what you need.
You could do this on the fly, but it would require a few steps and is hackish (not to mention pointless if you really don't know anything about the schema):
Run pwiz as an os command
Read it to pick out the model names
Import whatever you find
BUT
If you really don't know the schema to start with then you have no idea what the semantics of the database are anyway, which means whatever you find is literally meaningless. Unless you at least know some schema/table/column names you are hunting for (in which case you do know something about the schema) there isn't really much you can do with regard to inserting data (not in a sane way), though you could certainly dump data from the db. But if you just wanted a database dump then pg_dump would have been easier.
I suspect this is actually an X-Y problem. What problem is it you are trying to solve by using this technique? What effect is it supposed to achieve within the context of your system?
If you want to create a GUI, check out the sqlite_web project. It uses Peewee to create a web-based SQLite database manager.