Use mock MongoDB server for unit test - python

I have to implement nosetests for Python code using a MongoDB store. Is there any python library which permits me initializing a mock in-memory MongoDB server?
I am using continuous integration. So, I want my tests to be independent of any MongoDB running server.
Is there a way to mock mongoDM Server in memory to test the code independently of connecting to a Mongo server?
Thanks in advance!

You could try: https://github.com/vmalloc/mongomock, which aims to be a small library for mocking pymongo collection objects for testing purposes.
However, I'm not sure that the cost of just running mongodb would be prohibitive compared to ensuring some mocking library is feature complete.

I don’t know about Python, but I had a similar concern with C#. I decided to just run a real instance of Mongo on my workstation pointed at an empty directory. It’s not great because the code isn’t isolated but it’s fast and easy.
Only the data access layer actually calls Mongo during the test. The rest can rely on the mocks of the data access layer. I didn’t feel like faking Mongo was worth the effort when really I want to verify the interaction with Mongo is correct anyway.

You can use Ming which has an in-memory mongo db pymongo connection replacement.
import ming
mg = ming.create_datastore('mim://')
mg.conn # is the connection
mg.db # is a db with no name
mg.conn.somedb.somecol
# >> mim.Collection(mim.Database(somedb), somecol)
col = mg.conn.somedb.somecol
col.insert({'a': 1})
# >> ObjectId('5216ac3fe0323a1218f4e9aa')
col.find().count()
# >> 1

I am also using pymongo and MockupDB is working very well for my purpose (integration tests).
To use it is as simple as:
from mockupdb import *
server = MockupDB()
port = server.run()
from pymongo import MongoClient
client = MongoClient(server.uri)
import module_i_want_to_patch
module_i_want_to_patch.client = client
You can check the official tutorial for MockupDB here

Related

Why do we need airflow hooks?

Doc says:
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators. Ref
But why do we need them?
I want to select data from one Postgres DB, and store to another one. Can I use, for example, psycopg2 driver inside python script, which runs by a python operator, or airflow should know for some reason what exactly I'm doing inside script, so, I need to use PostgresHook instead of just psycopg2 driver?
You should use just PostresHook. Instead of using psycopg2 as so:
conn = f'{pass}:{server}#host etc}'
cur = conn.cursor()
cur.execute(query)
data = cur.fetchall()
You can just type:
postgres = PostgresHook('connection_id')
data = postgres.get_pandas_df(query)
Which can also make use of encryption of connections.
So using hooks is cleaner, safer and easier.
While it is possible to just hardcode the connections in your script and run it, the power of hooks will allow to edit environment variables from within the UI.
Have a look at "Automate AWS Tasks Thanks to Airflow Hooks" to learn a bit more about how to use hooks.

Django redis LPUSH / RPUSH

I am using the django-redis backend and the django.core.cache.cache module.
The django cache module does not seem to support proper functionality of pushing to lists and manipulating certain data structures.
The implied implementation used to update a list in the django cache module:
my_list = cache.get('my_list')
my_list.append('my value')
cache.set('my_list', my_list)
This approach is not efficient because the entire list is being loaded into the application server's memory.
Redis has support for the LPUSH / RPUSH commands to dynamically update a list. However, it doesn't look like these methods are available in the django cache module.
The official python redis client seems to implement these methods.
Is there any reason why django wouldn't offer this implementation? I'm asking out of my curiosity. Possibly I missed some details?
It does support raw client and command access, for that you would have to get access to raw client instead of using django cache.
https://github.com/jazzband/django-redis#raw-client-access
In some situations your application requires access to a raw Redis client to use some advanced features that aren't exposed by the Django cache interface. To avoid storing another setting for creating a raw connection, django-redis exposes functions with which you can obtain a raw client reusing the cache connection string: get_redis_connection(alias).
Code example:
>>> from django_redis import get_redis_connection
>>> con = get_redis_connection("default")
>>> con
<redis.client.StrictRedis object at 0x2dc4510>
>>> con.lpush('mylist',1)

Setup Cassandra DB in django using cqlengine but without using django-cassandra-engine

I'm a Django beginner and have developed 1 app using mysql as primary DB, but in my next project I have to use Cassandra db using https://github.com/cqlengine/cqlengine but do not use https://github.com/r4fek/django-cassandra-engine (which is a wrapper over cqlengine?).
I dont have any clue How do I start? I mean how and where should I create db connection and then create models in models.py file?
Should I create connection in init.py file?in views.py? what would be the most efficient way?
would be great(for future readers too) if someone provide a simple configuration and a model.
The twissandra demo should be a good example of how to build an app using Cassandra and Django.
In this implementation there is no models.py and the connection is maintained in the file cass.py.
You'll see cass.py also hosts all the functions required to return data from the C* database and make objects which are used by the system. This is where you would swap out the api requests with your CqlEngine code.
I hope these resources get you pointed in the right direction
Rustyrazorblade shows an easy way to accomplish this via his CQLEngine tutorial branch HERE.
You can easily setup the connection by doing something like this in your_app_project/models/connection.py:
from cqlengine import management
from cqlengine.connection import setup
def connect():
setup(["127.0.0.1", "127.0.1.1", "127.0.1.2"], "tutorial", retry_connect=True)
management.create_keyspace("tutorial", replication_factor=1, strategy_class="SimpleStrategy")
In this example: "tutorial" is the keyspace we are using, strategy_class is the replication strategy your C* instance is using, replication_factor is the amount of replications that will be stored throughout the ring, 127.0.0.1 is a Cassandra cluster node IP address (you can pass this a list or a string) and retry_connect specifies whether or not you would like it to attempt to reconnect if there is a connection failure.
From here, it is very easy for new C* users to get confused. You can call this anytime Before syncing the C* tables or using a C* query.
So, you'll want to do something like:
from cqlengine.management import sync_table
from models.connection import connect
from models.somemodels import MyCassandraModel
# This will fire off our previously setup 'connect' method
connect()
# This will setup the Model as a table in your C* DB
sync_table(MyCassandraModel)
You can even drop this into manage.py, just as long as that CQLEngine setup() is properly executed.

Python database WITHOUT using Django (for Heroku)

To my surprise, I haven't found this question asked elsewhere. Short version, I'm writing an app that I plan to deploy to the cloud (probably using Heroku), which will do various web scraping and data collection. The reason it'll be in the cloud is so that I can have it be set to run on its own every day and pull the data to its database without my computer being on, as well as so the rest of the team can access the data.
I used to use AWS's SimpleDB and DynamoDB, but I found SDB's storage limitations to be to small and DDB's poor querying ability to be a problem, so I'm looking for a database system (SQL or NoSQL) that can store arbitrary-length values (and ideally arbitrary data structures) and that can be queried on any field.
I've found many database solutions for Heroku, such as ClearDB, but all of the information I've seen has shown how to set up Django to access the database. Since this is intended to be script and not a site, I'd really prefer not to dive into Django if I don't have to.
Is there any kind of database that I can hook up to in Heroku with Python without using Django?
You can get a database provided from Heroku without requiring your app to use Django. To do so:
heroku addons:add heroku-postgresql:dev
If you need a larger more dedicated database, you can examine the plans at Heroku Postgres
Within your requirements.txt you'll want to add:
psycopg2
Then you can connect/interact with it similar to the following:
import psycopg2
import os
import urlparse
urlparse.uses_netloc.append('postgres')
url = urlparse.urlparse(os.environ['DATABASE_URL'])
conn = psycopg2.connect("dbname=%s user=%s password=%s host=%s " % (url.path[1:], url.username, url.password, url.hostname))
cur = conn.cursor()
query = "SELECT ...."
cur.execute(query)
I'd use MongoDB. Heroku has support for it, so I think it will be really easy to start and scale out: https://addons.heroku.com/mongohq
About Python: MongoDB is a really easy database. The schema is flexible and fits really well with Python dictionaries. That's something really good.
You can use PyMongo
from pymongo import Connection
connection = Connection()
# Get your DB
db = connection.my_database
# Get your collection
cars = db.cars
# Create some objects
import datetime
car = {"brand": "Ford",
"model": "Mustang",
"date": datetime.datetime.utcnow()}
# Insert it
cars.insert(car)
Pretty simple, uh?
Hope it helps.
EDIT:
As Endophage mentioned, another good option for interfacing with Mongo is mongoengine. If you have lots of data to store, you should take a look at that.
I did this recently with Flask. (https://github.com/HexIce/flask-heroku-sqlalchemy).
There are a couple of gotchas:
1. If you don't use Django you may have to set up your database yourself by doing:
heroku addons:add shared-database
(Or whichever database you want to use, the others cost money.)
2. The database URL is stored in Heroku in the "DATABASE_URL" environment variable.
In python you can get it by doing.
dburl = os.environ['DATABASE_URL']
What you do to connect to the database from there is up to you, one option is SQLAlchemy.
Create a standalone Heroku Postgres database. http://postgres.heroku.com

Database testing in python, postgresql

How do you unit test your python DAL that is using postgresql.
In sqlite you could create in-memory database for every test but this cannot be done for postgresql.
I want a library that could be used to setup a database and clean it once the test is done.
I am using Sqlalchemy as my ORM.
pg_tmp(1) is a utility intended to make this task easy. Here is how you might start up a new connection with SQLAlchemy:
from subprocess import check_output
from sqlalchemy import create_engine
url = check_output(['pg_tmp', '-t'])
engine = create_engine(url)
This will spin up a new database that is automatically destroyed in 60 seconds. If a connection is open pg_tmp will wait until all active connections are closed.
Have you tried testing.postgresql?
You can use nose to write your tests, then just use SQLAlchemy to create and clean the test database in your setup/teardown methods.
There's QuickPiggy too, which is capable of cleaning up after itself.
From the docs:
A makeshift PostgresSQL instance can be obtained quite easily:
pig = quickpiggy.Piggy(volatile=True, create_db='somedb')
conn = psycopg2.connect(pig.dsnstring())

Categories

Resources