To my surprise, I haven't found this question asked elsewhere. Short version, I'm writing an app that I plan to deploy to the cloud (probably using Heroku), which will do various web scraping and data collection. The reason it'll be in the cloud is so that I can have it be set to run on its own every day and pull the data to its database without my computer being on, as well as so the rest of the team can access the data.
I used to use AWS's SimpleDB and DynamoDB, but I found SDB's storage limitations to be to small and DDB's poor querying ability to be a problem, so I'm looking for a database system (SQL or NoSQL) that can store arbitrary-length values (and ideally arbitrary data structures) and that can be queried on any field.
I've found many database solutions for Heroku, such as ClearDB, but all of the information I've seen has shown how to set up Django to access the database. Since this is intended to be script and not a site, I'd really prefer not to dive into Django if I don't have to.
Is there any kind of database that I can hook up to in Heroku with Python without using Django?
You can get a database provided from Heroku without requiring your app to use Django. To do so:
heroku addons:add heroku-postgresql:dev
If you need a larger more dedicated database, you can examine the plans at Heroku Postgres
Within your requirements.txt you'll want to add:
psycopg2
Then you can connect/interact with it similar to the following:
import psycopg2
import os
import urlparse
urlparse.uses_netloc.append('postgres')
url = urlparse.urlparse(os.environ['DATABASE_URL'])
conn = psycopg2.connect("dbname=%s user=%s password=%s host=%s " % (url.path[1:], url.username, url.password, url.hostname))
cur = conn.cursor()
query = "SELECT ...."
cur.execute(query)
I'd use MongoDB. Heroku has support for it, so I think it will be really easy to start and scale out: https://addons.heroku.com/mongohq
About Python: MongoDB is a really easy database. The schema is flexible and fits really well with Python dictionaries. That's something really good.
You can use PyMongo
from pymongo import Connection
connection = Connection()
# Get your DB
db = connection.my_database
# Get your collection
cars = db.cars
# Create some objects
import datetime
car = {"brand": "Ford",
"model": "Mustang",
"date": datetime.datetime.utcnow()}
# Insert it
cars.insert(car)
Pretty simple, uh?
Hope it helps.
EDIT:
As Endophage mentioned, another good option for interfacing with Mongo is mongoengine. If you have lots of data to store, you should take a look at that.
I did this recently with Flask. (https://github.com/HexIce/flask-heroku-sqlalchemy).
There are a couple of gotchas:
1. If you don't use Django you may have to set up your database yourself by doing:
heroku addons:add shared-database
(Or whichever database you want to use, the others cost money.)
2. The database URL is stored in Heroku in the "DATABASE_URL" environment variable.
In python you can get it by doing.
dburl = os.environ['DATABASE_URL']
What you do to connect to the database from there is up to you, one option is SQLAlchemy.
Create a standalone Heroku Postgres database. http://postgres.heroku.com
Related
Doc says:
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators. Ref
But why do we need them?
I want to select data from one Postgres DB, and store to another one. Can I use, for example, psycopg2 driver inside python script, which runs by a python operator, or airflow should know for some reason what exactly I'm doing inside script, so, I need to use PostgresHook instead of just psycopg2 driver?
You should use just PostresHook. Instead of using psycopg2 as so:
conn = f'{pass}:{server}#host etc}'
cur = conn.cursor()
cur.execute(query)
data = cur.fetchall()
You can just type:
postgres = PostgresHook('connection_id')
data = postgres.get_pandas_df(query)
Which can also make use of encryption of connections.
So using hooks is cleaner, safer and easier.
While it is possible to just hardcode the connections in your script and run it, the power of hooks will allow to edit environment variables from within the UI.
Have a look at "Automate AWS Tasks Thanks to Airflow Hooks" to learn a bit more about how to use hooks.
I'm a Django beginner and have developed 1 app using mysql as primary DB, but in my next project I have to use Cassandra db using https://github.com/cqlengine/cqlengine but do not use https://github.com/r4fek/django-cassandra-engine (which is a wrapper over cqlengine?).
I dont have any clue How do I start? I mean how and where should I create db connection and then create models in models.py file?
Should I create connection in init.py file?in views.py? what would be the most efficient way?
would be great(for future readers too) if someone provide a simple configuration and a model.
The twissandra demo should be a good example of how to build an app using Cassandra and Django.
In this implementation there is no models.py and the connection is maintained in the file cass.py.
You'll see cass.py also hosts all the functions required to return data from the C* database and make objects which are used by the system. This is where you would swap out the api requests with your CqlEngine code.
I hope these resources get you pointed in the right direction
Rustyrazorblade shows an easy way to accomplish this via his CQLEngine tutorial branch HERE.
You can easily setup the connection by doing something like this in your_app_project/models/connection.py:
from cqlengine import management
from cqlengine.connection import setup
def connect():
setup(["127.0.0.1", "127.0.1.1", "127.0.1.2"], "tutorial", retry_connect=True)
management.create_keyspace("tutorial", replication_factor=1, strategy_class="SimpleStrategy")
In this example: "tutorial" is the keyspace we are using, strategy_class is the replication strategy your C* instance is using, replication_factor is the amount of replications that will be stored throughout the ring, 127.0.0.1 is a Cassandra cluster node IP address (you can pass this a list or a string) and retry_connect specifies whether or not you would like it to attempt to reconnect if there is a connection failure.
From here, it is very easy for new C* users to get confused. You can call this anytime Before syncing the C* tables or using a C* query.
So, you'll want to do something like:
from cqlengine.management import sync_table
from models.connection import connect
from models.somemodels import MyCassandraModel
# This will fire off our previously setup 'connect' method
connect()
# This will setup the Model as a table in your C* DB
sync_table(MyCassandraModel)
You can even drop this into manage.py, just as long as that CQLEngine setup() is properly executed.
I am using pymongo to connect to mongodb in my code. I am writing a google analytic kind of application. My db structure is like that for each new website I create a new db. So when someone registers a website I create a new db with that name, however when unregistering the website I wish the database to be deleted. I remove all the collection but still the database could not be removed
And as such the list of databases is growing very large. When I do
client = MongoClient(host=MONGO_HOST,port=27017,max_pool_size=200)
client.database_names()
I see a more than a 1000 list of apps. Many of them are just empty databases. Is there a way that I remove the mongo databases ?
Use drop_database method:
client = MongoClient(host=MONGO_HOST,port=27017,max_pool_size=200)
client.drop_database("database_name")
I have a PostgreSQL schema that resides in a schema.sql file that gets run each time a database connection is initiated in Python. It looks something like:
CREATE TABLE IF NOT EXISTS users (
id SERIAL PRIMARY KEY,
facebook_id TEXT NOT NULL,
name TEXT NOT NULL,
access_token TEXT,
created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
The app is deployed on Heroku, using their PostgreSQL and everything works as expected.
Now, what if I want to change a bit the structure of my users table? How can I do this the easiest and the best way? I thought of writing an ALTER... line in schema.sql for each change I want to produce in the database, but I don't think this is the best approach, since after some time the schema file will be full of ALTERs and it will slow down my app.
What's the indicated way to deploy changes made to a database?
Running a hard-coded script on each connection is not a great way to handle schema management.
You need to either manage the schema manually, or use a full-fledged tool that keeps a schema version identifier in the database, checks that, and applies a script to upgrade to the next schema version if it's different to the latest one. Rails calls this "migrations" and it kind-of works. If you're using Django it has schema management too.
If you're not using a framework like that, I suggest just writing your own schema upgrade scripts. Add a "schema_version" table with a single row. SELECT it when the app first starts after a redeploy and if it's lower than the current version the app knows about, apply the update script(s) in order, eg schema_1_to_2, schema_2_to_3, etc.
I don't recommend doing this on connect, do it on app start, or better, as a special maintenance command. If you do it on every connection you'll have multiple connections trying to make the same changes and you'll land up with duplicated columns and all sorts of other mess.
I support several django apps on heroku with Postgres. I just connect via PgAdmin and run my scripts when changes are required. I don't see any need for running a script every time a connection is made.
I've looked over Google Cloud SQL's documentation and various searches, but I can't find out whether it is possible to use SQLAlchemy with Google Cloud SQL, and if so, what the connection URI should be.
I'm looking to use the Flask-SQLAlchemy extension and need the connection string like so:
mysql://username:password#server/db
I saw the Django example, but it appears the configuration uses a different style than the connection string. https://developers.google.com/cloud-sql/docs/django
Google Cloud SQL documentation:
https://developers.google.com/cloud-sql/docs/developers_guide_python
Update
Google Cloud SQL now supports direct access, so the MySQLdb dialect can now be used. The recommended connection via the mysql dialect is using the URL format:
mysql+mysqldb://root#/<dbname>?unix_socket=/cloudsql/<projectid>:<instancename>
mysql+gaerdbms has been deprecated in SQLAlchemy since version 1.0
I'm leaving the original answer below in case others still find it helpful.
For those who visit this question later (and don't want to read through all the comments), SQLAlchemy now supports Google Cloud SQL as of version 0.7.8 using the connection string / dialect (see: docs):
mysql+gaerdbms:///<dbname>
E.g.:
create_engine('mysql+gaerdbms:///mydb', connect_args={"instance":"myinstance"})
I have proposed an update to the mysql+gaerdmbs:// dialect to support both of Google Cloud SQL APIs (rdbms_apiproxy and rdbms_googleapi) for connecting to Cloud SQL from a non-Google App Engine production instance (ex. your development workstation). The change will also modify the connection string slightly by including the project and instance as part of the string, and not require being passed separately via connect_args.
E.g.
mysql+gaerdbms:///<dbname>?instance=<project:instance>
This will also make it easier to use Cloud SQL with Flask-SQLAlchemy or other extension where you don't explicitly make the create_engine() call.
If you are having trouble connecting to Google Cloud SQL from your development workstation, you might want to take a look at my answer here - https://stackoverflow.com/a/14287158/191902.
Yes,
If you find any bugs in SA+Cloud SQL, please let me know. I wrote the dialect code that was integrated into SQLAlchemy. There's a bit of silly business about how Cloud SQL bubbles up exceptions, so there might be some loose ends there.
For those who prefer PyMySQL over MySQLdb (which is suggested in the accepted answer), the SQLAlchemy connection strings are:
For Production
mysql+pymysql://<USER>:<PASSWORD>#/<DATABASE_NAME>?unix_socket=/cloudsql/<PUT-SQL-INSTANCE-CONNECTION-NAME-HERE>
Please make sure to
Add the SQL instance to your app.yaml:
beta_settings:
cloud_sql_instances: <PUT-SQL-INSTANCE-CONNECTION-NAME-HERE>
Enable the SQL Admin API as it seems to be necessary:
https://console.developers.google.com/apis/api/sqladmin.googleapis.com/overview
For Local Development
mysql+pymysql://<USER>:<PASSWORD>#localhost:3306/<DATABASE_NAME>
given that you started the Cloud SQL Proxy with:
cloud_sql_proxy -instances=<PUT-SQL-INSTANCE-CONNECTION-NAME-HERE>=tcp:3306
it is doable, though I haven't used Flask at all so I'm not sure about establishing the connection through that. I got it working through Pyramid and submitted a patch to SQLAlchemy (possibly to the wrong repo) here:
https://bitbucket.org/sqlalchemy/sqlalchemy/pull-request/2/added-a-dialect-for-google-app-engines
That has since been replaced and accepted into SQLAlchemy as
http://www.sqlalchemy.org/trac/ticket/2484
I don't think it's made it way to a release though.
There are some issues with Google SQL throwing different exceptions so we had issues with things like deploying a database automatically. You also need to disable connection pooling using NullPool as mentioned in the second patch.
We've since moved to using the datastore through NDB so I haven't followed the progess of these fixes for a while..
PostgreSQL, pg8000 and flask_sqlalchemy
Adding information in case someone is on the lookout how to use flask_sqlalchemy with PostgreSQL: Using pg8000 as driver, the working connection string is
postgres+pg8000://<db_user>:<db_pass>#/<db_name>