I'm currently working on something where data is produced, with a fair amount of processing, within SQL server, that I ultimately need to manipulate within Python to complete the task. It seems to me I have a couple different options:
(1) Run the SQL code from within Python, manipulate output
(2) Create an SP in SSMS, run the SP from within Python, manipulate output
(3) ?
The second seems cleanest, but I wonder if there's a better way to achieve my objective without needing to create a stored procedure every time I need SQL data in Python. Copying the entirety of the SQL code into Python seems similarly kludgy, particularly for larger or complex queries.
For those with more experience working between the two: can you offer any suggestions on workflow?
There is no silver bullet.
It really depends on the specifics of what you're doing. What amount of data are we talking? Is it even feasible to stream it all over the network, through Python, and back? How much more load can the database server handle? How complex are the manipulations you consider doing in Python? How proficient are you and your team in SQL, and in Python?
I've seen both approaches in production, and one slight advantage that sometimes gets overlooked is that when you have all the SQL nicely formatted inside your Python program, it's automatically under some Version Control, and you can check who edited what last and is thus to blame for the latest SNAFU ;-)
Related
I'm working on a framework for Digital Forensic Investigators to use to compare files with each other for my Master's capstone project. However, I ran into a bit of a snag...
I'm trying to implement multiprocessing on the comparisons since using a single core seems to be really slow. The trouble I'm having, however, is when the code goes to enter information into an SQLite database. It will occasionally get a "Database is locked" error when two cores complete at nearly the same time.
So, simple side of my question, is it unsafe to operate database functions within a multiprocessing environment due to the errors I'm encountering? If not, is there a method of going about this that is safe and won't result in random errors?
Thanks!
Your problem is that you are trying to have multiple writers access a toy database -- i.e. sqlite -- which is stored in a single file. Using Lock might help, but it's going to kill your multiprocess throughput because of all the waiting-for-the-lock time. In essence, the lock choke point will serialize your program.
Setting up either MySQL or Postgres on almost any platform is straightforward, and there are several excellent Python modules for accessing them. Using one of those will completely eliminate this problem.
Update for an extended response to comment:
I always ask clients / students, "What problem are you trying to solve?" I'm assuming that you are not trying to create a database system, simply to use one. SQLite3 is fine for a well-defined set of problems, but multiprocess access is not one of them. I could veer off into asking what aspect of your project requires multiprocess access, but I'll assume that you have already determined that this is needed. I don't know either your programming skills or your understanding of how a database works, so forgive me if the following is a bit basic.
Normally you need a database (my preference is Postgres), and a Python module that understands all of the fiddly details of how to talk to that database. Then you need to know what it is you want the DBMS to do for you. The Good News is that you are hardly the first to go down this path.
The Postgres Wiki is full of good stuff. See their page on Python Drivers. Psycopg2 is the category leader and runs on Win/Linux/Mac. Also check out PyPi, the Python Package Index, for many well-written extensions.
If you want to stay more object-oriented, as opposed to writing straight SQL, you might want to look at an ORM like SQLAlchemy. This is another category leader that is well-maintained and widely deployed.
The value of using an ORM is that you can (mostly) keep your head in ObjectLand, where most of your problem lives, and not get tangled up in the cognitive dissonance created by object-oriented programming vs. relational database management, which are two very different views of the world of data.
If you need more help, email me. My address is in my profile.
You can make use of Lock. Take a look at https://docs.python.org/2/library/multiprocessing.html#synchronization-between-processes
I am finding Neo4j slow to add nodes and relationships/arcs/edges when using the REST API via py2neo for Python. I understand that this is due to each REST API call executing as a single self-contained transaction.
Specifically, adding a few hundred pairs of nodes with relationships between them takes a number of seconds, running on localhost.
What is the best approach to significantly improve performance whilst staying with Python?
Would using bulbflow and Gremlin be a way of constructing a bulk insert transaction?
Thanks!
There are several ways to do a bulk create with py2neo, each making only a single call to the server.
Use the create method to build a number of nodes and relationships in a single batch.
Use a cypher CREATE statement.
Use the new WriteBatch class (just released this week) to manually make a batch of nodes and relationships (this is really just a manual version of 1).
If you have some code, I'm happy to look at it and make suggestions on performance tweaks. There are also quite a few tests you may be able to get inspiration from.
Cheers,
Nige
Neo4j's write performance is slow unless you are doing a batch insert.
The Neo4j batch importer (https://github.com/jexp/batch-import) is the fastest way to load data into Neo4j. It's a Java utility, but you don't need to know any Java because you're just running the executable. It handles typed data and indexes, and it imports from a CSV file.
To use it with Bulbs (http://bulbflow.com/) Models, use the model get_bundle() method to get the data, index name, and index keys, which is prepared for insert, and then output the data to a CSV file. Or if you don't want to model your data, just output your data from Python to the CSV file.
Will that work for you?
There's so many old answers to this question online, that it took me forever to realize there's an import tool that comes with neo4j. It's very fast and the best tool I was able to find.
Here's a simple example if we want to import student nodes:
bin/neo4j-import --into [path-to-your-neo4j-directory]/data/graph.db --nodes students
The students file contains data that looks like this, for example:
studentID:Id(Student),name,year:int,:LABEL
1111,Amy,2000,Student
2222,Jane,2012,Student
3333,John,2013,Student
Explanation:
The header explains how the data below it should be interpreted.
studentID is a property with type Id(Student).
name is of type string which is the default.
year is an integer
:LABEL is the label you want for these nodes, in this case it is "Student"
Here's the documentation for it: http://neo4j.com/docs/stable/import-tool-usage.html
Note: I realize the question specifically mentions python, but another useful answer mentions a non-python solution.
Well, I myself had need for massive performance from neo4j. I end up doing following things to improve graph performance.
Ditched py2neo, since there were lot of issues with it. Besides it is very convenient to use REST endpoint provided by neo4j, just make sure to use request sessions.
Use raw cypher queries for bulk insert, instead of any OGM(Object-Graph Mapper). That is very crucial if you need an high-performant system.
Performance was not still enough for my needs, so I ended writing a custom system that merges 6-10 queries together using WITH * AND UNION clauses. That improved performance by a factor of 3 to 5 times.
Use larger transaction size with atleast 1000 queries.
To insert a bulk of nodes in very high speed to Neo4K
Batch Inserter
http://neo4j.com/docs/stable/batchinsert-examples.html
In my case I'm working on Java.
we are still pretty new to Postgres and came from Microsoft Sql Server.
We are wanting to write some stored procedures now. Well, after struggling to get something more complicated than a hello world to work in pl/pgsql, we decided it's better if we are going to learn a new language we might as well learn Python because we got the same query working in it in about 15 minutes(note, none of us actually know python).
So I have some questions about it in comparison to pl/psql.
Is pl/Pythonu slower than pl/pgsql?
Is there any kind of "good" reference for how to write good stored procedures using it? Five short pages in the Postgres documentation doesn't really tell us enough.
What about query preparation? Should it always be used?
If we use the SD and GD arrays for a lot of query plans, will it ever get too full or have a negative impact on the server? Will it automatically delete old values if it gets too full?
Is there any hope of it becoming a trusted language?
Also, our stored procedure usage is extremely light. Right now we only have 4, but we are still trying to convert little bits of code over from Sql Server specific syntax(such as variables, which can't be used in Postgres outside of stored procedures)
Depends on what operations you're doing.
Well, combine that with a general Python documentation, and that's about what you have.
No. Again, depends on what you're doing. If you're only going to run a query once, no point in preparing it separately.
If you are using persistent connections, it might. But they get cleared out whenever a connection is closed.
Not likely. Sandboxing is broken in Python and AFAIK nobody is really interested in fixing it. I heard someone say that python-on-parrot may be the most viable way, once we have pl/parrot (which we don't yet).
Bottom line though - if your stored procedures are going to do database work, use pl/pgsql. Only use pl/python if you are going to do non-database stuff, such as talking to external libraries.
I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?
Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.
A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.
I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.
if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.
Sqlite -- it comes with python, fast, widely availible and easy to maintain
If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.
Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.
The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.
It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.
Yes, this is as stupid a situation as it sounds like. Due to some extremely annoying hosting restrictions and unresponsive tech support, I have to use a CSV file as a database.
While I can use MySQL with PHP, I can't use it with the Python backend of my program because of install issues with the host. I can't use SQLite with PHP because of more install issues, but can use it as it's a Python builtin.
Anyways, now, the question: is it possible to update values SQL-style in a CSV database? Or should I keep on calling the help desk?
Don't walk, run to get a new host immediately. If your host won't even get you the most basic of free databases, it's time for a change. There are many fish in the sea.
At the very least I'd recommend an xml data store rather than a csv. My blog uses an xml data provider and I haven't had any issues with performance at all.
Take a look at this: http://www.netpromi.com/kirbybase_python.html
Keep calling on the help desk.
While you can use a CSV as a database, it's generally a bad idea. You would have to implement you own locking, searching, updating, and be very careful with how you write it out to make sure that it isn't erased in case of a power outage or other abnormal shutdown. There will be no transactions, no query language unless you write your own, etc.
I couldn't imagine this ever being a good idea. The current mess I've inherited writes vital billing information to CSV and updates it after projects are complete. It runs horribly and thousands of dollars are missed a month. For the current restrictions that you have, I'd consider finding better hosting.
You can probably used sqlite3 for more real database. It's hard to imagine hosting that won't allow you to install it as a python module.
Don't even think of using CSV, your data will be corrupted and lost faster than you say "s#&t"
"Anyways, now, the question: is it possible to update values SQL-style in a CSV database?"
Technically, it's possible. However, it can be hard.
If both PHP and Python are writing the file, you'll need to use OS-level locking to assure that they don't overwrite each other. Each part of your system will have to lock the file, rewrite it from scratch with all the updates, and unlock the file.
This means that PHP and Python must load the entire file into memory before rewriting it.
There are a couple of ways to handle the OS locking.
Use the same file and actually use some OS lock module. Both processes have the file open at all times.
Write to a temp file and do a rename. This means each program must open and read the file for each transaction. Very safe and reliable. A little slow.
Or.
You can rearchitect it so that only Python writes the file. The front-end reads the file when it changes, and drops off little transaction files to create a work queue for Python. In this case, you don't have multiple writers -- you have one reader and one writer -- and life is much, much simpler.
I'd keep calling help desk. You don't want to use CSV for data if it's relational at all. It's going to be nightmare.
I agree. Tell them that 5 random strangers agree that you being forced into a corner to use CSV is absurd and unacceptable.
If I understand you correctly: you need to access the same database from both python and php, and you're screwed because you can only use mysql from php, and only sqlite from python?
Could you further explain this? Maybe you could use xml-rpc or plain http requests with xml/json/... to get the php program to communicate with the python program (or the other way around?), so that only one of them directly accesses the db.
If this is not the case, I'm not really sure what the problem.
It's technically possible. For example, Perl has DBD::CSV that provides a driver that runs SQL queries on the CSV file.
That being said, why not run off a SQLite database on your server?
What about postgresql? I've found that quite nice to work with, and python supports it well.
But I really would look for another provider unless it's really not an option.
Disagreeing with the noble colleagues, I often use DBD::CSV from Perl. There are good reasons to do it. Foremost is data update made simple using a spreadsheet. As a bonus, since I am using SQL queries, the application can be easily upgraded to a real database engine. Bear in mind these were extremely small database in a single user application.
So rephrasing the question: Is there a python module equivalent to Perl's DBD:CSV