Use case for Amazon SimpleDB - python

I'm starting to build out a project using MySQL and now starting to think that SimpleDB might be more appropriate. (My reason for potentially using SimpleDB over another NoSQL solution is that it's easy to use with EC2).
I have a series of spiders scraping information on widgets using the Python framework, Scrapy, and the Django ORM to put the results into a MySQL db. I'll be building out a website that makes use of this data. I'm thinking that SimpleDB might be more appropriate because:
Some of the sites have fields specific to them and so the schema may be subject to change when I come across these. SimpleDB obviously allows for a lot more flexibility here
I'm going to be collecting info on around 5m widgets a year. My sense is that MySQL can handle this but figuring out the indexes might be a hassle. SimpleDB will offer assured performance at scale
The cons I can see are that writing queries will be more complex, I'll need to pre-aggregate more and general unfamiliarity with NoSQL.
Questions:
Which option would you recommend?
How would you approach integrating Python/ Django with SimpleDB? Is django-norel worth looking at?
Are there any other issues I'll likely encounter with SimpleDB?

Related

Sharing an ORM between languages

I am making a database with data in it. That database has two customers: 1) a .NET webserver that makes the data visible to users somehow someway. 2) a python dataminer that creates the data and populates the tables.
I have several options. I can use the .NET Entity Framework to create the database, then reverse engineer it on the python side. I can vice versa that. I can just write raw SQL statements in one or the other systems, or both. What are possible pitfalls of doing this one way or the other? I'm worried, for example, that if I use the python ORM to create the tables, then I'm going to have a hard time in the .NET space...
I love questions like that.
Here is what you have to consider, your web site has to be fast, and the bottleneck of most web sites is a database. The answer to your question would be - make it easy for .NET to work with SQL. That will require little more work with python, like specifying names of the table, maybe row names. I think Django and SQLAlchemy are both good for that.
Another solution could be to have a bridge between database with gathered data and database to display data. On a background you can have a task/job to migrate collected data to your main database. That is also an option and will make your job easier, at least all database-specific and strange code will go to the third component.
I've been working with .NET for quite a long time before I switched to python, and what you should know is that whatever strategy you chose it will be possible to work with data in both languages and ORMs. Do the hardest part of the job in the language your know better. If you are a Python developer - pick python to mess with the right names of tables and rows.

Building a DSL query language

i'm working on a project (written in Django) which has only a few entities, but many rows for each entity.
In my application i have several static "reports", directly written in plain SQL. The users can also search the database via a generic filter form. Since the target audience is really tech-savvy and at some point the filter doesn't fit their needs, i think about creating a query language for my database like YQL or Jira's advanced search.
I found http://sourceforge.net/projects/littletable/ and http://www.quicksort.co.uk/DeeDoc.html, but it seems that they only operate on in-memory objects. Since the database can be too large for holding it in-memory, i would prefer that the query is translated in SQL (or better a Django query) before doing the actual work.
Are there any library or best practices on how to do this?
Writing such a DSL is actually surprisingly easy with PLY, and what ho—there's already an example available for doing just what you want, in Django. You see, Django has this fancy thing called a Q object which make the Django querying side of things fairly easy.
At DjangoCon EU 2012, Matthieu Amiguet gave a session entitled Implementing Domain-specific Languages in Django Applications in which he went through the process, right down to implementing such a DSL as you desire. His slides, which include all you need, are available on his website. The final code (linked to from the last slide, anyway) is available at http://www.matthieuamiguet.ch/media/misc/djangocon2012/resources/compiler.html.
Reinout van Rees also produced some good comments on that session. (He normally does!) These cover a little of the missing context.
You see in there something very similar to YQL and JQL in the examples given:
groups__name="XXX" AND NOT groups__name="YYY"
(modified > 1/4/2011 OR NOT state__name="OK") AND groups__name="XXX"
It can also be tweaked very easily; for example, you might want to use groups.name rather than groups__name (I would). This modification could be made fairly trivially (allow . in the FIELD token, by modifying t_FIELD, and then replacing . with __ before constructing the Q object in p_expression_ID).
So, that satisfies simple querying; it also gives you a good starting point should you wish to make a more complex DSL.
I've faced exactly this problem - a large database which needs searching. I made some static reports and several fancy filters using django (very easy with django) just like you have.
However the power users were clamouring for more. I decided that there already was a DSL that they all knew - SQL. The question was how to make it secure enough.
So I used django permissions to give the power users permission to make SQL queries in a new table. I then made a view for the not-quite-so-power users to use these queries. I made them take optional parameters. The queries were run using Python's lower level DB-API which django is using under the hood for its ORM anyway.
The real trick was opening a read only database connection to run these queries just to make sure that no updates were ever run. I made a read only connection by creating a different user in the database with lower permissions and opening a specific connection for that in the view.
TL;DR - SQL is the way to go!
Depending on the form of your data, the types of queries your users need to use, and the frequency that your data is updated, an alternative to the pure SQL solution suggested by Nick Craig-Wood is to index your data in Solr and then run queries against it.
Solr is an added layer of complexity (configuration, data synchronization) but it is super-fast, can handle large datasets, and provides a (relatively) intuitive query language.
You could write your own SQL-ish language using pyparsing, actually. There is even pretty verbose example you could extend.

Full text searching and Python

Can someone help me out with some suggestion for a full-text searching engine that supports Python?
Right now we have a MySQL database in place and I'd like to add the ability to have a full-text search engine index some of the text in some of the tables in this database. This text data would be used by a web application to search for the corresponding records in the database. For instance, index the customer name information in our customer table, full text search that with the web application to get the MySQL record for the customer.
I've looked (briefly) at Lucene, Swish-E and MongoDB, and few others, but I'm not sure what would be a good choice for me considering a couple of things:
I'm not a Java guy (though I've been programming for a long time),
we only want to search a relatively small set of data,
we're looking to index text in a MySQL database,
and would like that index to be updated in semi-realtime.
Any hints, tips or pointers would be greatly appreciated!
Have a look at Whoosh. I've heard it doesn't scale up terribly well (maybe that's fixed now) but for small collections, it might be useful.
For a scalable solution, consider using Lucene with PyLucene or Jython.
Building pylucene a few months ago was one of the most painful experiences I had. The project won't get any traction IMHO if it's so hard to build.
With a few other folks having the same itch to scratch, we started https://code.google.com/a/apache-extras.org/p/pylucene-extra/ to gather prebuilt pylucene and jcc eggs on several operating systems, Python versions and Java runtimes combos. It is not very active lately, though.
Whoosh might be a good fit, or you may want to have a look at Sphinx, ElasticSearch or HaystackSearch (CAVEAT: I did not work on any of these).
Or maybe try to access Solr via python (there are a few APIs), which might be much easier than using pylucene. Consider that lucene will still need a JVM to run, of course.
Since you don't have huge scalability needs, I would focus on simple usage and community support rather than performance and scale. Hope it helps.
Solr is a great wrapper to Lucene, it greatly simplifies things. It doesn't require any Java tinkering for most things, you just need to configure some XML files. It does run as another process, so this may complicate your deployment.
I have had great results with pysolr, but really, you could write your own python communication library since Solr uses REST, so it is really simple to send and retrieve data in either xml or json.

Can Rails & Django (now) query more than one database at a time?

EDIT: Since you are asking for specifics, consider a photo-sharing site (like Flickr or picasa - - I know that one uses PHP, while the other uses Python) for instance. If it proves to be successful, it needs to scale enormously. I hope this is specific enough.
It's been sometime since I heard any discussion on this, and since I am in the decision making process of choosing between Ruby and Python for a web project, here come the questions:
[1] Can current versions of Rails (Ruby) and Django (Python) query more than one database at a time?
[2] I also read on SO that "If your focus is building web-sites or web-applications go Ruby" (because it has fully featured, web-focused Rails). But that was about 2 years ago. What's the state of Python web framework Django today? Is it head-to-head with Rails now?
EDIT: [3] Don't know if I can ask this here, it's really surprising how quickly the Stack Exchange sites load. Do SE sites still use the same technology mentioned here? If not, does anyone have an update?
There's nothing in either of the languages that would prevent you from connecting to more than one database at a time. The real question is why would you want to?
The reason the StackOverflow sites are so fast is not so much the choice of technology but they way it's applied. Database optimization techniques are largely independent of the platform involved, just based on common-sense principles and proven methods of scaling.
Ruby on Rails offers a number of methods for connecting to multiple databases, though you might mean connecting to a system that's divided into shards, into multi-tenant partitions, or where different forms of data are stored on different databases. All of these approaches are supported, but they are quite different in implementation.
You should post a new question with an outline of the problem you're trying to solve if you want a specific answer.
Multi-database support exists in Django. In our Django project we have models pulling data from Postgres, MySQL, Oracle, and MS SQL Server (there are various issues depending on the database, but it all generally works). From what I've read, RoR supports multiple databases as well. Each framework comes with its own set of strengths and weaknesses which you have to evaluate against your particular needs and requirements. I don't think anyone can give you a (valid/useful) general answer without knowing the specifics of your situation.

Is there any python web app framework that provides database abstraction layer for SQL and NoSQL?

Is it even possible to create an abstraction layer that can accommodate relational and non-relational databases? The purpose of this layer is to minimize repetition and allows a web application to use any kind of database by just changing/modifying the code in one place (ie, the abstraction layer). The part that sits on top of the abstraction layer must not need to worry whether the underlying database is relational (SQL) or non-relational (NoSQL) or whatever new kind of database that may come out later in the future.
There's a Summer of Code project going on right now to add non-relational support to Django's ORM. It seems to be going well and chances are good that it will be merged into core in time for Django 1.3.
You could use stock Django and Django-nonrel ( http://www.allbuttonspressed.com/projects/django-nonrel ) together to get a quite unified experience. Some limits apply, read docs carefully though, remembering Spolsky's "All abstractions are leaky".
Yo may also check web2py, they support relational databases and GAE on the core.
Regarding App Engine, all existing attempts limit you in some way (web2py doesn't support transactions or namespaces and probably many other stuff, for example). If you plan to work with GAE, use what GAE provides and forget looking for a SQL-NoSQL holy grail. Existing solutions are inevitably limited and affect performance negatively.
Thank you for all the answers. To summarize the answers, currently only web2py and Django supports this kind of abstraction.
It is not about a SQL-NoSQL holy grail, using abstraction can make the apps more flexible. Lets assume that you started a project using NoSQL, and then later on you need to switch over to SQL. It is desirable that you only make changes to the codes in a few spots instead of all over the place. For some cases, it does not really matter whether you store the data in a relational or non-relational db. For example, storing user profiles, text content for dynamic page, or blog entries.
I know there must be a trade off by using the abstraction, but my question is more about the existing solution or technical insight, instead of the consequences.

Categories

Resources