Python, mongo + spider monkey

Python, mongo + spider monkey - python

Ok, so this isn't exactly a question that I expect a full answer for but here goes...
I am currently using a python driver to fire data at a mongo instance and all it well in the world. Now I want to be able to pull data from mongo and evaluate each record in the collection. Now I need to pass in to this evaluation a script that will look at the row of data and if a condition is met return true i.e.
(PSUDO CODE)
foreach(row in resultSet)
if(row.Name=="Chris) return true
return false
Now the script that I use to evaluate each item in the row should be sandboxes somehow with limited functionality/security privileges.
In other words the code will be evaled, and I dont want it to have rights to i.e. include external libraries, call remote servers or have access any files on the server etc...
With this in mind I know that mongo uses something called spider monkey (which I gather is a JS evaluator) to write queries. I wonder would it be possible to take the result of a mongo call and pass it to an evaluated javascript function using spider monkey (somehow) to achieve what I am after? If so would this be safe enough.
To be honest, Im writing this question and I realize its sounding a lot like one of those "please help, how to code the world" type questions but any pointers would be helpful.

Have you looked at $where clauses in MongoDB? Seems like those would pretty much give you exactly what you're looking for. In PyMongo it would look something like:
db.foo.find().where("some javascript function that will get applied to each document matched by the find")

Related

SQLite query restrictions

I am building a little interface where I would like users to be able to write out their entire sql statement and then see the data that is returned. However, I don't want a user to be able to do anything funny ie delete from user_table;. Actually, the only thing I would like users to be able to do is to run select statements. I know there aren't specific users for SQLite, so I am thinking what I am going to have to do, is have a set of rules that reject certain queries. Maybe a regex string or something (regex scares me a little bit). Any ideas on how to accomplish this?
def input_is_safe(input):
input = input.lower()
if "select" not in input:
return False
#more stuff
return True

I can suggest a different approach to your problem. You can restrict the access to your database as read-only. That way even when the users try to execute delete/update queries they will not be able to damage your data.
Here is the answer for Python on how to open a read-only connection:
db = sqlite3.connect('file:/path/to/database?mode=ro', uri=True)

Python's sqlite3 execute() method will only execute a single SQL statement, so if you ensure that all statements start with the SELECT keyword, you are reasonably protected from dumb stuff like SELECT 1; DROP TABLE USERS. But you should check sqlite's SQL syntax to ensure there is no way to embed a data definition or data modification statement as a subquery.
My personal opinion is that if "regex scares you a little bit", you might as well just put your computer in a box and mail it off to <stereotypical country of hackers>. Letting untrusted users write SQL code is playing with fire, and you need to know what you're doing or you'll get fried.

Open the database as read only, to prevent any changes.
Many statements, such as PRAGMA or ATTACH, can be dangerous. Use an authorizer callback (C docs) to allow only SELECTs.
Queries can run for a long time, or generate a large amount of data. Use a progress handler to abort queries that run for too long.

Is there a way to print out output in a pyramid view callable?

I am new to python and pyramid and I am trying to figure out a way to print out some object values that I am using in a view callable to get a better idea of how things are working. More specifically, I am wanting to see what is coming out of a sqlalchemy query.
DBSession.query(User).filter(User.name.like('%'+request.matchdict['search']+'%'))
I need to take that query and then look up what Office a user belongs to by the office_id attribute that is part of the User object. I was thinking of looping through the users that come up from that query and doing another query to look up the office information (in the offices table). I need to build a dictionary that includes some User information and some Office information then return it to the browser as json.
Is there a way that I can experiment with different attempts at this while viewing my output without having to rely on the browser. I am more of a front end developer so when I am writing javascript I just view my outputs using console.log(output).
console.log(output) is to JavaScript
as
????? is to Python (specifically pyramid view callable)
Hope the question is not dumb. Just trying to learn. Appreciate anyones help.

This is a good reason to experiment with pshell, Pyramid's interactive python interpreter. From within pshell you can tinker with things on the command-line and see what they will do before adding them to your application.
http://docs.pylonsproject.org/projects/pyramid/en/1.4-branch/narr/commandline.html#the-interactive-shell
Of course, you can always use "print" and things will show up in the console. SQLAlchemy also has the sqlalchemy.echo ini option that you can turn on to see all queries. And finally, it sounds like you just need to do a join but maybe aren't familiar with how to write complex database queries, so I'd suggest you look into that before resorting to writing separate queries. Likely a single query can return you what you need.

How can I run my python script from within a web browser and process the results?

I have a written a short python script which takes a text and does a few things with it. For example it has a function which counts the words in the text and returns the number.
How can I run this script within django?
I want to take that text from the view (textfield or something) and return a result back to the view.
I want to use django only to give the script a webinterface. And it is only for me, maybe for a few people, not for a big audience. No deployment.
Edit: When I first thought the solution would be "Django", I asked for it explicitly. That was of course a mistake because of my ignorance of WSGI. Unfortunately nobody advised me of this mistake.

First off, is your heart really set on it being Django? If not I'd advise that Django, whilst an awesome framework, is a bit much for your needs. You don't really need full stack.
You might want to look at Flask instead, which is a Python micro-framework (and dead easy to use)
However, since you asked about Django...
You can create a custom Django command (docs here) that calls your script,
that can be called from a view as described in this question.
This has the added benefit of allowing you to run your script via the Django management.py script too. Which means you can keep any future scripts related to this project nice and uniform.
For getting the results of your script running, you can get them from the same bit of code that calls the command (the part described in the last link), or you can write large result sets to a file and process that file. Which you choose would really depend on the size of your result set and if you want to do anything else with it afterwards.

What nobody told me here, since I asked about Django:
What I really needed was a simple solution called WSGI. In order to make your python script accessible from the webbrowser you don't need Django, nor Flask. Much easier is a solution like Werkzeug or CherryPy.

After following the django tutorial, as suggested in a comment above, you'll want to create a view that has a text field and a submit button. On submission of the form, your view can run the script that you wrote (either imported from another file or copy and pasted; importing is probably preferable if it's complicated, but yours sounds like it's just a few lines), then return the number that you calculated. If you want to get really fancy, you could do this with some javascript and an ajax request, but if you're just starting, you should do it with a simple form first.

How to differentiate model classes between webapp and unittest

I've started looking in to unittest when using google app engine. And it seems to bit tricky from what I've read. Since you can't (and not suppose to) run your test's against the datastore.
I've written an abstract class to emulate a datastore model class. And it quite works pretty nice returning mockup data on get, all, fetch and so on (only tried on a small scale) returning dbModel like results.
The one thing I haven't found a solution I'm satisfied with is how to differentiate which model class to use. I want to use the mock-ups for unit tests and the actual db.Model for when webapp is running.
My current solution looks like this in my .py containing all db.Models:
if 'SERVER_SOFTWARE' in os.environ:
class dbTest(db.Model):
content = db.StringProperty()
comments = db.ListProperty(str)
else:
class dbTest(Abstract):
content = 'Test'
comments = ['test1', 'test2']
And it kinda feels like it could break any minute. Is this the way to go or could one combine these as one class and if the db.Model is invoked properly use that else the mockup?

Check out gaetestbed (docs). It stubs out the datastore (and all the other services like memcache) and makes testing from the command line very easy. It ensures a clean environment before every test runs.
I personally think it is nicer than the other solutions I have seen.

Instead of messing your models.py I would go with gaeunit.
I've used it with success in a couple of projects and the features I like are:
Just one file to add to your project (gaeunit.py) and you are almost done
Gaeunit isolates the test datastore from the development store (i.e. tests don't pollute your development db)

Since you can't (and not suppose to)
run your test's against the datastore.
This is not true. You can and should use the local datastore implementation as a test harness - there's no reason to waste your time creating mocks for every datastore behaviour. You can use a tool such as noseGAE or gaeunit, as suggested by other posters, but if you want to set it up yourself, see this snippet.

There's more than one problem you're trying to address here...
First, for running tests with GAE emulation you can take a look at gaeunit, which I like best. If you don't want to run them from the browser then you can take a look at noseGAE (part of nose). This should give you command-line testing.
Second, regarding your comment about about 'creating an overhead of dependencies' it sounds like you're searching for a good unit testing and mocking framework. These will let you mock out the database for tests which don't need to hit it. Try mox and mockito for python.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.