Full text searching and Python

Full text searching and Python - python

Can someone help me out with some suggestion for a full-text searching engine that supports Python?
Right now we have a MySQL database in place and I'd like to add the ability to have a full-text search engine index some of the text in some of the tables in this database. This text data would be used by a web application to search for the corresponding records in the database. For instance, index the customer name information in our customer table, full text search that with the web application to get the MySQL record for the customer.
I've looked (briefly) at Lucene, Swish-E and MongoDB, and few others, but I'm not sure what would be a good choice for me considering a couple of things:
I'm not a Java guy (though I've been programming for a long time),
we only want to search a relatively small set of data,
we're looking to index text in a MySQL database,
and would like that index to be updated in semi-realtime.
Any hints, tips or pointers would be greatly appreciated!

Have a look at Whoosh. I've heard it doesn't scale up terribly well (maybe that's fixed now) but for small collections, it might be useful.
For a scalable solution, consider using Lucene with PyLucene or Jython.

Building pylucene a few months ago was one of the most painful experiences I had. The project won't get any traction IMHO if it's so hard to build.
With a few other folks having the same itch to scratch, we started https://code.google.com/a/apache-extras.org/p/pylucene-extra/ to gather prebuilt pylucene and jcc eggs on several operating systems, Python versions and Java runtimes combos. It is not very active lately, though.
Whoosh might be a good fit, or you may want to have a look at Sphinx, ElasticSearch or HaystackSearch (CAVEAT: I did not work on any of these).
Or maybe try to access Solr via python (there are a few APIs), which might be much easier than using pylucene. Consider that lucene will still need a JVM to run, of course.
Since you don't have huge scalability needs, I would focus on simple usage and community support rather than performance and scale. Hope it helps.

Solr is a great wrapper to Lucene, it greatly simplifies things. It doesn't require any Java tinkering for most things, you just need to configure some XML files. It does run as another process, so this may complicate your deployment.
I have had great results with pysolr, but really, you could write your own python communication library since Solr uses REST, so it is really simple to send and retrieve data in either xml or json.

Related

Searching with the pyramid framework

I'm trying to implement a search function into my website, which is running on pyramid, and I was wondering what is the most efficient way of approaching this problem. I am currently looking into Whoosh and MySQL full text searching with SqlAlchemy. I need a fast and simple implementation, and wondering which one would be the best choice.

I tried using fulltext with the native database for a while and it just was too much work to keep things working across sqlite, mysql, and pgsql. I ported all the search code over to whoosh and have been really happy ever since. It performs well for small workloads, is pure python, and no server to setup.
You just implement it almost like writing and updating a file on disk. From what I've read it does well in the single millions of documents. I'm using it with some 18k documents with an index size of around 100MB. There's a lot of flexibility to implement various tokenizing and other config with it. I really suggest people start there and if they out grow the whoosh, then look at starting up extra processes with elasticsearch, lucene/solr, and the like.
You can see how I've got it implemented here:
https://github.com/mitechie/Bookie/blob/develop/bookie/models/fulltext.py
and I update it using SqlAlchemy event hooks:
https://github.com/mitechie/Bookie/blob/develop/bookie/models/__init__.py#L663
and you can judge a basic implementation of it by searching:
https://bmark.us/search

I'm a huge fan of ElasticSearch. It's the easiest to set up, maintain, and work with.
I generally use requests.
to index:
requests.put("http://localhost:9200/myindex/category/",data=json.dumps(document))
to search:
requests.get("http://localhost:9200/myindex/category/_search?q="+somequery)
you can get way more in depth in searching using the DSL:
http://www.elasticsearch.org/guide/reference/query-dsl/

Use case for Amazon SimpleDB

I'm starting to build out a project using MySQL and now starting to think that SimpleDB might be more appropriate. (My reason for potentially using SimpleDB over another NoSQL solution is that it's easy to use with EC2).
I have a series of spiders scraping information on widgets using the Python framework, Scrapy, and the Django ORM to put the results into a MySQL db. I'll be building out a website that makes use of this data. I'm thinking that SimpleDB might be more appropriate because:
Some of the sites have fields specific to them and so the schema may be subject to change when I come across these. SimpleDB obviously allows for a lot more flexibility here
I'm going to be collecting info on around 5m widgets a year. My sense is that MySQL can handle this but figuring out the indexes might be a hassle. SimpleDB will offer assured performance at scale
The cons I can see are that writing queries will be more complex, I'll need to pre-aggregate more and general unfamiliarity with NoSQL.
Questions:
Which option would you recommend?
How would you approach integrating Python/ Django with SimpleDB? Is django-norel worth looking at?
Are there any other issues I'll likely encounter with SimpleDB?

Encrypting a Sqlite db file that will be bundled in a pyexe file

I have been working on developing this analytical tool to help interpret and analyze a database that is bundled within the package. It is very important for us to secure the database in a way that can only be accessed with our software. What is the best way of achieving it in Python?
I am aware that there may not be a definitive solution, but deterrence is what really matters here.
Thank you very much.

Someone has gotten Python and SQLCipher working together by rebuilding SQLCipher as a DLL and replacing Python's sqlite3.dll here.

This question comes up on the SQLite users mailing list about once a month.
No matter how much encryption etc you do, if the database is on the client machine then the key to decrypt will also be on the machine at some point. An attacker will be able to get that key since it is their machine.
A better way of looking at this is in terms of money - how much would a bad guy need to spend in order to get the data. This will generally be a few hundred dollars at most. And all it takes is any one person to get the key and they can then publish the database for everyone.
So either go for a web service as mentioned by Donal or just spend a few minutes obfuscating the database. For example if you use APSW then you can write a VFS in a few lines that XORs the database content so regular SQLite will not open it, nor will a file viewer show the normal SQLite header. (There is example code in APSW showing how to do this.)
Consequently anyone who does have the database content had to knowingly do so.

Django, Turbo Gears, Web2Py, which is better for what?

I got a project in mind that makes it worth to finally take the plunge into programming.
After reading a lot of stuff, here and elsewhere, I'm set on making Python the one I learn for now, over C# or java. What convinced me the most was actually Paul Graham's excursions on programming languages and Lisp, though Arc is in the experimental stage, which wouldn't help me do this web app right now.
As for web app fast, I've checked out Django, Turbo Gears and Py2Web. In spite of spending a lot of time reading, I still have no clue which one I should use.
1) Django certainly has the nicest online presence, and a nicely done onsite tutorial, they sure know how to show off their thing.
2) Web2Py attracted me with its no-install-needed and the claim of making Django look complicated. But when you dig around on their website, you quickly find content that hasn't been updated in years with broken external links... There's ghosts on that website that make someone not intimately familiar with the project worry if it might be flatlining.
3) Turbo Gears ...I guess its modular too. People who wrote about it loved it... I couldn't find anything specific that might make it special over Django.
I haven't decided on an IDE yet, though I read all the answers to the Intellisense code completion post here. Showing extra code snippets would be cool too for noobs like me, but I suppose I should choose my web frame work first and then pick an editor that will work well with it.
Since probably no framework is hands down the best at everything, I will give some specifics on the app I want to build:
It will use MySQL, it needs register/sign-in, and there will be a load of simple math operations on data from input and SQL queries. I've completed a functional prototype in Excel, so I know exactly what I want to build, which I hope will help me overcome my noobness. I'll be a small app, nothing big.
And I don't want to see any HTML while building it ;-)
PS: thanks to the people running Stackoverflow, found this place just at the right moment too!

You should look at the web2py online documentation (http://web2py.com/book). It comes with a Role Based Access Control (the most general access control mechanism) and it is very granular, you can grant access for specific operation on specific records. It comes with a web based IDE but you can use WingIDE, Eclipse and PyCharm too. It comes with helper system that allows you to generate HTML without using HTML. Here is an example of a complete app that requires users to register/login/post messages:
db.define_table('message',Field('body'),Field('author',db.auth_user))
#auth.requires_login()
def index():
db.message.author.default=auth.user.id
db.message.author.writable=False
return dict(form=crud.create(db.message),
messages=db(db.message.id>0).select())
The web2py project is very active as you can see from the list of changes http://code.google.com/p/web2py/source/list
If you have web2py related questions I strongly suggest you join the web2py mailing list:
http://groups.google.com/group/web2py/topics
We are very active and your questions will be answered very quickly.

I have to say as not particularly skilled developer, the speed at which I have been able to create using web2py has blown my mind. In large part due to the amazing community and the core value Massimo has of making the framework accessible.
When I started I had written 0 lines of code in Python
Never heard of web2py
I've been at it seriously for about a month and have progressed (in my usual fashion) from asking questions that no one could answer (because they didn't make any sense) to coding for hours at a time without picking up a book or asking a question.
I'm really impressed.

I've had positive experiences with Django.
Built-In Authentication and easy to use extensions for registration
Very good documentation
You probable write your HTML templates mostly in base.html, then just use template inheritance (Note: You'll need to write at least a little bit of HTML)
In contrast to Turbogears, Django is more 'out-of-the-box'
I don't have any experience with web2py, but from my impression, it tries to do a little to much 'out-of-the-box'

If you decide to go with Django, make sure that you use its Generic Views. They will save you from writing lots of code, both Python and HTML.
Also, unless there is a very specific reason for you to use MySQL, I advise you to switch to PostgreSQL. Django is much more oriented towards PostgreSQL and it's a much better database anyway.
The online Django documentation is great, this is what put it apart from all the other frameworks. I also recommend the book Practical Django Projects by James Bennett

Django: Heard it has the best administrative
interface. But uses it's own ORM, i.e. doesn't use SQL-Alchemy.
Web2py: Didn't research this.
Turbogears2:
Uses SQL-Alchemy by default, uses Catwalk for admin
interface, but documentation isn't as
great.
I chose Turbogears2 because it uses popular components, so I didn't have to learn anything new...

I've used both web2py and RoR extensively, and while RoR has gotten a lot of popularity and support in the past few years, web2py is simpler, cleaner, less "magical", and yet also offers more (useful) out-of-the-box functionality. I'd say that web2py has more potential than RoR, but it is a relatively new framework and does yet not have the maturity of RoR. (Despite that, though, I'd choose web2py over RoR any day...)

If you "don't want to see any HTML while building it" then you can forget Django. It is not focused on "point-click-done," it is focused on pros going from concept to production in the shortest time possible. The hierarchical nature of the templating language can lead to some very clean overall site layouts. I use Django for all of my larger sites and I love it.
Although it's written in PHP, not Python, you might take a look at the major new version of WordPress that came out about 2 or 3 months ago. In 3.0 they have come a long way from being a "blogs only" environment and there are tons of ready-made templates for it. Of course if you want to tweak a template, well, there's that nasty old HTML again. I am considering using it for my smaller clients that can't deal with the admin of a dedicated server, etc., that tends to come with a Django site.
Update:
Ah, I missed the semi-joke -- I was up too early and that tends to make me tone deaf to humor. As far as using templates from existing sites, I have done this quite successfully with a couple of sites, both those that were static and those originally driven by well-written PHP scripts. I recommend a careful reading of the {% extends %} and {% include %} docs. Both take either a string literal or a variable. I have used the later method and it can be quite useful for a site that has strong hierarchy distinguished by style changes across branches.
It is also worth the time to understand the search order for templates -- it can be used to good effect, but it can be puzzling if you don't grok it. See the template-related items in the settings.py file for this and other useful goodies.

Graph databases and RDF triplestores: storage of graph data in python

I need to develop a graph database in python (I would enjoy if anybody can join me in the development. I already have a bit of code, but I would gladly discuss about it).
I did my research on the internet. in Java, neo4j is a candidate, but I was not able to find anything about actual disk storage. In python, there are many graph data models (see this pre-PEP proposal, but none of them satisfy my need to store and retrieve from disk.
I do know about triplestores, however. triplestores are basically RDF databases, so a graph data model could be mapped in RDF and stored, but I am generally uneasy (mainly due to lack of experience) about this solution. One example is Sesame. Fact is that, in any case, you have to convert from in-memory graph representation to RDF representation and viceversa in any case, unless the client code wants to hack on the RDF document directly, which is mostly unlikely. It would be like handling DB tuples directly, instead of creating an object.
What is the state-of-the-art for storage and retrieval (a la DBMS) of graph data in python, at the moment? Would it make sense to start developing an implementation, hopefully with the help of someone interested in it, and in collaboration with the proposers for the Graph API PEP ? Please note that this is going to be part of my job for the next months, so my contribution to this eventual project is pretty damn serious ;)
Edit: Found also directededge, but it appears to be a commercial product

I have used both Jena, which is a Java framework, and Allegrograph (Lisp, Java, Python bindings). Jena has sister projects for storing graph data and has been around a long, long time. Allegrograph is quite good and has a free edition, I think I would suggest this cause it is easy to install, free, fast and you could be up and going in no time. The power you would get from learning a little RDF and SPARQL may very well be worth your while. If you know SQL already then you are off to a great start. Being able to query your graph using SPARQL would yield some great benefits to you. Serializing to RDF triples would be easy, and some of the file formats are super easy ( NT for instance ). I'll give an example. Lets say you have the following graph node-edge-node ids:
1 <- 2 -> 3
3 <- 4 -> 5
these are already subject predicate object form so just slap some URI notation on it, load it in the triple store and query at-will via SPARQL. Here it is in NT format:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
Now query for all nodes two hops from node 1:
SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}
This would of course yield <http://mycompany.com#5>.
Another candidate would be Mulgara, written in pure Java. Since you seem more interested in Python though I think you should take a look at Allegrograph first.

I think the solution really depends on exactly what it is you want to do with the graph once you have managed to store it on disk/in database, and this is a little unclear in your question. However, a couple of things you might wish to consider are:
if you just want to persist the graph without using any of the features or properties you might expect from an rdbms solution (such as ACID), then how about just pickling the objects into a flat file? Very rudimentary, but like I say, depends on exactly what you want to achieve.
ZODB is an object database for Python (a spin off from the Zope project I think). I can't say I've had much experience of it in a high performance environment, but bar a few restrictions does allow you to store Python objects natively.
if you wish to pursue RDF, there is an RDF Alchemy project which might help to alleviate some of your concerns about converting from your graph to RDF structures and I think has Sesame as part of it's stack.
There are some other persistence tools detailed on the python site which may be of interest, however I spent quite a while looking into this area last year, and ultimately I found there wasn't a native Python solution that met my requirements.
The most success I had was using MySQL with a custom ORM and I posted a couple of relevant links in an answer to this question. Additionally, if you want to contribute to an RDBMS project, when I spoke to someone from Open Query about a Graph storage engine for MySQL them seemed interested in getting active participation in their project.
Sorry I can't give a more definitive answer, but I don't think there is one... If you do start developing your own implementation, I'd be interested to keep up-to-date with how you get on.

Greetings from your Serius Cybernetics Intelligent Agent!
Some useful links...
Programming the Semantic Web
SEMANTIC PROGRAMMING
RDFLib Python Library for RDF

Hmm, maybe you should take a look at CubicWeb

Regarding Neo4j, did you notice the existing Python bindings? As for the disk storage, take a look at this thread on the mailing list.
For graphdbs in Python, the Hypergraph Database Management System project was recently started on SourceForge by Maurice Ling.

Redland (http://librdf.org) is probably the solution you're looking for. It has Python bindings too.

RDFLib is a python library that you can use. Using harschware's example:
Create a test.nt file like below:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
To query for all nodes two hops from node 1 in RDFLib:
from rdflib import Graph
g = Graph()
g.parse("test.nt", format="nt")
qres = g.query(
"""SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}"""
)
for row in qres:
print(node)
Should return the answer <http://mycompany.com#5>.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.