How to Customize the full text search function provided by postgresql

How to Customize the full text search function provided by postgresql - python

I am trying to customize the postgresql full text search functionality so that
For example, I can enter "Stouffers" and get a match on Stouffer's frozen foods. If I leave off one of the "f"s and spell it "Stoufers" I don't get a match. That's one of the things that the customized text search is supposed to handle. It converts all the text into a phonetic type code and searches on that.
Please help me in this how i can achive that.
i found some help that i need to write a custom parser in C for doing this but i am very poor in C.

Maybe you can try using the ISpell dictionary for postgresql or Tsearch2 which has some spelling correction module.

Or get a standalone search engine such as Solr or Xapian with stemming, spelling, phonetic etc.
Django-haystack brings you both of them.

Related

Is there a way to scan a document and copy specific information into a word template?

Essentially I need to be able to scan an invoice, pull only the names and then insert those names into a word template for printing. Preferably the solution will open multiple word documents at a time and the user will only have to hit print. The main issue is that there needs to be minimal interaction from a user perspective. I'm strong in python and weak in Java if that helps

You can achieve that in Java using Tesseract OCR and Apache POI
https://github.com/tesseract-ocr/tesseract
https://poi.apache.org/

How to create a dynamic form with python using translated text as input?

I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.

To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.

Search Engine for a single DB column

I'm looking for a search engine that I can point to a column in my database that supports advanced functions like spelling correction and "close to" results.
Right now I'm just using
SELECT <column> from <table> where <colname> LIKE %<searchterm>%
and I'm missing some results particularly when users misspell items.
I've written some code to fix misspellings by running it through a spellchecker but thought there may be a better out-of-the box option to use. Google turns up lots of options for indexing and searching the entire site where I really just need to index and search this one table column.

Apache Solr is a great Search Engine that provides (1) N-Gram Indexing (search for not just complete strings but also for partial substrings, this helps greatly in getting similar results) (2) Provides an out of box Spell Corrector based on distance metric/edit distance (which will help you in getting a "did you mean chicago" when the user types in chicaog) (3) It provides you with a Fuzzy Search option out of box (Fuzzy Searches helps you in getting close matches for your query, for an example if a user types in GA-123 he would obtain VMDEO-123 as a result) (4) Solr also provides you with "More Like This" component which would help you out like the above options.
Solr (based on Lucene Search Library) is open source and is slowly rising to become the de-facto in the Search (Vertical) Industry and is excellent for database searches (As you spoke about indexing a database column, which is a cakewalk for Solr). Lucene and Solr are used by many Fortune 500 companies as well as Internet Giants.
Sphinx Search Engine is also great (I love it too as it has very low foot print for everything & is C++ based) but to put it simply Solr is much more popular.
Now Python support and API's are available for both. However Sphinx is an exe and Solr is an HTTP. So for Solr you simply have to call the Solr URL from your python program which would return results that you can send to your front end for rendering, as simple as that)
So far so good. Coming to your question:
First you should ask yourself that whether do you really require a Search Engine? Search Engines are good for all use cases mentioned above but are really made for searching across huge amounts of full text data or million's of rows of tabular data. The Algorithms like Did you Mean, Similar Records, Spell Correctors etc. can be written on top. Before zero-ing on Solr please also search Google for (1) Peter Norvig Spell Corrector & (2) N-Gram Indexing. Possibility is that just by writing few lines of code you may get really the stuff that you were looking out for.
I leave it up to you to decide :)

I would suggest looking into open source technologies like Sphynx Search.

Before going down the Solr/Sphinx route for full text indexing - which adds complexity and their own overhead - you can try the built-in full text engine in PostgreSQL if you are using that database. It's easy to setup and performs better than LIKE queries.
Check out https://github.com/hcarvalhoalves/django-tsearch2

Solr Search Spell check and Stemming configuration without using text file

I need help to get some information on Solr-Search. Below is the problem statement:
Problem Statement
Need to implement spell check functionality (same as google did you mean).
Stemming of search words. e.g. dose, dossier, dosing. If some one search for dose result will be also for dossier and dosing.
Requirement
Need to implement both of the functionality without using any manual text file like spellcheck.txt for spell check and synonym.txt for stemming. I want it to be configured through search engine and want taht it use some general English dictionary.
My Understanding
Solr does not provide any dictionary . Spell check can be implemented by providing a text file for spell check..
For stemming also we need to upload txt file.
I need to mention this in schema.xml present in solr. These text files need to be maintained manually.
I need to confirm that is there any other way to configure a general dictionary with Solr or any other way we can achieve these requirements through Solr configuration changes without using text files.

You can use the DirectSolrSpellcheck so no dictionaries are needed.
You don't need text files for stemming, just an analyzer.

Design help for static content with fixed keywords search framework

I am trying to work out a solution for detecting traceability between source code and documentation. The most important use case is that the user needs to see the a collection of source code tokens (sorted by relevance to the documentation) that can be traced back to the documentation. She is wont be bothered about the code format, but somehow needs to see an "identifier- documentation" mapping to get the idea of traceability.
I take the tokens from source code files - somehow split the concatenated identifiers (SimpleMAXAnalyzer becomes "simple max analyzer"), which then act as search terms on the documentation. Search frameworks are best for doing this specific task - drilling down documents to locate stuff using powerful information retrieval algorithms. Whoosh looked really great python search... with a number of analyzer and filters.
Though the problem is similar to search - it differs in that the user is not physically doing any search. So am I solving the problem the right way? Given that everything is static and needs to computed only once - am I using a wrong tool(a search framework) for the job?

I'm not sure, if I understand your use case. The user sees the source code and has some ways of jumping from a token to the appropriate part or a listing of the possible parts of the documentation, right?
Then a search tool seems to be the right tool for the job, although you could precompile every possible search (there is only a limited number of identifiers in the source, so you can calculate all possible references to the docs in advance).
Or are there any "canonical" parts of the documentation for every identifier? Then maybe some kind of index would be a better choice.
Maybe you could clarify your use case a bit further.
Edit: Maybe an alphabetical index of the documentation could be a step to the solution. Then you can look up the pages/chapters/sections for every token of the source, where all or most of its components are mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.