Using RethinkDB for "Full Text Search"

Using RethinkDB for "Full Text Search" - python

I am currently working on a web application where, ideally, I would be able to support a search bar on the documents that are going to be stored for users. Each of these documents is going to be a small snippet up to a decently-sized article. (I don't imagine any documents are going to be larger than a few KB of text for search purposes) As I have been reading about the proper ways of using RethinkDB, one of the bits of information that has stuck out as worrying to me is the performance of doing something such as a filter on non-indexed data, where I have seen people mention multiple minutes spent in one of those calls. Considering I expect that, over the long run, there are going to be at least 10,000+ documents (and in the really long run, 100,000+, 1,000,000+, etc.), is there a way to be able to search those documents in a way that has sub-second (preferably in the 10s of milliseconds) response time in the standard RethinkDB API? Or am I going to have to come up with a separate scheme that allows for quick search through clever use of indexes? Or would I be better off using another database that provides that capability?

If you don't use an index, your query is going to have to look at every document in your table, so it will get slower as your table gets larger. 10,000 documents should be reasonable to search through on fast hardware, but you probably can't do it in 10s of milliseconds, and millions of documents will probably be slow to search through.
You may want to look into elasticsearch as a way to do this: http://www.rethinkdb.com/docs/elasticsearch/

Related

Storing / Working With Large Text Blocks to Be Inserted Into Documents in Python

I work in Python and have to generate spreadsheets frequently to share my data with programming-naive colleagues. I embed large blocks of text explaining the contents of the spreadsheet and how it was generated into the first page of these spreadsheets routinely. I don't like relying on an associated document to explain definitions, criteria, algorithms, and reliability when I send my results out into the world.
It's really awkward to edit and store the long strings that make up these blocks of text. I'd love to store them in dedicated files that I can work with using a tool INTENDED to edit large blocks of text. I'm wondering how other people deal with this kind of situation. JSON files? YAML? Some obvious built-in functionality in Python I don't know about?
This is obviously a very open-ended question. I'm sure there a lot of different approaches and solutions out there. It's a difficult thing to search for online as there are a lot of obfuscating factors when you search for things like 'python large strings' or 'python text files'. I'm hoping to hear about a number of different approaches.

How to extract all tweets from multiple users' timelines using R?

I am working on a project for which I want to extract the timelines of around 500 different twitter users (I am using this for historical analysis, so I'll only need to retrieve them all once- no need to update with incoming tweets).
While I know the Twitter API only allows the last 3,200 tweets to be retrieved, when I use the basic UserTimeline method of the R twitteR package, I only seem to fetch about 20 every time I try (for users with significantly more, recent, tweets). Is this because of rate limiting, or because I am doing something wrong?
Does anyone have tips for doing this most efficiently? I realize it might take a lot of time because of rate limiting, is there a way of automating/iterating this process in R?
I am quite stuck, so thank you very much for any help/tips you may have!
(I have some experience using the Twitter API/twitteR package to extract tweets using a certain hashtag over a couple of days. I have basic Python skills, if it turns out to be easier/quicker to do in Python).

It looks like the twitteR documentation suggests using the maxID argument for pagination. So when you get the first batch of results, you could use the minimum ID in that set minus one as the maxID for the next request, until you get no more results back (meaning you've gotten to the beginning of a user's timeline).

Efficient Alternative to "in"

I'm writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven't a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.
The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15.
Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.
for href in tempUrl:
...
if href not in urls:
collect(href,parent,depth+1)
This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.
So my question is, what is another method of achieving the same functionality while improving its efficiency.
I've thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn't helped.
Is there a way I could do this with sets or something?
Thanks for the help
edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.

Perhaps you could use a set instead of a list for the urls that you have seen so far.

Simply replace your 'list of crawled URLS" with a "set of crawled urls". Sets are optimised for random access (using the same hashing algorithms that dictionaries use) and they're a heck of a lot faster. A lookup operation for lists is done using a linear search so it's not particularly fast. You won't need to change the actual code that does the lookup.
Check this out.
In [3]: timeit.timeit("500 in t", "t = list(range(1000))")
Out[3]: 10.020853042602539
In [4]: timeit.timeit("500 in t", "t = set(range(1000))")
Out[4]: 0.1159818172454834

I had a similar problem. Ended up profiling various methods (list/file/sets/sqlite) for memory vs time. See these 2 posts.
Finally sqlite was the best choice. You can also use url hash to reduce the size
Searching for a string in a large text file - profiling various methods in python
sqlite database design with millions of 'url' strings - slow bulk import from csv

Use a dict with the urls as keys instead (O(1) access time).
But a set will also work. See
http://wiki.python.org/moin/TimeComplexity

How to display database query results of 100,000 rows or more with HTML?

We're rewriting a website used by one of our clients. The user traffic on it is very low, less than 100 unique visitors a week. It's basically just a nice interface to their data in our databases. It allows them to query and filter on different sets of data of theirs.
We're rewriting the site in Python, re-using the same Oracle database that the data is currently on. The current version is written in an old, old version of Coldfusion. One of the things that Coldfusion does well though is displays tons of database records on a single page. It's capable of displaying hundreds of thousands of rows at once without crashing the browser. It uses a Java applet, and it looks like the contents of the rows are perhaps compressed and passed in through the HTML or something. There is a large block of data in the HTML but it's not displayed - it's just rendered by the Java applet.
I've tried several JavaScript solutions but they all hinge on the fact that the data will be present in an HTML table or something along those lines. This causes browsers to freeze and run out of memory.
Does anyone know of any solutions to this situation? Our client loves the ability to scroll through all of this data without clicking a "next page" link.

I have done just what you are describing using the following (which works very well):
jQuery Datatables
It enables you to do 'fetch as you scroll' pagination, so you can disable the pagination arrows in favor of a 'forever' scroll.

Give a try with Jquery scroll.
Instead of image scroll , you need to have data scroll.
You should poulate data in the divs , instead of images.
http://www.smoothdivscroll.com/#quickdemo
It should work. I wish.
You gotta great client anyway :-)
Something related to your Q
http://www.9lessons.info/2009/07/load-data-while-scroll-with-jquery-php.html
http://api.jquery.com/scroll/

I'm using Open Rico's LiveGrid in a project to display a table with thousands of rows in a page as an endless scrolling table. It has been working really fine so far. The table requests data on demand when you scroll through the rows. The parameters are send as simple GET parameters and the response you have to create on the serverside is simple XML. It should be possible to implement a data backend for a Rico LiveGrid in Python.

Most people, in this case, would use a framework. The best documented and most popular framework in Python is Django. It has good database support (including Oracle), and you'll have the easiest time getting help using it since there's such an active Django community.
You can try some other frameworks, but if you're tied to Python I'd recommend Django.
Of course, Jython (if it's an option), would make your job very easy. You could take the existing Java framework you have and just use Jython to build a frontend (and continue to use your Java applet and Java classes and Java server).
The memory problem is an interesting one; I'd be curious to see what you come up with.

Have you tried jqGrid? It can be buggy at times, but overall it's one of the better JavaScript grids. It's fairly efficient in dealing with large datasets. It also has a feature whereby the grid retrieves data asynchronously in chunks, but still allows continuous scrolling. It just asks for more data as the user scrolls down to it.

I did something like this a while ago and successfully implemented YUI's data table combined with Django
http://developer.yahoo.com/yui/datatable/
This gives you column sorting, pagination, scrolling and so on. It also allows you to use a variety of data sources such as JSON or XML.

Using MongoDB on Django for real-time search?

I'm working on a project that is quite search-oriented. Basically, users will add content to the site, and this content should be immediately available in the search results. The project is still in development.
Up until now, I've been using Haystack with Xapian. One thing I'm worried about is the performance of the website once a lot of content is available. Indexing will have to occur very frequently if I want to emulate real-time search.
I was reading up on MongoDB recently. I haven't found a satisfying answer to my question, but I have the feeling that MongoDB might be of help for the real-time search indexing issue I expect to encounter. Is this correct? In other words, would the search functionality available in MongoDB be more suited for a real-time search function?
The content that will be available on the site is large unstructured text (including HTML) and related data (prices, tags, datetime info).
Thanks in advance,
Laundro

I don't know much about MongoDB, but I'm using with great success Sphinx Search - simple, powerful and very fast tool for full text indexing&search. It also provides Python wrapper out-of-the-box.
It would be easier to pick it up if Haystack provided bindings for it, unfortunately Sphinx bindings are still on a wish list.
Nevertheless, setting Spinx up is so quick (I did it in a few hours, for existing in-production Django-based CRM), that maybe you can give it a try before switching to a more generic solution.

MongoDB is not really a "dedicated full text search engine". Based on their full text search docs you can only create a array of tags that duplicates the string data or other columns, which with many elements (hundreds or thousands) can make inserts very expensive.
Agree with Tomasz, Sphinx Search can be used for what you need. Real time indexes if you want it to be really real time or Delta indexes if several seconds of delay are acceptable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using RethinkDB for "Full Text Search" - python

Related

Storing / Working With Large Text Blocks to Be Inserted Into Documents in Python

How to extract all tweets from multiple users' timelines using R?

Efficient Alternative to "in"

How to display database query results of 100,000 rows or more with HTML?

Using MongoDB on Django for real-time search?

Categories

Resources