I have around 80,000 text files and I want to be able to do an advanced search on them.
Let's say I have two lists of keywords and I want to return all the files that include at least one of the keywords in the first list and at least one in the second list.
Is there already a library that would do that, I don't want to rewrite it if it exists.
As you need to search the documents multiple times, you most likely want to index the text files to makes such searches as fast as possible.
Implementing a reasonable index yourself is certainly possible, but a quick search lead me to:
https://pypi.python.org/pypi/Whoosh/
http://pythonhosted.org/Whoosh/
Take a look at the documentation. It should hopefully be rather trivial to achieve the desired behaviour.
I just get a feeling you want to use MapReduce type of processing for the search. It should be very scalable, Python should have MapReduce packages.
Related
With emergence of new TLDs (.club, .jobs, etc...) what is the current best practice for extracting/parsing domains from text? My typical approach is regex however given that things like file names with extensions will trigger false positives, I will need something more restrictive.
I noticed even google sometimes does not properly recognize if I'm searching for a file name or want to go to a domain. This appears to be a rather challenging problem. Machine Learning could potentially be an approach to understand the context surrounding a string. However unless there is a library that does this already I won't bother getting too fancy.
One approach I'm thinking of is after regexing, querying http://data.iana.org/TLD/tlds-alpha-by-domain.txt which holds a static list of current TLDs and use it as a filter. Any suggestions?
This is not an easy problem and it depends on the context in which you need to extract the domain names, and the accepted rate of false positives and negatives you can support. You can indeed use the list of currently existing TLDs but this list changes so you need to make sure you are taking into account recent enought values of the list.
You are hitting issues covered by the Universal Acceptance movement, in trying to make sure all TLDs (whatever length, date of creation, and characters it uses) are equal.
They provided a document about "Linkification" which has as a sub problem the fact of extracting the links hence the domain among other things. Have a look at their documentation: https://uasg.tech/wp-content/uploads/2017/06/UASG010-Quick-Guide-to-Linkification.pdf
So this could give you some ideas, as well as their Quick Guide at https://uasg.tech/wp-content/uploads/2016/06/UASG005-160302-en-quickguide-digital.pdf
I am currently working on a script to process some log files (few MB). I am quite new to Python and up to now, my method was to extract some information and pass it to another text file and so on. I would then compare different text files and work that way. Even if I would delete the intermediate text files that I used to get my final output, I found it a bit messy.
As I have started to become more acquainted with lists and I am now trying to use lists instead of text files to store and manage data.
I was wondering what is the best method to use. Should I try to use lists more instead of text files or does that not really matter? I would tend to think lists are better for obvious reasons but I wanted to make sure. I hope it is not too much of a silly question. Thanks
EDIT
Quick examples: I would create two text file from the log files and then comparing those text files when now I am doing the same thing with lists
List in python have many methods and features that make it very flexible and manageable.
There are also other types that is similar to lists like generator.
Best of luck.
This isn't as much of a specific problem as something I am looking for more of a "Pythonic" philosophical answer to. Namely, what's the best way to keep track of unique items and ensure duplicates don't arise?
For example, I am writing a script to scrape a website for links to songs on SoundCloud so I can automatically download them. If I want to automate this program with, say, cron, what's the most efficient way to ensure that I am downloading only content I don't have already?
Or if I downloaded images, how could I make sure that there aren't any duplicates, or have some sort of process that searches for and removes duplicates efficiently?
Kind of open ended, so contribute as little or as much as you please.
Thanks.
Use a dict or set. Consider computing a checksum of each item. This brings you toward what's known as Content Addressable Storage, which is where the checksum actually is stored as if it were the item's "name", and a separate index is stored which maps things like filenames or song names to the checksums or data blocks. The problem with the CAS approach in your particular case is that it may not be possible for you to get a checksum computed on the remote side for new content--that's how programs like rsync avoid copying duplicate data.
I am trying to work out a solution for detecting traceability between source code and documentation. The most important use case is that the user needs to see the a collection of source code tokens (sorted by relevance to the documentation) that can be traced back to the documentation. She is wont be bothered about the code format, but somehow needs to see an "identifier- documentation" mapping to get the idea of traceability.
I take the tokens from source code files - somehow split the concatenated identifiers (SimpleMAXAnalyzer becomes "simple max analyzer"), which then act as search terms on the documentation. Search frameworks are best for doing this specific task - drilling down documents to locate stuff using powerful information retrieval algorithms. Whoosh looked really great python search... with a number of analyzer and filters.
Though the problem is similar to search - it differs in that the user is not physically doing any search. So am I solving the problem the right way? Given that everything is static and needs to computed only once - am I using a wrong tool(a search framework) for the job?
I'm not sure, if I understand your use case. The user sees the source code and has some ways of jumping from a token to the appropriate part or a listing of the possible parts of the documentation, right?
Then a search tool seems to be the right tool for the job, although you could precompile every possible search (there is only a limited number of identifiers in the source, so you can calculate all possible references to the docs in advance).
Or are there any "canonical" parts of the documentation for every identifier? Then maybe some kind of index would be a better choice.
Maybe you could clarify your use case a bit further.
Edit: Maybe an alphabetical index of the documentation could be a step to the solution. Then you can look up the pages/chapters/sections for every token of the source, where all or most of its components are mentioned.
I'm working on a project that is quite search-oriented. Basically, users will add content to the site, and this content should be immediately available in the search results. The project is still in development.
Up until now, I've been using Haystack with Xapian. One thing I'm worried about is the performance of the website once a lot of content is available. Indexing will have to occur very frequently if I want to emulate real-time search.
I was reading up on MongoDB recently. I haven't found a satisfying answer to my question, but I have the feeling that MongoDB might be of help for the real-time search indexing issue I expect to encounter. Is this correct? In other words, would the search functionality available in MongoDB be more suited for a real-time search function?
The content that will be available on the site is large unstructured text (including HTML) and related data (prices, tags, datetime info).
Thanks in advance,
Laundro
I don't know much about MongoDB, but I'm using with great success Sphinx Search - simple, powerful and very fast tool for full text indexing&search. It also provides Python wrapper out-of-the-box.
It would be easier to pick it up if Haystack provided bindings for it, unfortunately Sphinx bindings are still on a wish list.
Nevertheless, setting Spinx up is so quick (I did it in a few hours, for existing in-production Django-based CRM), that maybe you can give it a try before switching to a more generic solution.
MongoDB is not really a "dedicated full text search engine". Based on their full text search docs you can only create a array of tags that duplicates the string data or other columns, which with many elements (hundreds or thousands) can make inserts very expensive.
Agree with Tomasz, Sphinx Search can be used for what you need. Real time indexes if you want it to be really real time or Delta indexes if several seconds of delay are acceptable.