I'm attempting some word analysis on a large collection of tweets.
I'm pulling tweets based on a search query, I then want to somehow find the keywords that appear often and that are related to the original query.
I'm not quite sure how to go about this in a reasonably effective manner though. I'm currently just removing stopwords then finding the words that occur most, but this is a bit more basic than I'd like.
Does anyone have any suggestions for this sort of thing (or even links to any reading on the topic)?
Any help greatly appreciated.
(my implementation is in Python, if that's relevant)
For semantic reasoning about the content of a tweet, you should definitely try the NLTK (Natural Language Toolkit Package). It's capable of quite sophisticated analysis of text.
Related
Here's the approach I've in mind:
Break the data paragraph such that each sentence is a string.
Find the keywords in the question (nouns, verbs and adjectives)
perhaps using POS tagging. Lemmatization might be necessary.
Search through all the data strings to find the one which has all/most of these keywords. That sentence would most probably be the answer we're looking for.
But, I find the implementation difficult as I'm new to Python and am unaware of the necessary commands/libraries. I will be grateful if someone can guide me down this path or any better way to do it.
Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!
Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.
I have a lot of text documents containing company and personal names. I have aligned text documents where the above have been manually anonymized (names replaced with a single unique character).
I want to use this corpora to train a system to perform automatic anonymization on unseen documents - that is simply replacing words with a character. Primary problem is to recognice words to be anonymized, secondary problem is to replace words by unique character. I can do the secondary problem.
Python is preferred and I'm thinking sklearn must contain the necessary tools.
How would I go about this? There are many articles on stackoverflow on supervised learning, but I'm not sure they match my situation. I suspect this is a fairly simple problem to solve, and I'm not necessarily looking for a complete solution, but some starting pointers would be nice. Also any insight on which algorithms would work better is much appreciated.
I try to do named entity recognition in python using NLTK.
I want to extract personal list of skills.
I have the list of skills and would like to search them in requisition and tag the skills.
I noticed that NLTK has NER tag for predefine tags like Person, Location etc.
Is there a external gazetter tagger in Python I can use?
any idea how to do it more sophisticated than search of terms ( sometimes multi words term )?
Thanks,
Assaf
I haven't used NLTK enough recently, but if you have words that you know are skills, you don't need to do NER- just a text search.
Maybe use Lucene or some other search library to find the text, and then annotate it? That's a lot of work but if you are working with a lot of data that might be ok. Alternatively, you could hack together a regex search which will be slower but probably work ok for smaller amounts of data and will be much easier to implement.
Have a look at RegexpTagger and eventually RegexpParser, I think that's exactly what you are looking for.
You can create your own POS tags, ie. map skills to a tag, and then easily define a grammar.
Some sample code for the tagger is in this pdf.
I want to get a related [things/questions] in my app, similar to what StackOverflow does, when you tab out of the Title field.
I can think of only one way to do it, which i think might be fast enough
Do a search for the title in corpus of titles of all [things], and return first x matches. We can use whatever search is being used for site search.
What are other ways o do this, which are fast enough, as this is going to be sent on tabout, so a large server side processing is not feasible for it.
I am just looking for the way to do this, but I am using mysql and DJango, so if your answer uses that, all the better.
[I cannot think of good tags for it, so please feel free to edit them]
You're looking at a content-based recommendation algorithm. AFAICT StackOverflow's looks at the tags and the words in the title, and finds questions that share some of these. It can be implemented as a nearest neighbour search in a space where documents are represented as TF-IDF vectors.
Implementation-wise, go with any Django search engine that supports stemming, stopwords, non-strict matches, and tf-idf weights. Algorithmic complexity isn't high (just a few index lookups), so it doesn't matter if it's written in Python.
If you don't find a search engine doing what you want, leave the stemming and stopwords to the search engine, call the search engine on individual words, and do your own tf-idf scoring with a score that favors similar tags.