So I have basic question which I did not get answered by reading the documentation.
If I want to do text classification (sentiment) about lets say an Articel with a topic.
I already have the plain text without the html stuff through python libraries. Is it better to make an analysis about each sentence and then combine the results or just pass the whole text as a string and have already combined result (I have already tried the whole text option with flair and it worked quite well I guess).
The next thing would be how you could check if the sentences are about the asked topic and how to check it.
If you could give me some guidelines or hints how to approach these problems I would be happy.
Related
Here's the approach I've in mind:
Break the data paragraph such that each sentence is a string.
Find the keywords in the question (nouns, verbs and adjectives)
perhaps using POS tagging. Lemmatization might be necessary.
Search through all the data strings to find the one which has all/most of these keywords. That sentence would most probably be the answer we're looking for.
But, I find the implementation difficult as I'm new to Python and am unaware of the necessary commands/libraries. I will be grateful if someone can guide me down this path or any better way to do it.
I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.
You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.
I'm new to programming and do not have much experience yet. I understand some python codes, but not into detail.
I have an Excel file which contains log files of problems people encountered. The description of the problem is pasted as an email (so it's a bunch of text). I want to analyze all of these texts (almost 1.000 rows in Excel) at once, and I think Python can do this.
The type of analysis I want to do is sentiment analysis (positive, neutral, negative) or I want to see the main problem out of the text. I don't know if the second one is possible.
I copied the emails that are listed in the Excel file, to a .txt file, so now every rule is one message. How can I use Python to analyze every single rule as one message and let it show me the sentiment or the main problem?
I'd appreciate the help
Sentiment analysis is a fairly large problem in computer science/language. How specific did you want to get?
I'd recommend looking into Text-Processing for simple SA.
Their API docs are here, http://text-processing.com/docs/sentiment.html, which will return a simple pos and neg score for your text.
If you want anything more specific, I'd recommend looking into the IBM Watson, specifically Natural Language Understanding https://www.ibm.com/watson/developercloud/natural-language-understanding.html
Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!
Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.
i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..
i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...
i.e I want to create a parser for c programs in python
If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful
Building a real C parser by yourself is a really big task.
I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.
You'll have plenty of work to do for your plagiarism detector after you are done parsing C.
I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.
what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.