Since I am all new to pyspark, can anyone help me with the pyspark implementation of sentiment analysis. I have done the Python implementation. Can anyone tell me what changes are to be made?
import nltk
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from nltk.classify import NaiveBayesClassifier
def format_sentence(sent):
return({word: True for word in nltk.word_tokenize(sent)})
#print(format_sentence("The cat is very cute"))
pos = []
with open("./pos_tweets.txt") as f:
for i in f:
pos.append([format_sentence(i), 'pos'])
neg = []
with open("./neg_tweets.txt") as fp:
for i in fp:
neg.append([format_sentence(i), 'neg'])
# next, split labeled data into the training and test data
training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))]
test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):]
classifier = NaiveBayesClassifier.train(training)
example1 = "no!"
print(classifier.classify(format_sentence(example1)))
The pattern would typically be:
convert your data into a spark DataFrame
df = spark.read.csv('./neg_tweets.txt')
you can use train/test split here:
df.randomSplit([0.8, 0.2])
find a suitable model: if naive bayes works for you it will look somethig like this
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
Otherwise, for sentiment analysis there may not be one precisely built in to spark.ml/mllib. You may need to look for external projects.
Iterate, iterate on the model and tuning parameters..
You can run an evaluator for the metrics you decide are important to your problem. Some examples for binary classification problems are here:
https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification
metrics = BinaryClassificationMetrics(predictionAndLabels)
Related
I found this post looking for a way to identify and clean abbreviations within my dataframe. The code works well for my use case.
However, I'm dealing with a large data set and was wondering if there was a better or proficient way to apply this without dealing with memory issues.
In order for me to run the code snipet, I sampled 10% of the original dataset and it runs perfectly. If I run the full dataset, my laptop locks.
Below is updated version of the original code:
import spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 43793966
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
text = [nlp(text, disable = ['ner', 'parser','tagger']) for text in train.text]
text = ' '.join([str(elem) for elem in text])
doc = nlp(text)
#Print the Abbreviation and it's definition
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
i have a project on the university of making a decision tree, i already have the code that creates the tree but i want to print it, can anyone help me?
#IMPORT ALL NECESSARY LIBRARIES
import Chefboost as chef
import pandas as pd
archivo = input("INSERT FILE NAMED FOLLOWED BY .CSV:\n")
# READ THE DATA SET FROM THE CSV FILE
df = pd.read_csv(str(archivo))
df.columns = ['ph', 'soil_temperature', 'soil_moisture', 'illuminance', 'env_temperature','env_humidity','Decision']
# print(df.head(10)) #UNCOMMENT IF WANT FIRST 10 ROWS PRINTED OUT
config = {'algorithm':'ID3'} # CONFIGURE THE ALGORITH. CHOOSE BETWEEN ID3, C4.5, CART, Regression
model = chef.fit(df.copy(), config) #CREATE THE DECISION TREE BASED OF THE CONFIGURATION ABOVE
resultados = pd.DataFrame(columns = ["Real", "Predicción"]) #CREATE AN EMPTY PANDAS DATAFRAME
# SAVE ALL REAL VS ESTIMATED VALUES IN THE ABOVE DATAFRAME
for i in range(1,372):
l = []
l.append(df.iloc[i]['Decision'])
feature = df.iloc[i]
prediction = chef.predict(model, feature)
l.append(prediction)
resultados.loc[i] = l
print(l)
Not knowing the Chefboost library, I can't directly answer your question, but when I am working with a new library, I will often use a few tools to help me understand what the library is giving me. Use dir(object) to get a listing of the attributes and methods of the object.
You might also get a little more specific about what you want to see when you "Print the decision tree." Are you trying to print the model, or the predictions? What trouble are you having or what errors are you seeing?
Hope this helps.
The built-in classifier in textblob is pretty dumb. It's trained on movie reviews, so I created a huge set of examples in my context (57,000 stories, categorized as positive or negative) and then trained it using nltk. I tried using textblob to train it but it always failed:
with open('train.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")
That would run for hours and end in a memory error.
I looked at the source and found it was just using nltk and wrapping that, so I used that instead, and it worked.
The structure for nltk training set needed to be a list of tuples, with the first part was a Counter of words in the text and frequency of appearance. The second part of tuple was 'pos' or 'neg' for sentiment.
>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later
>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using
>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)
Then I pickled it.
with open('storybayes.pickle','wb') as f:
pickle.dump(cl,f)
Now... I took this pickled file, and re opened it to get the nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- and tried to feed it into textblob. Instead of
from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
I tried:
blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
File "<pyshell#116>", line 1, in <module>
blob = TextBlob("I love this library", analyzer=cl4)
File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
parser, classifier)
File "C:\python\lib\site-packages\textblob\blob.py", line 323, in
_initialize_models
BaseSentimentAnalyzer, BaseBlob.analyzer)
File "C:\python\lib\site-packages\textblob\blob.py", line 305, in
_validated_param
.format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer
what now? I looked at the source and both are classes, but not quite exactly the same.
I wasn't able to be certain that a nltk corpus cannot work with textblob, and that would surprise me since textblob imports all of the nltk functions in its source code, and is basically a wrapper.
But what I did conclude after many hours of testing is that nltk offers a better built-in sentiment corpus called "vader" that outperformed all of my trained models.
import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE
vader_lexicon and nltk code does a lot more parsing of negation language in sentences in order to negate positive Words. Like when Darth Vader says "lack of faith" that changes the sentiment to its opposite.
I explained it here, with examples of the better results:
https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/
That replaces this textblob implementation:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE
The vader nltk classifier also has additional documentation here on using it for sentiment analysis: http://www.nltk.org/howto/sentiment.html
textBlob always crashed my computer with as little as 5000 examples.
Going over the error message, it seems like the analyzer must be inherited from the abstract class BaseSentimentAnalyzer. As mentioned in the docs here, this class must implement the analyze(text) function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI here. Hence, I believe both these implementations cannot be combined, unless you can implement a new analyze function in NLTK's implementation to make it compatible with TextBlob's.
Another more forward-looking solution is to use spaCy to build the model instead of textblob or nltk. This is new to me, but seems a lot easier to use and more powerful:
https://spacy.io/usage/spacy-101#section-lightning-tour
"spaCy is the Ruby of Rails of natural language processing."
import spacy
import random
nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})] # better model stuff
with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
optimizer = nlp.begin_training()
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
I am working on a text classification problem in python using sklearn. I have created the model and saved it in a pickle.
Below is the code I used in sklearn.
vectorizerPipe = Pipeline([('tfidf', TfidfVectorizer(lowercase=True,
stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),])
prd=vectorizerPipe.fit(features_used,labels_used])
f = open(file_path, 'wb')
pickle.dump(prd, f)
Is there any way to use this same pickle to get the output in DataFrame based apache spark and not RDD based. I have gone through the following articles but didn't find a proper way to implement.
what-is-the-recommended-way-to-distribute-a-scikit-learn-classifier-in-spark
how-to-do-prediction-with-sklearn-model-inside-spark
-> I found both these questions on StackOverflow and find it useful.
deploy-a-python-model-more-efficiently-over-spark
I am a beginner in Machine learning. So, pardon me If the explanation is naive. Any related example or implementation would be helpful.
RDD -> spark dataframe using Spark
like:
import spark.implicits._
val testDF = rdd.map {line=>
(line._1,line._2)
}.toDF("col1","col2")
Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library.
Here is the current code that I am using:
from gensim import corpora, models, similarities, matutils
def train_model(fname):
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dictionary = corpora.Dictionary(line.lower().split() for line in open(fname))
print "DOC2BOW"
corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]
print "running LDA"
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)
running this script on a small corpus (2M sentences) I realized that it needs about 7GB of RAM.
And when I try to run it on the larger corpora, it fails because of the memory issue.
The problem is obviously due to the fact that I am loading the corpus using this command:
corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]
But, I think there is no other way because I would need it for calling the LdaModel() method:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)
I searched for a solution to this problem but I could not find anything helpful.
I would imagine that it should be a common problem since we mostly train the models on very large corpora (usually wikipedia documents). So, it should be already a solution for it.
Any ideas about this issue and the solution for it?
Consider wrapping your corpus up as an iterable and passing that instead of a list (a generator will not work).
From the tutorial:
class MyCorpus(object):
def __iter__(self):
for line in open(fname):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)
Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus, which should fit your format nicely already:
corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=corpus.dictionary, # TextCorpus can build the dictionary for you
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)