I have a dataset, where each document possesses a corresponding score/ rating
dataset = [
{"text":"I don't like this small device", "rating":"2"},
{"text":"Really love this large device", "rating":"5"},
....
]
In addition, I have a category(variable) of term lists extracted out of text variables from the same dataset
x1 = [short, slim, small, shrink]
x2 = [big,huge,large]
So, how can I do the linear regression with multiple independent variables as a word lists(or varible representing the existence of any word from corresponding term list, because each term in lists is unique ) above and the dependent variable as a rating. In other words
how could I evaluate term lists impact on the rating with sklearn
I used TfidfVectorizer to derive the document-term matrix. If it's possible please provide simple code snippet or example.
Given the discussion in the comments, it seems that the interpretation should be that each list defines a binary variable whose value depends on whether or not any words from the list appear in the text in question. So, let us first change the texts so that the words actually appear:
dataset = [
{"text": "I don't like this large device", "rating": "2"},
{"text": "Really love this small device", "rating": "5"},
{"text": "Some other text", "rating": "3"}
]
To simplify our work, we'll then load this data into a data frame, change the ratings to be integers, and create the relevant variables:
df = pd.DataFrame(dataset)
df['rating'] = df['rating'].astype(int)
df['text'] = df['text'].str.split().apply(set)
x1 = ['short', 'slim', 'small', 'shrink']
x2 = ['big', 'huge', 'large']
df['x1'] = df.text.apply(lambda x: x.intersection(x1)).astype(bool)
df['x2'] = df.text.apply(lambda x: x.intersection(x2)).astype(bool)
That is, at this point df is the following data frame:
rating text x1 x2
0 2 {this, large, don't, like, device, I} False True
1 5 {this, small, love, Really, device} True False
2 3 {other, Some, text} False False
With this, we can create the relevant model, and check what the coefficients end up being:
model = LinearRegression()
model.fit(df[['x1', 'x2']], df.rating)
print(model.coef_) # array([ 2., -1.])
print(model.intercept_) # 3.0
As also mentioned in the comments, this thing will produce at most four ratings, one for each of the combinations of x1 and x2 being True or False. In this case, it just so happens that all possible outputs are integers, but in general, they need not be, nor need they be confined to the interval of interest. Given the ordinal nature of the ratings, this is really a case for some sort of ordinal regression (cf. e.g. mord).
Related
I'm using LDA with gensim for topic modeling. My data has 23 documents and I want separate topics/words for each document but gensim is giving topics for entire set of documents together. How to get it for individual docs?
dictionary = corpora.Dictionary(doc_clean)
# Converting list of documents (corpus) into Document Term Matrix using
#dictionary prepared above.
corpus = [dictionary.doc2bow(doc) for doc in doc_clean]
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
result=ldamodel.print_topics(num_topics=3, num_words=3)
This is the output I'm getting:
[(0, '0.011*"plex" + 0.010*"game" + 0.009*"racing"'),
(1, '0.008*"app" + 0.008*"live" + 0.007*"share"'),
(2, '0.015*"device" + 0.009*"file" + 0.008*"movie"')]
print_topics() returns a list of topics, the words loading onto that topic and those words.
If you want the topic loadings per document, instead, you need to use get_document_topics().
From the gensim documentation:
get_document_topics(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False)
Get the topic distribution for the given document.
Parameters:
bow (corpus : list of (int, float)) – The document in BOW format.
minimum_probability (float) – Topics with an assigned probability lower than this threshold will be discarded.
minimum_phi_value (float) – If per_word_topics is True, this represents a lower bound on the term probabilities that are included.
If set to None, a value of 1e-8 is used to prevent 0s.
per_word_topics (bool) – If True, this function will also return two extra lists as explained in the “Returns” section.
Returns:
list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.
list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.
list of (int, list of float), optional – Phi relevance values, multiplied by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.
get_term_topics() and get_topic_terms() may also be potentially interesting for you.
If I understand you correctly, you need to put the entire thing inside a loop and do print_topics():
Your documents example:
doc1 = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc2 = "My mother spends a lot of time driving my brother around to baseball practice."
doc3 = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_set = [doc_a, doc_b, doc_c]
Now your loop must iterate through your doc_set:
for i in doc_set:
##### after all the cleaning in these steps, append to a list #####
dictionary = corpora.Dictionary(doc_clean)
corpus = [dictionary.doc2bow(doc) for doc in doc_clean]
##### set the num_topics you want for each document, I set one for now #####
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 1, id2word = dictionary, passes=20)
for i in ldamodel.print_topics():
print(i)
print('\n')
Sample output:
(0, '0.200*"brocolli" + 0.200*"eat" + 0.200*"good" + 0.133*"brother" + 0.133*"like" + 0.133*"mother"')
(0, '0.097*"brocolli" + 0.097*"eat" + 0.097*"good" + 0.097*"mother" + 0.097*"brother" + 0.065*"lot" + 0.065*"spend" + 0.065*"practic" + 0.065*"around" + 0.065*"basebal"')
(0, '0.060*"drive" + 0.060*"eat" + 0.060*"good" + 0.060*"mother" + 0.060*"brocolli" + 0.060*"brother" + 0.040*"pressur" + 0.040*"health" + 0.040*"caus" + 0.040*"increas"')
I am trying to get the doc2vec function to work in python 3.
I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function.
The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work.
the model trains on my data (1000 documents) and build a vocab which is tagged.
The documentation I mainly have used is this:
Gensim dokumentation
Torturial
I hope that some one can help. If any additional info is need please let me know.
best
Niels
If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)
I am trying to train a naive bayes classifier and I am having troubles with the data. I plan to use it for extractive text summarization.
Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.
I have a dataset that I plan to use and in every document there is at least 1 sentence for summary.
I decided to use sklearn but I don't know how to represent the data that I have. Namely X and y.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)
The closest to my mind is to make it like this:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
y = [
[0,1],
[1,0]
]
where the target values mean 1 - included in the summary and 0 - not included. This unfortunately gives bad shape exception beacause y is expected to be 1-d array. I cannot think of a way of representing it as such so please help.
btw, I don't use the string values in X directly but represent them as vectors with CountVectorizer and TfidfTransformer from sklearn.
As per your requirement, you are classifying the data. That means, you need to separate each sentence to predict it's class.
for example:
Instead of using:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
Use it as following:
X = [
'It was a sunny day.',
'The weather was nice and the birds were singing.',
'I like trains.',
'Hi, again.'
]
Use sentence tokenizer of NLTK to achieve this.
Now, for labels, use two-classes. let say 1 for yes, 0 for no.
y = [
[0,],
[1,],
[1,],
[0,]
]
Now, use this data to fit and predict the way you want!
Hope it helps!
I am working on keyword extraction problem. Consider the very general case
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.
"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."
"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"
Our best blessings are often the least appreciated."""
tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print(feature_names[col], ' - ', response[0, col])
and this gives me
(0, 28) 0.443509712811
(0, 27) 0.517461475101
(0, 8) 0.517461475101
(0, 6) 0.517461475101
tree - 0.443509712811
travellers - 0.517461475101
jupiter - 0.517461475101
fruit - 0.517461475101
which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?
You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:
feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
n = 3
top_n = feature_array[tfidf_sorting][:n]
This gives me:
array([u'fruit', u'travellers', u'jupiter'],
dtype='<U13')
The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.
Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.
Solution using sparse matrix itself (without .toarray())!
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus',
'frequency of words in a document is called term frequency'
]
X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())
new_doc = ['can key words in this new document be identified?',
'idf is the inverse document frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)
def get_top_tf_idf_words(response, top_n=2):
sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
return feature_names[response.indices[sorted_nzs]]
print([get_top_tf_idf_words(response,2) for response in responses])
#[array(['key', 'words'], dtype='<U9'),
array(['frequency', 'words'], dtype='<U9')]
Here is a quick code for that:
(documents is a list)
def get_tfidf_top_features(documents,n_top=10):
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
importance = np.argsort(np.asarray(tfidf.sum(axis=0)).ravel())[::-1]
tfidf_feature_names = np.array(tfidf_vectorizer.get_feature_names())
return tfidf_feature_names[importance[:n_top]]
i have a question concerning scikit-learn.
Is it possible to merge a multi dimensional feature list to one feature vector.
For example:
I have results from an application analysis and I would like to represent an application with one feature vector.
In case of network traffic, an analysis result looks like the following:
traffic = [
{
"http_body": "http_body_data", "length": 1024
},
{
"http_body2": "http_body_data2", "length": 2048
},
... and many more
]
So each dict in traffic list describes one network activity of a specific application.
I would like to generate a feature vector which contains all these information for one application to be able to generate a model out of the analysis results from a variety of applications.
How can I do this with scikit-learn?
Thank you in advance!
If every application sends response of the same length (e.g. the first have length 1024, the second - 2048 etc.) you can just join all the results into one vector.
For example if the responses in traffic is serialized lists (for example json).
def merge_feature_vector(traffic):
result = []
for id, data in enumerate(traffic):
result.extend(json.loads(data['http_body%s' % id]))
return result
Another approach is to use sklearn.feature_hasher.
For example
def encode_traffic(traffic):
result = {}
for id, data in enumerate(traffic):
result['html_body%s' % id] = data['html_body%s' % id]
return result
...
features = [encode_traffic(traffic) for traffic in train]
h = FeatureHasher(n_features=10)
features = h.fit_transform(features)
features.toarray()
clf = RandomForestClassifier(n_estimators=100)
clf.train(features)
By the way it mostly depends on what you have in http_data
You can't have non-numeric features the value of a feature should be a number.
so what do you do if you have : ip:127.0.0.1 ip:192.168.0.1 ip:220.220.220.220 ?
you create three features : ip_127.0.0.1, ip.192.168.0.1 and ip.220.220.220.220 and set the value of the first one 1 and the other two zero if the value of ip is 127.0.0.1.
If ip:val can have say more than 10 values you just create 10 features for the most common ones and create an ip_other feature and set that for all other samples who have another ip address.