TF-IDF for data filtering - python

I've a list of raw document, already filtered and removed english stopwords:
rawDocument = ['sport british english sports american english includes forms competitive physical activity games casual organised ...', 'disaster serious disruption occurring relatively short time functioning community society involving ...', 'government system group people governing organized community often state case broad associative definition ...', 'technology science craft greek τέχνη techne art skill cunning hand λογία logia collection techniques ...']
and I've used
from sklearn.feature_extraction.text import TfidfVectorizer
sklearn_tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=False)
sklearn_representation = sklearn_tfidf.fit_transform(rawDocuments)
But I got a
<4x50 sparse matrix of type '<class 'numpy.float64'>'
with 51 stored elements in Compressed Sparse Row format>
and I cant interpret the result. So, am I using the right tool or have I to change the way?
My goal is to get the relevant word in each document, in order to perform a cosine similarity with other words in a query document.
Thank you in advance.

Very often Pandas module can be used to better visualize your data:
Demo:
import pandas as pd
df = pd.SparseDataFrame(sklearn_tfidf.fit_transform(rawDocument),
columns=sklearn_tfidf.get_feature_names(),
default_fill_value=0)
Result:
In [85]: df
Out[85]:
activity american art associative british ... system techne techniques technology time
0 0.25 0.25 0.000000 0.000000 0.25 ... 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.00 0.00 0.000000 0.000000 0.00 ... 0.000000 0.000000 0.000000 0.000000 0.308556
2 0.00 0.00 0.000000 0.282804 0.00 ... 0.282804 0.000000 0.000000 0.000000 0.000000
3 0.00 0.00 0.288675 0.000000 0.00 ... 0.000000 0.288675 0.288675 0.288675 0.000000
[4 rows x 48 columns]

Related

Pandas GroupBy to calculate weighted percentages meeting a certain condition

I have a dataframe with survey data like so, with each row being a different respondent.
weight race Question_1 Question_2 Question_3
0.9 white 1 5 4
1.1 asian 5 4 3
0.95 white 2 1 5
1.25 black 5 4 3
0.80 other 4 5 2
Each question is on a scale from 1 to 5 (there are several more questions in the actual data). For each question, I am trying to calculate the percentage of respondents who responded with a 5, grouped by race and weighted by the weight column.
I believe that the code below works for calculating the percentage who responded with a 5 for each question, grouped by race. But I do not know how to weight it by the weight column.
df.groupby('race').apply(lambda x: ((x == 5).sum()) / x.count())
I am new to pandas. Could someone please explain how to do this? Thanks for any help.
Edit: The desired output for the above dataframe would look something like this. Obviously the real data has far more respondents (rows) and many more questions.
Question_1 Question_2 Question_3
white 0.00 0.49 0.51
black 1.00 0.00 0.00
asian 1.00 0.00 0.00
other 0.00 1.00 0.00
Thank you.
Here is a solution by defining a custom function and applying that function to each columns. Then you could concatenate each column into a dataframe:
def wavg(x, col):
return (x['weight']*(x[col]==5)).sum()/x['weight'].sum()
grouped = df.groupby('race')
pd.concat([grouped.apply(wavg,col) for col in df.columns if col.startswith('Question')],axis=1)\
.rename(columns = {num:f'Question_{num+1}' for num in range(3)})
Output:
Question_1 Question_2 Question_3
race
asian 1.0 0.000000 0.000000
black 1.0 0.000000 0.000000
other 0.0 1.000000 0.000000
white 0.0 0.486486 0.513514
Here's how you could do it for question 1. You can easily generalize it for the other questions.
# Define a dummy indicating a '5 response'
df['Q1'] = np.where(df['Question_1']==5 ,1, 0)
# Create a weighted version of the above dummy
df['Q1_w'] = df['Q1'] * df['weight']
# Compute the sum by race
ds = df.groupby(['race'])[['Q1_w', 'weight']].sum()
# Compute the weighted average
ds['avg'] = ds['Q1_w'] / ds['weight']
Basically, you first take the sum of the weights and of the weighted 5 dummy by race and then divide by the sum of the weights.
This gives you the weighted average.

token-pattern for numbers in tfidfvectorizer sklearn in python

I need to calculate the tfidf matrix for few sentences. sentence include both numbers and words.
I am using below code to do so
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data1=['1/8 wire','4 tube','1-1/4 brush']
dataset=pd.DataFrame(data1, columns=['des'])
vectorizer1 = TfidfVectorizer(lowercase=False)
tf_idf_matrix = pd.DataFrame(vectorizer1.fit_transform(dataset['des']).toarray(),columns=vectorizer1.get_feature_names())
Tfidf function is considering only words as its vocabulary i.e
Out[3]: ['brush', 'tube', 'wire']
but i need numbers to be part of tokens
expected
Out[3]: ['brush', 'tube', 'wire','1/8','4','1-1/4']
After reading TfidfVectorizer documentation, I came to know have to change token_pattern and tokenizer parameters. But I am not getting how to change it to consider numbers and punctuation.
can anyone please tell me how to change the parameters.
You're right, token_pattern requires a custom regex pattern, pass a regex that treats any one or more characters that don't contain whitespace characters as a single token.
tfidf = TfidfVectorizer(lowercase=False, token_pattern=r'\S+')
tf_idf_matrix = pd.DataFrame(
tfidf.fit_transform(dataset['des']).toarray(),
columns=tfidf.get_feature_names()
)
print(tf_idf_matrix)
1-1/4 1/8 4 brush tube wire
0 0.000000 0.707107 0.000000 0.000000 0.000000 0.707107
1 0.000000 0.000000 0.707107 0.000000 0.707107 0.000000
2 0.707107 0.000000 0.000000 0.707107 0.000000 0.000000
you can explicitly point out in token_pattern parameter the symbols you would like to parse:
token_pattern_ = r'([a-zA-Z0-9-/]{1,})'
where {1,} indicates the minimum number of symbols the word should contain. End then you pass this as a parameter to token_pattern:
tfidf = TfidfVectorizer(token_pattern = token_pattern_)

TfIDf Vectorizer weights

Hi I have a lemmatized text in the format as shown by lemma. I want to get TfIdf score for each word this is the function that I wrote:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
lemma=["'Ah", 'yes', u'say', 'softly', 'Harry',
'Potter', 'Our', 'new', 'celebrity', 'You',
'learn', 'subtle', 'science', 'exact', 'art',
'potion-making', u'begin', 'He', u'speak', 'barely',
'whisper', 'caught', 'every', 'word', 'like',
'Professor', 'McGonagall', 'Snape', 'gift',
u'keep', 'class', 'silent', 'without', 'effort',
'As', 'little', 'foolish', 'wand-waving', 'many',
'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really',
'understand', 'beauty']
def Tfidf_Vectorize(lemmas_name):
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
vect_transform = vect.fit_transform(lemmas_name)
# First approach of creating a dataframe of weight & feature names
vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
vect_array.sort_values(by='weight',ascending=False,inplace=True)
# Second approach of getting the feature names
vect_fn = np.array(vect.get_feature_names())
sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
return vect_array
tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])
The output I am getting by:
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
is
Largest Tfidf:
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
u'granger']
The result of tf_dataframe
term weight
261 snape 0.027875
238 say 0.022648
211 potter 0.013937
181 mind 0.010453
123 harry 0.010453
60 dark 0.006969
75 dumbledore 0.006969
311 voice 0.005226
125 head 0.005226
231 ron 0.005226
Shouldn't both approaches lead to the same result of top features? I just want to calculate the tfidf scores and get the top 5 features/weight. What am i doing wrong?
I am not sure what I am looking at here but I have the feeling that you're using TfidfVectorizer incorrectly. However, please correct me in case I got the wrong idea of what you're trying.
So.. what you need is a list of documents which you feed to fit_transform(). From that you can construct a matrix where, for example, each column represents a document and each row a word. One cell in that matrix is the tf-idf score of the word i in document j.
Here's an example:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"This is a document.",
"This is another document with slightly more text.",
"Whereas this is yet another document with even more text than the other ones.",
"This document is awesome and also rather long.",
"The car he drove was red."
]
document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]
def get_tfidf(docs, ngram_range=(1,1), index=None):
vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
tfidf = vect.fit_transform(documents).todense()
return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T
print(get_tfidf(documents, ngram_range=(1,2), index=document_names))
Which will give you:
Doc 0 Doc 1 Doc 2 Doc 3 Doc 4
awesome 0.0 0.000000 0.000000 0.481270 0.000000
awesome long 0.0 0.000000 0.000000 0.481270 0.000000
car 0.0 0.000000 0.000000 0.000000 0.447214
car drove 0.0 0.000000 0.000000 0.000000 0.447214
document 1.0 0.282814 0.282814 0.271139 0.000000
document awesome 0.0 0.000000 0.000000 0.481270 0.000000
document slightly 0.0 0.501992 0.000000 0.000000 0.000000
document text 0.0 0.000000 0.501992 0.000000 0.000000
drove 0.0 0.000000 0.000000 0.000000 0.447214
drove red 0.0 0.000000 0.000000 0.000000 0.447214
long 0.0 0.000000 0.000000 0.481270 0.000000
ones 0.0 0.000000 0.501992 0.000000 0.000000
red 0.0 0.000000 0.000000 0.000000 0.447214
slightly 0.0 0.501992 0.000000 0.000000 0.000000
slightly text 0.0 0.501992 0.000000 0.000000 0.000000
text 0.0 0.405004 0.405004 0.000000 0.000000
text ones 0.0 0.000000 0.501992 0.000000 0.000000
The two methods you show to get to words and their respective scores calculate the mean over all documents and fetch the max score of each word respectively.
So let's do this and compare the two methods:
df = get_tfidf(documents, ngram_range=(1,2), index=index)
print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)
We can see that the scores are of course different.
score_mean score_max
awesome 0.096254 0.481270
awesome long 0.096254 0.481270
car 0.089443 0.447214
car drove 0.089443 0.447214
document 0.367353 1.000000
document awesome 0.096254 0.481270
document slightly 0.100398 0.501992
document text 0.100398 0.501992
drove 0.089443 0.447214
drove red 0.089443 0.447214
long 0.096254 0.481270
ones 0.100398 0.501992
red 0.089443 0.447214
slightly 0.100398 0.501992
slightly text 0.100398 0.501992
text 0.162002 0.405004
text ones 0.100398 0.501992
Note:
You can convince yourself that this does the same as calling min/max on the TfidfVectorizer:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf = vect.fit_transform(documents)
print(tfidf.max(0))
print(tfidf.mean(0))

Writing XY coordinates from an ASCII file with no feature part ID using Python

I need to read an ASCII file containing X and Y coordinates as well a Z value using Python. These will be written as features in a feature class in ArcMap. Each point makes up a polygon where each feature is seperated by a row containing '999.0 999.0 999.0' as shown in the example. I'm wondering what the best way is to seperate each feature as there is no feature ID column.
329462.713287 8981177.910780 0.000000
331660.441771 8981187.405700 0.000000
331669.945462 8978975.695090 0.000000
329472.340912 8978966.180280 0.000000
329462.713287 8981177.910780 0.000000
999.0 999.0 999.0
297517.590475 8981318.596530 0.000000
299715.649732 8981329.876880 0.000000
299726.953175 8979117.630860 0.000000
297529.017922 8979106.326860 0.000000
297517.590475 8981318.596530 0.000000
999.0 999.0 999.0
Simply iterate the data line by line, and check whether the line contains your magic triplet and when you catch that line increase the feature index.

Is there any nipype interface for avscale (FSL script)?

I am trying to use nipype to analyze transformation matrixes that were created by FSL.
FSL has a script called "avscale" that analyzes those transformation matrixes (*.mat files).
I was wondering whether nipype has any interface that wrap that script and enable to work with its output.
Thanks
Based on the docs and the current source the answer is no. Also, avscale has also not been mentioned on the nipy-devel mailing list since at least last February. It's possible that Nipype already wraps something else that does this (perhaps with a matlab wrapper?) You could try opening an issue or asking the the mailing list.
As long as you're trying to use Python (with nipype and all), maybe the philosophy of the nipype project is that you should just use numpy/scipy for this? Just a guess, I don't know the functions to replicate this output with those tools. It's also possible that no one has gotten around to adding it yet.
For the uninitiated, avscale takes this affine matrix:
1.00614 -8.39414e-06 0 -0.757356
0 1.00511 -0.00317841 -0.412038
0 0.0019063 1.00735 -0.953364
0 0 0 1
and yields this or similar output:
Rotation & Translation Matrix:
1.000000 0.000000 0.000000 -0.757356
0.000000 0.999998 -0.001897 -0.412038
0.000000 0.001897 0.999998 -0.953364
0.000000 0.000000 0.000000 1.000000
Scales (x,y,z) = 1.006140 1.005112 1.007354
Skews (xy,xz,yz) = -0.000008 0.000000 -0.001259
Average scaling = 1.0062
Determinant = 1.01872
Left-Right orientation: preserved
Forward half transform =
1.003065 -0.000004 -0.000000 -0.378099
0.000000 1.002552 -0.001583 -0.206133
0.000000 0.000951 1.003669 -0.475711
0.000000 0.000000 0.000000 1.000000
Backward half transform =
0.996944 0.000004 0.000000 0.376944
0.000000 0.997452 0.001575 0.206357
0.000000 -0.000944 0.996343 0.473777
0.000000 0.000000 0.000000 1.000000

Categories

Resources