NLTK classifier precision and recall are always none (0)

NLTK classifier precision and recall are always none (0) - python

I have used Python NLTK library and the Naive Bayes classifier to detect if a string should be tagged "php" or not, based on training data (Stackoverflow questions in fact).
The classifier seem to find interesting features:
Most Informative Features
contains-word-isset = True True : False = 125.6 : 1.0
contains-word-echo = True True : False = 28.1 : 1.0
contains-word-php = True True : False = 17.1 : 1.0
contains-word-this- = True True : False = 16.0 : 1.0
contains-word-mysql = True True : False = 14.3 : 1.0
contains-word-_get = True True : False = 11.7 : 1.0
contains-word-foreach = True True : False = 7.6 : 1.0
Features are defined as follows:
def features(question):
features = {}
for token in detectorTokens:
featureName = "contains-word-"+token
features[featureName] = (token in question)
return features
but it seems the classifier decided to never tag a string as being a "php" question.
Even a simple string like: "is this a php question?" is being classified as False.
Can anyone help me understand this phenomenon?
Here is some partial code (I have 3 or 4 pages of code, so this is just a small part):
classifier = nltk.NaiveBayesClassifier.train(train_set)
cross_valid_accuracy = nltk.classify.accuracy(classifier, cross_valid_set)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(cross_valid_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'Precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'Recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])

Related

create new dataframe field using lambda function

I am trying to create a new column based on conditions on other column.
(the data frame is already aggragated by user)
this is a sample of the data frame:
event_names country
["deleteobject", "getobject"] ["us"]
["getobject"] ["ca"]
["deleteobject", "putobject"] ["ch"]
I want to create 3 new columns:
was data deleted?
was data downloaded?
did the events come from my whitelisted countries?
WHITELISTED_COUNTRIES = ["us", "sg"]
like this:
event_names country was_data_deleted? was_data_downloaded? whitelisted_country?
["deleteobject","getobject"] ["us"] True True True
["getobject"] ["ca"] False True False
["deleteobject","putobject"] ["ch"] True False False
This is what I tried so far:
result_df['was_data_deleted'] = result_df['event_name'].apply(lambda x:True if any("delete" in x for i in x) else False)
result_df['was_data_downloaded'] = result_df['event_name'].apply(lambda x:True if "getObject" in i for i in x else False)
result_df['strange_countries'] = result_df['country'].apply(lambda x:False if any(x in WHITELISTED_COUNTRIES for x in result_df['country']) else False)
I get an Error "SyntaxError: invalid syntax"
any ideas? thanks!

df['was_data_deleted'] = df['event_names'].apply(lambda x: 'deleteobject' in x)
df['was_data_downloaded'] = df['event_names'].apply(lambda x: 'getobject' in x)
df['whitelisted_country'] = df['country'].apply(lambda x: x[0] in WHITELISTED_COUNTRIES)
print(df)
Prints:
event_names country was_data_deleted was_data_downloaded whitelisted_country
0 [deleteobject, getobject] [us] True True True
1 [getobject] [ca] False True False
2 [deleteobject, putobject] [ch] True False False

You can simplify your lambda function with remove if-else and True, False, because compared values already return it:
WHITELISTED_COUNTRIES = ["us", "sg"]
#checked substring delete
f1 = lambda x: any("delete" in i for i in x)
result_df['was_data_deleted'] = result_df['event_names'].apply(f1)
#checked string "getobject"
f2 = lambda x:"getobject" in x
result_df['was_data_downloaded'] = result_df['event_names'].apply(f2)
#checked list
f3 = lambda x:any(y in WHITELISTED_COUNTRIES for y in x)
result_df['strange_countries'] = result_df['country'].apply(f3)
print (result_df)
event_names country was_data_deleted was_data_downloaded \
0 [deleteobject, getobject] [us] True True
1 [getobject] [ca] False True
2 [deleteobject, putobject] [ch] True False
strange_countries
0 True
1 False
2 False

Removing words for using NayveBayes NLTK Classifier for survey data

I have a CSV file with survey data and I wish to perform a sentiment analysis on it.
I am using the Naive Bayes to show the most informative features but the output is not showing a meaningful insight. It outputs irrelevant words such as level or of, hence I tried to manually create a list of stop words that I want to remove but I don't think it is working properly because they are still there. There is my code:
import csv
from collections import Counter
import nltk
from nltk.corpus import stopwords
with open('/Users/Alessandra/Desktop/Dissertation Data/Survey Coding Inst.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
alist = []
iterreader = iter(reader)
next(iterreader)
c = Counter()
for row in iterreader:
clean_rows = row[0].replace(",", " ").rsplit()
clean_symbols = row[0].replace("-", "").rsplit()
remove_words = ['of', 'Level', 'study', 'How', 'many', 'SC', '2.', '1.', '3.', '4.', '5.', '6.', '7.', '8.',
'9.',
'10.', '11.', '12.', '13.', '14.', '15.', 'Gender', 'inconvenience', 'times', 'Agree',
'Experience', 'Interrupted', 'Workflow', 'Unable', 'Yes', 'No', 'Statement', 'Safety',
'non-UCL', 'people', 'guards', 'Stronglee', 'Disagree', 'Neutral', 'Somewhat', 'on', 'if',
'too', '-', 'i', '1', '2']
# alist.append(clean_rows)
# alist.append(clean_symbols)
c.update(clean_rows)
c.update(clean_symbols)
alist.append(c)
word_count = Counter(c)
mostWcommon = word_count.most_common()
for i in alist:
if i in remove_words:
mostWcommon.remove(i)
print(mostWcommon)
all_words = nltk.FreqDist(w.lower() for w in alist[0])
word_features = list(all_words)[:100]
english_stop_words = stopwords.words('english')
def remove_stop_words(corpus):
removed_stop_words = []
for review in corpus:
removed_stop_words.append(' '.join([word for word in review[0].split() if word not in english_stop_words]))
return removed_stop_words
no_stop_words = remove_stop_words(mostWcommon)
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains {}'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d, c) in mostWcommon]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(5)
OUTPUT:
Most Informative Features
contains i = True 3 : 2 = 1.6 : 1.0
contains 1 = True 1 : 3 = 1.5 : 1.0
contains i = False 2 : 3 = 1.3 : 1.0
contains 2 = True 1 : 3 = 1.2 : 1.0
contains - = True 2 : 1 = 1.2 : 1.0
contains 1 = False 2 : 1 = 1.2 : 1.0
contains 2 = False 2 : 1 = 1.1 : 1.0
contains - = False 1 : 3 = 1.0 : 1.0
contains 5. = False 1 : 4 = 1.0 : 1.0
contains disagree = False 1 : 4 = 1.0 : 1.0
Data looks like this:
('Yes', 194), ('No', 173), ('agree', 61), ('Agree', 57), ('to', 48), ('UG', 47), ('Strongly', 38), ('and', 36), ('unlikely', 36), ('Female', 34), ('-', 34),....)
As you can see even most common is not picking up the manual removal hence displaying less meaningful data... Any suggestions would be appreciated.

Better way to replace my function?

I have attached a json data link for download-
json data
Currently I have written following function for getting each level of children data into a combined dataframe-
def get_children(catMapping):
level4 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', 'children', 'children', 'children', ['children']])
level3 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', 'children', 'children', ['children']])
['children', 'children', ['children']])
level1 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', ['children']])
level0 = json_normalize(catMapping['SuccessResponse']['Body'],
['children'])
combined = pd.concat([level0, level1, level2, level3,level4])
combined = combined.reset_index(drop=True)
return combined
And it looks like this is not the recommended way but I am unable to write a function which can traverse each level.
Can you please help me with any better function?

Here is a function that recursively iterate all items:
import pandas as pd
import ast
with open(r"data.json", "r") as f:
data = ast.literal_eval(f.read())
def nest_iter(items):
for item in items:
children_ids = [o["categoryId"] for o in item["children"]]
ret_item = item.copy()
ret_item["children"] = children_ids
yield ret_item
yield from nest_iter(item["children"])
df = pd.DataFrame(nest_iter(data['SuccessResponse']['Body']))
the result:
categoryId children leaf name var
....
4970 10001244 [] True Business False
4971 10001245 [] True Casual False
4972 10001246 [] True Fashion False
4973 10001247 [] True Sports False
4974 7756 [7761, 7758, 7757, 7759, 7760] False Women False
4975 7761 [] True Accessories False
4976 7758 [] True Business False
4977 7757 [] True Casual False
4978 7759 [] True Fashion False
4979 7760 [] True Sports False

pandas custom file format parsing

I have data in the following format:
1_engineer_grade1 |Boolean IsMale IsNorthAmerican IsFromUSA |Name blah
2_lawyer_grade7 |Boolean IsFemale IsAlive |Children 2
I need to convert this into a dataframe with the following columns:
id job grade Bool.IsMale Bool.IsFemale Bool.IsAlive Bool.IsNorthAmerican Bool.IsFromUSA Name Children
1 engineer 1 True False False True True blah NaN
2 lawyer 7 False True True True False NaN 2
I could preprocess this data in python and then call pd.DataFrame on this, but I was wondering if there was a better way of doing this?
UPDATE: I ended up doing the following: If there are obvious optimizations, please let me know
with open(vwfile, encoding='latin-1') as f:
data = []
for line in f:
line = [x.strip() for x in line.strip().split('|')]
# line == [
# "1_engineer_grade1",
# "|Boolean IsMale IsNorthAmerican IsFromUSA",
# "|Name blah"
# ]
ident, job, grade = line[0].split("_")
features = line[1:]
bools = {
"IsMale": False,
"IsFemale": False,
"IsNorthAmerican": False,
"IsFromUSA": False,
"IsAlive": False,
}
others = {}
for category in features:
if category.startswith("Bools "):
for feature in category.split(' ')[1:]:
bools[feature] = True
else:
feature = category.split(" ")
# feature == ["Name", "blah"]
others[feature[0]] = feature[1]
featuredict = {
'ident': ident,
'job': job,
'grade': grade,
}
featuredict.update(bools)
featuredict.update(others)
data.append(featuredict)
df = pd.DataFrame(data)
UPDATE-2 A million line file took about 55 seconds to process this.

Set limit feature_importances_ in DataFrame Pandas

I want to set a limit for my feature_importances_ output using DataFrame.
Below is my code (refer from this blog):
train = df_visualization.sample(frac=0.9,random_state=639)
test = df_visualization.drop(train.index)
train.to_csv('train.csv',encoding='utf-8')
test.to_csv('test.csv',encoding='utf-8')
train_dis = train.iloc[:,:66]
train_val = train_dis.values
train_in = train_val[:,:65]
train_out = train_val[:,65]
test_dis = test.iloc[:,:66]
test_val = test_dis.values
test_in = test_val[:,:65]
test_out = test_val[:,65]
dt = tree.DecisionTreeClassifier(random_state=59,criterion='entropy')
dt = dt.fit(train_in,train_out)
score = dt.score(train_in,train_out)
test_predicted = dt.predict(test_in)
# Print the feature ranking
print("Feature ranking:")
print (DataFrame(dt.feature_importances_, columns = ["Imp"], index = train.iloc[:,:65].columns).sort_values(['Imp'], ascending = False))
My problem now is it display all 65 features.
Output :
Imp
wbc 0.227780
age 0.100949
gcs 0.069359
hr 0.069270
rbs 0.053418
sbp 0.052067
Intubation-No 0.050729
... ...
Babinski-Normal 0.000000
ABG-Metabolic Alkolosis 0.000000
ABG-Respiratory Acidosis 0.000000
Reflexes-Unilateral Hyperreflexia 0.000000
NS-No 0.000000
For example I just want top 5 features only.
Expected output:
Imp
wbc 0.227780
age 0.100949
gcs 0.069359
hr 0.069270
rbs 0.053418
Update :
I got the way to display using itertuples.
display = pd.DataFrame(dt.feature_importances_, columns = ["Imp"], index = train.iloc[:,:65].columns).sort_values(['Imp'], ascending = False)
x=0
for row,col in display.itertuples():
if x<5:
print(row,"=",col)
else:
break
x++
Output :
Feature ranking:
wbc = 0.227780409582
age = 0.100949241154
gcs = 0.0693593476192
hr = 0.069270425399
rbs = 0.0534175402602
But I want to know whether this is the efficient way to get the output?

Try this:
indices = np.argsort(dt.feature_importances_)[::-1]
for i in range(5):
print " %s = %s" % (feature_cols[indices[i]], dt.feature_importances_[indices[i]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NLTK classifier precision and recall are always none (0) - python

Related

create new dataframe field using lambda function

Removing words for using NayveBayes NLTK Classifier for survey data

Better way to replace my function?

pandas custom file format parsing

Set limit feature_importances_ in DataFrame Pandas

Categories

Resources