NLTK classifier precision and recall are always none (0) - python

I have used Python NLTK library and the Naive Bayes classifier to detect if a string should be tagged "php" or not, based on training data (Stackoverflow questions in fact).
The classifier seem to find interesting features:
Most Informative Features
contains-word-isset = True True : False = 125.6 : 1.0
contains-word-echo = True True : False = 28.1 : 1.0
contains-word-php = True True : False = 17.1 : 1.0
contains-word-this- = True True : False = 16.0 : 1.0
contains-word-mysql = True True : False = 14.3 : 1.0
contains-word-_get = True True : False = 11.7 : 1.0
contains-word-foreach = True True : False = 7.6 : 1.0
Features are defined as follows:
def features(question):
features = {}
for token in detectorTokens:
featureName = "contains-word-"+token
features[featureName] = (token in question)
return features
but it seems the classifier decided to never tag a string as being a "php" question.
Even a simple string like: "is this a php question?" is being classified as False.
Can anyone help me understand this phenomenon?
Here is some partial code (I have 3 or 4 pages of code, so this is just a small part):
classifier = nltk.NaiveBayesClassifier.train(train_set)
cross_valid_accuracy = nltk.classify.accuracy(classifier, cross_valid_set)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(cross_valid_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'Precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'Recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])

Related

create new dataframe field using lambda function

I am trying to create a new column based on conditions on other column.
(the data frame is already aggragated by user)
this is a sample of the data frame:
event_names country
["deleteobject", "getobject"] ["us"]
["getobject"] ["ca"]
["deleteobject", "putobject"] ["ch"]
I want to create 3 new columns:
was data deleted?
was data downloaded?
did the events come from my whitelisted countries?
WHITELISTED_COUNTRIES = ["us", "sg"]
like this:
event_names country was_data_deleted? was_data_downloaded? whitelisted_country?
["deleteobject","getobject"] ["us"] True True True
["getobject"] ["ca"] False True False
["deleteobject","putobject"] ["ch"] True False False
This is what I tried so far:
result_df['was_data_deleted'] = result_df['event_name'].apply(lambda x:True if any("delete" in x for i in x) else False)
result_df['was_data_downloaded'] = result_df['event_name'].apply(lambda x:True if "getObject" in i for i in x else False)
result_df['strange_countries'] = result_df['country'].apply(lambda x:False if any(x in WHITELISTED_COUNTRIES for x in result_df['country']) else False)
I get an Error "SyntaxError: invalid syntax"
any ideas? thanks!
df['was_data_deleted'] = df['event_names'].apply(lambda x: 'deleteobject' in x)
df['was_data_downloaded'] = df['event_names'].apply(lambda x: 'getobject' in x)
df['whitelisted_country'] = df['country'].apply(lambda x: x[0] in WHITELISTED_COUNTRIES)
print(df)
Prints:
event_names country was_data_deleted was_data_downloaded whitelisted_country
0 [deleteobject, getobject] [us] True True True
1 [getobject] [ca] False True False
2 [deleteobject, putobject] [ch] True False False
You can simplify your lambda function with remove if-else and True, False, because compared values already return it:
WHITELISTED_COUNTRIES = ["us", "sg"]
#checked substring delete
f1 = lambda x: any("delete" in i for i in x)
result_df['was_data_deleted'] = result_df['event_names'].apply(f1)
#checked string "getobject"
f2 = lambda x:"getobject" in x
result_df['was_data_downloaded'] = result_df['event_names'].apply(f2)
#checked list
f3 = lambda x:any(y in WHITELISTED_COUNTRIES for y in x)
result_df['strange_countries'] = result_df['country'].apply(f3)
print (result_df)
event_names country was_data_deleted was_data_downloaded \
0 [deleteobject, getobject] [us] True True
1 [getobject] [ca] False True
2 [deleteobject, putobject] [ch] True False
strange_countries
0 True
1 False
2 False

Removing words for using NayveBayes NLTK Classifier for survey data

I have a CSV file with survey data and I wish to perform a sentiment analysis on it.
I am using the Naive Bayes to show the most informative features but the output is not showing a meaningful insight. It outputs irrelevant words such as level or of, hence I tried to manually create a list of stop words that I want to remove but I don't think it is working properly because they are still there. There is my code:
import csv
from collections import Counter
import nltk
from nltk.corpus import stopwords
with open('/Users/Alessandra/Desktop/Dissertation Data/Survey Coding Inst.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
alist = []
iterreader = iter(reader)
next(iterreader)
c = Counter()
for row in iterreader:
clean_rows = row[0].replace(",", " ").rsplit()
clean_symbols = row[0].replace("-", "").rsplit()
remove_words = ['of', 'Level', 'study', 'How', 'many', 'SC', '2.', '1.', '3.', '4.', '5.', '6.', '7.', '8.',
'9.',
'10.', '11.', '12.', '13.', '14.', '15.', 'Gender', 'inconvenience', 'times', 'Agree',
'Experience', 'Interrupted', 'Workflow', 'Unable', 'Yes', 'No', 'Statement', 'Safety',
'non-UCL', 'people', 'guards', 'Stronglee', 'Disagree', 'Neutral', 'Somewhat', 'on', 'if',
'too', '-', 'i', '1', '2']
# alist.append(clean_rows)
# alist.append(clean_symbols)
c.update(clean_rows)
c.update(clean_symbols)
alist.append(c)
word_count = Counter(c)
mostWcommon = word_count.most_common()
for i in alist:
if i in remove_words:
mostWcommon.remove(i)
print(mostWcommon)
all_words = nltk.FreqDist(w.lower() for w in alist[0])
word_features = list(all_words)[:100]
english_stop_words = stopwords.words('english')
def remove_stop_words(corpus):
removed_stop_words = []
for review in corpus:
removed_stop_words.append(' '.join([word for word in review[0].split() if word not in english_stop_words]))
return removed_stop_words
no_stop_words = remove_stop_words(mostWcommon)
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains {}'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d, c) in mostWcommon]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(5)
OUTPUT:
Most Informative Features
contains i = True 3 : 2 = 1.6 : 1.0
contains 1 = True 1 : 3 = 1.5 : 1.0
contains i = False 2 : 3 = 1.3 : 1.0
contains 2 = True 1 : 3 = 1.2 : 1.0
contains - = True 2 : 1 = 1.2 : 1.0
contains 1 = False 2 : 1 = 1.2 : 1.0
contains 2 = False 2 : 1 = 1.1 : 1.0
contains - = False 1 : 3 = 1.0 : 1.0
contains 5. = False 1 : 4 = 1.0 : 1.0
contains disagree = False 1 : 4 = 1.0 : 1.0
Data looks like this:
('Yes', 194), ('No', 173), ('agree', 61), ('Agree', 57), ('to', 48), ('UG', 47), ('Strongly', 38), ('and', 36), ('unlikely', 36), ('Female', 34), ('-', 34),....)
As you can see even most common is not picking up the manual removal hence displaying less meaningful data... Any suggestions would be appreciated.

Better way to replace my function?

I have attached a json data link for download-
json data
Currently I have written following function for getting each level of children data into a combined dataframe-
def get_children(catMapping):
level4 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', 'children', 'children', 'children', ['children']])
level3 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', 'children', 'children', ['children']])
['children', 'children', ['children']])
level1 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', ['children']])
level0 = json_normalize(catMapping['SuccessResponse']['Body'],
['children'])
combined = pd.concat([level0, level1, level2, level3,level4])
combined = combined.reset_index(drop=True)
return combined
And it looks like this is not the recommended way but I am unable to write a function which can traverse each level.
Can you please help me with any better function?
Here is a function that recursively iterate all items:
import pandas as pd
import ast
with open(r"data.json", "r") as f:
data = ast.literal_eval(f.read())
def nest_iter(items):
for item in items:
children_ids = [o["categoryId"] for o in item["children"]]
ret_item = item.copy()
ret_item["children"] = children_ids
yield ret_item
yield from nest_iter(item["children"])
df = pd.DataFrame(nest_iter(data['SuccessResponse']['Body']))
the result:
categoryId children leaf name var
....
4970 10001244 [] True Business False
4971 10001245 [] True Casual False
4972 10001246 [] True Fashion False
4973 10001247 [] True Sports False
4974 7756 [7761, 7758, 7757, 7759, 7760] False Women False
4975 7761 [] True Accessories False
4976 7758 [] True Business False
4977 7757 [] True Casual False
4978 7759 [] True Fashion False
4979 7760 [] True Sports False

pandas custom file format parsing

I have data in the following format:
1_engineer_grade1 |Boolean IsMale IsNorthAmerican IsFromUSA |Name blah
2_lawyer_grade7 |Boolean IsFemale IsAlive |Children 2
I need to convert this into a dataframe with the following columns:
id job grade Bool.IsMale Bool.IsFemale Bool.IsAlive Bool.IsNorthAmerican Bool.IsFromUSA Name Children
1 engineer 1 True False False True True blah NaN
2 lawyer 7 False True True True False NaN 2
I could preprocess this data in python and then call pd.DataFrame on this, but I was wondering if there was a better way of doing this?
UPDATE: I ended up doing the following: If there are obvious optimizations, please let me know
with open(vwfile, encoding='latin-1') as f:
data = []
for line in f:
line = [x.strip() for x in line.strip().split('|')]
# line == [
# "1_engineer_grade1",
# "|Boolean IsMale IsNorthAmerican IsFromUSA",
# "|Name blah"
# ]
ident, job, grade = line[0].split("_")
features = line[1:]
bools = {
"IsMale": False,
"IsFemale": False,
"IsNorthAmerican": False,
"IsFromUSA": False,
"IsAlive": False,
}
others = {}
for category in features:
if category.startswith("Bools "):
for feature in category.split(' ')[1:]:
bools[feature] = True
else:
feature = category.split(" ")
# feature == ["Name", "blah"]
others[feature[0]] = feature[1]
featuredict = {
'ident': ident,
'job': job,
'grade': grade,
}
featuredict.update(bools)
featuredict.update(others)
data.append(featuredict)
df = pd.DataFrame(data)
UPDATE-2 A million line file took about 55 seconds to process this.

Set limit feature_importances_ in DataFrame Pandas

I want to set a limit for my feature_importances_ output using DataFrame.
Below is my code (refer from this blog):
train = df_visualization.sample(frac=0.9,random_state=639)
test = df_visualization.drop(train.index)
train.to_csv('train.csv',encoding='utf-8')
test.to_csv('test.csv',encoding='utf-8')
train_dis = train.iloc[:,:66]
train_val = train_dis.values
train_in = train_val[:,:65]
train_out = train_val[:,65]
test_dis = test.iloc[:,:66]
test_val = test_dis.values
test_in = test_val[:,:65]
test_out = test_val[:,65]
dt = tree.DecisionTreeClassifier(random_state=59,criterion='entropy')
dt = dt.fit(train_in,train_out)
score = dt.score(train_in,train_out)
test_predicted = dt.predict(test_in)
# Print the feature ranking
print("Feature ranking:")
print (DataFrame(dt.feature_importances_, columns = ["Imp"], index = train.iloc[:,:65].columns).sort_values(['Imp'], ascending = False))
My problem now is it display all 65 features.
Output :
Imp
wbc 0.227780
age 0.100949
gcs 0.069359
hr 0.069270
rbs 0.053418
sbp 0.052067
Intubation-No 0.050729
... ...
Babinski-Normal 0.000000
ABG-Metabolic Alkolosis 0.000000
ABG-Respiratory Acidosis 0.000000
Reflexes-Unilateral Hyperreflexia 0.000000
NS-No 0.000000
For example I just want top 5 features only.
Expected output:
Imp
wbc 0.227780
age 0.100949
gcs 0.069359
hr 0.069270
rbs 0.053418
Update :
I got the way to display using itertuples.
display = pd.DataFrame(dt.feature_importances_, columns = ["Imp"], index = train.iloc[:,:65].columns).sort_values(['Imp'], ascending = False)
x=0
for row,col in display.itertuples():
if x<5:
print(row,"=",col)
else:
break
x++
Output :
Feature ranking:
wbc = 0.227780409582
age = 0.100949241154
gcs = 0.0693593476192
hr = 0.069270425399
rbs = 0.0534175402602
But I want to know whether this is the efficient way to get the output?
Try this:
indices = np.argsort(dt.feature_importances_)[::-1]
for i in range(5):
print " %s = %s" % (feature_cols[indices[i]], dt.feature_importances_[indices[i]])

Categories

Resources