I learned how to customize Stanford NER (Named Entity Recognizer) in Java from here:
http://nlp.stanford.edu/software/crf-faq.shtml#a
But I am developing my project with Python and here I need to train my classier with some custom entities.
I searched a lot for a solution but could not find any. Any idea? If it is not possible, is there any other way to train my classifier with custom entities, i.e, with nltk or others in python?
EDIT: Code addition
This is what I did to set up and test Stanford NER which worked nicely:
from nltk.tag.stanford import StanfordNERTagger
path_to_model = "C:\..\stanford-ner-2016-10-31\classifiers\english.all.3class.distsim.crf.ser"
path_to_jar = "C:\..\stanford-ner-2016-10-31\stanford-ner.jar"
nertagger=StanfordNERTagger(path_to_model, path_to_jar)
query="Show me the best eye doctor in Munich"
print(nertagger.tag(query.split()))
This code worked successfully. Then, I downloaded the sample austen.prop file and both jane-austen-emma-ch1.tsv and jane-austen-emma-ch2.tsv file and put it in a custom folder in NerTragger library folder. I modified the jane-austen-emma-ch1.tsv file with my custom entity tags. The code of austen.prop file has link to jane-austen-emma-ch1.tsv file. Now, I modified the above code as follow but it is not working:
from nltk.tag.stanford import StanfordNERTagger
path_to_model = "C:\..\stanford-ner-2016-10-31\custom/austen.prop"
path_to_jar = "C:\..\stanford-ner-2016-10-31\stanford-ner.jar"
nertagger=StanfordNERTagger(path_to_model, path_to_jar)
query="Show me the best eye doctor in Munich"
print(nertagger.tag(query.split()))
But this code is producing the following error:
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 236C6F63
raise OSError('Java command failed : ' + str(cmd))
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1507)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3017)
Caused by: java.io.StreamCorruptedException: invalid stream header: 236C6F63
OSError: Java command failed : ['C:\\Program Files\\Java\\jdk1.8.0_111\\bin\\java.exe', '-mx1000m', '-cp', 'C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0-javadoc.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0-sources.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\joda-time.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\jollyday-0.4.9.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\stanford-ner-resources.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', 'C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31/custom/austen.prop', '-textFile', 'C:\\Users\\HP\\AppData\\Local\\Temp\\tmppk8_741f', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:808)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1462)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1494)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1505)
... 1 more
The Stanford NER classifier is a java program. The NLTK's module is only an interface to the java executable. So you train a model exactly as you did before (or as you saw done in the link you provide).
In your code, you are confusing the training of a model with its use to chunk new text. The .prop file contains instructions for training a new model; it is not itself a model. This is what I recommend:
Forget about python/nltk for the moment, and train a new model from the Windows command line (CMD prompt or whatever): Follow the how-to you mention in your question, to generate a serialized model (.ser file) named ner-model.ser.gz or whatever you decide to call it from your .prop file.
In your python code, set the path_to_model variable to point to the .ser file you generated in step 1.
If you really want to control the training process from python, you could use the subprocess module to issue the appropriate command line commands. But it sounds like you don't really need this; just try to understand what these steps do so that you can carry them out properly.
Related
I'm loading this object detection model in python. I can load it with the following lines of code:
import tflite_runtime.interpreter as tflite
model_path = 'path_to_model_file.tf'
interpreter = tflite.Interpreter(model_path)
I'm able to perform inferences on this without any problem. However, labels are suposed to be included in the metadata, according to model's documentation, but I can't extract it.
The closest I was, it was when following this:
from tflite_support import metadata as _metadata
displayer = _metadata.MetadataDisplayer.with_model_file(model_path)
export_json_file = "extracted_metadata.json")
json_file = displayer.get_metadata_json()
# Optional: write out the metadata as a json file
with open(export_json_file, "w") as f:
f.write(json_file)
but the very first line of code, fails with this error: {AtributeError}'int' object has no attribute 'tobytes'.
How to extract it?
If you only care about the label file, you can simply run command like unzip model_path on Linux or Mac. TFLite model with metadata is essentially a zip file. See the public introduction for more details.
You code snippet to extract metadata works on my end. Make sure to double check model_path. It should be a string, such as "lite-model_ssd_mobilenet_v1_1_metadata_2.tflite".
If you'd like to read label files in an Android app, here is the sample code to do so.
According to Chaquopy Not able to download Resource i'm not sure if the problem got solved.
So here is question in the nltk context.
After including one of the nltk.download line:
nltk.download('popular')
or
nltk.download('punkt')
or
nltk.download('all')
I get this stack trace:
2020-08-26 13:33:45.742 19765-19765/com.pro.useyournotes E/ExceptionTag: com.chaquo.python.PyException: BadZipFile: File is not a zip file
com.chaquo.python.PyException: BadZipFile: File is not a zip file
at <python>.zipfile._RealGetContents(zipfile.py:1335)
at <python>.zipfile.__init__(zipfile.py:1268)
at <python>.nltk.data.__init__(data.py:936)
at <python>.nltk.compat._decorator(compat.py:41)
at <python>.nltk.data.__init__(data.py:396)
at <python>.nltk.compat._decorator(compat.py:41)
at <python>.nltk.data.find(data.py:544)
at <python>.nltk.data.find(data.py:557)
at <python>.nltk.tag.perceptron.__init__(perceptron.py:168)
at <python>.nltk.tag._get_tagger(__init__.py:106)
at <python>.nltk.tag.pos_tag_sents(__init__.py:178)
at <python>.uyn_pre_processing.pre_processing(uyn_pre_processing.py:88)
at <python>.uyn_analysis_workflow.analyse_new_data(uyn_analysis_workflow.py:62)
at <python>.uyn_main.main(uyn_main.py:266)
at <python>.chaquopy_java.call(chaquopy_java.pyx:285)
at <python>.chaquopy_java.Java_com_chaquo_python_PyObject_callAttrThrows(chaquopy_java.pyx:257)
at com.chaquo.python.PyObject.callAttrThrows(Native Method)
at com.chaquo.python.PyObject.callAttr(PyObject.java:209)
at com.pro.useyournotes.MainActivity.getPythonHello(MainActivity.kt:70)
at com.pro.useyournotes.MainActivity.onCreate(MainActivity.kt:59)
at android.app.Activity.performCreate(Activity.java:7136)
at android.app.Activity.performCreate(Activity.java:7127)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1271)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2893)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3048)
at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:78)
at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:108)
at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:68)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1808)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.app.ActivityThread.main(ActivityThread.java:6669)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)
The code where this error occurs is:
tagged_words=nltk.pos_tag_sents(tokenized_sentences)
at <python>.uyn_pre_processing.pre_processing(uyn_pre_processing.py:88)
I also don't know where the nltk-files are placed. Earlier when i just programmed on the python side i onlyremember using the import nltk command. Hopefully some already found a solution for using nltk.
I was able to reproduce something similar on the emulator. In my case the root cause was that the download failed with a DECRYPTION_FAILED_OR_BAD_RECORD_MAC error, leaving behind an incomplete ZIP file.
This appears to be a low-level problem with the emulator which isn't specific to Python. If you can confirm you have the same problem (by seeing DECRYPTION_FAILED_OR_BAD_RECORD_MAC in the nltk.download logcat output), then please add a star on the Android issue tracker here, to help encourage them to fix it.
You can work around this by calling nltk.download repeatedly in a loop until it returns true. To save time, you should probably only download the data files you need. You can find out what these are by simply calling the corresponding function and looking at the error message, e.g.:
>>> nltk.pos_tag_sents([["hello", "world"]])
...
LookupError:
**********************************************************************
Resource [93maveraged_perceptron_tagger[0m not found.
Please use the NLTK Downloader to obtain the resource:
[31m>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
Then you can add this to your code:
while not nltk.download('averaged_perceptron_tagger'):
print("Retrying download")
This succeeded after a few iterations, and I was then able to call nltk.pos_tag_sents successfully.
Add this to your python script:
while not nltk.download('punkt'):
return ("Retrying download punkt")
Also in your AndroidManifest don't forget to add these permissions:
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
I am trying to load a TF1 model, using hub and following this guide.
This model has a sentencepiece model which comes with it:
spm_path = m.signatures['spm_path']
<tensorflow.python.eager.wrap_function.WrappedFunction at 0x1341129e8>
If I execute this function:
spm_path()
{'default': <tf.Tensor: id=5905, shape=(), dtype=string, numpy=b'SAVEDMODEL-ASSET'>}
However if I use the output of b'SAVEDMODEL-ASSET' to load my sentencepiece model - I get the following error:
sp.Load(b'SAVEDMODEL-ASSET')
OSError: Not found: "SAVEDMODEL-ASSET": No such file or directory Error #2
The issue is that I am not sure where this asset is located - where does hub store dowloaded modules?
I can find the following: os.environ['TFHUB_CACHE_DIR'] = '/tmp/tfhub' but this is not enough for me to locate the actual file on my machine and pass in the correct file.
It's a bit clunky but here's one way to do it:
uselite = hub.load("https://tfhub.dev/google/universal-sentence-encoder-lite/2")
sp = sentencepiece.SentencePieceProcessor()
sp.load(uselite.asset_paths[0].asset_path.numpy())
The asset_paths contains only one item and it's the path to the spm model.
I have the same problem, trying to load the assets needed for the tokenization for the ALBERT model. I found that there is an asset list in m.asset_paths. There are the Asset objects, and you can access the path with the .asset_path property. The problem is that you need to check the paths of the assets, to find the one that you need. Maybe there is a better way, but I don't know it.
I am new to python and word2vec and keep getting a "you must first build vocabulary before training the model" error. What is wrong with my code?
Here is my code:
file_object=open("SupremeCourt.txt","w")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)
out=model.most_similar()
print(out[1])
print(out[2])
I could see some wrong things in your code like the file is opened in write mode and the model which you have loaded doesn't contain the word which you want to find the most similar words.
I would like to suggest to use the predefined models like google_news_vectors to load in the gensim or to build your own word2vec model so that you won't get the error.
the usage of most_similar in gensim is out = model.most_similar("word-name")
file_object=open("SupremeCourt.txt","r")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)#use google news vectors here
out=model.most_similar("word")
print(out)
You're opening that file in write mode with this line:
file_object = open("SupremeCourt.txt", "w")
By doing this, you're erasing the contents of your file, so that when you try to pass the file the model for training, there is no data to read. That's why that error is thrown.
Remove that line (and also restore your file contents), and it'll work.
I was following the instructions on this link ("http://radimrehurek.com/2014/03/tutorial-on-mallet-in-python/"), however I came across an error when I tried to train the model:
model = models.LdaMallet(mallet_path, corpus, num_topics =10, id2word = corpus.dictionary)
IOError: [Errno 2] No such file or directory: 'c:\\users\\brlu\\appdata\\local\\temp\\c6a13a_state.mallet.gz'
Please share any thoughts you might have.
Thanks.
This can happen for two reasons:
1. You have space in your mallet path.
2. There is no MALLET_HOME environment variable.
In my case I forgot to import gensim's mallet wrapper. The following code resolved the error.
import os
from gensim.models.wrappers import LdaMallet
os.environ['MALLET_HOME'] = 'C:/.../mallet-2.0.8/'
A more detailed explanation can be found here:
https://github.com/RaRe-Technologies/gensim/issues/2137
Make sure that mallet properly works from command-line.
Look to your folder 'c:\users\brlu\appdata\local\temp\...' if there are some files, you can deduce at which step mallet-wrapper fails. Try this step at command line.
I had similar problems with gensim + MALLET on Windows:
Make sure that MALLET_HOME is set
Escape slashes when set mallet_path in Python
mallet_path = 'c:\\mallet-2.0.7\\bin\\mallet'
LDA_model = gensim.models.LdaMallet(mallet_path, ...
Also, it might be useful to modify line 142 in Python\Lib\site-packages\gensim\models\ldamallet.py: change --token-regex '\S+' to --token-regex \"\S+\"
Hope it helps
Try the following
import tempfile
tempfile.tempdir='some_other_non_system_temp_directory'