Chaquopy problems with nltk and download - python

According to Chaquopy Not able to download Resource i'm not sure if the problem got solved.
So here is question in the nltk context.
After including one of the nltk.download line:
nltk.download('popular')
or
nltk.download('punkt')
or
nltk.download('all')
I get this stack trace:
2020-08-26 13:33:45.742 19765-19765/com.pro.useyournotes E/ExceptionTag: com.chaquo.python.PyException: BadZipFile: File is not a zip file
com.chaquo.python.PyException: BadZipFile: File is not a zip file
at <python>.zipfile._RealGetContents(zipfile.py:1335)
at <python>.zipfile.__init__(zipfile.py:1268)
at <python>.nltk.data.__init__(data.py:936)
at <python>.nltk.compat._decorator(compat.py:41)
at <python>.nltk.data.__init__(data.py:396)
at <python>.nltk.compat._decorator(compat.py:41)
at <python>.nltk.data.find(data.py:544)
at <python>.nltk.data.find(data.py:557)
at <python>.nltk.tag.perceptron.__init__(perceptron.py:168)
at <python>.nltk.tag._get_tagger(__init__.py:106)
at <python>.nltk.tag.pos_tag_sents(__init__.py:178)
at <python>.uyn_pre_processing.pre_processing(uyn_pre_processing.py:88)
at <python>.uyn_analysis_workflow.analyse_new_data(uyn_analysis_workflow.py:62)
at <python>.uyn_main.main(uyn_main.py:266)
at <python>.chaquopy_java.call(chaquopy_java.pyx:285)
at <python>.chaquopy_java.Java_com_chaquo_python_PyObject_callAttrThrows(chaquopy_java.pyx:257)
at com.chaquo.python.PyObject.callAttrThrows(Native Method)
at com.chaquo.python.PyObject.callAttr(PyObject.java:209)
at com.pro.useyournotes.MainActivity.getPythonHello(MainActivity.kt:70)
at com.pro.useyournotes.MainActivity.onCreate(MainActivity.kt:59)
at android.app.Activity.performCreate(Activity.java:7136)
at android.app.Activity.performCreate(Activity.java:7127)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1271)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2893)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3048)
at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:78)
at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:108)
at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:68)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1808)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.app.ActivityThread.main(ActivityThread.java:6669)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)
The code where this error occurs is:
tagged_words=nltk.pos_tag_sents(tokenized_sentences)
at <python>.uyn_pre_processing.pre_processing(uyn_pre_processing.py:88)
I also don't know where the nltk-files are placed. Earlier when i just programmed on the python side i onlyremember using the import nltk command. Hopefully some already found a solution for using nltk.

I was able to reproduce something similar on the emulator. In my case the root cause was that the download failed with a DECRYPTION_FAILED_OR_BAD_RECORD_MAC error, leaving behind an incomplete ZIP file.
This appears to be a low-level problem with the emulator which isn't specific to Python. If you can confirm you have the same problem (by seeing DECRYPTION_FAILED_OR_BAD_RECORD_MAC in the nltk.download logcat output), then please add a star on the Android issue tracker here, to help encourage them to fix it.
You can work around this by calling nltk.download repeatedly in a loop until it returns true. To save time, you should probably only download the data files you need. You can find out what these are by simply calling the corresponding function and looking at the error message, e.g.:
>>> nltk.pos_tag_sents([["hello", "world"]])
...
LookupError:
**********************************************************************
Resource [93maveraged_perceptron_tagger[0m not found.
Please use the NLTK Downloader to obtain the resource:
[31m>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
Then you can add this to your code:
while not nltk.download('averaged_perceptron_tagger'):
print("Retrying download")
This succeeded after a few iterations, and I was then able to call nltk.pos_tag_sents successfully.

Add this to your python script:
while not nltk.download('punkt'):
return ("Retrying download punkt")
Also in your AndroidManifest don't forget to add these permissions:
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />

Related

Tabula font error in reading table from PDF

I saw a lot of people had similar issues, but not this one. And many of the similar issues do not have an applicable solution, unfortunately.
I am getting this warning from tabula. And when I look at the result or test the length of what it extracts, there is nothing there. Here is the message:
Got stderr: Apr 12, 2022 5:34:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'Helvetica-Oblique' for 'CenturyGothic-Italic'
All I am using is:
table = tabula.read_pdf(pdf_path, pages= page, multiple_tables = True)
Any ideas??
The correct approach, would be to install the missing fonts as recommended in the answer here:
Using fallback font while parsing file content using pdfbox - can it cause mistakes?
However, for my application, which is reading pdf files from a docker container, installing extra fonts in the OS might be unnecessary. Because what you see in the logs are a warning, the missing fonts do not really impact the parsing of the PDF.
To remove these warnings from any logging in tabula.py I just added silent=True to the arguments in the method call as follows:
table_df = tabula.read_pdf(
input_path=pdf_file,
output_format="dataframe",
pages="all",
silent=True,
)

SentiStrength: [WinError 2] The system cannot find the file specified

I tried to use SentiStrength with Python to classify text sentiment.
import sentistrength
from sentistrength import PySentiStr
senti = PySentiStr()
senti.setSentiStrengthPath('C:/Users/xx/SentiStrengthCom.jar')
senti.setSentiStrengthLanguageFolderPath ('C:/Users/xx/SentStrength_Data_Sept2011/')
str_arr = ['What a lovely day', 'What a bad day']
result = senti.getSentiment(str_arr, score='scale')
However, when I try to execute the last line, I get the error [WinError 2] The system cannot find the file specified. However, the file is found by the system, as there is no error message when trying the code below.
SentiStrengthLocation = "C:/Users/xx/SentiStrengthCom.jar"
SentiStrengthLanguageFolder = "C:/Users/xx/SentStrength_Data_Sept2011/"
if not os.path.isfile(SentiStrengthLocation):
print("SentiStrength not found at: ", SentiStrengthLocation)
if not os.path.isdir(SentiStrengthLanguageFolder):
print("SentiStrength data folder not found at: ", SentiStrengthLanguageFolder)
I am really looking forward to your help! Thank you a lot!
Also, do you have any recommendations about how to perform a good sentiment analysis on Python?
Edit: I tried it on colab and there it works, is it possible that there are any admin rights that make it impossible to get the file?
According to this comment of an issue on github, is it possible, that you don't have java installed? The package might be throwing the error because of that.

Tableau SDK TableException (40200)

Issue: Error being thrown: tableausdk.Exceptions.TableauException: TableauException (40200): The system cannot find the path specified.
- OS::mkdir(CreateDirectory path="C:\PATH\Tableau-SDK\tdetmp2A0E0E5E")
I am attempting to to create a tableau extract from oracle data using python and the tableauSDK.
The code seems to run correctly if the extract already exists. (although the produced tde is unreadable)
According to the Tableau community I should be able to create an extract from any source data without the extract already existing...
Any idea on why this is occuring?
tde_path = r'C:\PATH\test.tde'
tde_file = Extract(path=tde_path) ## ERROR Thrown here
The reason now seems obvious...
The error had the answer :
OS::mkdir(CreateDirectory path="C:\PATH\Tableau-SDK\tdetmp2A0E0E5E")
To solve the issue :
The Directory C:\PATH\Tableau-SDK\ did not exist.
Created the Directory and the code ran without error.

How to customize Stanford NER in python?

I learned how to customize Stanford NER (Named Entity Recognizer) in Java from here:
http://nlp.stanford.edu/software/crf-faq.shtml#a
But I am developing my project with Python and here I need to train my classier with some custom entities.
I searched a lot for a solution but could not find any. Any idea? If it is not possible, is there any other way to train my classifier with custom entities, i.e, with nltk or others in python?
EDIT: Code addition
This is what I did to set up and test Stanford NER which worked nicely:
from nltk.tag.stanford import StanfordNERTagger
path_to_model = "C:\..\stanford-ner-2016-10-31\classifiers\english.all.3class.distsim.crf.ser"
path_to_jar = "C:\..\stanford-ner-2016-10-31\stanford-ner.jar"
nertagger=StanfordNERTagger(path_to_model, path_to_jar)
query="Show me the best eye doctor in Munich"
print(nertagger.tag(query.split()))
This code worked successfully. Then, I downloaded the sample austen.prop file and both jane-austen-emma-ch1.tsv and jane-austen-emma-ch2.tsv file and put it in a custom folder in NerTragger library folder. I modified the jane-austen-emma-ch1.tsv file with my custom entity tags. The code of austen.prop file has link to jane-austen-emma-ch1.tsv file. Now, I modified the above code as follow but it is not working:
from nltk.tag.stanford import StanfordNERTagger
path_to_model = "C:\..\stanford-ner-2016-10-31\custom/austen.prop"
path_to_jar = "C:\..\stanford-ner-2016-10-31\stanford-ner.jar"
nertagger=StanfordNERTagger(path_to_model, path_to_jar)
query="Show me the best eye doctor in Munich"
print(nertagger.tag(query.split()))
But this code is producing the following error:
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 236C6F63
raise OSError('Java command failed : ' + str(cmd))
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1507)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3017)
Caused by: java.io.StreamCorruptedException: invalid stream header: 236C6F63
OSError: Java command failed : ['C:\\Program Files\\Java\\jdk1.8.0_111\\bin\\java.exe', '-mx1000m', '-cp', 'C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0-javadoc.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0-sources.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\joda-time.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\jollyday-0.4.9.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\stanford-ner-resources.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', 'C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31/custom/austen.prop', '-textFile', 'C:\\Users\\HP\\AppData\\Local\\Temp\\tmppk8_741f', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:808)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1462)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1494)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1505)
... 1 more
The Stanford NER classifier is a java program. The NLTK's module is only an interface to the java executable. So you train a model exactly as you did before (or as you saw done in the link you provide).
In your code, you are confusing the training of a model with its use to chunk new text. The .prop file contains instructions for training a new model; it is not itself a model. This is what I recommend:
Forget about python/nltk for the moment, and train a new model from the Windows command line (CMD prompt or whatever): Follow the how-to you mention in your question, to generate a serialized model (.ser file) named ner-model.ser.gz or whatever you decide to call it from your .prop file.
In your python code, set the path_to_model variable to point to the .ser file you generated in step 1.
If you really want to control the training process from python, you could use the subprocess module to issue the appropriate command line commands. But it sounds like you don't really need this; just try to understand what these steps do so that you can carry them out properly.

Saving NLTK Alignments

I am using NLTK 3.2 and I was wondering how you save NLTK alignments. I have found this link: How to save Python NLTK alignment models for later use?, but it seems that there is no align() method. Also, I figured out that nltk.align has been renamed to nltk.translate, but I still cannot access the align() method. Thanks!
Yeah, you are right. The method align became private in the current version. So, if you want to use that method, you have to modify the source code.
To modify the source code, you have to get to the directory of the file. You can find that directory by:
Open your terminal
Type these commands:
>>> python
>>> import nltk
>>> nltk.translate.ibm1.__file__
Here is a screen-shot of what it should look like:
Now, you have to go to that directory and find the file 'ibm1.py'. Open the file and modify the method __align to align.
It's the last method in the file.
CAUTION:
The align method returns Alignment class instead of AlignedSent in earlier versions.

Categories

Resources