Tabula font error in reading table from PDF

Tabula font error in reading table from PDF - python

I saw a lot of people had similar issues, but not this one. And many of the similar issues do not have an applicable solution, unfortunately.
I am getting this warning from tabula. And when I look at the result or test the length of what it extracts, there is nothing there. Here is the message:
Got stderr: Apr 12, 2022 5:34:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'Helvetica-Oblique' for 'CenturyGothic-Italic'
All I am using is:
table = tabula.read_pdf(pdf_path, pages= page, multiple_tables = True)
Any ideas??

The correct approach, would be to install the missing fonts as recommended in the answer here:
Using fallback font while parsing file content using pdfbox - can it cause mistakes?
However, for my application, which is reading pdf files from a docker container, installing extra fonts in the OS might be unnecessary. Because what you see in the logs are a warning, the missing fonts do not really impact the parsing of the PDF.
To remove these warnings from any logging in tabula.py I just added silent=True to the arguments in the method call as follows:
table_df = tabula.read_pdf(
input_path=pdf_file,
output_format="dataframe",
pages="all",
silent=True,
)

Related

Can't catch python warning with warnings.catch_warnings()

I'am using the Camelot library in python to read tables from pdf's.
If There is no table recognized, but something else (like a text), the library gives a warning: UserWarning: No tables found in table area 1 [stream.py:365].
My idea was to catch this Warning with warnings.catch_warnings() function.
This is my Code:
with warnings.catch_warnings(record=True) as w:
# reading tables from pdf
parsed_tables = camelot.read_pdf(
tmp_file.name,
pages=page,
flavor="stream",
row_tol=row_tol,
table_areas=["30,480,790,100"],
surpress_stdout=False
)
# warning.warn("TEST")
print("warning", w)
My problem is that variable w is always empty. If I uncomment the "TEST" warning the warning appears in variable w (it works with my own warning).
I searched the library for warning filters but I didn't find any.
I tried to add warnings.filterwarnings("default") or warnings.simplefilter("always").
Why can't I catch this warning? Is it because it occurs in the library and not in my Code?

Chaquopy problems with nltk and download

According to Chaquopy Not able to download Resource i'm not sure if the problem got solved.
So here is question in the nltk context.
After including one of the nltk.download line:
nltk.download('popular')
or
nltk.download('punkt')
or
nltk.download('all')
I get this stack trace:
2020-08-26 13:33:45.742 19765-19765/com.pro.useyournotes E/ExceptionTag: com.chaquo.python.PyException: BadZipFile: File is not a zip file
com.chaquo.python.PyException: BadZipFile: File is not a zip file
at <python>.zipfile._RealGetContents(zipfile.py:1335)
at <python>.zipfile.__init__(zipfile.py:1268)
at <python>.nltk.data.__init__(data.py:936)
at <python>.nltk.compat._decorator(compat.py:41)
at <python>.nltk.data.__init__(data.py:396)
at <python>.nltk.compat._decorator(compat.py:41)
at <python>.nltk.data.find(data.py:544)
at <python>.nltk.data.find(data.py:557)
at <python>.nltk.tag.perceptron.__init__(perceptron.py:168)
at <python>.nltk.tag._get_tagger(__init__.py:106)
at <python>.nltk.tag.pos_tag_sents(__init__.py:178)
at <python>.uyn_pre_processing.pre_processing(uyn_pre_processing.py:88)
at <python>.uyn_analysis_workflow.analyse_new_data(uyn_analysis_workflow.py:62)
at <python>.uyn_main.main(uyn_main.py:266)
at <python>.chaquopy_java.call(chaquopy_java.pyx:285)
at <python>.chaquopy_java.Java_com_chaquo_python_PyObject_callAttrThrows(chaquopy_java.pyx:257)
at com.chaquo.python.PyObject.callAttrThrows(Native Method)
at com.chaquo.python.PyObject.callAttr(PyObject.java:209)
at com.pro.useyournotes.MainActivity.getPythonHello(MainActivity.kt:70)
at com.pro.useyournotes.MainActivity.onCreate(MainActivity.kt:59)
at android.app.Activity.performCreate(Activity.java:7136)
at android.app.Activity.performCreate(Activity.java:7127)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1271)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2893)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:3048)
at android.app.servertransaction.LaunchActivityItem.execute(LaunchActivityItem.java:78)
at android.app.servertransaction.TransactionExecutor.executeCallbacks(TransactionExecutor.java:108)
at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:68)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1808)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.app.ActivityThread.main(ActivityThread.java:6669)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)
The code where this error occurs is:
tagged_words=nltk.pos_tag_sents(tokenized_sentences)
at <python>.uyn_pre_processing.pre_processing(uyn_pre_processing.py:88)
I also don't know where the nltk-files are placed. Earlier when i just programmed on the python side i onlyremember using the import nltk command. Hopefully some already found a solution for using nltk.

I was able to reproduce something similar on the emulator. In my case the root cause was that the download failed with a DECRYPTION_FAILED_OR_BAD_RECORD_MAC error, leaving behind an incomplete ZIP file.
This appears to be a low-level problem with the emulator which isn't specific to Python. If you can confirm you have the same problem (by seeing DECRYPTION_FAILED_OR_BAD_RECORD_MAC in the nltk.download logcat output), then please add a star on the Android issue tracker here, to help encourage them to fix it.
You can work around this by calling nltk.download repeatedly in a loop until it returns true. To save time, you should probably only download the data files you need. You can find out what these are by simply calling the corresponding function and looking at the error message, e.g.:
>>> nltk.pos_tag_sents([["hello", "world"]])
...
LookupError:
**********************************************************************
Resource [93maveraged_perceptron_tagger[0m not found.
Please use the NLTK Downloader to obtain the resource:
[31m>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
Then you can add this to your code:
while not nltk.download('averaged_perceptron_tagger'):
print("Retrying download")
This succeeded after a few iterations, and I was then able to call nltk.pos_tag_sents successfully.

Add this to your python script:
while not nltk.download('punkt'):
return ("Retrying download punkt")
Also in your AndroidManifest don't forget to add these permissions:
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />

Fatal error reading PNG image file: Not a PNG file in Ubuntu 20.04 LTS

I try to download an image using requests module in python.It works but when i try to open this image it showing "Fatal error reading PNG image file: Not a PNG file". Here is my error screenshot.And the code i used to download is,
import requests
img_url = "http://dimik.pub/wp-content/uploads/2020/02/javaWeb.jpg"
r = requests.get(img_url)
with open("java_book.png","wb") as f:
f.write(r.content)
And i run my code in terminal just saying, python3 s.py (s.py is the name of file).
Is something wrong in my code or something else in my operating system(ubuntu 20.04 LTS)?

import requests
response = requests.get("https://devnote.in/wp-content/uploads/2020/04/devnote.png")
file = open("sample_image.png", "wb")
file.write(response.content)
print (response.content)
file.close()
https://devnote.in/wp-content/uploads/2020/04/devnote.png this url is Disable mod_security. so this return error like :
<html><head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>.
Disable mod_security using .htaccess on apache server
Mod_security can be easily disabled with the help of .htaccess.
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>

It's because you tried to save javaWeb.jpg (A JPG file) as java_book.png (A PNG file).

In an attempt to see what we are working on, I've tried replicating the issue, please see below what found out.
1.) The file you are attempting to open is the ENTIRE HTML document. I can support this, because we are finding !DOCTYPE html at the beginning of your 'wb' or WRITE BINARY command.
<---------------------------------------------- WE ARE AT AN IMPASSE
From here we have a few options to solve our problem.
a.) We could simply download the image from the web page - placing it in a local folder/directory/ or wherever you want it. This is by far our easiest call, because it allows us to call and open it for later without having to do too much. While I'm on a Windows machine - Ubuntu should have no problem doing this either (Unless you aren't in an UBUNTU with a GUI - that can be fixed with startx IF SUPPORTED)
b.) If you have to pull directly from the site itself, you could try something like this using BEAUTIFULSOUP from this answer here. Honestly, I've never really used the latter option since downloading and moving is much more effective.

You just need to save the image as a JPG.
import requests
img_url = "http://dimik.pub/wp-content/uploads/2020/02/javaWeb.jpg"
r = requests.get(img_url)
with open("java_book.jpg","wb") as f:
f.write(r.content)

Yeah, it's a full HTML document:

Can LibreOffice/OpenOffice programmatically add passwords to existing .docx/.xlsx/.pptx files?

TL;DR version - I need to programmatically add a password to .docx/.xlsx/.pptx files using LibreOffice and it doesn't work, and no errors are reported back either, my request to add a password is simply ignored, and a password-less version of the same file is saved.
In-depth:
I'm trying to script the ability to password-protect existing .docx/.xlsx/.pptx files using LibreOffice.
I'm using 64-bit LibreOffice 6.2.5.2 which is the latest version at the time of writing, on Windows 8.1 64-bit Professional.
Whilst I can do this manually via the UI - specifically, I open the "plain" document, do "Save As" and then tick "Save with Password", and enter the password in there, I cannot get this to work via any kind of automation. I'm been trying via Python/Uno, but to no gain. Although the code below correctly opens and saves the document, my attempt to add a password is completely ignored. Curiously, the file size shrinks from 12kb to 9kb when I do this.
Here is my code:
import socket
import uno
import sys
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",ctx)
from com.sun.star.beans import PropertyValue
properties=[]
oDocB = desktop.loadComponentFromURL ("file:///C:/Docs/PlainDoc.docx","_blank",0, tuple(properties) )
sp=[]
sp1=PropertyValue()
sp1.Name='FilterName'
sp1.Value='MS Word 2007 XML'
sp.append(sp1)
sp2=PropertyValue()
sp2.Name='Password'
sp2.Value='secret'
sp.append(sp2)
oDocB.storeToURL("file:///C:/Docs/PasswordDoc.docx",sp)
oDocB.dispose()
I've had great results using Python/Uno to open password-protected files, but I cannot get it to protect a previously unprotected document. I've tried enabling the macro recorder and recording my actions - it recorded the following LibreOffice BASIC code:
sub SaveDoc
rem ----------------------------------------------------------------------
rem define variables
dim document as object
dim dispatcher as object
rem ----------------------------------------------------------------------
rem get access to the document
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
rem ----------------------------------------------------------------------
dim args1(2) as new com.sun.star.beans.PropertyValue
args1(0).Name = "URL"
args1(0).Value = "file:///C:/Docs/PasswordDoc.docx"
args1(1).Name = "FilterName"
args1(1).Value = "MS Word 2007 XML"
args1(2).Name = "EncryptionData"
args1(2).Value = Array(Array("OOXPassword","secret"))
dispatcher.executeDispatch(document, ".uno:SaveAs", "", 0, args1())
end sub
Even when I try to run that, it...saves an unprotected document, with no password encryption. I've even tried converting the macro above into the equivalent Python code, but to no avail either. I don't get any errors, it simply doesn't protect the document.
Finally, out of desperation, I've even tried other approaches that don't include LibreOffice, for example, using the Apache POI library as per the following existing StackOverflow question:
Python or LibreOffice Save xlsx file encrypted with password
...but I just get an error saying "Error: Could not find or load main class org.python.util.jython". I've tried upgrading my JDK, tweaking the paths used in the example, i.e. had an "intelligent" go, but still no joy. I suspect the error above is trivial to fix, but I'm not a Java developer and lack the experience in this area.
Does anyone have any solution? Do you have some LibreOffice code that can do this (password-protect .docx/.xlsx/.pptx files)? Or OpenOffice for that matter, I'm not precious about which package I use. Or something else entirely!
NOTE: I appreciate this is trivial using full-fat Microsoft Office, but thanks to Microsoft's licensing restrictions, is a complete no-go for this project - I have to use an alternative.

The following example is from page 40 (file page 56) of Useful Macro Information
For OpenOffice.org by Andrew Pitonyak (http://www.pitonyak.org/AndrewMacro.odt). The document is directed to OpenOffice.org Basic but is generally applicable to LibreOffice as well. The example differs from the macro recorder version primarily in its use of the documented API rather than dispatch calls.
5.8.3. Save a document with a password
To save a document with a password, you must set the “Password”
attribute.
Listing 5.19: Save a document using a password.
Sub SaveDocumentWithPassword
Dim args(0) As New com.sun.star.beans.PropertyValue
Dim sURL$
args(0).Name ="Password"
args(0).Value = "test"
sURL=ConvertToURL("/andrew0/home/andy/test.odt")
ThisComponent.storeToURL(sURL, args())
End Sub
The argument name is case sensitive, so “password” will not work.

Suppress warnings in Hachoir

I'm using hachior-parser to grab the duration of a large set of video files. (I'm resetting the "Last modified" date based on the file's timestamp, plus its duration.) I'm using code adapted from this question.
The problem I'm running into is that hachior reports four warnings for each file, and this is cluttering up my output. I still get my duration from the file, so I'd like to know how to suppress these warnings in the output, if possible.
Python isn't really my strong suit, so I'm not sure where to look and the documentation for hachior seems pretty sparse on the error reporting. I'd prefer not to resort to grepping the lines from the output of my script.
Edit: Running python -W ignore set_last_modified.py results in the same [warn] lines being printed.
[warn] [/headers/stream[2]/stream_fmt] Can't get field "stream_hdr" from /headers/stream[2]
[warn] [/headers/stream[2]/stream_fmt] [Autofix] Fix parser error: stop parser, add padding
[warn] [/headers/stream[3]/stream_fmt] Can't get field "stream_hdr" from /headers/stream[3]
[warn] [/headers/stream[3]/stream_fmt] [Autofix] Fix parser error: stop parser, add padding

You can use the -W option to suppress warnings in python.
python -W ignore my_file.py
Edit: since you've already tried the above, you could try the following.
import warnings
# add the following before you call the function that gives warnings.
warnings.filterwarnings("ignore")
# run your function here

I found the solution by checking the issues page for the project on BitBucket.
https://bitbucket.org/haypo/hachoir/issues/54/control-log-level-whith-the-python-api
from hachoir_core import config as HachoirConfig
HachoirConfig.quiet = True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tabula font error in reading table from PDF - python

Related

Can't catch python warning with warnings.catch_warnings()

Chaquopy problems with nltk and download

Fatal error reading PNG image file: Not a PNG file in Ubuntu 20.04 LTS

Can LibreOffice/OpenOffice programmatically add passwords to existing .docx/.xlsx/.pptx files?

Suppress warnings in Hachoir

Categories

Resources