python how to use tika with existing jar file without downloading again - python

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.
The problem is that the jar file size is around 60MB, which takes some time to download.
This is the code I'm using :
from tika import parser
def get_pdf_text(path):
parsed = parser.from_file(path):
return parsed['content']
The only workaround I found is this :
1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx
2 - Using tika.TikaClientOnly = True
3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')
But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.

To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.
TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.

if you don't want to add environment variable, you can change the directory that the tika looking for tika-server.jar file with code bellow.
from tika import tika
tika.TikaJarPath = r'TIKA_SERVER_PATH'
in that TIKA_SERVER_PATH the jar file name should be tika-server.jar(the name shouldn't include the version) and also the .md5 file must be there. if the .md5 file isn't the right version as tika-server.jar this method doesn't work and tika will delete your file and download the default version.

Here is what worked here :
os.environ['TIKA_SERVER_JAR'] = "<path_to_jar_and_md5>/tika-server.jar"
os.environ['TIKA_PATH'] = "<path_to_jar_and_md5_again>"
These are read at library import, so import the parser after, and reimport if you change them.

After trying almost everything, and debugging tika.py library code I found that you must set both of these variables for this hack to work.
TIKA_SERVER_JAR="/path_to_tika_server/tika-server.jar"
TIKA_SERVER_JAR="/path_to_tika_server"
You also need to provide a .md5 signature file because since Tika version 1.18 .md5 file is not provided (sha512 signature is provided instead, see https://archive.apache.org/dist/tika/). So you need to trick the library to accept your downloaded file.
Or someone could just patch python library :)

i am wondering how to get the .md5 file of tika-server.jar, since .md5 file is not provided and sha512 signature is provided instead

Related

Pytesseract Failed loading language 'chi-sim'

I am working on python tesseract package with sample code like the follows:
import pytesseract
from PIL import Image
tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)
And I received the following error message:
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')
From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.
My developing environment is M1 macOS, and I installed tesseract and tesseract-lang from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))
I get a long list of languages printed, including chi-sim.
Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
I've tried to specify environment variable TESSDATA_PREFIX in multiple ways, including:
Using config parameter as in the original code.
Adding global environment variable in PyCharm.
Adding the following line in the code
os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/
But unfortunately, none of these works.
It seems as if my file chi-sim.traineddata is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata is located). Yes, I've tried both, but neither works.
With respect to this issue, is there any potential solutions?
Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.
If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

How do I access a file for reading/writing in a different (non-current) directory?

I am working on the listener portion of a backdoor program (for an ETHICAL hacking course) and I would like to be able to read files from any part of my linux system and not just from within the directory where my listener python script is located - however, this has not proven to be as simple as specifying a typical absolute path such as "~/Desktop/test.txt"
So far my code is able to read files and upload them to the virtual machine where my reverse backdoor script is actively running. But this is only when I read and upload files that are in the same directory as my listener script (aptly named listener.py). Code shown below.
def read_file(self, path):
with open(path, "rb") as file:
return base64.b64encode(file.read())
As I've mentioned previously, the above function only works if I try to open and read a file that is in the same directory as the script that the above code belongs to, meaning that path in the above content is a simple file name such as "picture.jpg"
I would like to be able to read a file from any part of my filesystem while maintaining the same functionality.
For example, I would love to be able to specify "~/Desktop/another_picture.jpg" as the path so that the contents of "another_picture.jpg" from my "~/Desktop" directory are base64 encoded for further processing and eventual upload.
Any and all help is much appreciated.
Edit 1:
My script where all the code is contained, "listener.py", is located in /root/PycharmProjects/virus_related/reverse_backdoor/. within this directory is a file that for simplicity's sake we can call "picture.jpg" The same file, "picture.jpg" is also located on my desktop, absolute path = "/root/Desktop/picture.jpg"
When I try read_file("picture.jpg"), there are no problems, the file is read.
When I try read_file("/root/Desktop/picture.jpg"), the file is not read and my terminal becomes stuck.
Edit 2:
I forgot to note that I am using the latest version of Kali Linux and Pycharm.
I have run "realpath picture.jpg" and it has yielded the path "/root/Desktop/picture.jpg"
Upon running read_file("/root/Desktop/picture.jpg"), I encounter the same problem where my terminal becomes stuck.
[FINAL EDIT aka Problem solved]:
Based on the answer suggesting trying to read a file like "../file", I realized that the code was fully functional because read_file("../file") worked without any flaws, indicating that my python script had no trouble locating the given path. Once the file was read, it was uploaded to the machine running my backdoor where, curiously, it uploaded the file to my target machine but in the parent directory of the script. It was then that I realized that problem lied in the handling of paths in the backdoor script rather than my listener.py
Credit is also due to the commentator who pointed out that "~" does not count as a valid path element. Once I reached the conclusion mentioned just above, I attempted read_file("~/Desktop/picture.jpg") which failed. But with a quick modification, read_file("/root/Desktop/picture.jpg") was successfully executed and the file was uploaded in the same directory as my backdoor script on my target machine once I implemented some quick-fix code.
My apologies for not being so specific; efforts to aid were certainly confounded by the unmentioned complexity of my situation and I would like to personally thank everyone who chipped in.
This was my first whole-hearted attempt to reach out to the stackoverflow community for help and I have not been disappointed. Cheers!
A solution I found is putting "../" before the filename if the path is right outside of the dictionary.
test.py (in some dictionary right inside dictionary "Desktop" (i.e. /Desktop/test):
with open("../test.txt", "r") as test:
print(test.readlines())
test.txt (in dictionary "/Desktop")
Hi!
Hello!
Result:
["Hi!", "Hello!"]
This is likely the simplest solution. I found this solution because I always use "cd ../" on the terminal.
This not only allows you to modify the current file, but all other files in the same directory as the one you are reading/writing to.
path = os.path.dirname(os.path.abspath(__file__))
dir_ = os.listdir(path)
for filename in dir_:
f = open(dir_ + '/' + filename)
content = f.read()
print filename, len(content)
try:
im = Image.open(filename)
im.show()
except IOError:
print('The following file is not an image type:', filename)

How to pass all files (with given name patterns) to python program in PyCharm using the parameters field in run/debug configurations?

I have a bunch of .html files in a directory that I am reading into a python program using PyCharm. I am using the (*) star operator in the following way in the parameters field of the run/debug configuration dialog box in PyCharm:
*.html
, but this doesn't work. I get the following error:
IOError: [Errno 2] No such file or directory: '*.html'
at the line where I open the file to read into my program. I think its reading the "*.html" literally as a file name. I'd appreciate your help in teaching me how to use the star operator in this case.
Addendum:
I'm pretty new to Python and Pycharm. I'm running my script using the following configuration options:
Now, I've tried different variations of parameters here, like '*.html', "*.html", and just *.html. I also tried glob.glob('*.html'), but the code takes it literally and thinks that the file name itself is "glob.glob('*.html')" and throws an error. I think this is more of a Pycharm thing than understanding bash or python. I guess what I want is to make Pycharm pass all the files of the directory through that parameters field in the picture. Is there some way for me to specify to Pycharm NOT to consider the string of parameters literally?
The way the files are being handled is by running a for loop through the sys.argv list and calling a function on each file. The function simply uses the open() method to read the contents of the file into a string so I can pull patterns out of the text. Hope that fleshes out the problem a bit better.
Filename expansion is a feature of bash. So if you call your python script from the linux command line, it will work, just like if you would have typed out all of the filenames as arguments to your script. Pycharm doesn't have this feature, so you will have to do that by yourself in your python script using a glob.
import glob
import sys
files = glob.glob(sys.argv[-1])
To keep compatibility between bash and pycharm, you can use something like this:
import glob
globs = ['*.html', '*.css', 'script.js']
files = []
for g in globs:
files.extend(glob.glob(g))
I have multiple arguments so this is what I did to allow for compatibility:
I have an argparse argument that returns an array of image file names. I check it as follows.
images = args["images"]
if len(images) == 1 and '*' in images[0]:
import glob
images = glob.glob(images[0])

Create a SFX archive using python

I am looking for some help with python script to create a Self Extracting Archive (SFX) an exe file which can be created by WinRar basically.
I would want to archive a folder with password protection and also split volume by 3900 MB so that it can be easily burned to a disk.
I know WinRar has command line parameters to create a archive, but i am not sure how to call it via python anyhelp on this would be of great help.
Here are main things I want:
Archive Format - RAR
Compression Method Normal
Split Volume size, 3900 MB
Password protection
I looked up everywhere but don't seem to find anything around this functionality.
You could have a look at rarfile
Alternatively use something like:
from subprocess import call
cmdlineargs = "command -switch1 -switchN archive files.. path_to_extract"
call(["WinRAR"] + cmdlineargs.split())
Note in the second line you will need to use the correct command line arguments, the ones above are just as an example.

Python3:Save File to Specified Location

I have a rather simple program that writes HTML code ready for use.
It works fine, except that if one were to run the program from the Python command line, as is the default, the HTML file that is created is created where python.exe is, not where the program I wrote is. And that's a problem.
Do you know a way of getting the .write() function to write a file to a specific location on the disc (e.g. C:\Users\User\Desktop)?
Extra cool-points if you know how to open a file browser window.
The first problem is probably that you are not including the full path when you open the file for writing. For details on opening a web browser, read this fine manual.
import os
target_dir = r"C:\full\path\to\where\you\want\it"
fullname = os.path.join(target_dir,filename)
with open(fullname,"w") as f:
f.write("<html>....</html>")
import webbrowser
url = "file://"+fullname.replace("\\","/")
webbrowser.open(url,True,True)
BTW: the code is the same in python 2.6.
I'll admit I don't know Python 3, so I may be wrong, but in Python 2, you can just check the __file__ variable in your module to get the name of the file it was loaded from. Just create your file in that same directory (preferably using os.path.dirname and os.path.join to remain platform-independent).

Categories

Resources