Pytesseract Failed loading language 'chi-sim'

Pytesseract Failed loading language 'chi-sim' - python

I am working on python tesseract package with sample code like the follows:
import pytesseract
from PIL import Image
tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)
And I received the following error message:
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')
From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.
My developing environment is M1 macOS, and I installed tesseract and tesseract-lang from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))
I get a long list of languages printed, including chi-sim.
Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
I've tried to specify environment variable TESSDATA_PREFIX in multiple ways, including:
Using config parameter as in the original code.
Adding global environment variable in PyCharm.
Adding the following line in the code
os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/
But unfortunately, none of these works.
It seems as if my file chi-sim.traineddata is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata is located). Yes, I've tried both, but neither works.
With respect to this issue, is there any potential solutions?

Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.
If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

Related

Python - images to video [duplicate]

cv2.imread is always returning NoneType.
I am using python version 2.7 and OpenCV 2.4.6 on 64 bit Windows 7.
Maybe it's some kind of bug or permissions issue because the exact same installation of python and cv2 packages in another computer works correctly. Here's the code:
im = cv2.imread("D:\testdata\some.tif",CV_LOAD_IMAGE_COLOR)
I downloaded OpenCV from http://www.lfd.uci.edu/~gohlke/pythonlibs/#opencv. Any clue would be appreciated.

First, make sure the path is valid, not containing any single backslashes. Check the other answers, e.g. https://stackoverflow.com/a/26954461/463796.
If the path is fixed but the image is still not loading, it might indeed be an OpenCV bug that is not resolved yet, as of 2013. cv2.imread is not working properly under Win32 for me either.
In the meantime, use LoadImage, which should work fine.
im = cv2.cv.LoadImage("D:/testdata/some.tif", CV_LOAD_IMAGE_COLOR)

In my case the problem was the spaces in the path. After I moved the images to a path with no spaces it worked.

Try changing the direction of the slashes
im = cv2.imread("D:/testdata/some.tif",CV_LOAD_IMAGE_COLOR)
or add r to the begining of the string
im = cv2.imread(r"D:\testdata\some.tif",CV_LOAD_IMAGE_COLOR)

I also met the same issue before on ubuntu 18.04.
cv2.imread(path)
I solved it when I changed the path argument from Relative_File_Path to Absolute_File_Path.
Hope it be useful.

just stumbled upon this one.
The solution is very simple but not intuitive.
if you use relative paths, you can use either '\' or '/' as in test\pic.jpg or test/pic.jpg respectively
if you use absolute paths, you should only use '/' as in /.../test/pic.jpg for unix or C:/.../test/pic.jpg for windows
to be on the safe side, just use for root, _, files in os.walk(<path>): in combination with abs_path = os.path.join(root, file). Calling imread afterwards, as in img = ocv.imread(abs_path) is always going to work.

In case no one mentioned in this question, another way to workaround is using plt to read image, then convert it to BGR format.
img=plt.imread(img_path)
print(img.shape)
img=img[...,::-1]
it has been mentioned in
cv2.imread does not read jpg files

This took a long time to resolve. first make sure that the file is in the directory and check that even though windows explorer says the file is "JPEG" it is actually "JPG". The first print statement is key to making sure that the file actually exists. I am a total beginner, so if the code sucks, so be it.
The code, just imports a picture and displays it . If the code finds the file, then True will be printed in the python window.
import cv2
import sys
import numpy as np
import os
image_path= "C:/python27/test_image.jpg"
print os.path.exists(image_path)
CV_LOAD_IMAGE_COLOR = 1 # set flag to 1 to give colour image
CV_LOAD_IMAGE_COLOR = 0 # set flag to 0 to give a grayscale one
img = cv2.imread(image_path,CV_LOAD_IMAGE_COLOR)
print img.shape
cv2.namedWindow('Display Window') ## create window for display
cv2.imshow('Display Window', img) ## Show image in the window
cv2.waitKey(0) ## Wait for keystroke
cv2.destroyAllWindows() ## Destroy all windows

I had a similar problem, changing the name of the image to English alphabetic worked for me. Also, it didn't work with a numeric name (e.g. 1.jpg).

My OS is Windows 10. I noticed imread is very sensitive to path. No any recommendation about slashes worked for me, so how I managed to solve problem: I have placed file to project folder and typed:
img = cv2.imread("MyImageName.jpg", 0)
So without any path and folder, just file name. And that worked for me.
Also try different files from different sources and of different formats

I spent some time on this only to find that this error is caused by a broken image file on my case. So please manually check your file to make sure it is valid and can be opened by common image viewers.

I had a similar issue,changing direction of slashes worked:
Change / to \

In my case helped changing file names to latin alphabet.
Instead of renaiming all files I wrote a simple wrapper to rename a file before the load into a random guid and right after the load rename it back.
import os
import uuid
import cv2
uid = str(uuid.uuid4())
def wrap_file_rename(my_path, function):
try:
directory = os.path.dirname(my_path)
new_full_name = os.path.join(directory, uid)
os.rename(my_path, new_full_name)
return function(new_full_name)
except Exception as error:
logger.error(error) # use your logger here
finally:
os.rename(new_full_name, my_path)
def my_image_read(my_path, param=None):
return wrap_file_rename(my_path, lambda p: cv2.imread(p) if param is None else cv2.imread(p, param))

Sometimes the file is corrupted. If it exists and cv2.imread returns None this may be the case.
Try opening the file כfrom file explorer and see if that works

I've run into this. Turns out the PIL module provides this functionality.
Similarly, numpy.imread and scipy.misc.imread both didn't exist until I installed PIL
In my configuration (win7 python2.7), that was done as follows:
cd /c/python27/scripts
easy_install PIL

h5py.File(path) doesn't recognize folder path

I am in my project folder call "project". I have two neural network h5 file, one in "project/my_folder/my_model_1.h5", I also copy it to folder "project/my_model_2.h5". So I open my Jupyter Notebook which is working at "project" folder.
import h5py
f = h5py.File("my_model_2.h5") # has NO Issue
but
f = h5py.File("my_folder/my_model_1.h5") # OSError
It says OSError: Unable to open file (unable to open file: name = 'my_folder/my_model_1.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Interestingly, I only have this issue when I do the same thing on my Mac, but I don't encounter any issue in Linux machine.
Please let me know if you know how to fix this. Thank you in advance.

So it looks like some hidden invalid character incidentally got copied when I simply copy and paste the file path from Mac folder system. Take a look at the code in the screen.
The Line 92 is the path name I directly copy and paste from Mac folder.
The Line 93 is the path I literally type with every single letter, then there is no error and .h5 file is loaded properly. It's a kinda of similar issue that has been spotted by someone at this link: Invalid character in identifier
I simply copy the error code to Pycharm, and the unwelcome character got busted.
So solution, for Mac user, be careful of of just simply copying the text from folder system, if something obviously weird, try type every letter into the text editor.

Specifying the absolute path using the os worked in windows
file_name = os.path.dirname(__file__) +'\\my_folder\\my_model_1.h5'
f = h5py.File(file_name)
dont forget to import os though

pandas.read_csv FileNotFoundError: File b'\xe2\x80\xaa<etc>' despite correct path

I'm trying to load a .csv file using the pd.read_csv() function when I get an error despite the file path being correct and using raw strings.
import pandas as pd
df = pd.read_csv('‪C:\\Users\\user\\Desktop\\datafile.csv')
df = pd.read_csv(r'‪C:\Users\user\Desktop\datafile.csv')
df = pd.read_csv('C:/Users/user/Desktop/datafile.csv')
all gives the error below:
FileNotFoundError: File b'\xe2\x80\xaaC:/Users/user/Desktop/tutorial.csv' (or the relevant path) does not exist.
Only when i copy the file into the working directory will it load correct.
Is anyone aware of what might be causing the error?
I had previously loaded other datasets with full filepaths without any problems and I'm currently only encountering issues since I've re-installed my python (via Anaconda package installer).
Edit:
I've found the issue that was causing the problem.
When I was copying the filepath over from the file properties window, I unwittingly copied another character that seems invisible.
Assigning that copied string also gives an unicode error.
Deleting that invisible character made any of above code work.

Try this and see if it works. This is independent of the path you provide.
pd.read_csv(r'C:\Users\aiLab\Desktop\example.csv')
Here r is a special character and means raw string. So prefix it to your string literal.
https://www.journaldev.com/23598/python-raw-string:
Python raw string is created by prefixing a string literal with ‘r’
or ‘R’. Python raw string treats backslash () as a literal character.
This is useful when we want to have a string that contains backslash
and don’t want it to be treated as an escape character.

$10 says your file path is correct with respect to the location of the .py file, but incorrect with respect to the location from which you call python
For example, let's say script.py is located in ~/script/, and file.csv is located in ~/. Let's say script.py contains
import pandas
df = pandas.read_csv('../file.csv') # correct path from ~/script/ where script.py resides
If from ~/ you run python script/script.py, you will get the FileNotFound error. However, if from ~/script/ you run python script.py, it will work.

I know following is a silly mistake but it could be the problem with your file.
I've renamed the file manually from adfa123 to abc.csv. The extension of the file was hidden, after renaming, Actual File name became abc.csv.csv. I've then removed the extra .csv from the name and everything was fine.
Hope it could help anyone else.

import pandas as pd
path1 = 'C:\\Users\\Dell\\Desktop\\Data\\Train_SU63ISt.csv'
path2 = 'C:\\Users\\Dell\\Desktop\\Data\\Test_0qrQsBZ.csv'
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
print(df1)
print(df2)

On Windows systems you should try with os.path.normcase.
It normalize the case of a pathname. On Unix and Mac OS X, this returns the path unchanged; on case-insensitive filesystems, it converts the path to lowercase. On Windows, it also converts forward slashes to backward slashes. Raise a TypeError if the type of path is not str or bytes (directly or indirectly through the os.PathLike interface).
import os
import pandas as pd
script_dir = os.getcwd()
file = 'example_file.csv'
data = pd.read_csv(os.path.normcase(os.path.join(script_dir, file)))

If you are using windows machine. Try checking the file extension.
There is a high possibility of file being saved as fileName.csv.txt instead of fileName.csv
You can check this by selecting File name extension checkbox under folder options (Please find screenshot)
below code worked for me:
import pandas as pd
df = pd.read_csv(r"C:\Users\vj_sr\Desktop\VJS\PyLearn\DataFiles\weather_data.csv");
If fileName.csv.txt, rename/correct it to fileName.csv
windows 10 screen shot
Hope it works,
Good Luck

Try using os.path.join to create the filepath:
import os
f_path = os.path.join(*['C:', 'Users', 'user', 'Desktop', 'datafile.csv'])
df = pd.read_csv(f_path)

I was trying to read the csv file from the folder that was in my 'c:\'drive but, it raises the error of escape,type error, unicode......as such but this code works
just take an variable then add r to read it.
rank = pd.read_csv (r'C:\Users\DELL\Desktop\datasets\iris.csv')
df=pd.DataFrame(rank)

There is an another problem on how to delete the characters that seem invisible.
My solution is copying the filepath from the file windows instead of the property windows.
That is no problem except that you should fulfill the filepath.

Experienced the same issue. Path was correct.
Changing the file name seems to solve the problem.
Old file name: Season 2017/2018 Premier League.csv
New file name: test.csv
Possibly the whitespaces or "/"

I had the same problem when running the file with the interactive functionality provided by Visual studio. Switched to running on the native command line and it worked for me.

For my particular issue, the failure to load the file correctly was due to an "invisible" character that was introduced when I copied the filepath from the security tab of the file properties in windows.
This character is e2 80 aa, the UTF-8 encoding of U+202A, the left-to-right embedding symbol. It can be easily removed by erasing (hitting backspace or delete) when you've located the character (leftmost char in the string).
Note: I chose to answer because the answers here do not answer my question and I believe a few folks (as seen in the comments) might meet the same situation as I did. There also seems to be new answers every now and then since I did not mark this question as resolved.

I had similar problem when I was using JupyterLab + Anaconda, and used my browser to type stuff on IDE.
My problem was simpler - there was a very subtle typo error. I didn't have to use raw text - either escaping or using {{r}} string worked for me :). If you use Anaconda with Jupyter Lab, I believe the Kernel starts with where you open the notebook from i.e. the working directory is that top level folder.

data = pd.read_csv('C:\\Users\username\Python\mydata.csv')
This worked for me. Note the double "\\" in "C:\\" where the rest of the folders use only a single "\".

cv2.imread always returns None type - using cv virtual environment

I am using a virtual environment, which I called cv. I am trying to read into a numpy array using opencv a .cr2 raw image.
Using:
import cv2
img = cv2.imread("raw.cr2")
print img
Returns:
None
Always.
I believe the problem is in the path of raw.cr2, which cannot be found apparently. I have tried including the absolute path in the file I pass to imread. My file is in the home folder (~) where I run python from. I know the path is the issue because if I run sys.os.exists(path), it always returns False.
Lastly, I also tried reading raw.cr2 using scipy.misc:
img = scipy.misc.imread(path)
Returns:
IOError: cannot identify image file 'raw.cr2'

Don't know if you ever solved this. I recently encountered the same problem (using ArchLinux) and found it was a permissions issue. Had to chown the images I wanted to use. Silly me.

Annoying python tesseract error Error opening data file ./tessdata/eng.traineddata

I'm bumping into this error that's driving me a little bit crazy with the python wrapper for tesseract which is a python module called tesseract.
Here's the python code I am trying to run :
img = cv2.imread(image, 0)
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
tesseract.SetCvImage(img,api)
url = api.GetUTF8Text()
conf=api.MeanTextConf()
print('Extracted URL : ' + url)
api.End()
and this is what I get:
Error opening data file ./tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
I don't understand why it is doing this since I have the TESSDATA_PREFIX env variable correctly set to the correct path to my tesseract installation (with the trailing slash).
When I try to run Tesseract directly from powershell (I'm on windows 7 btw), by doing:
tesseract.exe .\data\test.tif -psm 7 out
it works like a charm !
Also when I call Tesseract with Popen in my python script it works fine but I don't like the idea of me not being able to grab the OCR'd text directly from stdout. Indeed, there seems to be no other choice than providing Tesseract with an output filename and then to fopen and read from that file. I feel it's going to be pretty awful to deal with temporary text files just to get the output of the OCR...
Help?

The first parameter to api.Init should be TESSDATA_PREFIX.

get location of ur tessdata folder by typing in command prompt:
$ brew list tesseract
in may case:
/usr/local/Cellar/tesseract/3.05.01/bin/tesseract
/usr/local/Cellar/tesseract/3.05.01/include/tesseract/ (27 files)
/usr/local/Cellar/tesseract/3.05.01/lib/libtesseract.3.dylib
/usr/local/Cellar/tesseract/3.05.01/lib/pkgconfig/tesseract.pc
/usr/local/Cellar/tesseract/3.05.01/lib/ (2 other files)
/usr/local/Cellar/tesseract/3.05.01/share/man/ (11 files)
/usr/local/Cellar/tesseract/3.05.01/share/tessdata/ (28 files)
now
tessdata_dir_config = r'--tessdata-dir "/usr/local/Cellar/tesseract/3.05.01/share/tessdata"'
txt= image_to_string(img,lang='eng',config=tessdata_dir_config)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytesseract Failed loading language 'chi-sim' - python

Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -. If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

Related

Python - images to video [duplicate]

h5py.File(path) doesn't recognize folder path

pandas.read_csv FileNotFoundError: File b'\xe2\x80\xaa<etc>' despite correct path

cv2.imread always returns None type - using cv virtual environment

Annoying python tesseract error Error opening data file ./tessdata/eng.traineddata

Categories

Resources