pytesseract tessedit_char_whitelist not accepting quote - python

I have started working with pytesserract in python. When i pass it single or double quote in
from PIL import Image
import pytesseract
import numpy as np
tesseract_config = r"""-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#'<>(){};:"""
tesseract_language = "eng"
text = pytesseract.image_to_string(Image.open('res/outc001.jpg'), lang=tesseract_language, config=tesseract_config)
print text
it returns
Traceback (most recent call last):
File "main.py", line 15, in <module>
text = pytesseract.image_to_string(Image.open('res/outc001.jpg'), lang=tesseract_language, config=tesseract_config).split('\n')
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 193, in image_to_string
return run_and_get_output(image, 'txt', lang, config, nice)
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 140, in run_and_get_output
run_tesseract(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 106, in run_tesseract
command += shlex.split(config)
File "/usr/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/usr/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
I have been searching for a way to escape single and double quotes and none of them worked.
When i run tesseract as itself with
tesseract res/outc001.jpg tesseract_out/out001 -c "tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#\'\"<>(){};:"
it works just fine.

Pytesseract uses shlex to separate config arguments.
The escape character for shlex is \, if you want to insert quotes in the shlex.split() function you must escape it with \.
If you want ' only in the whitelist:
tesseract_config = "-c tessedit_char_whitelist=blahblah\\'")
If you want " only:
tesseract_config = '-c tessedit_char_whitelist=blahblah\\"')
If you want both ' and ":
tesseract_config = '''-c tessedit_char_whitelist=blahblah\\'\\"''')
or
tesseract_config = """-c tessedit_char_whitelist=blahblah\\"\\'""")

Related

Can I read parquet from HTTP(s) octet-stream?

Some backend-endpoint returns parquet-file in octet-stream.
In pandas I can do something like this:
result = requests.get("https://..../file.parquet")
df = pd.read_parquet(io.BytesIO(result.content))
Can I do it in Dask somehow?
This code:
dd.read_parquet("https://..../file.parquet")
Raises exception (obviously, because this is bytes-like object):
File "to_parquet_dask.py", line 153, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 137, in main
download_parquet(
File "to_parquet_dask.py", line 121, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 148, in _determine_pf_parts
elif fs.isdir(paths[0]):
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 418, in _isdir
return bool(await self._ls(path))
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 195, in _ls
out = await self._ls_real(url, detail=detail, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 150, in _ls_real
text = await r.text()
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1082, in text
return self._body.decode(encoding, errors=errors) # type: ignore
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 7: invalid start byte
UPD
With changes in fsspec from #mdurant answer I got error
ValueError: Cannot seek streaming HTTP file
So I put "simplecache::" to my url and I face next:
Traceback (most recent call last):
File "to_parquet_dask.py", line 161, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 145, in main
download_parquet(
File "to_parquet_dask.py", line 128, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 185, in _determine_pf_parts
pf = ParquetFile(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fastparquet/api.py", line 127, in __init__
raise ValueError("Opening directories without a _metadata requires"
ValueError: Opening directories without a _metadata requiresa filesystem compatible with fsspec
Temperary workaround
Maybe this way is dirty and not optimal, but some kind of works:
#dask.delayed
def parquet_from_http(url, token):
result = requests.get(
url,
headers={'Authorization': token}
)
return pd.read_parquet(io.BytesIO(result.content))
delayed_download = parquet_from_http(url, token)
df = dd.from_delayed(delayed_download, meta=meta)
p.s. meta argument in this approach is necessary, because otherwise dask will use this function twice: to find out meta and than to calculate, so two requests will be made.
This is not an answer, but I believe the following change in fsspec will fix your problem. If you would be willing to try and confirm, we can make this a patch.
--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
## -472,7 +472,10 ## class HTTPFileSystem(AsyncFileSystem):
async def _isdir(self, path):
# override, since all URLs are (also) files
- return bool(await self._ls(path))
+ try:
+ return bool(await self._ls(path))
+ except (FileNotFoundError, ValueError):
+ return False
(we can put this in a branch, if that makes it easier for you to install)
-edit-
The second problem (which is the same thing in both parquet engines) stems from the server either not providing the size of the file, or not allowing range-gets. The parquet format requires random access to the data to be able to read. The only way to get around this (short of improving the server) is to copy the whole file locally, e.g., by prepending "simplecache::" to your URL.

'File does not exist' when passing a non-ASCII path to xarray.open_dataset

I have a problem when attempting to open a .nc file. For my college work I need to work with some data stored on .nc files, so I decided to give the 'xarray' library a go. The files are located on an OneDrive cloud. When passing the 'open_dataset' function a path that contains non-ASCII characters, the following error occurs:
import xarray as xr
path1 = (r'C:\Users\myname\OneDrive - Prirodoslovno-matematički fakultet'
'\DACCIWA\DATA\Sodar\Save_KIT_CM_20160702.nc')
ds = xr.open_dataset(path1)
Traceback (most recent call last):
File "C:\Users\petar\Desktop\Geofizika\5. Godina\KB - Research opportunity\test.py", line 9, in <module>
ds = xr.open_dataset(path1)
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\api.py", line 499, in open_dataset
filename_or_obj, group=group, lock=lock, **backend_kwargs
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 389, in open
return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 335, in __init__
self.format = self.ds.data_model
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 398, in ds
return self._acquire()
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 392, in _acquire
with self._manager.acquire_context(needs_lock) as root:
File "C:\Users\petar\Anaconda3\lib\contextlib.py", line 112, in __enter__
return next(self.gen)
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\file_manager.py", line 183, in acquire_context
file, cached = self._acquire_with_cache_info(needs_lock)
File "C:\Users\petar\Anaconda3\lib\site-packages\xarray\backends\file_manager.py", line 201, in _acquire_with_cache_info
file = self._opener(*self._args, **kwargs)
File "netCDF4\_netCDF4.pyx", line 2135, in netCDF4._netCDF4.Dataset.__init__
File "netCDF4\_netCDF4.pyx", line 1752, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'C:\\Users\\petar\\OneDrive - Prirodoslovno-matemati\xc4\x8dki fakultet\\DACCIWA\\DATA\\Sodar\\Save_KIT_CM_20160702.nc'
I am confused since the file definitely is there (In the code above I replaced my name in the path with "myname", which does not contain non-ASCII characters). At first I thought this had to do something with OneDrive, but I created a folder on it with a path that does not contain non-ASCII characters, and it opens those no problem.
What I tried (although this was really just shotgunning, not familiar with encodings and such):
- input string as raw string (as you do to escape the slashes)
I noticed in the last line that the string path was preceeded by the letter "b", apparently this means that the string is a "byte literal" and can only contain ASCII characters, in which case why does xarray convert/interpret the string as a byte literal? How would I go about opening the file?
Thanks for help!

ansible unexpected exception no escaped character

I am fairly new to Ansible and to configuration management tools in general. I've been playing around with it for the last two days and for the life of me I can't get past typing out ansible testserver. It comes back with an error message that says Unexpected Exception: No escaped character. The full error message is:
mac-dgarcia:playbooks dgarcia$ ansible testserver -i hosts -m ping -vvv
Using /Users/dgarcia/Documents/Playbooks/ansible.cfg as config file
Unexpected Exception: No escaped character
the full traceback was:
Traceback (most recent call last):
File "/Users/dgarcia/Documents/Playbooks/ansible/bin/ansible", line 79, in <module>
sys.exit(cli.run())
File "/Users/dgarcia/Documents/Playbooks/ansible/lib/ansible/cli/adhoc.py", line 106, in run
inventory = Inventory(loader=loader, variable_manager=variable_manager, host_list=self.options.inventory)
File "/Users/dgarcia/Documents/Playbooks/ansible/lib/ansible/inventory/__init__.py", line 135, in __init__
self.parser = InventoryParser(filename=host_list)
File "/Users/dgarcia/Documents/Playbooks/ansible/lib/ansible/inventory/ini.py", line 45, in __init__
self._parse()
File "/Users/dgarcia/Documents/Playbooks/ansible/lib/ansible/inventory/ini.py", line 49, in _parse
self._parse_base_groups()
File "/Users/dgarcia/Documents/Playbooks/ansible/lib/ansible/inventory/ini.py", line 107, in _parse_base_groups
tokens = shlex.split(line)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shlex.py", line 191, in read_token
raise ValueError, "No escaped character"
ValueError: No escaped character
Have searched everywhere I can on Google and came back with nothing. Any ideas?
I had the same issue and changing my host file to have only one line fixed the problem.
My host file looks like the following:
testserver ansible_ssh_host=128.0.0.1 ansible_ssh_port=2222 \
ansible_ssh_user=vagrant \
ansible_ssh_private_key_file=/home/bibryam/Desktop/.vagrant/machines/fabric/virtualbox/private_key

Upload with wtforms - unexpected end of regular expression

I am trying this code from here docs
class Form(Form):
image = FileField(u'Image File', validators=[Regexp(u'^[^/\\]\.jpg$')])
def validate_image(form, field):
if field.data:
field.data = re.sub(r'[^a-z0-9_.-]', '_', field.data)
Here is the error:
Traceback (most recent call last):
File "tornadoexample2-1.py", line 111, in <module>
class Form(Form):
File "tornadoexample2-1.py", line 119, in Form
image = FileField(u'Image File', validators=[Regexp(u'^[^/\\]\.jpg$')])
File "/usr/local/lib/python2.7/dist-packages/wtforms/validators.py", line 256, in __init__
regex = re.compile(regex, flags)
File "/usr/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of regular expression
Any idea about what the problem?
The regexp in Regexp(u'^[^/\\]\.jpg$') is not quite good.
Try running this, you will get the same exception:
import re
re.compile(u'^[^/\\]\.jpg$')
You need to escape each \ slash twice inside the [] brackets.
So you can rewrite it as u'^[^/\\\\]\.jpg$' or as a raw string ur'^[^/\\]\.jpg$'.
Hope this helps.

Error with pydub in python

i have successfully imported pydub
but for the code:
from pydub import AudioSegment
song = AudioSegment.from_mp3("c:\mks.mp3")
first_ten_seconds = song[:10000]
song.export("d:\mks.mp3", format="mp3")
But it gives the following error:
python "C:\Users\mKs\Desktop\mks2.py"
Process started >>>
Traceback (most recent call last):
File "C:\Users\mKs\Desktop\mks2.py", line 2, in <module>
song=AudioSegment.from_mp3("c:\mks.mp3");
File "C:\Python27\lib\site-packages\pydub-0.5.2-py2.7.egg\pydub\audio_segment.py", line 194, in from_mp3
return cls.from_file(file, 'mp3')
File "C:\Python27\lib\site-packages\pydub-0.5.2-py2.7.egg\pydub\audio_segment.py", line 189, in from_file
return cls.from_wav(output)
File "C:\Python27\lib\site-packages\pydub-0.5.2-py2.7.egg\pydub\audio_segment.py", line 206, in from_wav
return cls(data=file)
File "C:\Python27\lib\site-packages\pydub-0.5.2-py2.7.egg\pydub\audio_segment.py", line 33, in __init__
raw = wave.open(StringIO(data), 'rb')
File "C:\Python27\lib\wave.py", line 498, in open
return Wave_read(f)
File "C:\Python27\lib\wave.py", line 163, in __init__
self.initfp(f)
File "C:\Python27\lib\wave.py", line 128, in initfp
self._file = Chunk(file, bigendian = 0)
File "C:\Python27\lib\chunk.py", line 63, in __init__
raise EOFError
EOFError
I would love to get help on this topic
The only issue that I see with your code is trailing ";" at the end of last 3 line. Please remove those, and see if you still get the error.
In addition, make sure you have ffmpeg (http://www.ffmpeg.org/) installed. It is required for the support of all of the none wav file formats.
ADDED:
I think you have broken module dependencies in your python installation.
I have tried code that you provided above with python 2.7.2. It worked fine for me:
>>> from pydub import AudioSegment
>>> song = AudioSegment.from_wav('goodbye.wav')
>>> first_ten_seconds = song[:10000]
>>> song.export('goodbye1.wav',format='wav')
<open file 'goodbye1.wav', mode 'wb+' at 0x10cf2b270>

Categories

Resources