How to upload TXT files to Google AutoML Natural Language?

How to upload TXT files to Google AutoML Natural Language? - python

I'm trying to upload some files to annotate and train inside AutoMl to extract entities. But I have been receiving errors in the process. For example When I used PDF files, and create the JSONL files with their corresponding CSV file I hadn't problems, but after that I applied OCR in some scanned documents, I followed the guide and I used this script:
https://cloud.google.com/natural-language/automl/docs/scripts/input_helper_v2.py?_gac=1.88279529.1589302044.CjwKCAjwkun1BRAIEiwA2mJRWeRJRDpqVkcxBu3um5PTXjp1KaVgPITbHqK9OunNca4lrI_MlZfVthoC748QAvD_BwE&_ga=2.177424421.-1405970329.1585427512
I installed python 2.7 in new anaconda environment, finally when I try to use this script, it is not working for me.
Uploading 57 files (including csv and local PDF files) to gs://mybucket ...
Traceback (most recent call last):
File "C:\Users\USUARIO\Anaconda3\envs\jsonl\Scripts\gsutil-script.py", line 5, in
from gslib.__main__ import main
File "C:\Users\USUARIO\Anaconda3\envs\jsonl\lib\site-packages\gslib__main__.py", line 66, in
import httplib2
File "C:\Users\USUARIO\Anaconda3\envs\jsonl\lib\site-packages\httplib2__init__.py", line 482
print("%s:" % h, end=" ", file=self._fp)
^
SyntaxError: invalid syntax
Traceback (most recent call last):
File "input_helper_v2.py", line 686, in
main()
File "input_helper_v2.py", line 678, in main
UploadFiles(annotated_files, FLAGS.target_gcs_directory)
File "input_helper_v2.py", line 651, in UploadFiles
subprocess.check_call(cmd, shell=True)
File "C:\Users\USUARIO\Anaconda3\envs\jsonl\lib\subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gsutil -m cp c:\users\usuario\appdata\local\temp\tmprxwqnj\analisis_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_kredit_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_av_villas_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\cifin_parte_4.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_6.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\fiduprevisora_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\entrevista_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\data_credito_mini_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_kredit_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_colpatria_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_5.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\fopep_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_ck_tr_parte_4.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\apc_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_7.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_ck_tr_parte_5.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\data_credito_mini_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\terminacion_proceso_embargo_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_10.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\cifin_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\apc_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\entrevista_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_av_villas_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\fomaq_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\data_credito_mini_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\cifin_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_av_villas_parte_4.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\entrevista_parte_4.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_ck_tr_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_colpatria_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_kredit_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_gnb_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_9.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\soporte_pago_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\entrevista_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_gnb_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_av_villas_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\colpensiones_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\cedulas_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_ck_tr_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\paz_y_salvo_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\levantamiento_embargo_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_parte_3.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_8.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\secretaria_popayan_parte_4.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\ape_parte1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\c_ck_tr_parte_1.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\analisis_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\cifin_parte_2.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\cifin_parte_5.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\data_credito_mini_parte_4.jsonl c:\users\usuario\appdata\local\temp\tmprxwqnj\dataset.csv gs://mybucket' returned non-zero exit status 1
If someone could help, I'll be grateful...

Related

BadRarFile when extracting single file using RarFile in Python

I need to extract a single file (~10kB) from many very large RAR files (>1Gb). The code below shows a basic implementation of how I'm doing this.
from rarfile import RarFile
rar_file='D:\\File.rar'
file_of_interest='Folder 1/Subfolder 2/File.dat'
output_folder='D:/Output'
rardata = RarFile(rar_file)
rardata.extract(file_of_interest, output_folder)
rardata.close()
However, the extract instruction is returning the following error: rarfile.BadRarFile: Failed the read enough data: req=16384 got=52
When I open the file using WinRAR, I can extract the file successfully, so I'm sure the file isn't corrupted.
I've found some similar questions, but not a definite answer that worked for me.
Can someone help me to solve this error?
Additional info:
Windows 10 build 1909
Spyder 5.0.0
Python 3.8.1
Complete traceback of the error:
Traceback (most recent call last):
File "D:\Test\teste_rar_2.py", line 27, in <module>
rardata.extract(file_of_interest, output_folder)
File "C:\Users\bernard.kusel\AppData\Local\Continuum\anaconda3\lib\site-packages\rarfile.py", line 826, in extract
return self._extract_one(inf, path, pwd, True)
File "C:\Users\bernard.kusel\AppData\Local\Continuum\anaconda3\lib\site-packages\rarfile.py", line 912, in _extract_one
return self._make_file(info, dstfn, pwd, set_attrs)
File "C:\Users\bernard.kusel\AppData\Local\Continuum\anaconda3\lib\site-packages\rarfile.py", line 927, in _make_file
shutil.copyfileobj(src, dst)
File "C:\Users\bernard.kusel\AppData\Local\Continuum\anaconda3\lib\shutil.py", line 79, in copyfileobj
buf = fsrc.read(length)
File "C:\Users\bernard.kusel\AppData\Local\Continuum\anaconda3\lib\site-packages\rarfile.py", line 2197, in read
raise BadRarFile("Failed the read enough data: req=%d got=%d" % (orig, len(data)))
BadRarFile: Failed the read enough data: req=16384 got=52

Unable to write textclip with moviepy due to errors with Imagemagick

I am trying to write TextClip into a video with moviepy. It has always been working, yet after I reinstalled ffmpeg, it doesn't work anymore (I am not sure whether it has caused it, but I mentioned it in case it does). Some of the TextClip objects are pure spaces.
I got the following error message:
Traceback (most recent call last):
File "C:\Users\USER\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\moviepy\video\VideoClip.py", line 1137, in __init__
subprocess_call(cmd, logger=None)
File "C:\Users\USER\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\moviepy\tools.py", line 54, in subprocess_call
raise IOError(err.decode('utf8'))
OSError: magick.exe: no images for write '-write' 'PNG32:C:\Users\USER\AppData\Local\Temp\tmpmnf0fkb5.png' at CLI arg 14 # error/operation.c/CLINoImageOperator/4893.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "media_main_user.py", line 79, in <module>
insert_audio_and_subtitles(input_clip,'output.mp4','text.mp3',subtitles,fontsize=subtitles_font_size,
File "D:\UserData\Desktop\Project\影片剪輯\關鍵字版本\make_media.py", line 318, in insert_audio_and_subtitles
annotated_clips = [annotate(video.subclip(from_t, to_t), txt) for (from_t, to_t), txt in subtitles]
File "D:\UserData\Desktop\Project\影片剪輯\關鍵字版本\make_media.py", line 318, in <listcomp>
annotated_clips = [annotate(video.subclip(from_t, to_t), txt) for (from_t, to_t), txt in subtitles]
File "D:\UserData\Desktop\Project\影片剪輯\關鍵字版本\make_media.py", line 311, in annotate
txtclip = TextClip(txt, fontsize=fontsize, font=font, color=txt_color)
File "C:\Users\USER\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\moviepy\video\VideoClip.py", line 1146, in __init__
raise IOError(error)
OSError: MoviePy Error: creation of None failed because of the following error:
magick.exe: no images for write '-write' 'PNG32:C:\Users\USER\AppData\Local\Temp\tmpmnf0fkb5.png' at CLI arg 14 # error/operation.c/CLINoImageOperator/4893.
.
.This error can be due to the fact that ImageMagick is not installed on your computer, or (for Windows users) that you didn't specify the path to the ImageMagick binary in file conf.py, or that the path you specified is incorrect
I have tried to solve the problems multiple ways, like checking the config-default.py file in moviepy installation directory, and the path for ImageMagick it's pointing to is not wrong. I tried to reinstall ImageMagick, it is also of no use. And I tried to edit the policy.xml file in ImageMagick directory from none to read|write, but it also doesn't work. Can anyone suggest other possible solutions?

In /etc/ImageMagick-6/policy.xml try to comment out the following line altogether:
<policy domain="path" rights="none" pattern="#*" />
Like so:
<!-- <policy domain="path" rights="none" pattern="#*" /> -->

Pychecker index error on first run

I've just installed pychecker on windows 7 Pro using "python setup.py install". When I run it on my script using the command:
c:\Python26\Scripts\pychecker -#100 finaltest17.py
I get the following error/traceback:
C:\Users\....\ToBeReleased>C:\Python26\python.exe C:\Python26\Lib\site-packages\pychecker\checker.py -#100 finaltest17.py
Processing module finaltest17 (finaltest17.py)...
Caught exception importing module finaltest17:
File "C:\Python26\Lib\site-packages\pychecker\pcmodules.py", line 533, in setupMainCode()
self.moduleName, self.moduleDir)
File "C:\Python26\Lib\site-packages\pychecker\pychecker\utils.py", line 184, in findModule()
handle, filename, smt = _q_find_module(p, path)
File "C:\Python26\Lib\site-packages\pychecker\pychecker\utils.py", line 162, in _q_find_module()
if not cfg().quixote:
File "C:\Python26\Lib\site-packages\pychecker\pychecker\utils.py", line 39, in cfg()
return _cfg[-1]
IndexError: list index out of range
Traceback (most recent call last):
File "C:\Python26\Lib\site-packages\pychecker\checker.py", line 364, in <module>
sys.exit(main(sys.argv))
File "C:\Python26\Lib\site-packages\pychecker\checker.py", line 337, in main
importWarnings = processFiles(files, _cfg, _print_processing)
File "C:\Python26\Lib\site-packages\pychecker\checker.py", line 270, in processFiles
loaded = pcmodule.load()
File "C:\Python26\Lib\site-packages\pychecker\pcmodules.py", line 477, in load
return utils.cfg().ignoreImportErrors
File "C:\Python26\Lib\site-packages\pychecker\pychecker\utils.py", line 39, in cfg
return _cfg[-1]
IndexError: list index out of range
If anyone could point me in the right direction that would be great.
Thanks
Stewart

Problem resolved.
I found the following support request on SourceForge which refers to a need to use short format (8.3) path and filenames in pychecker.bat and not long format as is allowed in newer versions of Windows.
https://sourceforge.net/p/pychecker/support-requests/7/#96cb

Package python script with pandas using PEX

I have a simple python script that depends on pandas. I need to package it with pex so it's executed without dependency installation.
import sys
import csv
import argparse
import pandas as pd
class myLogic():
def __init__(self):
pass
def loadData(self, data_file):
return pd.read_csv(data_file, delimiter="|")
#command line interaction interface
def processInputArguments(self,args):
parser = argparse.ArgumentParser(description="my logic")
#transactions file name
parser.add_argument('-td',
'--data',
type=str,
dest='data',
help='data file location'
)
options = parser.parse_args(args)
return vars(options)
def main(self):
options = self.processInputArguments(sys.argv[1:])
data_file = options["data"]
data = self.loadData(data_file)
print data.head()
if __name__ == '__main__':
ml = myLogic()
ml.main()
I am trying to use pex to do that, so I did the following:
pex pandas -e myprogram.myLogic:main -o test1.pex
But I am getting this error when running the generated pex file:
Traceback (most recent call last):
File ".bootstrap/_pex/pex.py", line 317, in execute
File ".bootstrap/_pex/pex.py", line 250, in _wrap_coverage
File ".bootstrap/_pex/pex.py", line 282, in _wrap_profiling
File ".bootstrap/_pex/pex.py", line 360, in _execute
File ".bootstrap/_pex/pex.py", line 418, in execute_entry
File ".bootstrap/_pex/pex.py", line 435, in execute_pkg_resources
File ".bootstrap/pkg_resources.py", line 2088, in load
ImportError: No module named myLogic
I also tried packaging with the -c (switch for script) using the following command:
pex pandas -c myprogram.py -o test2.pex
But also getting an error:
Traceback (most recent call last):
File "/usr/local/bin/pex", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/pex/bin/pex.py", line 509, in main
pex_builder = build_pex(reqs, options, resolver_options_builder)
File "/usr/local/lib/python2.7/dist-packages/pex/bin/pex.py", line 486, in build_pex
pex_builder.set_script(options.script)
File "/usr/local/lib/python2.7/dist-packages/pex/pex_builder.py", line 214, in set_script
script, ', '.join(self._distributions)))
TypeError: sequence item 0: expected string, DistInfoDistribution found

The only option that worked for me up until now is creating an interpreter with pex that includes pandas and then shipping it with the python script. This can be done as follows:
pex pandas -o my_interpreter.pex
But this fails when the building python version is UCS4 and the version to run with is UCS2

HTTP header error using the Python SDK for Azure

I am starting with Microsoft Azure SDK for Python (https://github.com/Azure/azure-sdk-for-python), but I have problems.
I am using Scientific Linux and I have installed the SDK for Python 3.4 following the next steps:
(instead of the SDK directory)
python setup.py install
after that I created a simple script just to test the connection:
from azure.storage import BlobService
blob_service = BlobService(account_name='thename', account_key='Mxxxxxxx3w==' )
blob_service.create_container('testcontainer')
for i in blob_service.list_containers():
print(i.name)
following this documentation:
http://blogs.msdn.com/b/tconte/archive/2013/04/17/how-to-interact-with-windows-azure-blob-storage-from-linux-using-python.aspx
http://azure.microsoft.com/en-us/documentation/articles/storage-python-how-to-use-blob-storage/#large-blobs
but is not working, I always receive the same error:
python3 test.py
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/storage/storageclient.py", line 143, in _perform_request
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/storage/storageclient.py", line 132, in _perform_request_worker
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/http/httpclient.py", line 247, in perform_request
azure.http.HTTPError: The value for one of the HTTP headers is not in the correct format.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 21, in <module>
blob_service.create_container('testcontainer')
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/storage/blobservice.py", line 192, in create_container
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/__init__.py", line 905, in _dont_fail_on_exist
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/storage/blobservice.py", line 189, in create_container
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/storage/storageclient.py", line 150, in _perform_request
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/storage/__init__.py", line 889, in _storage_error_handler
File "/usr/local/lib/python3.4/site-packages/azure-0.9.0-py3.4.egg/azure/__init__.py", line 929, in _general_error_handler
azure.WindowsAzureError: Unknown error (The value for one of the HTTP headers is not in the correct format.)
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidHeaderValue</Code><Message>The value for one of the HTTP headers is not in the correct format.
RequestId:b37c5584-0001-002b-24b8-c2c245000000
Time:2014-11-19T14:54:38.9378626Z</Message><HeaderName>x-ms-version</HeaderName><HeaderValue>2012-02-12</HeaderValue></Error>
Thanks in advance and best regards.

I have this exact same issue. I believe it's a library bug, but the author/s haven't had their say yet.
It looks like the response states the version, but it's actually giving you the header that's wrong. Its value should be "2014-02-14", you can do the fix shown in https://github.com/Azure/azure-sdk-for-python/pull/289 .
Hopefully this will be fixed and nobody will ever read this answer. Cheers!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to upload TXT files to Google AutoML Natural Language? - python

Related

BadRarFile when extracting single file using RarFile in Python

Unable to write textclip with moviepy due to errors with Imagemagick

Pychecker index error on first run

Package python script with pandas using PEX

HTTP header error using the Python SDK for Azure

Categories

Resources