'Error -5 while decompressing data' in Spark, in PyPDF2 lib - python

Does anybody know why this occurs:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1089.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1089.0 (TID 1951, ip-10-0-208-38.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 101, in main
process()
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-762-e1c8f006c3c2>", line 4, in getPdfData
File "<ipython-input-762-e1c8f006c3c2>", line 85, in extractData
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2566, in extractText
content = ContentStream(content, self.pdf)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2644, in __init__
stream = BytesIO(b_(stream.getData()))
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/generic.py", line 837, in getData
decoded._data = filters.decodeStreamData(self)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 346, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 111, in decode
data = decompress(data)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 49, in decompress
return zlib.decompress(data)
error: Error -5 while decompressing data: incomplete or truncated stream
I process PDF files on workers using PyPDF2, creating a PyPDFpdfObject, calling getDocumentInfo() and calling extract_text() on the pageObjects. Im not explicitly using the zlib module where this 'compression' error usually occurs according to the oracle called the internet.
My code runs perfectly fine for smaller number of PDF's (around 500 or so) stored in an RDD (3 workers), but when I scale it up to 5000 or higher it goes wrong. Any ideas?

Related

Has anyone experienced random file access errors when working with luigi in Windows?

When working with luigi on Windows 10, the following error is sometimes thrown:
Traceback (most recent call last):
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\luigi\worker.py", line 192, in run
new_deps = self._run_get_new_deps()
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\luigi\worker.py", line 130, in _run_get_new_deps
task_gen = self.task.run()
File "project.py", line 15000, in run
data_frame = pd.read_excel(self.input()["tables"].open(),segment)
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\pandas\io\excel.py", line 191, in read_excel
io = ExcelFile(io, engine=engine)
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\pandas\io\excel.py", line 247, in __init__
self.book = xlrd.open_workbook(file_contents=data)
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\xlrd\__init__.py", line 115, in open_workbook
zf = zipfile.ZipFile(timemachine.BYTES_IO(file_contents))
File "C:\Python27\lib\zipfile.py", line 793, in __init__
self._RealGetContents()
File "C:\Python27\lib\zipfile.py", line 862, in _RealGetContents
raise BadZipfile("Bad magic number for central directory")
BadZipfile: Bad magic number for central directory
This error seems to only happen when working on Windows, and happens more frequently when a task requires to access the same file more than once in the same task, or is used by different tasks as a Target even when there are no parallel workers running. Code runs without errors on Linux.
Has anyone else experienced this behavior?
I'm trying to create pandas DataFrames from Excel file sheets but I get a BadZipFile error instead sometimes.

Getting an error while training Resnet50 on Imagenet at 14th Epoch

I am training Resnet50 on imagenet using the script provided from PyTorch (with a slight trivial tweak for my purpose). However, I am getting the following error after 14 epochs of training. I have allocated 4 gpus in the server I'm using to run this. Any pointers as to what this error is about would be appreciated. Thanks a lot!
Epoch: [14][5000/5005] Time 1.910 (2.018) Data 0.000 (0.191) Loss 2.6954 (2.7783) Total 2.6954 (2.7783) Reg 0.0000 Prec#1 42.969 (40.556) Prec#5 64.844 (65.368)
Test: [0/196] Time 86.722 (86.722) Loss 1.9551 (1.9551) Prec#1 51.562 (51.562) Prec#5 81.641 (81.641)
Traceback (most recent call last):
File "main_group.py", line 549, in <module>
File "main_group.py", line 256, in main
File "main_group.py", line 466, in validate
if args.gpu is not None:
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
return self._process_data(data)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
sample = self.loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
return pil_loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'
It is difficult to tell what the problem is just by looking at the error you have posted.
All we know is that there was an issue reading the file at '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'.
Try the following:
Confirm the file actually exists.
Confirm that it is infact a valid JPEG and not corrupted (by viewing it).
Confirm that you can open it with Python and also load it manually with PIL.
If none of that works, try deleting the file. Do you get the same error on another file in the folder?

ImportError when running NuPIC model on PySpark

I am trying to run NuPIC on PySpark but I am getting an ImportError. Does anyone have any ideas for how I can fix it?
The code runs fine when I don't use PySpark, but I am trying to run it from a Spark Dataset now.
I am trying to run it using the source code I have in my directory, since running it by installing the Nupic package causes some other errors.
Thank you for your help!!
I am trying to run this function
input_data.rdd.foreach(lambda row: iterateRDD(row, model))
def iterateRDD(record, model):
modelInput = record.asDict(False)
modelInput["value"] = float(modelInput["value"])
modelInput["timestamp"] = datetime.datetime.strptime(modelInput["timestamp"], "%Y-%m-%d %H:%M:%S")
print"modelInput", modelInput
result = model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print "Anomaly score is", anomalyScore
However, I get this error and don't understand it.
File
"C:/Users/rakshit.trn/Documents/Nupic/nupic-master/examples/anomaly.py",
line 100, in runAnomaly
input_data.rdd.foreach(lambda row: iterateRDD(row, model)) File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 789,
in foreach self.mapPartitions(processPartition).count() # Force
evaluation File "C:\Python\Python27\lib\site-packages\pyspark\rdd.py",
line 1055, in count return self.mapPartitions(lambda i: [sum(1 for _
in i)]).sum() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 1046, in
sum return self.mapPartitions(lambda x: [sum(x)]).fold(0,
operator.add) File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 917, in
fold vals = self.mapPartitions(func).collect() File
"C:\Python\Python27\lib\site-packages\pyspark\rdd.py", line 816, in
collect sock_info =
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File
"C:\Python\Python27\lib\site-packages\py4j\java_gateway.py", line
1257, in call answer, self.gateway_client, self.target_id,
self.name) File
"C:\Python\Python27\lib\site-packages\pyspark\sql\utils.py", line 63,
in deco return f(*a, **kw) File
"C:\Python\Python27\lib\site-packages\py4j\protocol.py", line 328, in
get_return_value format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0
in stage 2.0 (TID 2, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 364, in main File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",
line 69, in read_command File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 172, in _read_with_length return self.loads(obj) File
"D:\spark-2.4.3-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py",
line 583, in loads return pickle.loads(obj) ImportError: No module
named frameworks.opf.htm_prediction_model
I guess that NuPIC is not able to get access to the frameworks/opf/htm_prediction_model.py file
You might be running an old version of NuPIC. See https://discourse.numenta.org/t/warning-0-7-0-breaking-changes/2200 and check what version you are using (https://discourse.numenta.org/t/how-to-check-what-version-of-nupic-is-installed/1045)

pymssql.operational Error

I am using python and pymssql for SQL connection.
Basically i am fetching some data from 'table1' and doing some
Mathematical calculation and inserting into 'table2'
I have tested my python script with limited number of
records from 'table1' and it was working fine.
Then I have started my script with real time data,ie really huge
amount of data(122457 records)
Then I have faced given below issue
Traceback (most recent call last):
File "trakryt_main_mod.py", line 312, in <module>
create_visitwalkby_table()
File "trakryt_main_mod.py", line 126, in create_visitwalkby_table
scenarios.one_beacon(cursor,orgid)
File "C:\Nikhil\trak_reporting\python\scenarios.py", line 260, in one_beacon
close_visit()
File "C:\Nikhil\trak_reporting\python\scenarios.py", line 207, in close_visit
conn = pymssql.connect(server,user,password,db,port=port)
File "pymssql.pyx", line 645, in pymssql.connect (pymssql.c:10301)
File "pymssql.pyx", line 247, in pymssql.Connection.__init__ (pymssql.c:3771)
pymssql.OperationalError: Cannot start transaction: (20004, 'DB-Lib error messag
e 20004, severity 9:\nRead from the server failed\nNet-Lib error during Unknown
error (10054)\n')

MemoryError using CVS2SVN

I'm trying to migrate a CSV repository to SVN using csv2svn. After dealing with some errors during the first pass (CollectRevsPass), i'm in the fourth step and this error appears:
...
c:\Users\Andres\Desktop\copa\copa\proyectosAMEG\JGA\KnapsackJG2A\src\operators\K
napsackSelection.java,v
Done
Time for pass1 (CollectRevsPass): 68.47 seconds.
----- pass 2 (CleanMetadataPass) -----
Converting metadata to UTF8...
Done
Time for pass2 (CleanMetadataPass): 0.437 seconds.
----- pass 3 (CollateSymbolsPass) -----
Checking for forced tags with commits...
Done
Time for pass3 (CollateSymbolsPass): 0.015 seconds.
----- pass 4 (FilterSymbolsPass) -----
Filtering out excluded symbols and summarizing items...
Traceback (most recent call last):
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn", line 70, in <module>
svn_main(os.path.basename(sys.argv[0]), sys.argv[1:])
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\main.py", line 113, in svn_main
main(progname, run_options, pass_manager)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\main.py", line 96, in main
pass_manager.run(run_options)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\pass_manager.py", line 181, in run
the_pass.run(run_options, stats_keeper)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\passes.py", line 505, in run
revision_collector.process_file(cvs_file_items)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\checkout_internal.py", line 615, in process_file
_Sink(self, cvs_file_items),
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\rcsparser.py", line 68, in parse
return selected_parser().parse(file, sink)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_rcsparse\common.py", line 477, in parse
self.parse_rcs_deltatext()
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_rcsparse\common.py", line 450, in parse_rcs_deltatext
self.sink.set_revision_info(revision, log, text)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\checkout_internal.py", line 539, in set_revision_info
text_record, self._rcs_stream.get_text()
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\checkout_internal.py", line 601, in _writeout
self._delta_db[text_record.id] = text
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\indexed_database.py", line 94, in __setitem__
s = self.serializer.dumps(item)
File "C:\Users\Andres\Downloads\cvs2svn-2.4.0.tar\dist\cvs2svn-2.4.0\cvs2svn-2
.4.0\cvs2svn_lib\serializer.py", line 138, in dumps
return marshal.dumps(zlib.compress(self.wrapee.dumps(object), 9))
MemoryError
I checked and there is enough free memory, when this error appear. Just before the Error appears, the process 'python' increaces a lot the use of memory.
Does Some one know what can I do?
CVS2SVN 2.4
Python 2.7
This error can also indicate that the process has run out of virtual address space. If you are running cvs2svn in 32-bit mode, then the process only has something like 2 GiB or 4 GiB of addresses that it can use, regardless of whether the computer has free RAM. If this is the case, try running the program in 64-bit mode.
If you are already running in 64-bit mode, then try increasing RAM or swap space or running the program on a beefier computer.

Categories

Resources