Is there any limits of saving result on S3 from sagemaker Processing? - python

※ I used google translation, if you have any question, let me know!
I am trying to run python script with huge 4 data, using sagemaker processing. And my current situation are as follows:
can run this script with 3 data
can't run the script with only 1 data (the biggest, the same structure with others)
as for all of 4 data, the script has finished (so, I suspected this error in S3, ie. when copying sagemaker result to S3)
The error I got is this InternalServerError.
Traceback (most recent call last):
File "sagemaker_train_and_predict.py", line 56, in <module>
outputs=outputs
File "{xxx}/sagemaker_constructor.py", line 39, in run
outputs=outputs
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/processing.py", line 408, in run
self.latest_job.wait(logs=logs)
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/processing.py", line 723, in wait
self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/session.py", line 3111, in logs_for_processing_job
self._check_job_status(job_name, description, "ProcessingJobStatus")
File "{masked}/.pyenv/versions/3.6.8/lib/python3.6/site-packages/sagemaker/session.py", line 2615, in _check_job_status
actual_status=status,
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job sagemaker-vm-train-and-predict-2020-04-12-04-15-40-655: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

There may be some issue transferring the output data to S3 if the output is generated at a high rate and size is too large.
You can 1) try to slow down writing the output a bit or 2) call S3 from your algorithm container to upload the output directly using boto client (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html).

Related

Why is np.load not able to load a file saved using np.save() without allow_pickle=True

I am trying to save 2d array as npy file using np.save(). It saves without any error and without using any pickling.But when i load the file , I am getting the following error traceback as:
Traceback (most recent call last):
File "final_model.py", line 144, in <module>
a=create_bags('abnormal')
File "final_model.py", line 130, in create_bags
video_feature=np.load(DATASET_ROOT+'train/features/'+flag+'/'+file)
File "/home/aditya_vartak_quantiphi_com/anaconda3/envs/v/lib/python3.8/site-packages/numpy/lib/npyio.py", line 452, in load
return format.read_array(fid, allow_pickle=allow_pickle,
File "/home/aditya_vartak_quantiphi_com/anaconda3/envs/v/lib/python3.8/site-packages/numpy/lib/format.py", line 739, in read_array
raise ValueError("Object arrays cannot be loaded when "
ValueError: Object arrays cannot be loaded when allow_pickle=False
I researched about it on internet , only to find one statement saying something like,
Array might not be loaded properly , that's why np.load is considering it as object array
So what i did is made a test code for that file(lets call it error_file) as follows:
f_b=feature_extractor(file_loc)
np.save(target_path,f_b)
feature=np.load(target_path)
print(feature, feature.shape)
This gives results as expected without throwing error
but when i use it inside a function which takes in all features npy files one by one and loads it to print its contents, the execution of function stops at the exact point where it encounters
error_file with the traceback as above
The test code should have replaced the wrongly formed feature npy with right one, but the error dont seem with the file , but with np.save itself.
Addendum: The Function works for all npy files before it, even though they all went through same procedure during feature formation
Any help would be great

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes; there's a HDFS data node on each).
On all 12 nodes, I can access HDFS and execute a simple Python script that displays info on a file:
import pyarrow as pa
fs = pa.hdfs.connect([url], 8022)
print(str(fs.info('/path/to/file.csv')))
If I create a single-node cluster (only 24 workers) using only the machine running Dask Scheduler, I can read the .csv from HDFS and print the length:
import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
dask.config.set(hdfs_backend='pyarrow')
df = dd.read_csv('hdfs://[url]:8022/path/to/file.csv')
df = client.persist(df)
print(str(len(df)))
That last line gives 1046250873 (nice!) and takes 3m17s to run.
However, when I use the full cluster, that last line calling len(df) dies and I get this error:
KilledWorker: ("('pandas_read_text-read-block-from-delayed-9ad3beb62f0aea4a07005d5c98749d7e', 1201)", 'tcp://[url]:42866')
This is similar to an issue mentioned here which has a solution here involving Dask Yarn and a config (?) that looks like: worker_env={'ARROW_LIBHDFS_DIR': ...}
However, I'm not using Yarn, although my guess is that the Dask Workers are somehow not configured with the HDFS/Arrow paths they need in order to connect.
I don't see any documentation on this, hence my question as to what I'm missing.
Edit:
Here's the error traceback I'm seeing in the output of the Dask Workers:
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95N\x05\x00\x00\x00\x00\x00\x00(\x8c\x14dask.dataframe.utils\x94\x8c\ncheck_meta\x94\x93\x94(\x8c\x12dask
.compatibility\x94\x8c\x05apply\x94\x93\x94\x8c\x15dask.dataframe.io.csv\x94\x8c\x10pandas_read_text\x94\x93\x94]\x94(\x8c\x11pandas.io.parsers\x94\x8c\x08read_csv\x94\x93\x94(
\x8c\x0fdask.bytes.core\x94\x8c\x14read_block_from_file\x94\x93\x94h\r\x8c\x08OpenFile\x94\x93\x94(\x8c\x12dask.bytes.pyarrow\x94\x8c\x17PyArrowHadoopFileSystem\x94\x93\x94)\x8
1\x94}\x94(\x8c\x02fs\x94\x8c\x0cpyarrow.hdfs\x94\x8c\x10HadoopFileSystem\x94\x93\x94(\x8c\r10.255.200.91\x94MV\x1fNN\x8c\x07libhdfs\x94Nt\x94R\x94\x8c\x08protocol\x94\x8c\x04h
dfs\x94ub\x8c\x1a/path/to/file.csv\x94\x8c\x02rb\x94NNNt\x94R\x94K\x00J\x00\x90\xd0\x03C\x01\n\x94t\x94C\x12animal,weight,age\n\x94\x8c\x08builtins\x94\x8c\x04dict\x94
\x93\x94]\x94\x86\x94h*]\x94(]\x94(\x8c\x06animal\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff
\xff\xff\xffK?t\x94be]\x94(\x8c\x06weight\x94h2\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94be]\x94(\x8c\x03age\x94h<e
e\x86\x94]\x94(h/h9h#eeh*]\x94(]\x94(\x8c\x0cwrite_header\x94\x89e]\x94(\x8c\x07enforce\x94\x89e]\x94(\x8c\x04path\x94Nee\x86\x94t\x94\x8c\x11pandas.core.frame\x94\x8c\tDataFra
me\x94\x93\x94)\x81\x94}\x94(\x8c\x05_data\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)\x81\x94(]\x94(\x8c\x18pandas.core.indexes.base\x94\x8c\n_new_In
dex\x94\x93\x94hW\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94h0\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01
b\x94\x87\x94R\x94(K\x01K\x03\x85\x94h5\x89]\x94(h/h9h#et\x94b\x8c\x04name\x94Nu\x86\x94R\x94hY\x8c\x19pandas.core.indexes.range\x94\x8c\nRangeIndex\x94\x93\x94}\x94(hjN\x8c\x0
5start\x94K\x00\x8c\x04stop\x94K\x00\x8c\x04step\x94K\x01u\x86\x94R\x94e]\x94(h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02K\x00\x86\x94h<\x89C\x00\x94t\x94bh`hbK\x00\x85\x94hd\x
87\x94R\x94(K\x01K\x01K\x00\x86\x94h5\x89]\x94t\x94be]\x94(hYh[}\x94(h]h`hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x02\x85\x94h5\x89]\x94(h9h#et\x94bhjNu\x86\x94R\x94hYh[}\x94(h]h`
hbK\x00\x85\x94hd\x87\x94R\x94(K\x01K\x01\x85\x94h5\x89]\x94h/at\x94bhjNu\x86\x94R\x94e}\x94\x8c\x060.14.1\x94}\x94(\x8c\x04axes\x94hV\x8c\x06blocks\x94]\x94(}\x94(\x8c\x06valu
es\x94hy\x8c\x08mgr_locs\x94h(\x8c\x05slice\x94\x93\x94K\x01K\x03K\x01\x87\x94R\x94u}\x94(h\x9dh\x7fh\x9eh\xa0K\x00K\x01K\x01\x87\x94R\x94ueust\x94b\x8c\x04_typ\x94\x8c\tdatafr
ame\x94\x8c\t_metadata\x94]\x94ub\x8c\x0cfrom_delayed\x94t\x94.'
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
File "/usr/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 38, in __init__
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unable to load libjvm
Again, I can use pyarrow to read a successfully read a file from HDFS from any of the 12 nodes.
Looking at the traceback my guess is that PyArrow isn't correctly installed on the worker nodes. I might ask on the PyArrow issue tracker to see if they can help you to diagnose that traceback.
Ho boy! After building libhdfs3 from scratch and deploying to part of the cluster and finding the same exact result (ImportError: Can not find the shared library: libhdfs3.so), I realized the issue is I've been starting the Dask workers via pssh so they aren't catching the environment variables they should.

KeyError: 'max_overlaps' on tensorflow ver Faster R-CNN

I tried to run python Faster-RCNN based on tensorflow, which clone from https://github.com/smallcorgi/Faster-RCNN_TF
I built a dataset by myself and re-write data API to make dataset fit. The images in the dataset are all composed of meaningless background and text.
I got a .txt file to record the text location in a image, such as
ID_card/train/3.jpg 1 209 39 261 89
And my goal is to find text from a new image.
But when I run
python ./tools/train_net.py --device cpu --device_id 1 --solver VGG_CNN_M_1024 --weight ./data/pretrain_model/VGG_imagenet.npy --imdb ID_card_train --network IDcard_train
I got this KeyError: 'max_overlaps'
and here is the terminal record and error traceback.
Traceback (most recent call last):
File "./tools/train_net.py", line 97, in <module>
max_iters=args.max_iters)
File"/Users/jay_fu/tasks/catpatch/ClickCatpatch/tools/../lib/fast_rcnn/train.py", line 259, in train_net
roidb = filter_roidb(roidb)
File"/Users/jay_fu/tasks/catpatch/ClickCatpatch/tools/../lib/fast_rcnn/train.py", line 250, in filter_roidb
filtered_roidb = [entry for entry in roidb if is_valid(entry)]
File"/Users/jay_fu/tasks/catpatch/ClickCatpatch/tools/../lib/fast_rcnn/train.py", line 239, in is_valid
overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'
I did googled and tried to delete /cache folder, and it didn't work. Next time i run the code the folder and .pkl file would be created again and then the same error came out.
some other answer said delete another folder lib/datasets/VOCdevkit-matlab-wrapper
however smallcorgi/Faster-RCNN_TF do not contain this folder, so i have no way to go.
I wonder what happened to my code and what would cause this error. I have no idea about what to do.
Anyone can give me some help, solution or whatever a piece of idea?
edit:
I run the demo on #VegardKT 's idea, the demo works good.
terminal shows succeed and figure 1-5 shows up.

Sound files in PsychoPy wont load

I'm currently working on building an experiment in PsychoPy (v1.82.01 stand-alone). I started on the project several months ago with an older version of PsychoPy.
It worked great and I ran some pilot subjects. We have since adjusted the stimuli sounds and it won’t run.
It looks like there is an issue with referencing the sound file, but I can’t figure out what’s going on.
I recreated the first part of the experiment with a single file rather than a loop so that it would be easier to debug. The sound file is referenced using:
study_sound = sound.Sound(u‘2001-1.ogg’, secs=-1)
When I run it, I get this output:
or see below
Running: /Users/dkbjornn/Desktop/Test/test.py
2016-04-29 14:05:43.164 python[65267:66229207] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/9f/3kr6zwgd7rz95bcsfw41ynw40000gp/T/org.psychopy.PsychoPy2.savedState
0.3022 WARNING Movie2 stim could not be imported and won't be available
sndinfo: failed to open the file.
Traceback (most recent call last):
File "/Users/dkbjornn/Desktop/Test/test.py", line 84, in <module>
study_sound = sound.Sound(u'2001-1.ogg', secs=-1)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 380, in __init__
self.setSound(value=value, secs=secs, octave=octave, hamming=hamming)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 148, in setSound
self._setSndFromFile(value)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/psychopy/sound.py", line 472, in _setSndFromFile
start=self.startTime, stop=self.stopTime)
File "/Applications/PsychoPy2.app/Contents/Resources/lib/python2.7/pyolib/tables.py", line 1420, in setSound
saved data to u'/Users/dkbjornn/Desktop/Test/data/99_test_2016_Apr_29_1405_1.csv'
_size, _dur, _snd_sr, _snd_chnls, _format, _type = sndinfo(path)
TypeError: 'NoneType' object is not iterable
The important thing here is the sndinfo: failed to open the file. message. Most likely, it cannot find your file on the disk. Check the following:
Is the file 2001-1.ogg in the same folder as your experiment? Not in a subfolder? Or have you accidentially changed your path, e.g. using os.chdir?
Is it actually called 2001-1.ogg? Any differences in uppercase/lowercase, spaces, etc. all count.
Alternatively, there's something in the particular way the .ogg was saved that causes the problem, even though the Sound class can read a large set of different sound codecs. Try exporting the sound file in other formats, e.g. .mp3 or .wav.

py2neo - Neo4j - System Error - Create Batch Nodes/Relationships

Attempting to batch create nodes & relationships - batch creation is failing - Traceback at end of the post
Note code functions with smaller subset of nodes - fails when get into massive number of relationships, unclear at what limit this is occurring.
Wondering if I need to increase ulimit above 40,000 open files
Read somewhere where persons were running into Xstream issues with REST API while conducting batch create - unclear if the problem set is on the py2neo end of the spectrum, or on the Neo4j server tuning/configuration, or on the Python end of the spectrum.
Any guidance would be greatly appreciated.
One cluster within the data set ends up with around 625525 relationships out of 700+ nodes.
Total Relationships will be 1M+ - utilizing an Apple Macbook Pro Retina with x86_64 - Ubuntu 13.04, SSD, 8GB memory.
Neo4j: configured auto_indexing & auto_relationships set to ON
Nodes Clustered/Grouped via Python Panadas DataFrame.groupby()
Nodes: contain 3 properties
Relationships Properties: 1 -> IN & Out Relationships created
ulimit set to 40,000 files open
Code
https://github.com/alienone/OSINT/blob/master/MANDIANTAPT/spitball.py
Operating System: Ubuntu 13.04
Python version: 2.7.5
py2neo Version: 1.5.1
Java version: 1.7.0_25-b15
Neo4j version: Community Edition 1.9.2
Traceback
Traceback (most recent call last):
File "/home/alienone/Programming/Python/OSINT/MANDIANTAPT/spitball.py", line 63, in
main()
File "/home/alienone/Programming/Python/OSINT/MANDIANTAPT/spitball.py", line 59, in main
graph_db.create(*sorted_nodes)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 420, in create
return batch.submit()
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 2123, in submit
for response in self._submit()
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/neo4j.py", line 2092, in submit
for id, request in enumerate(self.requests)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 428, in _send
return self._client().send(request)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 365, in send
return Response(request.graph_db, rs.status, request.uri, rs.getheader("Location", None), rs_body)
File "/home/alienone/.pythonbrew/pythons/Python-2.7.5/lib/python2.7/site-packages/py2neo/rest.py", line 279, in init
raise SystemError(body)
SystemError: None
Process finished with exit code 1
I had a similar issue. One way to deal with it is to do the batch.submit() for chunks of your data and not the whole data set. This is slower of course, but splitting one million nodes in chunks of 5000 is still faster than adding every node separately.
I use a small helper class to do this, note that all my nodes are indexed: https://gist.github.com/anonymous/6293739

Categories

Resources