Unable to upload a file from sagemaker notebook to S3

Unable to upload a file from sagemaker notebook to S3 - python

I am attempting to upload my cleaned (and split data using kfold) to s3 so that I can use sagemaker to create a model using it (since sagemaker wants an s3 file with training and test data). However, whenever I attempt to upload the csv to s3 it runs but I don't see the file in s3.
I have tried changing which folder I access in sagemaker, or trying to upload different types of files none of which work. In addition, I have tried the approaches in similar Stack Overflow posts without success.
Also note that I am able to manually upload my csv to s3, just not through sagemaker automatically.
The code below is what I currently have to upload to s3, which I have copied directly from AWS documentation for file uploading using sagemaker.
import io
import csv
import boto3
#key = "{}/{}/examples".format(prefix,data_partition_name)
#url = 's3n://{}/{}'.format(bucket, key)
name = boto3.Session().resource('s3').Bucket('nc-demo-sagemaker').name
print(name)
boto3.Session().resource('s3').Bucket('nc-demo-sagemaker').upload_file('train', '/')
print('Done writing to {}'.format('sagemaker bucket'))
I expect that when I run that code snippet, I am able to upload the training and test data to the folder I want for use in creating sagemaker models.

I always upload files from Sagemaker notebook instance to S3 using this code. This will upload all the specified folder's contents to S3. Alternatively, you can specify a single file to upload.
import sagemaker
s3_path_to_data = sagemaker.Session().upload_data(bucket='my_awesome_bucket',
path='local/path/data/train',
key_prefix='my_crazy_project_name/data/train')
I hope this helps!

The issue may be due to a lack of proper S3 permissions for your SageMaker notebook.
Your IAM user has a role with permissions, which is what dictates whether you can manually upload the CSV via the S3 console.
SageMaker notebooks actually have their own IAM role, which will require you to explicitly add S3 permissions. You can see this in the SageMaker console, the default IAM role is prefaced with SageMaker-XXX. You can either edit this SageMaker created IAM role, or attach existing IAM roles that include read/write permissions for S3.

Import sagemaker library and use sagemaker session to upload and download files to/from s3 bucket.
import sagemaker
sagemaker_session = sagemaker.Session(default_bucket='MyBucket')
upload_data = sagemaker_session.upload_data(path='local_file_path', key_prefix='my_prefix')
print('upload_data : {}'.format(upload_data))

Related

How to retrieve .dcm image files from the ADLS gen2 using Azure Synapse and pySpark notebook?

I want to access the files of type .dcm (dicom) stored in a container on ADLS gen2 in a pyspark notebook on azure synapse analytics. I'm using pydicom to access the files but getting and error that file does not exists. Please have a look at the code below,
To create the filepath I'm using path library:
Path(path_to_dicoms_dir).joinpath('stage_2_train_images/%s.dcm' % pid)
where pid is the id of the dcm image.
To fetch the dcm image I'm using the following way.
d = pydicom.read_file(data['dicom'])
OR
d = pydicom.dcmread(data['dicom'])
where data['dicom'] is the path.
I've checked the path there is no issue with the it, the file exists and all the access rights are there as I'm accessing other files in the directory just above the directory in which these dcm files are there. But the other files are csv and not dcm
Error:
FileNotFoundError: [Errno 2] No such file or directory: 'abfss:/#.dfs.core.windows.net//stage_2_train_images/stage_2_train_images/003d8fa0-6bf1-40ed-b54c-ac657f8495c5.dcm'
Questions that I have in my mind:
Is this the right storage solution for such image data, if not shall I use blog storage then?
Is it some issue with pydicom library and I'm missing some settings to tell the pydicom that this is a ADLS link.
Or should I entirely change the approach and use databricks instead to run my notebooks?
Or is someone can help me with issue?

Is this the right storage solution for such image data, if not shall I
use blog storage then?
The ADLS Gen2 storage account works perfectly fine with Synapse, so there is no need to use blob storage.
It seems like the pydicom not taking the path correctly.
You need to mount the ADLS Gen2 account in synapse so that pydicom will treat the path as an attached hard drive instead if taking URL.
Follow this tutorial given my Microsoft to How to mount Gen2/blob Storage to do the same.
You need to first create a Linked Service in Synapse which will store your ADLS Gen2 account connection details. Later use below code in notebook to mount the storage account:
mssparkutils.fs.mount(
"abfss://mycontainer#<accountname>.dfs.core.windows.net",
"/test",
{"linkedService":"mygen2account"}
)

Why does AWS SageMaker create an S3 Bucket

Upon deploying a custom pytorch model with the boto3 client in python. I noticed that a new S3 bucket had been created with no visible objects. Is there a reason for this?
The bucket that contained my model was named with the keyword "sagemaker" included, so I don't any issue there.
Here is the code that I used for deployment:
remote_model = PyTorchModel(
name = model_name,
model_data=model_url,
role=role,
sagemaker_session = sess,
entry_point="inference.py",
# image=image,
framework_version="1.5.0",
py_version='py3'
)
remote_predictor = remote_model.deploy(
instance_type='ml.g4dn.xlarge',
initial_instance_count=1,
#update_endpoint = True, # comment or False if endpoint doesns't exist
endpoint_name=endpoint_name, # define a unique endpoint name; if ommited, Sagemaker will generate it based on used container
wait=True
)

It was likely created as a default bucket by the SageMaker Python SDK. Note that the code you wrote about is not boto3 (AWS python SDK), but sagemaker (link), the SageMaker-specific Python SDK, that is higher-level than boto3.
The SageMaker Python SDK uses S3 at multiple places, for example to stage training code when using a Framework Estimator, and to stage inference code when deployment with a Framework Model (your case). It gives you control of the S3 location to use, but if you don't specify it, it may use an automatically generated bucket, if it has the permissions to do so.
To control code staging S3 location, you can use the parameter code_location in either your PyTorchEstimator (training) or your PyTorchModel (serving)

How to send/copy/upload file from AWS S3 to Google GCS using Python

Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)

So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')

I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.

Tensorflow - S3 object does not exist

How do I set up direct private bucket access for Tensorflow?
After running
from tensorflow.python.lib.io import file_io
and running print file_io.stat('s3://my/private/bucket/file.json') I end up with an error -
NotFoundError: Object s3://my/private/bucket/file.json does not exist
However, the same line on a public object works without an error:
print file_io.stat('s3://ryft-public-sample-data/wikipedia-20150518.bin')
There appears to be an article on support here: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md
However, I end up with the same error after exporting the variables shown.
I have awscli set up with all credentials, and boto3 can view and download the file in question. I am wondering how I can get Tensorflow to have S3 access directly when the bucket is private.

I had the same problem when trying to access files in private S3 bucket from Sagemaker notebook. The mistake I made was to try using credentials I obtained from boto3, which seem not to be valid outside.
The solution was not to specify credentials (in such case it uses the role attached to the machine), but instead just specify the region name (for some reason it didn't read it from ~/.aws/config file) as follows:
import boto3
import os
session = boto3.Session()
os.environ['AWS_REGION']=session.region_name
NOTE: when debugging this error useful was to look at CloudWatch logs, as the logs of S3 client were printed only there and not in the Jupyter notebook.
In there I have first have seen, that:
when I did specify credentials from boto3 the error was: The AWS Access Key Id you provided does not exist in our records.
When accessing without AWS_REGION env variable set I had The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. which apparently is common when you don't specify bucket (see 301 Moved Permanently after S3 uploading)

Upload a file to S3 compatible service using python

I'm using an S3 compatible service. That means my dynamic storage is not hosted on AWS. I found a couple of python scripts that upload files to AWS S3. I would like to do the same but I need to be able to set my own host url. How can that be done?

You can use the Boto3 library (https://boto3.readthedocs.io/en/latest/) for all your S3 needs in Python. To use a custom S3-compatible host instead of the AWS, set the endpoint_url argument when constructing a S3 resource object, e.g.:
import boto3
session = boto3.session.Session(...)
s3 = session.resource("s3", endpoint_url="http://...", ...)

You can use amazon route53.
Please refer
http://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to upload a file from sagemaker notebook to S3 - python

Related

How to retrieve .dcm image files from the ADLS gen2 using Azure Synapse and pySpark notebook?

Why does AWS SageMaker create an S3 Bucket

How to send/copy/upload file from AWS S3 to Google GCS using Python

Tensorflow - S3 object does not exist

Upload a file to S3 compatible service using python

Categories

Resources