GoogleCloud DataFlow Failed to write a file to temp location

GoogleCloud DataFlow Failed to write a file to temp location - python

I am building a beam pipeline on Google cloud dataflow.
I am getting an error that cloud dataflow does not have permissions to write to a temp directory.
This is confusing since clearly dataflow has the ability to write to the bucket, it created a staging folder.
Why would I be able to write a staging folder, but not a temp folder?
I am running from within a docker container on a compute engine. I am fully authenticated with my service account.
PROJECT=$(gcloud config list project --format "value(core.project)")
BUCKET=gs://$PROJECT-testing
python tests/prediction/run.py \
--runner DataflowRunner \
--project $PROJECT \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--job_name $PROJECT-deepmeerkat \
--setup_file tests/prediction/setup.py
EDIT
In response to #alex amato
Does the bucket belong to the project or is it owned by another project?
Yes, when I go the home screen for the project, this is one of four buckets listed. I commonly upload data and interact with other google cloud services (cloud vision API) from this bucket.
Would you please provide the full error message.
"(8d8bc4d7fc4a50bd): Failed to write a file to temp location 'gs://api-project-773889352370-testing/temp/api-project-773889352370-deepmeerkat.1498771638.913123'. Please make sure that the bucket for this directory exists, and that the project under which the workflow is running has the necessary permissions to write to it."
"8d8bc4d7fc4a5f8f): Workflow failed. Causes: (8d8bc4d7fc4a526c): One or more access checks for temp location or staged files failed. Please refer to other error messages for details. For more information on security and permissions, please see https://cloud.google.com/dataflow/security-and-permissions."
Can you confirm that there isn't already an existing GCS object which matches the name of the GCS folder path you are trying to use?
Yes, there is no folder named temp in the bucket.
Could you please verify the permissions you have match the members you run as
Bucket permissions have global admin
which matches my gcloud auth

#chamikara was correct. Despite inheriting credentials from my service account, cloud dataflow needs its own credentials.
Can you also give access to cloudservices account (<project-number>#developer.gserviceaccount.com) as mentioned in cloud.google.com/dataflow/security-and-permissions.

Ran into the same issue with a different cause: I had set object retention policies, which prevents manual deletions. Given that renaming triggers a deletion, this error happened.
Therefore, if anyone runs into a similar issue, investigate your temp bucket's properties and potentially lift any retention policies.

I've got similar errors while moving from DirectRunner to DataflowRunner:
Staged package XXX.jar at location 'gs://YYY/staging/XXX.jar' is inaccessible.
After I've played with the permissions, this is what I did:
at Storage Browser, clicked on Edit Bucket Permissions (for the specific bucket), added the right Storage Permission for the member ZZZ-compute#developer.gserviceaccount.com
I hope this will save future time for other users as well.

Related

Unable to write files in a GCP bucket using gcsfuse

I have mounted a storage bucket on a VM using the command:
gcsfuse my-bucket /path/to/mount
After this I'm able to read files from the bucket in Python using Pandas, but I'm not able to write files nor create new folders. I have tried with Python and from the terminal using sudo but get the same error.
I have also tried Using the key_file from the bucket:
sudo mount -t gcsfuse -o implicit_dirs,allow_other,uid=1000,gid=1000,key_file=Notebooks/xxxxxxxxxxxxxx10b3464a1aa9.json <BUCKET> <PATH>
It does not through errors when I run the code, but still I'm not able to write in the bucket.
I have also tried:
gcloud auth login
But still have the same issue.

I ran into the same thing a while ago, which was really confusing. You have to set the correct access scope for the virtual machine so that anyone using the VM is able to call the storage API. The documentation shows that the default access scope for storage on a VM is read-only:
When you create a new Compute Engine instance, it is automatically
configured with the following access scopes:
Read-only access to Cloud Storage:
https://www.googleapis.com/auth/devstorage.read_only
All you have to do is change this scope so that you are also able to write to storage buckets from the VM. You can find an overview of different scopes here. To apply the new scope to your VM, you have to first shut it down. Then from your local machine execute the following command:
gcloud compute instances set-scopes INSTANCE_NAME \
--scopes=storage-rw \
--zone=ZONE
You can do the same thing from the portal if you go to the settings of your VM, scroll all the way down, and choose "Set Access for each API". You have the same options when you create the VM for the first time. Below is an example of how you would do this:

Access from gcloud ml-engine jobs to Big Query

I have a python ML process which connects to BigQuery using a local json file which the env variable GOOGLE_APPLICATION_CREDENTIALS is pointing to (The file contains my keys supplied by google, see authentication getting-started )
When Running it locally its works great.
Im now looking to deploy my model through Google's Ml engine, specifically using the shell command gcloud ml-engine jobs submit training.
However, after i ran my process and looked at the logs in console.cloud.google.com/logs/viewer i saw that gcloud cant access Bigquery and i'm getting the following error:
google.auth.exceptions.DefaultCredentialsError: File:
/Users/yehoshaphatschellekens/Desktop/google_cloud_xgboost/....-.....json was not found.
Currently i don't think that the gcloud ml-engine jobs submit training takes the Json file with it (I thought that gcloud has access automatically to BigQuery, i guess not)
One optional workaround to this is to save my personal .json into my python dependancies in the other sub-package folder (see packaging-trainer) and import it.
Is this solution feasible / safe ?
Is there any other workaround to this issue?

What i did eventually is to upload the json to a gcloud storage bucket and then uploading it into my project each time i launch the ML-engine train process:
os.system('gsutil cp gs://secured_bucket.json .')
os.environ[ "GOOGLE_APPLICATION_CREDENTIALS"] = "......json"

the path should be absolute and with backslashes in Windows:
GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\Downloads\[FILE_NAME].json"
set it this way in your Python code:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "C:\PATH.JSON"
Example with the Google Translate API here.

How to authorize Azure Python SDK on VM instance?

In AWS, you can assign a role to a VM, which then authorizes the instance when it makes queries to the AWS SDK. I am looking for similar functionality in Azure, or something that would enable me to do close to that.
I found this post which suggests that this is not possible in the way AWS does it. Are there any workarounds for this? I really don't want the system administrator to have to login to the instance and give their Azure Active Directory credentials to authorize it.

Excellent question :). I would suggest to wait a few days, we have something in progress that seems to fit your need. I created this issue for tracking.
The most simple would be to create a Service Principal credentials for these VMs. To do that, execute a post deployment script to install the CLI and "az ad sp create-for-rbac --sdk-auth >~/mycredentials.json". Then, you can start SDK script reading this credential file.
The "create-for-rbac" commands already exists if you want to look at it (--sdk-auth is the new option coming), so you can see that you can specify all scope and permissions needed in this command.
(I own the Azure SDK for Python at MS)

Migrating from Amazon S3 to Azure Storage (Django web app)

I maintain this Django web app where users congregate and chat with one another. They can post pictures too if they want. I process these photos (i.e. optimize their size) and store them on an Amazon S3 bucket (like a 'container' in Azure Storage). To do that, I set up the bucket on Amazon, and included the following configuration code in my settings.py:
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
AWS_S3_FORCE_HTTP_URL = True
AWS_QUERYSTRING_AUTH = False
AWS_SECRET_ACCESS_KEY = os.environ.get('awssecretkey')
AWS_ACCESS_KEY_ID = os.environ.get('awsaccesskeyid')
AWS_S3_CALLING_FORMAT='boto.s3.connection.OrdinaryCallingFormat'
AWS_STORAGE_BUCKET_NAME = 'dakadak.in'
Additionally Boto 2.38.0 and django-storages 1.1.8 are installed in my virtual environment. Boto is a Python package that provides interfaces to Amazon Web Services, whereas django-storages is a collection of custom storage backends for Django.
I now want to stop using Amazon S3 and instead migrate to Azure Storage. I.e. I need to migrate all existing files from S3 to Azure Storage, and next I need to configure my Django app to save all new static assets on Azure Storage.
I can't find precise documentation on what I'm trying to achieve. Though I do know django-storages support Azure. Has anyone done this kind of a migration before, and can point out where I need to begin, what steps to follow to get everything up and running?
Note: Ask me for more information if you need it.

Per my experience, you can do the two steps for migrating from Amazon S3 to Azure Storage for Django web app.
The first step is moving all files from S3 to Azure Blob Storage. There are two ways you can try.
Using the tools for S3 and Azure Storage to move files from S3 to local directory to Azure Blob Storage.
These tools below for S3 can help moving files to local directory.
AWS Command Line(https://aws.amazon.com/cli/): aws s3 cp s3://BUCKET/FOLDER localfolder --recursive
S3cmd Tools(http://s3tools.org/): s3cmd -r sync s3://BUCKET/FOLDER localfolder
S3 Browser (http://s3browser.com/) :
This is a GUI client.
For moving local files to Azure Blob Storage, you can use AzCopy Command-Line Utility for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob, File, and Table storage, please refer to the document https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/.
Example: AzCopy /Source:C:\myfolder /Dest:https://myaccount.blob.core.windows.net/mycontainer
Migrating by programming with Amazon S3 and Azure Blob Storage APIs in your familiar language like Python, Java, etc. Please refer to their API usage docs https://azure.microsoft.com/en-us/documentation/articles/storage-python-how-to-use-blob-storage/ and http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html.
The second step is following the document http://django-storages.readthedocs.org/en/latest/backends/azure.html to re-configure Django settings.py file. Django-storages would allow any uploads you do to be automatically stored in your storage container.
DEFAULT_FILE_STORAGE='storages.backends.azure_storage.AzureStorage'
AZURE_ACCOUNT_NAME='myaccount'
AZURE_ACCOUNT_KEY='account_key'
AZURE_CONTAINER='mycontainer'
You can find these settings on Azure Portal as #neolursa said.
Edit:
On Azure old portal:
On Azure new portal:

The link you've shared has the configuration for Django to Azure Blobs. Now when I did this all I had to do is to go to the Azure portal. Under the storage account, get the access keys as below. Then create a container and give the container name to Django configuration. This should be enough. However I've done this a while ago.
For the second part, migrating the current files from S3 bucket to BLOB, there a couple of tools you can use;
If you are using visual studio, you can find the blobs under the Server Explorer after entering your azure account credentials in Visual Studio.
Alternatively you can use third party tools like Azure Storage Explorer or Cloudberry explorer.
Hope this helps!

Python azure module: how to create a new deployment

azure.servicemanagmentservice.py contains:
def create_deployment(self, service_name, deployment_slot, name,
package_url, label, configuration,
start_deployment=False,
treat_warnings_as_error=False,
extended_properties=None):
What is package_url and configuration? Method comment indicates that
package_url:
A URL that refers to the location of the service package in the
Blob service. The service package can be located either in a
storage account beneath the same subscription or a Shared Access
Signature (SAS) URI from any storage account.
....
configuration:
The base-64 encoded service configuration file for the deployment.
All over the internet there are references to Visual Studio and Powershell to create those files. What do they look like? Can I manually create them? Can azure module create them? Why is Microsoft service so confusing and lacks documentation?
I am using https://pypi.python.org/pypi/azure Python Azure SDK. I am running Mac OS X on my dev box, so I don't have Visual Studio or cspack.exe.
Any help appreciated. Thank you.

According to your description, it looks like you are trying to use Python Azure SDK to create a cloud service deployment. Here is the documentation of how to use the create_deployment function.
Can I manually create them? Can azure module create them?
If you mean you wanna know how to create an Azure deployment package of your Python app, based on my experience, there are several options you can leverage.
If you have a visual studio, you can create a cloud project from projects’ template and package the project via 1 click. In VS, create new project -> cloud ->
Then package the project:
Without VS, you can use Microsoft Azure PowerShell cmdlets, or cspack commandline tool to create a deployment package. Similar ask could be found at: Django project to Azure Cloud Services without Visual Studio
After you packaging the project, you would have the .cspkg file like this:
For your better reference, I have uploaded the testing project at:
https://onedrive.live.com/redir?resid=7B27A151CFCEAF4F%21143283
As to the ‘configuration’, it means the base-64 encoded service configuration file (.cscfg) for the deployment
In Python, we can set up the ‘configuration’ via below code
configuration = base64.b64encode(open('E:\\TestProjects\\Python\\YourProjectFolder\\ServiceConfiguration.Cloud.cscfg', 'rb').read())
Hope above info could provide you a quick clarification. Now, let’s go back to Python SDK itself and see how we can use create_deployment function to create a cloud service deployment accordingly.
Firstly, I’d like to suggest you to refer to https://azure.microsoft.com/en-us/documentation/articles/cloud-services-python-how-to-use-service-management/ to get the basic idea of what azure Service Management is and how it’s processing.
In general, we can make create_deployment function work via 5 steps
Create your project’s deployment package and set up a configuration file (.cscfg) – (to have the quick test, you can use the one that I have uploaded)
Store your project’s deployment package in a Microsoft Azure Blob Storage account under the same subscription as the hosted service to which the package is being uploaded. Get the blob file’s URL (or use a Shared Access Signature (SAS) URI from any storage account). You can use Azure storage explorer to upload the package file, and then it will be shown on Azure portal.
Use OpenSSL to create your management certificate. It is needed to create two certificates, one for the server (a .cer file) and one for the client (a .pem file), the article I mentioned just now have provided a detailed info https://azure.microsoft.com/en-us/documentation/articles/cloud-services-python-how-to-use-service-management/
Screenshot of my created certificates
Then, upload .cer certificate to Azure portal: SETTINGS -> management certificate tab -> click upload button (on the bottom of the page).
Create a cloud service in Azure and keep the name in mind.
Create another project to test Azure SDK - create_deployment, code snippet for your reference:
subscription_id = 'Your subscription ID That could be found in Azure portal'
certificate_path = 'E:\YourFoloder\mycert.pem'
sms = ServiceManagementService(subscription_id, certificate_path)
def TestForCreateADeployment():
service_name = "Your Cloud Service Name"
deployment_name = "name"
slot = 'Production'
package_url = ".cspkg file’s URL – from your blob"
configuration = base64.b64encode(open('E:\\TestProjects\\Python\\ YourProjectFolder\\ServiceConfiguration.Cloud.cscfg ', 'rb').read())
label = service_name
result = sms.create_deployment(service_name,
slot,
deployment_name,
package_url,
label,
configuration)
operation = sms.get_operation_status(result.request_id)
print('Operation status: ' + operation.status)
Here is running result’s screenshot:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.