BigQuery load_table_from_storage won't recognize uri

BigQuery load_table_from_storage won't recognize uri - python

I'm running into the following error when trying to load a table from google cloud storage:
BadRequest: 400 Load configuration must specify at least one source URI (POST https://www.googleapis.com/bigquery/v2/projects/fansidata/jobs)
Meanwhile my uri is valid (ie: i can see it in the gcs web app)
uris = ['gs://my-bucket-name/datastore_backup_analytics_2016_12_21_2_User/1569751766512529035929A5AA9742/output-0']
job_name = 'Load_User'
destinationTable = dataset.table('Transfer')
job = bigquery_client.load_table_from_storage(job_name, destinationTable, uris)
job.begin()

I could be wrong, but it looks like load_table_from_storage in the Python API expects a single string for the third argument as opposed to a list. If you want to match multiple files, you can use a * at the end. For example,
uri = 'gs://my-bucket-name/datastore_backup_analytics_2016_12_21_2_User/1569751766512529035929A5AA9742/output-*']
job_name = 'Load_User'
destinationTable = dataset.table('Transfer')
job = bigquery_client.load_table_from_storage(job_name, destinationTable, uri)
job.begin()

Client.load_table_from_storage takes one or more source URIs (that is what *soure_uris means in python). E.g.:
job = client.load_table_from_storage(
'load-job-123', my_table_object,
'gs://my-bucket-name/table_one',
'gs://my-bucket-name/table_two')

Related

Listing IBM Cloud Resources using ResourceControllerV2 and pagination issues

I'm using the Python ibm-cloud-sdk in an attempt to iterate all resources in a particular IBM Cloud account. My trouble has been that pagination doesn't appear to "work for me". When I pass in the "next_url" I still get the same list coming back from the call.
Here is my test code. I successfully print many of my COS instances, but I only seem to be able to print the first page....maybe I've been looking at this too long and just missed something obvious...anyone have any clue why I can't retrieve the next page?
try:
####### authenticate and set the service url
auth = IAMAuthenticator(RESOURCE_CONTROLLER_APIKEY)
service = ResourceControllerV2(authenticator=auth)
service.set_service_url(RESOURCE_CONTROLLER_URL)
####### Retrieve the resource instance listing
r = service.list_resource_instances().get_result()
####### get the row count and resources list
rows_count = r['rows_count']
resources = r['resources']
while rows_count > 0:
print('Number of rows_count {}'.format(rows_count))
next_url = r['next_url']
for i, resource in enumerate(resources):
type = resource['id'].split(':')[4]
if type == 'cloud-object-storage':
instance_name = resource['name']
instance_id = resource['guid']
crn = resource['crn']
print('Found instance id : name - {} : {}'.format(instance_id, instance_name))
############### this is SUPPOSED to get the next page
r = service.list_resource_instances(start=next_url).get_result()
rows_count = r['rows_count']
resources = r['resources']
except Exception as e:
Error = 'Error : {}'.format(e)
print(Error)
exit(1)

From looking at the API documentation for listing resource instances, the value of next_url includes the URL path and the start parameter including its token for start.
To retrieve the next page, you would only need to pass in the parameter start with the token as value. IMHO this is not ideal.
I typically do not use the SDK, but a simply Python request. Then, I can use the endpoint (base) URI + next_url as full URI.
If you stick with the SDK, use urllib.parse to extract the query parameter. Not tested, but something like:
from urllib.parse import urlparse,parse_qs
o=urlparse(next_url)
q=parse_qs(o.query)
r = service.list_resource_instances(start=q['start'][0]).get_result()

Could you use the Search API for listing the resources in your account rather than the resource controller? The search index is set up for exactly that operation, whereas paginating results from the resource controller seems much more brute force.
https://cloud.ibm.com/apidocs/search#search

InvalidS3ObjectException when calling the AnalyzeDocument operation:

InvalidS3ObjectException when calling the AnalyzeDocument operation: Unable to get object metadata from S3. Check object key, region and/or access permissions."
I keep getting this error. Over. And. Over. This program worked with my test cases of what I'm bringing in, the json with a {"body":"imagename.jpg"}. But the very moment I try to utilize the actual code my JS brings in, I get this error. The thing that confuses me is that I've checked the regions and they are fine. I went into my account and created users with full access to all AWS and S3 features, and utilized those logins, I've used my root account, everything. All I'm trying to do is access an image from my s3 bucket. Why won't it work? Below is my code. It works if I utilize the test case I provided above, but the moment I try and use the website it's connected to, it doesn't work.
def main(event, context):
key_map, value_map, block_map = get_kv_map(event) #Take map variables in to get the key and value map we need.
It goes to this function...
def get_kv_map(event):
filePath = event
fileExt = filePath.get('body')
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket('react-images-ex')
obj = bucket.Object(bucket)
client = boto3.client('textract') #We utilize boto3's textract lib
response = client.analyze_document(Document={'S3Object': {'Bucket': 'react-images-ex', 'Name': fileExt}}, FeatureTypes=['FORMS'])
# Get the text blocks
blocks=response['Blocks'] #We make a blocks variable that will be the blocks we find in the document
# get key and value maps
key_map = {}
value_map = {}
block_map = {}
for block in blocks: #Traverse the blocks found in the document
block_id = block['Id'] #Set variable for blockId to the Id's found on that block location
block_map[block_id] = block #Make the block map at that ID be the block variable
if block['BlockType'] == "KEY_VALUE_SET": #if we see that the type of block we're on is a key and value set pair, we check if it's a key or not. If it's not a key, we know it's a value. We send it to the respective map.
if 'KEY' in block['EntityTypes']:
key_map[block_id] = block
else:
value_map[block_id] = block
return key_map, value_map, block_map #Return the maps we need after they're filled.
I have been told before this code is fine, and it should work. So why exactly is it that I get this error?

Based on the comments.
The issue with body was that it was json string, not actual json object.
The solution was to parse the string into json:
fileExt = json.loads(filePath.get('body'))

Try awscli to see if you can access the image in s3:
aws s3 ls s3://react-images-ex/<some-fileExt>
Either you are parsing the fileExt wrongly, or you don't have S3 permission to access the file. The awscli command will help to verify this.

Azure SDK for Python: How to limit results in list_blobs()?

How can the number of blobs returned from ContainerClient.list_blobs() method can be limited?
The Azure Blob service RESP API docs mentions a maxresults parameter, but it seems it is not honored by list_blobs(maxresults=123).

A combination of itertools.islice and the results_per_page parameter (which translates to the REST maxresults parameter) will do the trick:
import itertools
service: BlobServiceClient = BlobServiceClient.from_connection_string(cstr)
cc = service.get_container_client("foo")
n = 42
for b in itertools.islice(cc.list_blobs(results_per_page=n), n):
print(b.name)

Please use by_page() on the ItemPaged class
pages = ContainerClient.list_blobs(maxresults=123).by_page()
first_page = next(pages)
items_in_page = list(a_page) #this will give you 123 results on the first page
second_page = next(pages) # it will throw exception if there's no second page
items_in_page = list(a_page) #this will give you 123 results on the second page

There's no way to do this currently with the SDK. The maxresults parameter really means "max results per page"; if you have more blobs than this, list_blobs will make multiple calls to the REST API until the listing is complete.
You could call the API directly and ignore pages after the first, but that would require you to handle the details of authentication, parsing the response, etc.

python kubernetes API error when trying to mount volume

I am using python with python-kubernetes with a minikube running locally, e.g there are no cloud issues.
I am trying to create a job and provide it with data to run on. I would like to provide it with a mount of a directory with my local machine data.
I am using this example and trying to add a mount volume
This is my code after adding the keyword volume_mounts (I tried multiple places, multiple keywords and nothing works)
from os import path
import yaml
from kubernetes import client, config
JOB_NAME = "pi"
def create_job_object():
# Configureate Pod template container
container = client.V1Container(
name="pi",
image="perl",
volume_mounts=["/home/user/data"],
command=["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"])
# Create and configurate a spec section
template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels={
"app": "pi"}),
spec=client.V1PodSpec(restart_policy="Never",
containers=[container]))
# Create the specification of deployment
spec = client.V1JobSpec(
template=template,
backoff_limit=0)
# Instantiate the job object
job = client.V1Job(
api_version="batch/v1",
kind="Job",
metadata=client.V1ObjectMeta(name=JOB_NAME),
spec=spec)
return job
def create_job(api_instance, job):
# Create job
api_response = api_instance.create_namespaced_job(
body=job,
namespace="default")
print("Job created. status='%s'" % str(api_response.status))
def update_job(api_instance, job):
# Update container image
job.spec.template.spec.containers[0].image = "perl"
# Update the job
api_response = api_instance.patch_namespaced_job(
name=JOB_NAME,
namespace="default",
body=job)
print("Job updated. status='%s'" % str(api_response.status))
def delete_job(api_instance):
# Delete job
api_response = api_instance.delete_namespaced_job(
name=JOB_NAME,
namespace="default",
body=client.V1DeleteOptions(
propagation_policy='Foreground',
grace_period_seconds=5))
print("Job deleted. status='%s'" % str(api_response.status))
def main():
# Configs can be set in Configuration class directly or using helper
# utility. If no argument provided, the config will be loaded from
# default location.
config.load_kube_config()
batch_v1 = client.BatchV1Api()
# Create a job object with client-python API. The job we
# created is same as the `pi-job.yaml` in the /examples folder.
job = create_job_object()
create_job(batch_v1, job)
update_job(batch_v1, job)
delete_job(batch_v1)
if __name__ == '__main__':
main()
I get this error
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Job
in version \"v1\" cannot be handled as a Job: v1.Job.Spec:
v1.JobSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.Containers:
[]v1.Container: v1.Container.VolumeMounts: []v1.VolumeMount:
readObjectStart: expect { or n, but found \", error found in #10 byte
of ...|ounts\": [\"/home/user|..., bigger context ...| \"image\":
\"perl\", \"name\": \"pi\", \"volumeMounts\": [\"/home/user/data\"]}],
\"restartPolicy\": \"Never\"}}}}|...","reason":"BadRequest","code":400
What am i missing here?
Is there another way to expose data to the job?
edit: trying to use client.V1Volumemount
I am trying to add this code, and add mount object in different init functions eg.
mount = client.V1VolumeMount(mount_path="/data", name="shai")
client.V1Container
client.V1PodTemplateSpec
client.V1JobSpec
client.V1Job
under multiple keywords, it all results in errors, is this the correct object to use? How shell I use it if at all?
edit: trying to pass volume_mounts as a list with the following code suggested in the answers:
def create_job_object():
# Configureate Pod template container
container = client.V1Container(
name="pi",
image="perl",
volume_mounts=["/home/user/data"],
command=["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"])
# Create and configurate a spec section
template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels={
"app": "pi"}),
spec=client.V1PodSpec(restart_policy="Never",
containers=[container]))
# Create the specification of deployment
spec = client.V1JobSpec(
template=template,
backoff_limit=0)
# Instantiate the job object
job = client.V1Job(
api_version="batch/v1",
kind="Job",
metadata=client.V1ObjectMeta(name=JOB_NAME),
spec=spec)
return job
And still getting a similar error
kubernetes.client.rest.ApiException: (422) Reason: Unprocessable
Entity HTTP response headers: HTTPHeaderDict({'Content-Type':
'application/json', 'Date': 'Tue, 06 Aug 2019 06:19:13 GMT',
'Content-Length': '401'}) HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Job.batch
\"pi\" is invalid:
spec.template.spec.containers[0].volumeMounts[0].name: Not found:
\"d\"","reason":"Invalid","details":{"name":"pi","group":"batch","kind":"Job","causes":[{"reason":"FieldValueNotFound","message":"Not
found:
\"d\"","field":"spec.template.spec.containers[0].volumeMounts[0].name"}]},"code":422}

The V1Container call is expecting a list of V1VolumeMount objects for volume_mounts parameter but you passed in a list of string:
Code:
def create_job_object():
volume_mount = client.V1VolumeMount(
mount_path="/home/user/data"
# other optional arguments, see the volume mount doc link below
)
# Configureate Pod template container
container = client.V1Container(
name="pi",
image="perl",
volume_mounts=[volume_mount],
command=["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"])
# Create and configurate a spec section
template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels={
"app": "pi"}),
spec=client.V1PodSpec(restart_policy="Never",
containers=[container]))
....
references:
https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1Container.md
https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1VolumeMount.md

Azure CosmosDB, Python - Replace a document: document_link?

I have a JSON document in my database that I want to modify frequently from my python program, once every 25 seconds. I know how to upload a document to the database and read a document from it, but I do not know how to modify/replace a document.
This link shows the functions offered in the python module. I see the ReplaceDocument function, but it takes in a document-link. Though how can I get the document link? Where am I suppose to look for this information?
Thanks.

It sounds like you had resolved it. Just as summary, the code below.
# Query a document
query = { 'query': 'SELECT * FROM <collection name> ....'}
docs = client.QueryDocuments(coll_link, query)
doc = list(docs)[0]
# Get the document link from attribute `_self`
doc_link = doc['_self']
# Modify the document
.....
# Replace the document via document link
client.ReplaceDocument(doc_link, doc)

April 2020
If you are reading MS Azure's Quickstart guide and following a supporting git repo, note that there might be some differences.
For example,
from azure.cosmos import exceptions, CosmosClient, PartitionKey
endpoint = 'endpoint'
key = 'key'
db_name = 'cosmos-db-name'
container_name = 'container-name'
client = CosmosClient(endpoint, key)
db = client.create_database_if_not_exists(id=db_name)
container = db.create_container_if_not_exists(id=container_name, partition_key=PartitionKey(path="/.."), offer_throughput=456
...
# Replace item
container.replace_item(doc_link, doc)
When it comes to doc_link and doc, in the above case, I encountered an error when I used doc['_self']. By using the primary key of the doc, the doc is updated.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BigQuery load_table_from_storage won't recognize uri - python

Client.load_table_from_storage takes one or more source URIs (that is what *soure_uris means in python). E.g.: job = client.load_table_from_storage( 'load-job-123', my_table_object, 'gs://my-bucket-name/table_one', 'gs://my-bucket-name/table_two')

Related

Listing IBM Cloud Resources using ResourceControllerV2 and pagination issues

InvalidS3ObjectException when calling the AnalyzeDocument operation:

Azure SDK for Python: How to limit results in list_blobs()?

python kubernetes API error when trying to mount volume

Azure CosmosDB, Python - Replace a document: document_link?

Categories

Resources