App Engine to GCS | Uploading Issue - python

I have some files that I'm receiving from the Evernote API (via getResource) and writing to Google Cloud Storage with the following code:
gcs_file = gcs.open(filename, 'w', content_type=res.mime,
retry_params=write_retry_params)
# Retrieve the binary data and write to GCS
resource_file = note_store.getResource(res.guid, True, False, False, False)
gcs_file.write(resource_file.data.body)
gcs_file.close()
For even some types of documents, it still works. But there are certain documents that GCS throws this in the logs:
Unable to fetch URL: https://storage.googleapis.com/evernoteresources/5db799f1-c03c-4056-812a-6d77bad55261/Sleep Away.mp3
and
Got exception while contacting GCS. Will retry in 0.11 seconds.
There doesn't seem to be any pattern to these errors. It happens with documents, sounds, pictures, whatever - some of these document types work and some don't. It isn't due to size (since some small work and some large do).
Any ideas?
Here's the full stack trace, though I'm not sure it will help.
Encountered unexpected error from ProtoRPC method implementation: TimeoutError (('Request to Google Cloud Storage timed out.', DownloadError('Unable to fetch URL: https://storage.googleapis.com/evernoteresources/78413585-2266-4426-b08c-71d6c224f266/Evernote Snapshot 20130512 124546.jpg',)))
Traceback (most recent call last):
File "/python27_runtime/python27_lib/versions/1/protorpc/wsgi/service.py", line 181, in protorpc_service_app
response = method(instance, request)
File "/python27_runtime/python27_lib/versions/1/google/appengine/ext/endpoints/api_config.py", line 972, in invoke_remote
return remote_method(service_instance, request)
File "/python27_runtime/python27_lib/versions/1/protorpc/remote.py", line 412, in invoke_remote_method
response = method(service_instance, request)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/endpoints.py", line 61, in get_note_details
url = tools.registerResource(note_store, req.note_guid, r)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/GlobalUtilities.py", line 109, in registerResource
retry_params=write_retry_params)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/cloudstorage_api.py", line 69, in open
return storage_api.StreamingBuffer(api, filename, content_type, options)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/storage_api.py", line 526, in __init__
status, headers, _ = self._api.post_object(path, headers=headers)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/rest_api.py", line 41, in sync_wrapper
return future.get_result()
File "/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 325, in get_result
self.check_success()
File "/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 368, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/storage_api.py", line 84, in do_request_async
'Request to Google Cloud Storage timed out.', e)
TimeoutError: ('Request to Google Cloud Storage timed out.', DownloadError('Unable to fetch URL: https://storage.googleapis.com/evernoteresources/78413585-2266-4426-b08c-71d6c224f266/Evernote Snapshot 20130512 124546.jpg',))

This is a bug in the gcs client code. It should properly handle the filename. The fact it is using http request to GCS should be "hidden". This will be fixed soon. Thanks!
Note if you quote the filename yourself to work around this bug, the filename will be double quoted after the fix. Sorry.

Thank you Brian! The problem was the spaces in the filenames. I just used urllib2.quote() to get those out of there and it works like a charm.

Related

Need to get proper URL for payloads for Airflow connected with Azure

I have four files main.py, jobs.zip, libs.zip & params.yaml and these I have stored on Azure Storage Account Container.
Now I have this code which is making a payload and will try to run a spark job using that payload. And that payload will be having the location link of these 4 files.
hook = AzureSynapseHook(
azure_synapse_conn_id=self.azure_synapse_conn_id, spark_pool=self.spark_pool
)
payload = SparkBatchJobOptions(
name=f"{self.job_name}_{self.app_id}",
file=f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/main.py",
arguments=self.job_args,
python_files=[
f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/jobs.zip",
f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/libs.zip",
],
files=[
f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/params.yaml"
],
)
self.log.info("Executing the Synapse spark job.")
response = hook.run_spark_job(payload=payload)
I have checked the location link that is correct but when I run this on airflow it throws an error related to the payload which I think it is trying to say that it is not able to grab the links.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/azure/core/pipeline/transport/_base.py", line 579, in format_url
base = self._base_url.format(**kwargs).rstrip("/")
KeyError: 'endpoint'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/airflow/dags/operators/spark/__init__.py", line 36, in execute
return self.executor.execute()
File "/usr/local/airflow/dags/operators/spark/azure.py", line 60, in execute
response = hook.run_spark_job(payload=payload)
File "/usr/local/lib/python3.9/site-packages/airflow/providers/microsoft/azure/hooks/synapse.py", line 144, in run_spark_job
job = self.get_conn().spark_batch.create_spark_batch_job(payload)
File "/usr/local/lib/python3.9/site-packages/azure/synapse/spark/operations/_spark_batch_operations.py", line 163, in create_spark_batch_job
request = self._client.post(url, query_parameters, header_parameters, **body_content_kwargs)
File "/usr/local/lib/python3.9/site-packages/azure/core/pipeline/transport/_base.py", line 659, in post
request = self._request(
File "/usr/local/lib/python3.9/site-packages/azure/core/pipeline/transport/_base.py", line 535, in _request
request = HttpRequest(method, self.format_url(url))
File "/usr/local/lib/python3.9/site-packages/azure/core/pipeline/transport/_base.py", line 582, in format_url
raise ValueError(err_msg.format(key.args[0]))
ValueError: The value provided for the url part endpoint was incorrect, and resulted in an invalid url
I also want to know the difference of abfss and wasbs and where should i upload my files so that the code will be able to grab the links ?
Maybe I am uploading the files at wrong place.
You have something wrong in the connection self.azure_synapse_conn_id, where the host (Synapse Workspace URL) is not valid, here is an example of the connection:
Connection(
conn_id=DEFAULT_CONNECTION_CLIENT_SECRET,
conn_type="azure_synapse",
host="https://testsynapse.dev.azuresynapse.net",
login="clientId",
password="clientSecret",
extra=json.dumps(
{
"extra__azure_synapse__tenantId": "tenantId",
"extra__azure_synapse__subscriptionId": "subscriptionId",
}
),
)
For the difference between abfss and wasbs, here is a detailed answer about the topic.

Azure ML: Upload File to Step Run's Output - Authentication Error

During a PythonScriptStep in an Azure ML Pipeline, I'm saving a model as joblib pickle dump to a directory in a Blob Container in the Azure Blob Storage which I've created during the setup of the Azure ML Workspace. Afterwards I'm trying to upload this model file to the step run's output directory using
Run.upload_file (name, path_or_stream)
(for the function's documentation, see https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#upload-file-name--path-or-stream--datastore-name-none-)
Some time ago when I created the script using the azureml-sdk version 1.18.0, everything worked fine. Now, I've updated the script's functionalities and upgraded the azureml-sdk to version 1.33.0 during the process and the upload function now runs into the following error:
Traceback (most recent call last):
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_file_utils/upload.py", line 64, in upload_blob_from_stream
validate_content=True)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 93, in execute_func_with_reset
return ClientBase._execute_func_internal(backoff, retries, module_logger, func, reset_func, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 367, in _execute_func_internal
left_retry = cls._handle_retry(back_off, left_retry, total_retry, error, logger, func)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 399, in _handle_retry
raise error
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 358, in _execute_func_internal
response = func(*args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/blockblobservice.py", line 614, in create_blob_from_stream
initialization_vector=iv
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 98, in _upload_blob_chunks
range_ids = [f.result() for f in futures]
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 98, in <listcomp>
range_ids = [f.result() for f in futures]
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 210, in process_chunk
return self._upload_chunk_with_progress(chunk_offset, chunk_bytes)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 224, in _upload_chunk_with_progress
range_id = self._upload_chunk(chunk_offset, chunk_data)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 269, in _upload_chunk
timeout=self.timeout,
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/blockblobservice.py", line 1013, in _put_block
self._perform_request(request)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/storageclient.py", line 432, in _perform_request
raise ex
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/storageclient.py", line 357, in _perform_request
raise ex
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/storageclient.py", line 343, in _perform_request
HTTPError(response.status, response.message, response.headers, response.body))
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/_error.py", line 115, in _http_error_handler
raise ex
azure.common.AzureHttpError: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5d4e1b7e-c01e-0070-0d47-9bf8a0000000
Time:2021-08-27T13:30:02.2685991Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcw
2021-08-27T13:19:56Z
2021-08-28T13:29:56Z
/blob/mystorage/azureml/ExperimentRun/dcid.98d11a7b-2aac-4bc0-bd64-bb4d72e0e0be/outputs/models/Model.pkl
2019-07-07
b
</AuthenticationErrorDetail></Error>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/.../azureml-setup/context_manager_injector.py", line 243, in execute_with_context
runpy.run_path(sys.argv[0], globals(), run_name="__main__")
File "/opt/miniconda/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/opt/miniconda/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/opt/miniconda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 318, in <module>
main()
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 286, in main
path_or_stream=model_path)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 53, in wrapped
return func(self, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 1989, in upload_file
datastore_name=datastore_name)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 114, in upload_artifact
return self.upload_artifact_from_path(artifact, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 107, in upload_artifact_from_path
return self.upload_artifact_from_stream(stream, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 99, in upload_artifact_from_stream
content_type=content_type, session=session)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 88, in upload_stream_to_existing_artifact
timeout=TIMEOUT, backoff=BACKOFF_START, retries=RETRY_LIMIT)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_file_utils/upload.py", line 71, in upload_blob_from_stream
raise AzureMLException._with_error(azureml_error, inner_exception=e)
azureml._common.exceptions.AzureMLException: AzureMLException:
Message: Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.
StorageAccount: mystorage
ContainerName: azureml
StatusCode: 403
InnerException Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5d4e1b7e-c01e-0070-0d47-9bf8a0000000
Time:2021-08-27T13:30:02.2685991Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcw
2021-08-27T13:19:56Z
2021-08-28T13:29:56Z
/blob/mystorage/azureml/ExperimentRun/dcid.98d11a7b-2aac-4bc0-bd64-bb4d72e0e0be/outputs/models/Model.pkl
2019-07-07
b
</AuthenticationErrorDetail></Error>
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.\n\tStorageAccount: mystorage\n\tContainerName: azureml\n\tStatusCode: 403",
"inner_error": {
"code": "Auth",
"inner_error": {
"code": "Authorization"
}
}
}
}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 318, in <module>
main()
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 286, in main
path_or_stream=model_path)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 53, in wrapped
return func(self, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 1989, in upload_file
datastore_name=datastore_name)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 114, in upload_artifact
return self.upload_artifact_from_path(artifact, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 107, in upload_artifact_from_path
return self.upload_artifact_from_stream(stream, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 99, in upload_artifact_from_stream
content_type=content_type, session=session)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 88, in upload_stream_to_existing_artifact
timeout=TIMEOUT, backoff=BACKOFF_START, retries=RETRY_LIMIT)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_file_utils/upload.py", line 71, in upload_blob_from_stream
raise AzureMLException._with_error(azureml_error, inner_exception=e)
UserScriptException: UserScriptException:
Message: Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.
StorageAccount: mystorage
ContainerName: azureml
StatusCode: 403
InnerException AzureMLException:
Message: Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.
StorageAccount: mystorage
ContainerName: azureml
StatusCode: 403
InnerException Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5d4e1b7e-c01e-0070-0d47-9bf8a0000000
Time:2021-08-27T13:30:02.2685991Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcw
2021-08-27T13:19:56Z
2021-08-28T13:29:56Z
/blob/mystorage/azureml/ExperimentRun/dcid.98d11a7b-2aac-4bc0-bd64-bb4d72e0e0be/outputs/models/Model.pkl
2019-07-07
b
</AuthenticationErrorDetail></Error>
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.\n\tStorageAccount: verovisionstorage\n\tContainerName: azureml\n\tStatusCode: 403",
"inner_error": {
"code": "Auth",
"inner_error": {
"code": "Authorization"
}
}
}
}
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.\n\tStorageAccount: mystorage\n\tContainerName: azureml\n\tStatusCode: 403"
}
}
As far as I can tell from the code of the azureml.core.Run class and the subsequent function calls, the Run object tries to upload the file to the step run's output directory using SAS-Token-Authentication (which fails). This documentation article is linked in the code (but I don't know if this relates to the issue): https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas#service-sas-example
Did anybody encounter this error as well and knows what causes it or how it can be resolved?
Best,
Jonas
We’ve seen the before, it’s annoying. I think the answer is to go to the data stores page of the AML Studio UI and manually enter the storage account key again.

using Google BigQuery through Python script

I want to make some very easy tasks on BigQuery via a python script. I found this package which does not work well. Indeed, when I try this code:
from bigquery import get_client
project_id = 'txxxxxxxxxxxxxxxxxx9'
# Service account email address as listed in the Google Developers Console.
service_account = '7xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.apps.googleusercontent.com'
# PKCS12 or PEM key provided by Google.
key = '/home/fxxxxxxxxxxxx/Dropbox/access_keys/google_storage/xxxxxxxxxxxxxxxxxxxxx.pem'
client = get_client(project_id, service_account=service_account, private_key_file=key, readonly=True)
# Submit an async query.
results = client.get_table_schema('newdataset', 'newtable2')
print('results')
I get this error:
/home/xxxxxx/anaconda3/envs/snakes/bin/python2.7 /home/xxxxxx/Dropbox/Prog/bigQuery_daily_import/src/main.py
Traceback (most recent call last):
File "/home/xxxxxx/Dropbox/Prog/bigQuery_daily_import/src/main.py", line 9, in <module>
client = get_client(project_id, service_account=service_account, private_key_file=key, readonly=True)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/bigquery/client.py", line 83, in get_client
readonly=readonly)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/bigquery/client.py", line 101, in _get_bq_service
service = build('bigquery', 'v2', http=http)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/util.py", line 142, in positional_wrapper
return wrapped(*args, **kwargs)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/googleapiclient/discovery.py", line 196, in build
cache)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/googleapiclient/discovery.py", line 242, in _retrieve_discovery_doc
resp, content = http.request(actual_url)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/client.py", line 565, in new_request
self._refresh(request_orig)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/client.py", line 835, in _refresh
self._do_refresh_request(http_request)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/client.py", line 862, in _do_refresh_request
body = self._generate_refresh_request_body()
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/client.py", line 1541, in _generate_refresh_request_body
assertion = self._generate_assertion()
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/client.py", line 1670, in _generate_assertion
private_key, self.private_key_password), payload)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/oauth2client/_pycrypto_crypt.py", line 121, in from_string
pkey = RSA.importKey(parsed_pem_key)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/Crypto/PublicKey/RSA.py", line 665, in importKey
return self._importKeyDER(der)
File "/home/xxxxxx/anaconda3/envs/snakes/lib/python2.7/site-packages/Crypto/PublicKey/RSA.py", line 588, in _importKeyDER
raise ValueError("RSA key format is not supported")
ValueError: RSA key format is not supported
Process finished with exit code 1
My question: is there a tutorial in python which shows how to communicate easily with BigQuery: importing a dataset from google storage or S3, querying something, exporting the result to google storage.
A lot depends on your environment, and once you've figure that out everything should be super simple. I see the only problem on the error log you pasted is figuring out authentication.
Python pandas has had support for BigQuery for a while:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.gbq.read_gbq.html
And I did a video with the creators of the module:
https://www.youtube.com/watch?v=gLeTDUMb7HY
Now, the simplest and fastest way these days to launch an Jupyter notebook with all of the Google Cloud goodies you mention is our new Google Datalab project:
https://cloud.google.com/datalab/
The only Datalab caveat is that it works on cloud servers, but if you want a fully managed Jupyter/IPython environment, totally secure, persistent, and ready to handle BigQuery, storage, etc... try it out.
Meanwhile, if you are writing a web application look at how other web applications solve this task.
For example, re:dash code to connect to BigQuery:
https://github.com/EverythingMe/redash/blob/master/redash/query_runner/big_query.py

Fetching a lot of urls in python with google app engine

In my subclass of RequestHandler, I am trying to fetch range of urls:
class GetStats(webapp2.RequestHandler):
def post(self):
lastpage = 50
for page in range(1, lastpage):
tmpurl = url + str(page)
response = urllib2.urlopen(tmpurl, timeout=5)
html = response.read()
# some parsing html
heap.append(result_of_parsing)
self.response.write(heap)
But it works with ~ 30 urls (page is loading long but it is works).
In case more than 30 I am getting an error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Is there any way to fetch a lot of urls? May be more optimal or smth?
Up to several hundreds of pages?
Update:
I am using BeautifulSoup to parse every single page. I found this traceback in gae logs:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
DeadlineExceededError
It's failing because you only have 60 seconds to return a response to the user and I'm going to guess it's taking longer then that.
You will want to use this: https://cloud.google.com/appengine/articles/deferred
to create a task that has a 10 minute time out. Then you can return instantly to the user and they can "pick up" the results at a later time via another handler (that you create). If collecting all the URLs takes longer then 10 minutes you'll have to split them up into further tasks.
See this: https://cloud.google.com/appengine/articles/deadlineexceedederrors
to understand why you cannot go longer then 60 seconds.
Edit:
Might come from Appengine quotas and limits.
Sorry for previous answer:
As this looks like a protection from server for avoiding ddos or scrapping from one client. You have few options:
Waiting between a certain number of queries before continuing.
Making request from several clients who has different IP address and sending information back to your main script (might be costly to rent different server for this..).
You could also watch if website as api to access the data you need.
You should also take care as the sitowner could block/blacklist your IP if he decides your request are not good.

Internal Server Error in Web App: Google Latitude API in Google App Engine using Python

I am new to python and Google App Engine. I am currently working on a location based web application, I have used the following sample code http://code.google.com/p/latitudesample/
I was able to run the program in the localhost, however when I deploy the program and try to access the website I get an Internal Server Error Message.
I check the App Engine log file and it is as follows:
argument 2 to map() must support iteration
Traceback (most recent call last):File"/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 700, in __call__
handler.get(*groups)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/main.py", line 57, in get
self.request.host_url + OAUTH_CALLBACK_PATH, parameters)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth_webapp.py", line 28, in redirect_to_authorization_page
request_token = helper.GetRequestToken(callback_url, parameters)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth_appengine.py", line 78, in GetRequestToken
None)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth.py", line 262, in sign_request
self.build_signature(signature_method, consumer, token))
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth.py", line 266, in build_signature
return signature_method.build_signature(self, consumer, token)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth.py", line 632, in build_signature
oauth_request, consumer, token)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth.py", line 623, in build_signature_base_string
key = '%s&' % escape(consumer.secret)
File "/base/data/home/apps/s~latitudesocial/1.351807357650160822/oauth.py", line 50, in escape
return urllib.quote(s, safe='~')
File "/base/python_runtime/python_dist/lib/python2.5/urllib.py", line 1214, in quote
res = map(safe_map.__getitem__, s)
TypeError: argument 2 to map() must support iteration
I have been trying to figure this out myself, but I have made no progress so far.
I am guessing there is something wrong with line 623 in oauth.py
If you could, please help me locate the error.
$key = '%s&' % escape(consumer.secret)
The change I've made to the main.py is:
class SetMyOauth(webapp.RequestHandler):
def get(self):
Config.set('oauth_consumer_key', 'myconsumerkey'),
Config.set('oauth_consumer_secret', 'myconsumersecret'),
self.response.out.write ("""key and secret set""")
(Update 17/7/2011)
Upon further inquiry I came across the following error when running the debugger
C:\Python25\lib\threading.py:699: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
return _active[_get_ident()]
Exception exceptions.SystemError: 'error return without exception set' in <generator object at 0x03C41378> ignored
and it highlighted the following piece of code:
def escape(s):
"""Escape a URL including any /."""
return urllib.quote(s, safe='~')
which is the same code that is logged in the log file.
However when I run the code in using the development server the application works fine with no errors.

Categories

Resources