GCS broken pipe error when trying to upload large files

GCS broken pipe error when trying to upload large files - python

I'm trying to upload a .csv.gz file to GCS after extracting it to .csv, the file size changes from 500MB to around 5GB. I'm able to extract the .csv.gz file to a temporary path and it fails when I try to upload that file to GCS. I get the following error:
[2019-11-11 13:59:58,180] {models.py:1796} ERROR - [Errno 32] Broken pipe
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1664, in _run_raw_tas
result = task_copy.execute(context=context
File "/home/airflow/gcs/dags/operators/s3_to_gcs_transform_operator.py", line 220, in execut
gcs_hook.upload(dest_gcs_bucket, dest_gcs_object, target_file, gzip=True
File "/home/airflow/gcs/dags/hooks/gcs_hook_conn.py", line 208, in uploa
.insert(bucket=bucket, name=object, media_body=media)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrappe
return wrapped(*args, **kwargs
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 835, in execut
method=str(self.method), body=self.body, headers=self.headers
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 179, in _retry_reques
raise exceptio
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 162, in _retry_reques
resp, content = http.request(uri, method, *args, **kwargs
File "/opt/python3.6/lib/python3.6/site-packages/google_auth_httplib2.py", line 198, in reques
uri, method, body=body, headers=request_headers, **kwargs
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 155, in new_reques
redirections, connection_type
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1924, in reques
cachekey
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1595, in _reques
conn, request_uri, method, body, header
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1502, in _conn_reques
conn.request(method, request_uri, body, headers
File "/opt/python3.6/lib/python3.6/http/client.py", line 1239, in reques
self._send_request(method, url, body, headers, encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1285, in _send_reques
self.endheaders(body, encode_chunked=encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1234, in endheader
self._send_output(message_body, encode_chunked=encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1065, in _send_outpu
self.send(chunk
File "/opt/python3.6/lib/python3.6/http/client.py", line 986, in sen
self.sock.sendall(data
File "/opt/python3.6/lib/python3.6/ssl.py", line 975, in sendal
v = self.send(byte_view[count:]
File "/opt/python3.6/lib/python3.6/ssl.py", line 944, in sen
return self._sslobj.write(data
File "/opt/python3.6/lib/python3.6/ssl.py", line 642, in writ
return self._sslobj.write(data
BrokenPipeError: [Errno 32] Broken pip
From what I understood, the error could be due to the following:
Your server process has received a SIGPIPE writing to a socket. This
usually happens when you write to a socket fully closed on the other
(client) side. This might be happening when a client program doesn't
wait till all the data from the server is received and simply closes a
socket (using close function).
But I have no idea whether this is the issue or how I can fix this. Can someone help?

You should try to uploads big files in chunks.
from google.cloud import storage
CHUNK_SIZE = 128 * 1024 * 1024
client = storage.Client()
bucket = client.bucket('destination')
blob = bucket.blob('really-big-blob', chunk_size=CHUNK_SIZE)
blob.upload_from_filename('/path/to/really-big-file')
Also you can check Parallel Composite Uploads
Similar SO question link.

Related

httplib2.socks.HTTPError: (403, b'Forbidden') python apache-beam dataflow

I work on a google cloud environment where i don't have internet access. I'm trying to launch a dataflow job. I'm using a proxy to access the internet.
when i run a simple wordcount.py with dataflow i get this error
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 4.750968074377858 seconds before retrying _uncached_gcs_file_copy because we caught exception: httplib2.socks.HTTPError: (403, b'Forbidden')
Traceback for above exception (most recent call last):
File "/opt/py38/lib64/python3.8/site-packages/apache_beam/utils/retry.py", line 275, in wrapper
return fun(*args, **kwargs)
File "/opt/py38/lib64/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 631, in _uncached_gcs_file_copy
self.stage_file(to_folder, to_name, f, total_size=total_size)
File "/opt/py38/lib64/python3.8/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 735, in stage_file
response = self._storage_client.objects.Insert(request, upload=upload)
File "/opt/py38/lib64/python3.8/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1152, in Insert
return self._RunMethod(
File "/opt/py38/lib64/python3.8/site-packages/apitools/base/py/base_api.py", line 728, in _RunMethod
http_response = http_wrapper.MakeRequest(
File "/opt/py38/lib64/python3.8/site-packages/apitools/base/py/http_wrapper.py", line 359, in MakeRequest
retry_func(ExceptionRetryArgs(http, http_request, e, retry,
File "/opt/py38/lib64/python3.8/site-packages/apache_beam/io/gcp/gcsio_overrides.py", line 45, in retry_func
return http_wrapper.HandleExceptionsAndRebuildHttpConnections(retry_args)
File "/opt/py38/lib64/python3.8/site-packages/apitools/base/py/http_wrapper.py", line 304, in HandleExceptionsAndRebuildHttpConnections
raise retry_args.exc
File "/opt/py38/lib64/python3.8/site-packages/apitools/base/py/http_wrapper.py", line 348, in MakeRequest
return _MakeRequestNoRetry(
File "/opt/py38/lib64/python3.8/site-packages/apitools/base/py/http_wrapper.py", line 397, in _MakeRequestNoRetry
info, content = http.request(
File "/opt/py38/lib64/python3.8/site-packages/google_auth_httplib2.py", line 209, in request
self.credentials.before_request(self._request, method, uri, request_headers)
File "/opt/py38/lib64/python3.8/site-packages/google/auth/credentials.py", line 134, in before_request
self.refresh(request)
File "/opt/py38/lib64/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
self._retrieve_info(request)
File "/opt/py38/lib64/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
info = _metadata.get_service_account_info(
File "/opt/py38/lib64/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
return get(request, path, params={"recursive": "true"})
File "/opt/py38/lib64/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 150, in get
response = request(url=url, method="GET", headers=_METADATA_HEADERS)
File "/opt/py38/lib64/python3.8/site-packages/google_auth_httplib2.py", line 119, in __call__
response, data = self.http.request(
File "/opt/py38/lib64/python3.8/site-packages/httplib2/__init__.py", line 1701, in request
(response, content) = self._request(
File "/opt/py38/lib64/python3.8/site-packages/httplib2/__init__.py", line 1421, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/opt/py38/lib64/python3.8/site-packages/httplib2/__init__.py", line 1343, in _conn_request
conn.connect()
File "/opt/py38/lib64/python3.8/site-packages/httplib2/__init__.py", line 1026, in connect
self.sock.connect((self.host, self.port) + sa[2:])
File "/opt/py38/lib64/python3.8/site-packages/httplib2/socks.py", line 504, in connect
self.__negotiatehttp(destpair[0], destpair[1])
File "/opt/py38/lib64/python3.8/site-packages/httplib2/socks.py", line 465, in __negotiatehttp
raise HTTPError((statuscode, statusline[2]))
My service account have this role:
BigQuery Data Editor
BigQuery User
Dataflow Developer
Dataflow Worker
Service Account User
Storage Admin
The istance have Cloud API access scopes: Allow full access to all Cloud APIs
what is the problem?

Based on the comment #luca the above error is solved using an internal proxy that will allow access to the internet. Add this --no_use_public_ip to the command and set no_proxy="metadata.google.internal,www.googleapis.com,dataflow.googleapis.com,bigquery.googleapis.com".

error uploading file to google drive with python

I wrote a code to upload (create and update) a file to google drive,
in Windows 10 with python 3.9 it work, but in windows 2008 server with python 3.8 it give me an error.
just to remember 3.8 is the max version that supports windows 2008
if I try to list from gdrive it work, the problem is just to create or update the file.
just to remember 3.8 is the last python version that supports windows 2008.
I suspect its related with windows 2008 and ssl maybe!?!?
the error is this:
C:\backupmgr>python drive.py
Traceback (most recent call last):
File "drive.py", line 112, in <module>
envia_zip('sexta.7z')
File "drive.py", line 104, in envia_zip
file = service.files().create(body=file_metadata, media_body=media).execute(
)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\googleapiclient\_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\googleapiclient\http.py", line 923, in execute
resp, content = _retry_request(
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\googleapiclient\http.py", line 222, in _retry_request
raise exception
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\googleapiclient\http.py", line 191, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\google_auth_httplib2.py", line 218, in request
response, content = self.http.request(
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\httplib2\__init__.py", line 1720, in request
(response, content) = self._request(
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\httplib2\__init__.py", line 1440, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, he
aders)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\site-p
ackages\httplib2\__init__.py", line 1363, in _conn_request
conn.request(method, request_uri, body, headers)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\http\c
lient.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\http\c
lient.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\http\c
lient.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\http\c
lient.py", line 1046, in _send_output
self.send(chunk)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\http\c
lient.py", line 968, in send
self.sock.sendall(data)
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\ssl.py
", line 1204, in sendall
v = self.send(byte_view[count:])
File "C:\Users\Administrador\AppData\Local\Programs\Python\Python38\lib\ssl.py
", line 1173, in send
return self._sslobj.write(data)
socket.timeout: The write operation timed out

Well it works now, as #DaImTo poited to the issue #632 in the google api github, it is not a problem with the api. The problem is that the socket core module has low default timeout. The pc with windows server 2008 that I was using was very slow and was hiting this default timeout, so I just had rise the default timeout by inserting the code in the beginin of the script:
import socket
socket.setdefaulttimeout(600)

How to connect to hdfs from the docker container?

My goal is to read file from hdfs in airflow and do further manipulations.
After researching, I found that url I need to use is as follows:
df = pd.read_parquet('http://localhost:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN'),
where localhost/172.20.80.1/computer-name.mshome.net can be interchangeably used,
9870 - namenode port,
hadoop_files/sample_2022_01.parquet - my folder and file created in the root.
I can access and read file locally in PyCharm, but I am unable to get the same result inside airflow in docker. I tried using local hdfs and hdfs hosted in docker and changing host to the host.docker.internal, but I am getting the same error.
Stack trace:
[2022-06-12, 17:52:45 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/usr/local/lib/python3.7/http/client.py", line 948, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/local/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/usr/local/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
OSError: [Errno 113] No route to host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 207, in execute
branch = super().execute(context)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/includes/parquet_dag/main.py", line 15, in main
df_parquet = read('hdfs://localhost:9000/hadoop_files/sample_2022_01.parquet')
File "/opt/airflow/dags/includes/parquet_dag/utils.py", line 29, in read
df = pd.read_parquet('http://172.20.80.1:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
**kwargs,
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
mode="rb",
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
path_or_handle, mode, is_text=False, storage_options=storage_options
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
storage_options=storage_options,
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 312, in _get_filepath_or_buffer
with urlopen(req_info) as req:
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 212, in urlopen
return urllib.request.urlopen(*args, **kwargs)
File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/local/lib/python3.7/urllib/request.py", line 1378, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>
With host.docker.internal:
urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>

you need to use any routable address inside airflow docker container.
if hadoop is inside docker container as well, check it ip address using docker inspect CONTAINER (doc). if hadoop is on localhost you can set network_mode: "host" (doc)
also there is an important notice if you are on macos and have the docker desktop app which basically a virtual machine. so in this case you need some extra settings, check this, for example.

where localhost/172.20.80.1/computer-name.mshome.net can be interchangeably used,
They shouldn't be interchangeable inside Docker network.
From Airflow, you could use Docker service names, not IP addresses, and ensure the containers are in the same bridge network (not host mode, which only works on Linux). host.docker.internal isn't correct either since you're trying to reach another container, not your host
https://docs.docker.com/network/bridge/
I'd also recommend using Airflow Spark operators to actually read Parquet from HDFS, using Spark, not Pandas or WebHDFS. You can convert Spark dataframes to Pandas, if needed

Google Cloud API: EOFError: Compressed file ended before the end-of-stream marker was reached

I have some really simple Python code that I'm using to do a GCP Google Vault export. However, about 4 out of ever 5 runs returns an EOF Error. The stacktrace is
Traceback (most recent call last):
File "src/initiate_job.py", line 57, in <module>
results = service.matters().list(pageSize=10).execute()
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
return wrapped(*args, **kwargs)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
headers=self.headers,
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/google_auth_httplib2.py", line 190, in request
self._request, method, uri, request_headers)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/google/auth/credentials.py", line 133, in before_request
self.refresh(request)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/google/oauth2/service_account.py", line 359, in refresh
access_token, expiry, _ = _client.jwt_grant(request, self._token_uri, assertion)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/google/oauth2/_client.py", line 153, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/google/oauth2/_client.py", line 105, in _token_endpoint_request
response = request(method="POST", url=token_uri, headers=headers, body=body)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/google_auth_httplib2.py", line 117, in __call__
url, method=method, body=body, headers=headers, **kwargs)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
cachekey,
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
conn, request_uri, method, body, headers
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/httplib2/__init__.py", line 1621, in _conn_request
content = _decompressContent(response, content)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/site-packages/httplib2/__init__.py", line 460, in _decompressContent
content = gzip.GzipFile(fileobj=io.BytesIO(new_content)).read()
File "/Users/ipq500/opt/miniconda3/lib/python3.7/gzip.py", line 276, in read
return self._buffer.read(size)
File "/Users/ipq500/opt/miniconda3/lib/python3.7/gzip.py", line 482, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
The code that throws this is just:
service = build('vault', 'v1', credentials=delegated_credentials)
results = service.matters().list(pageSize=10).execute()
I tried to add some defensive programming wherein I simply retry anything that throws this error -- but now I'm hitting rate limits. So, I have to really debug why I'm getting this error.
Any and all help would be appreciated!

Based on this question and this conversation, It seems that the file that you read is damaged and the error of end of the file is expected, In this case I suggest to verify if the credentials exist, take the sample as references and also try to perform a double check before to perform any task, On the other hand, it does the same with the file for the file since it is possible that the writing of the file is not finished, including whether it is being compressed or decompressed

Python dataflow job fails to submit

We have a kubernetes cron job on GCP that submits several copies of the same Python dataflow job, each in its own container. Whenever we need a new copy of the job, we just add it to the spec->jobTemplate->spec->template->spec->containers part of the cron job yaml and adjust the dataflow job parameters as needed. This usually works fine, but the latest copy we tried to add does not work. The existing copies are all still working as expected. The job seems to fail on submission to GCP, and the error message is not very helpful:
Traceback (most recent call last):
File "/app/job.py", line 117, in <module>
newness.pipeline.run_dataflow(sys.argv)
File "/app/newness/pipeline.py", line 480, in run_dataflow
result = pipe.run()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 403, in run
self.to_runner_api(), self.runner, self._options).run(False)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 416, in run
return self.runner.run_pipeline(self)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 389, in run_pipeline
self.dataflow_client.create_job(self.job), self)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 504, in create_job
return self.submit_job_description(job)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 184, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 551, in submit_job_description
response = self._client.projects_locations_jobs.Create(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 578, in Create
config, request, global_params=global_params)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpError: <exception str() failed>
The job does not appear in the dataflow console at all.
The previous lines of the container logs look like:
2019-10-13T03:57:47.725542287Z Successfully downloaded apache-beam
2019-10-13T03:58:17.125601280Z INFO:root:Staging SDK sources from PyPI to gs://gcs-bucket-name/staging/newness-boosting-c2898.1570936519.827087/dataflow_python_sdk.tar
2019-10-13T03:58:17.324843623Z INFO:root:Starting GCS upload to gs://gcs-bucket-name/staging/newness-boosting-c2898.1570936519.827087/dataflow_python_sdk.tar...
2019-10-13T03:58:24.825657227Z INFO:root:Completed GCS upload to gs://gcs-bucket-name/staging/newness-boosting-c2898.1570936519.827087/dataflow_python_sdk.tar
2019-10-13T03:58:25.225646529Z INFO:root:Downloading binary distribtution of the SDK from PyPi
2019-10-13T03:58:25.225716554Z INFO:root:Executing command: ['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpk5TfMS', 'apache-beam==2.8.0', '--no-deps', '--only-binary', ':all:', '--python-version', '27', '--implementation', 'cp', '--abi', 'cp27mu', '--platform', 'manylinux1_x86_64']
2019-10-13T03:59:33.926186906Z Collecting apache-beam==2.8.0
2019-10-13T03:59:52.125678183Z Using cached https://files.pythonhosted.org/packages/0f/63/ea5453ba656d060936acf41d2ec057f23aafd69649e2129ac66fdda67d48/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl
2019-10-13T04:00:11.525435891Z Saved /tmp/tmpk5TfMS/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl
2019-10-13T04:00:12.025054706Z Successfully downloaded apache-beam
2019-10-13T04:00:26.726190542Z INFO:root:Staging binary distribution of the SDK from PyPI to gs://gcs-bucket-name/staging/newness-boosting-c2898.1570936519.827087/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl
2019-10-13T04:00:26.825618945Z INFO:root:Starting GCS upload to gs://gcs-bucket-name/staging/newness-boosting-c2898.1570936519.827087/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl...
2019-10-13T04:00:33.725522899Z INFO:root:Completed GCS upload to gs://gcs-bucket-name/staging/newness-boosting-c2898.1570936519.827087/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl
2019-10-13T04:06:14.525017097Z Traceback (most recent call last):
...
Why is this job failing to submit? Are there any other logs we can look at to see the cause of this failure?
(Most of our dataflow jobs are written in Java, where the error messages are usually more helpful.)
UPDATE: Running job locally (Windows) with apache-beam 2.16 has the same issue but more logging detail:
...
INFO:root:Starting GCS upload to gs://gcs-bucket-name/staging/newness-boosting-c2898.1571606418.971000/apache_beam-2.16.0-cp27-cp27mu-manylinux1_x86_64.whl...
INFO:root:Completed GCS upload to gs://gcs-bucket-name/staging/newness-boosting-c2898.1571606418.971000/apache_beam-2.16.0-cp27-cp27mu-manylinux1_x86_64.whl in 3 seconds.
WARNING:root:Discarding unparseable args: ['job.py', '--days_history=30']
WARNING:root:Discarding unparseable args: ['job.py', '--days_history=30']
WARNING:root:Retry with exponential backoff: waiting for 2.64795143823 seconds before retrying submit_job_description because we caught exception: error: [Errno 10053] An established connection was aborted by the software in your host machine
Traceback for above exception (most recent call last):
File "C:\Python27\lib\site-packages\apache_beam\utils\retry.py", line 206, in wrapper
return fun(*args, **kwargs)
File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\apiclient.py", line 593, in submit_job_description
response = self._client.projects_locations_jobs.Create(request)
File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\clients\dataflow\dataflow_v1b3_client.py", line 657, in Create
config, request, global_params=global_params)
File "C:\Python27\lib\site-packages\apitools\base\py\base_api.py", line 729, in _RunMethod
http, http_request, **opts)
File "C:\Python27\lib\site-packages\apitools\base\py\http_wrapper.py", line 346, in MakeRequest
check_response_func=check_response_func)
File "C:\Python27\lib\site-packages\apitools\base\py\http_wrapper.py", line 396, in _MakeRequestNoRetry
redirections=redirections, connection_type=connection_type)
File "C:\Python27\lib\site-packages\oauth2client\transport.py", line 169, in new_request
redirections, connection_type)
File "C:\Python27\lib\site-packages\oauth2client\transport.py", line 169, in new_request
redirections, connection_type)
File "C:\Python27\lib\site-packages\httplib2\__init__.py", line 1694, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "C:\Python27\lib\site-packages\httplib2\__init__.py", line 1434, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "C:\Python27\lib\site-packages\httplib2\__init__.py", line 1390, in _conn_request
response = conn.getresponse()
File "C:\Python27\lib\httplib.py", line 1121, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 438, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 394, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "C:\Python27\lib\socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize)
File "C:\Python27\lib\ssl.py", line 754, in recv
return self.read(buflen)
File "C:\Python27\lib\ssl.py", line 641, in read
v = self._sslobj.read(len)
... retries 4 times total ...
Traceback (most recent call last):
File "job.py", line 117, in <module>
newness.pipeline.run_dataflow(sys.argv)
File "C:\Users\LeeW\Desktop\newness\newness\pipeline.py", line 480, in run_dataflow
result = pipe.run()
File "C:\Python27\lib\site-packages\apache_beam\pipeline.py", line 407, in run
self._options).run(False)
File "C:\Python27\lib\site-packages\apache_beam\pipeline.py", line 420, in run
return self.runner.run_pipeline(self, self._options)
File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 485, in run_pipeline
self.dataflow_client.create_job(self.job), self)
File "C:\Python27\lib\site-packages\apache_beam\utils\retry.py", line 206, in wrapper
return fun(*args, **kwargs)
File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\apiclient.py", line 546, in create_job
return self.submit_job_description(job)
File "C:\Python27\lib\site-packages\apache_beam\utils\retry.py", line 219, in wrapper
raise_with_traceback(exn, exn_traceback)
File "C:\Python27\lib\site-packages\apache_beam\utils\retry.py", line 206, in wrapper
return fun(*args, **kwargs)
File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\apiclient.py", line 593, in submit_job_description
response = self._client.projects_locations_jobs.Create(request)
File "C:\Python27\lib\site-packages\apache_beam\runners\dataflow\internal\clients\dataflow\dataflow_v1b3_client.py", line 657, in Create
config, request, global_params=global_params)
File "C:\Python27\lib\site-packages\apitools\base\py\base_api.py", line 729, in _RunMethod
http, http_request, **opts)
File "C:\Python27\lib\site-packages\apitools\base\py\http_wrapper.py", line 346, in MakeRequest
check_response_func=check_response_func)
File "C:\Python27\lib\site-packages\apitools\base\py\http_wrapper.py", line 396, in _MakeRequestNoRetry
redirections=redirections, connection_type=connection_type)
File "C:\Python27\lib\site-packages\oauth2client\transport.py", line 169, in new_request
redirections, connection_type)
File "C:\Python27\lib\site-packages\oauth2client\transport.py", line 169, in new_request
redirections, connection_type)
File "C:\Python27\lib\site-packages\httplib2\__init__.py", line 1694, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "C:\Python27\lib\site-packages\httplib2\__init__.py", line 1434, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "C:\Python27\lib\site-packages\httplib2\__init__.py", line 1390, in _conn_request
response = conn.getresponse()
File "C:\Python27\lib\httplib.py", line 1121, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 438, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 394, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "C:\Python27\lib\socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize)
File "C:\Python27\lib\ssl.py", line 754, in recv
return self.read(buflen)
File "C:\Python27\lib\ssl.py", line 641, in read
v = self._sslobj.read(len)
socket.error: [Errno 10053] An established connection was aborted by the software in your host machine

Which version of Beam Python SDK are you using?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

GCS broken pipe error when trying to upload large files - python

Related

httplib2.socks.HTTPError: (403, b'Forbidden') python apache-beam dataflow

error uploading file to google drive with python

How to connect to hdfs from the docker container?

Google Cloud API: EOFError: Compressed file ended before the end-of-stream marker was reached

Python dataflow job fails to submit

Categories

Resources