AWS S3 bucket boto - python

I have written following code to create a AWS S3 bucket using boto:-
from boto.s3.connection import S3Connection
conn = S3Connection()
bucket = conn.create_bucket('mybucket1')
But When I ran this code I am getting following error:-
Traceback (most recent call last):
File "prob1.py", line 3, in <module>
bucket = conn.create_bucket('mybucket1')
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 619, in create_bucket
data=data)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 671, in make_request
retry_handler=retry_handler
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1071, in make_request
retry_handler=retry_handler)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 913, in _mexe
self.is_secure)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 705, in get_http_connection
return self.new_http_connection(host, port, is_secure)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 747, in new_http_connection
connection = self.proxy_ssl(host, is_secure and 443 or 80)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 792, in proxy_ssl
int(self.proxy_port)), timeout)
File "/usr/lib/python2.7/socket.py", line 575, in create_connection
raise err
socket.timeout: timed out
I am not using any proxy server.
Help me to debug this code.Thanks in advance.

Your code is perfectly fine.
The error is a timeout, which suggests a networking issue, such as a port being blocked by corporate IT.
Try it from another network (eg from home) and you'll find that it will work correctly. It is then a matter of tracking down whoever runs your network to figure out what is blocking your connection.
Alternatively, create an Amazon EC2 instance, connect to it (if possible), and run your code from there.

Related

How to connect to hdfs from the docker container?

My goal is to read file from hdfs in airflow and do further manipulations.
After researching, I found that url I need to use is as follows:
df = pd.read_parquet('http://localhost:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN'),
where localhost/172.20.80.1/computer-name.mshome.net can be interchangeably used,
9870 - namenode port,
hadoop_files/sample_2022_01.parquet - my folder and file created in the root.
I can access and read file locally in PyCharm, but I am unable to get the same result inside airflow in docker. I tried using local hdfs and hdfs hosted in docker and changing host to the host.docker.internal, but I am getting the same error.
Stack trace:
[2022-06-12, 17:52:45 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/usr/local/lib/python3.7/http/client.py", line 948, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/local/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/usr/local/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
OSError: [Errno 113] No route to host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 207, in execute
branch = super().execute(context)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/includes/parquet_dag/main.py", line 15, in main
df_parquet = read('hdfs://localhost:9000/hadoop_files/sample_2022_01.parquet')
File "/opt/airflow/dags/includes/parquet_dag/utils.py", line 29, in read
df = pd.read_parquet('http://172.20.80.1:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
**kwargs,
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
mode="rb",
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
path_or_handle, mode, is_text=False, storage_options=storage_options
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
storage_options=storage_options,
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 312, in _get_filepath_or_buffer
with urlopen(req_info) as req:
File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 212, in urlopen
return urllib.request.urlopen(*args, **kwargs)
File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/local/lib/python3.7/urllib/request.py", line 1378, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>
With host.docker.internal:
urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>
you need to use any routable address inside airflow docker container.
if hadoop is inside docker container as well, check it ip address using docker inspect CONTAINER (doc). if hadoop is on localhost you can set network_mode: "host" (doc)
also there is an important notice if you are on macos and have the docker desktop app which basically a virtual machine. so in this case you need some extra settings, check this, for example.
where localhost/172.20.80.1/computer-name.mshome.net can be interchangeably used,
They shouldn't be interchangeable inside Docker network.
From Airflow, you could use Docker service names, not IP addresses, and ensure the containers are in the same bridge network (not host mode, which only works on Linux). host.docker.internal isn't correct either since you're trying to reach another container, not your host
https://docs.docker.com/network/bridge/
I'd also recommend using Airflow Spark operators to actually read Parquet from HDFS, using Spark, not Pandas or WebHDFS. You can convert Spark dataframes to Pandas, if needed

GCS broken pipe error when trying to upload large files

I'm trying to upload a .csv.gz file to GCS after extracting it to .csv, the file size changes from 500MB to around 5GB. I'm able to extract the .csv.gz file to a temporary path and it fails when I try to upload that file to GCS. I get the following error:
[2019-11-11 13:59:58,180] {models.py:1796} ERROR - [Errno 32] Broken pipe
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1664, in _run_raw_tas
result = task_copy.execute(context=context
File "/home/airflow/gcs/dags/operators/s3_to_gcs_transform_operator.py", line 220, in execut
gcs_hook.upload(dest_gcs_bucket, dest_gcs_object, target_file, gzip=True
File "/home/airflow/gcs/dags/hooks/gcs_hook_conn.py", line 208, in uploa
.insert(bucket=bucket, name=object, media_body=media)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrappe
return wrapped(*args, **kwargs
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 835, in execut
method=str(self.method), body=self.body, headers=self.headers
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 179, in _retry_reques
raise exceptio
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 162, in _retry_reques
resp, content = http.request(uri, method, *args, **kwargs
File "/opt/python3.6/lib/python3.6/site-packages/google_auth_httplib2.py", line 198, in reques
uri, method, body=body, headers=request_headers, **kwargs
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 155, in new_reques
redirections, connection_type
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1924, in reques
cachekey
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1595, in _reques
conn, request_uri, method, body, header
File "/opt/python3.6/lib/python3.6/site-packages/httplib2/__init__.py", line 1502, in _conn_reques
conn.request(method, request_uri, body, headers
File "/opt/python3.6/lib/python3.6/http/client.py", line 1239, in reques
self._send_request(method, url, body, headers, encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1285, in _send_reques
self.endheaders(body, encode_chunked=encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1234, in endheader
self._send_output(message_body, encode_chunked=encode_chunked
File "/opt/python3.6/lib/python3.6/http/client.py", line 1065, in _send_outpu
self.send(chunk
File "/opt/python3.6/lib/python3.6/http/client.py", line 986, in sen
self.sock.sendall(data
File "/opt/python3.6/lib/python3.6/ssl.py", line 975, in sendal
v = self.send(byte_view[count:]
File "/opt/python3.6/lib/python3.6/ssl.py", line 944, in sen
return self._sslobj.write(data
File "/opt/python3.6/lib/python3.6/ssl.py", line 642, in writ
return self._sslobj.write(data
BrokenPipeError: [Errno 32] Broken pip
From what I understood, the error could be due to the following:
Your server process has received a SIGPIPE writing to a socket. This
usually happens when you write to a socket fully closed on the other
(client) side. This might be happening when a client program doesn't
wait till all the data from the server is received and simply closes a
socket (using close function).
But I have no idea whether this is the issue or how I can fix this. Can someone help?
You should try to uploads big files in chunks.
from google.cloud import storage
CHUNK_SIZE = 128 * 1024 * 1024
client = storage.Client()
bucket = client.bucket('destination')
blob = bucket.blob('really-big-blob', chunk_size=CHUNK_SIZE)
blob.upload_from_filename('/path/to/really-big-file')
Also you can check Parallel Composite Uploads
Similar SO question link.

httplib2.ServerNotFoundError: Unable to find the server at accounts.google.com

I have been struggling with this problem for a while now and have scoured the internet but am still unable to solve it.
I am trying to run a python script to access a google sheet from my local machine at work (so may be firewall issues). Whenever I am connected on a guest network the script works fine. Whenever I try to use my work internal network (with or without proxy settings) I get the following error:
***Traceback (most recent call last):
File "C:\Python36-32\lib\site-packages\httplib2\__init__.py", line 995, in _conn_request
conn.connect()
File "C:\Python36-32\lib\http\client.py", line 1392, in connect
super().connect()
File "C:\Python36-32\lib\http\client.py", line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File "C:\Python36-32\lib\socket.py", line 704, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "C:\Python36-32\lib\socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.1\helpers\pydev\pydevd.py", line 1664, in <module>
main()
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.1\helpers\pydev\pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.1\helpers\pydev\pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.1.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Git/datalab-notebooks/2017-vin-profitability/US/Finance_Mapping_test.py", line 23, in <module>
client = gspread.authorize(credentials)
File "C:\Python36-32\lib\site-packages\gspread\__init__.py", line 38, in authorize
client.login()
File "C:\Python36-32\lib\site-packages\gspread\client.py", line 52, in login
self.auth.refresh(http)
File "C:\Python36-32\lib\site-packages\oauth2client\client.py", line 668, in refresh
self._refresh(http.request)
File "C:\Python36-32\lib\site-packages\oauth2client\client.py", line 873, in _refresh
self._do_refresh_request(http_request)
File "C:\Python36-32\lib\site-packages\oauth2client\client.py", line 905, in _do_refresh_request
self.token_uri, method='POST', body=body, headers=headers)
File "C:\Python36-32\lib\site-packages\httplib2\__init__.py", line 1322, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "C:\Python36-32\lib\site-packages\httplib2\__init__.py", line 1072, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "C:\Python36-32\lib\site-packages\httplib2\__init__.py", line 1002, in _conn_request
raise ServerNotFoundError("Unable to find the server at %s" % conn.host)
httplib2.ServerNotFoundError: Unable to find the server at accounts.google.com***
After some research I get the impression that it is due to my proxy settings but I cant figure out what I need to do to fix it. I also understand that the python library socks may help but once again I'm stumped
Any help would be much appreciated
Thanks in advance
While this was asked a long time ago, I wanted to share the fix that worked for me.
The error occurred when I used the host with the protocol
httplib2.ProxyInfo(
proxy_type=httplib2.socks.PROXY_TYPE_HTTP,
proxy_host='https://proxy.prod.com',
)
Removing the protocol solved the problem:
httplib2.ProxyInfo(
proxy_type=httplib2.socks.PROXY_TYPE_HTTP,
proxy_host='proxy.prod.com',
)

How to correctly upload and get blobs google cloud python client library multithreadedly?

I am trying to download mp3 file from a server and then upload to google storage with the following code
from google.cloud import storage
bucket = storage_client.get_bucket('bucketname')
blob = bucket.blob(uploadKey)
blob.upload_from_filename(flacFileName)
But to be noted the downloads and uploads are multithreaded and in some of the threads I get the following exception
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/tamim/upworkProjects/speech_recong/upload_to_storage.py", line 51, in downloadMp3AndUploadToBucket
blob.upload_from_filename(flacFileName)
File "/home/tamim/.local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 719, in upload_from_filename
file_obj, content_type=content_type, client=client)
File "/home/tamim/.local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 673, in upload_from_file
connection.http, request, retries=num_retries)
File "/home/tamim/.local/lib/python2.7/site-packages/google/cloud/streaming/http_wrapper.py", line 384, in make_api_request
redirections=redirections)
File "/home/tamim/.local/lib/python2.7/site-packages/google/cloud/streaming/http_wrapper.py", line 347, in _make_api_request_no_retry
redirections=redirections, connection_type=connection_type)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/tamim/.local/lib/python2.7/site-packages/google/cloud/streaming/http_wrapper.py", line 107, in _httplib2_debug_level
http.connections[connection_key].set_debuglevel(old_level)
KeyError: 'https:accounts.google.com'
I also get [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2178) exceptions. But those exceptions are erratic and also I have none of these problems if I don't use multiple threads. So, how do I correctly upload files to google storage multithreadedly ?
I have looked at this link https://github.com/GoogleCloudPlatform/google-cloud-python/issues/1214 and here https://developers.google.com/api-client-library/python/guide/thread_safety but I couldn't find answer of how to do it with google-cloud-storage client library.

py2neo SocketError: Connection refused, but curl works

I'm trying to get a Flask/Neo4j app set up on a remote Ubuntu server, and I've run into a problem that I haven't been able to figure out. My app uses py2neo, but when it tries to connect to the graph, the app crashes and the Neo4j process seems to stop. I've tried connecting in a python shell like this...
test = Graph('http://localhost:7474/db/data/',username='neo4j',password='myPassword')
which fails, and also renders neo4j inoperative until I restart it. However, these return 200 responses (and the web interface also works):
curl -u neo4j http://localhost:7474/db/data/
requests.get('http://localhost:7474/db/data/', auth=('neo4j','myPassword'))
I've tried to provide more information than this similar question, because it seems like the connection works from everywhere but py2neo.
Here's the full traceback:
Traceback (most recent call last):
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/database/__init__.py", line 318, in __new__
inst = cls.__instances[key]
KeyError: (<class 'py2neo.database.Graph'>, <ServerAddress settings={'http_port': 7474, 'host': 'localhost'}>, 'data')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 322, in submit
response = send()
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 317, in send
http.request(xstr(method), xstr(uri.absolute_path_reference), body, headers)
File "/usr/lib/python3.5/http/client.py", line 1106, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
self.endheaders(body)
File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
self._send_output(message_body)
File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
self.send(msg)
File "/usr/lib/python3.5/http/client.py", line 877, in send
self.connect()
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 80, in connect
self.source_address)
File "/usr/lib/python3.5/socket.py", line 711, in create_connection
raise err
File "/usr/lib/python3.5/socket.py", line 702, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/database/__init__.py", line 327, in __new__
use_bolt = version_tuple(inst.__remote__.get().content["neo4j_version"]) >= (3,)
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/database/http.py", line 154, in get
response = self.__base.get(headers=headers, redirect_limit=redirect_limit, **kwargs)
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 966, in get
return self.__get_or_head("GET", if_modified_since, headers, redirect_limit, **kwargs)
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 943, in __get_or_head
return rq.submit(redirect_limit=redirect_limit, **kwargs)
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 433, in submit
http, rs = submit(self.method, uri, self.body, self.headers)
File "/home/deploy/toponimika/toponimikaenv/lib/python3.5/site-packages/py2neo/packages/httpstream/http.py", line 362, in submit
raise SocketError(code, description, host_port=uri.host_port)
py2neo.packages.httpstream.http.SocketError: Connection refused
Anything I might try to figure out what's going on would be appreciated.
Changed to http://username:password#localhost:7474/db/data/ and it works!
Example:
test = Graph('http://username:password#localhost:7474/db/data/')
I had the same issue, solved with a simple upgrade of pip version.
pip install --upgrade py2neo

Categories

Resources