Is there any effective way to import csv/text file from Amazon S3 bucket into MS SQL Server 2012/2014 by BCP or Python (without using SSIS)?
I have found this answer: Read a file line by line from S3 using boto? , but I am not sure whether it will be effective and secure way in compare with downloading file and then using BCP.
Thanks.
Related
I am trying to upload file from SFTP server to GCS bucket using cloud function. But this code not working. I am able to sftp. But when I try to upload file in GCS bucket, it doesn't work, and the requirement is to use cloud function with Python.
Any help will be appreciated. Here is the sample code I am trying. This code is working except sftp.get("test_report.csv", bucket_destination). Please help.
destination_bucket ="gs://test-bucket/reports"
with pysftp.Connection(host, username, password=sftp_password) as sftp:
print ("Connection successfully established ... ")
# Switch to a remote directory
sftp.cwd('/test/outgoing/')
bucket_destination = "destination_bucket"
sftp.cwd('/test/outgoing/')
if sftp.exists("test_report.csv"):
sftp.get("test_report.csv", bucket_destination)
else:
print("doesnt exist")
The pysftp cannot work with GCP directly.
Imo, you cannot actually upload a file directly from SFTP to GCP anyhow, at least not from a code running on yet another machine. But you can transfer the file without storing it on the intermediate machine, using pysftp Connection.open (or better using Paramiko SFTPClient.open) and GCS API Blob.upload_from_file. That's what many actually mean by "directly".
client = storage.Client(credentials=credentials, project='myproject')
bucket = client.get_bucket('mybucket')
blob = bucket.blob('test_report.csv')
with sftp.open('test_report.csv', bufsize=32768) as f:
blob.upload_from_file(f)
For the rest of the GCP code, see How to upload a file to Google Cloud Storage on Python 3?
For the purpose of bufsize, see Reading file opened with Python Paramiko SFTPClient.open method is slow.
Consider not using pysftp, it's dead project. Use Paramiko directly (the code will be mostly the same). See pysftp vs. Paramiko.
i got the solution based on your reply thanks here is the code
bucket = client.get_bucket(destination_bucket)
`blob = bucket.blob(destination_folder +filename)
with sftp.open(sftp_filename, bufsize=32768) as f:
blob.upload_from_file(f)
This is exactly what we built SFTP Gateway for. We have a lot of customers that still want to use SFTP, but we needed to write files directly to Google Cloud Storage. Files don't get saved temporarily on another machine. The data is streamed directly from the SFTP Client (python, filezilla, or any other client), straight to GCS.
https://console.cloud.google.com/marketplace/product/thorn-technologies-public/sftp-gateway?project=thorn-technologies-public
Full disclosure, this is our product and we use it for all our consulting clients. We are happy to help you get it setup if you want to try it.
I have a use case where I need to make a S3 navigator which should allow users to navigate s3 files and view them without giving any sort of aws access. So users need not have aws credentials configured on their systems.
The approach I tried is to create a python app using tkinter and allow access to s3 using api gateway proxy to s3 docs. However, all this works fine for txt files in s3 but I have to read feather files and it's causing
s3_data=pd.read_feather("https://<api_gateway>/final/s3?key=naxi143/data.feather")
File "C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\io\feather_format.py", line 130, in read_feather
return feather.read_feather(
File "C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pyarrow\feather.py", line 218, in read_feather
return (read_table(source, columns=columns, memory_map=memory_map)
File "C:\Users\<User>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pyarrow\feather.py", line 239, in read_table
reader = _feather.FeatherReader(source, use_memory_map=memory_map)
File "pyarrow\_feather.pyx", line 75, in pyarrow._feather.FeatherReader.__cinit__
File "pyarrow\error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 114, in pyarrow.lib.check_status
OSError: Verification of flatbuffer-encoded Footer failed.
error in my python code. Not sure if some settings are misconfigured on api gateway side.
s3_data=pd.read_feather("https://<api_gateway>/final/s3?key=naxi143/data.feather")
Is there any other way to make this work without involving aws credentials ?
Update
Looks like api gateway has a payload limit of 10 MB which leaves this solution out of scope for me as most of my data is more than that size. Is there any other way to achieve the same without using aws credentials ?
The intake server can be used as a data gateway, if you wish, and Intake's plugins allow communication with S3 natively via fsspec/s3fs. Intake deals in datasets not files, so you would want to find the correct invocation for each dataset you want to read (i.e., set of arguments that pandas would normally take) and write descriptions and metadata before launching the server.
There is no feather driver, however (unlike parquet), although one would be easy to write. The intake-dremio package, for instance, already interfaces with arrow transport directly.
I think the solution you're looking for is API Gateway + pre-signed S3 URLs + 303 HTTP redirects + CORS. That securely gets around the API gateway limit because it uses a signed redirect to the S3 object. Here is a really good explanation of how to set that up:
https://advancedweb.hu/how-to-solve-cors-problems-when-redirecting-to-s3-signed-urls/
It essentially comes down setting some headers in the REST call to configure CORS to allow a client to receive a 303 redirect to a different domain (CORS is security against cross scripting type of attacks). But because there are security implications I suggest reading the whole article and understand what you're allowing rather than just copying the header names and values.
Situation: My AWS Lambda analyze a given file and return cleaned data.
Input: path of the file given by the user
Ouptut: data dictionnary
Actually in my lambda I:
save the file from local PC to an s3
load it from the s3 to my lambda
analyze the file
delete it from the s3.
Can I simplify the process by loading in the lambda "cash memory" ?
load it from local PC to my lambda
analyze the file
No we can not directly load the file from local to Lambda or its tmp memory.
But if you want you can use a Storage gateway which automatically sink any file from a physical drive( your local pc ) into S3. This would help you to eliminate your step 1 ( save the file from local PC to an s3 ).
First of all, you might use the wrong pattern. Just upload file to S3 using AWS SDK and handle lambda with S3:CreateObject event.
I have a URL (https://example.com/myfile.txt) of a file and I want to upload it to my bucket (gs://my-sample-bucket) on Google Cloud Storage.
What I am currently doing is:
Downloading the file to my system using the requests library.
Uploading that file to my bucket using python function.
Is there any way I can upload the file directly using the URL.
You can use urllib2 or requests library to get the file from HTTP, then your existing python code to upload to Cloud Storage. Something like this should work:
import urllib2
from google.cloud import storage
client = storage.Client()
filedata = urllib2.urlopen('http://example.com/myfile.txt')
datatoupload = filedata.read()
bucket = client.get_bucket('bucket-id-here')
blob = Blob("myfile.txt", bucket)
blob.upload_from_string(datatoupload)
It still downloads the file into memory on your system, but I don't think there's a way to tell Cloud Storage to do that for you.
There is a way to do this, using a Cloud Storage Transfer job, but depending on your use case, it may be worth doing or not. You would need to create a transfer job to transfer a URL list.
I marked this question as duplicated from this.
I have a json file with file name like '203456_instancef9_code323.json' in my C:\temp\testfiles directory and want to copy the file to Amazon s3 bucket and my bucket name is 'input-derived-files' using python and boto library but throwing exceptions at all times saying the file does not exist.I have a valid access id and secret key and could establish connection to AWS. Could someone help me with the best code to script for this please. Many thanks for your contribution
Here is the code that you need based on boto3, it is the latest boto library and is maintained. You need to make sure that you use the forward slash for directory path. I have tested this code on windows and it works.
import boto3
s3 = boto3.resource('s3')
s3.meta.client.upload_file('C:/temp/testfiles/203456_instancef9_code323.json',
'input-derived-files', '203456_instancef9_code323.json')