I have some datasets in a github repo and I want to move them to S3 using python without saving locally anything .
This is my source public repo :
[https://github.com/statsbomb/open-data/tree/master/data]
I have seen boto3 working but I have to save the file in my workspace to upload it to s3.
This is too much data to download so i want to move directly to s3 then start wrangling the data .
import requests
import boto3
s3 = boto3.client('s3')
bucket_name = 'your_bucket_name'
# List of datasets you want to download
datasets = [
'events',
'matches',
'competitions.json',
'lineups'
]
# Download the datasets and upload them to S3
for dataset in datasets:
url = f'https://github.com/statsbomb/open-data/blob/master/data/{dataset}.json?raw=true'
response = requests.get(url, stream=True)
s3.upload_fileobj(response.raw, bucket_name, f'{dataset}.json')
Related
i would like to upload a really large file 3TB from a URL to s3 directly, but want to upload it in parts, similar to if i would split the file on my machine and upload the parts to s3.
I have the following script to upload the file (whole) from url to S3, how can i modify it to upload it to (parts) file.aa, file.ab, file.ac etc..
import requests
import boto3
url = "https://example.com/file.tar"
r = requests.get(url, stream=True)
session = boto3.Session()
s3 = session.resource('s3')
bucket_name = 'bucket-name'
key = 'file.tar' # key is the name of file on your bucket
bucket = s3.Bucket(bucket_name)
bucket.upload_fileobj(r.raw, key)
I would recommend reading this:
https://medium.com/#niyazi_erd/aws-s3-multipart-upload-with-python-and-boto3-9d2a0ef9b085
I'm trying to read file residing inside an s3 bucket and upload it to another s3 bucket through pre-signed url. I have tried to upload single file, that works fine. But if I try to upload list of files in a bucket, data can't be uploaded. It's always giving the response code as 200, but data is uploaded in the target s3 location especially with large files.
My python code is given below.
import boto3
import requests
from io import BytesIO
for file in file_list:#list of file names in the bucket
pre_url=getPreSinedUrl()#for each file method will return a separate pre-signed url to upload file
data_in_bytes = s3.get_object(Bucket="my-bucket", Key=file)['Body'].read()
res = requests.put(pre_url, data=BytesIO(data_in_bytes), headers=my_header)
print(res.status_code)
Any help why the file in s3 is not getting uploaded?
conn = tinys3.Connection(S3_ACCESS_KEY,S3_SECRET_KEY)
f = open('sample.zip','rb')
conn.upload('sample.zip',f,bucketname)
I can upload the file to my bucket (test) via the code above, but I want to upload it directly to test/images/example. I am open to moving over to boto, but I can't seem to import boto.s3 in my environment.
I have looked through How to upload a file to directory in S3 bucket using boto but none of the tinys3 examples show this.
import boto3
client = boto3.client('s3', region_name='ap-southeast-2')
client.upload_file('/tmp/foo.txt', 'my-bucket', 'test/images/example/foo.txt')
The following worked for me
from boto3.s3.transfer import S3Transfer
from boto3 import client
client_obj = client('s3',
aws_access_key_id='my_aws_access_key_id',
aws_secret_access_key='my_aws_secret_access_key')
transfer = S3Transfer(client_obj)
transfer.upload_file(src_file,
'my_s3_bucket_name',
dst_file,
extra_args={'ContentType': "application/zip"})
I'm implementing Boto3 to upload files to S3, and all works fine. The process that I'm doing is the following:
I get base64 image from FileReader Javascript object. Then I send the base64 by ajax to the server, I decode the base64 image and I generate a random name to rename the key argument
data = json.loads(message['text'])
dec = base64.b64decode(data['image'])
s3 = boto3.resource('s3')
s3.Bucket('bucket_name').put_object(Key='random_generated_name.png', Body=dec,ContentType='image/png',ACL='public-read')
This works fine but respect to performance, is there a better way to improve it?
I used this and I believe its more effective and pythonic.
import boto3
s3 = boto3.client('s3')
bucket = 'your-bucket-name'
file_name = 'location-of-your-file'
key_name = 'name-of-file-in-s3'
s3.upload_file(file_name, bucket, key_name)
I am working on a process to dump files from a Redshift database, and would prefer not to have to locally download the files to process the data. I saw that Java has a StreamingObject class that does what I want, but I haven't seen anything similar in boto3.
If you have a mybucket S3 bucket, which contains a beer key, here is how to download and fetch the value without storing it in a local file:
import boto3
s3 = boto3.resource('s3')
print s3.Object('mybucket', 'beer').get()['Body'].read()
smart_open is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
https://pypi.org/project/smart-open/
import boto3
import smart_open
client = boto3.client(service_name='s3',
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_KEY,
)
url = 's3://.............'
fin = smart_open.open(url, 'r', transport_params={'client':client})
for line in fin:
data = json.loads(line)
print(data)
fin.close()
This may or may not be relevant to what you want to do, but for my situation one thing that worked well was using tempfile:
import tempfile
import boto3
bucket_name = '[BUCKET_NAME]'
key_name = '[OBJECT_KEY_NAME]'
s3 = boto3.resource('s3')
temp = tempfile.NamedTemporaryFile()
s3.Bucket(bucket_name).download_file(key_name, temp.name)
# do what you will with your file...
temp.close()
I use that solution, actually:
import boto3
s3_client = boto3.client('s3')
def get_content_from_s3(bucket: str, key: str) -> str:
"""Save s3 content locally
param: bucket, s3 bucket
param: key, path to the file, f.i. folder/subfolder/file.txt
"""
s3_file = s3_client.get_ojct(Bucket=bucket, Key=key)['Body'].read()
return s3_file.decode('utf-8').strip()