How To Use boto3 To Retrieve S3 File Size - python

I'm new to python and boto and I'm currently trying to write a dag that will check an s3 file size given the bucket location and file name. How can I take the file location (s3://bucket-info/folder/filename) and get the size of the file? If the file size is greater than 0kb, I will need to fail the job.
Thank you for your time

You can use boto3 head_object for this
Here's something that will get you the size. Replace bucket and key with your own values:
import boto3
client = boto3.client(service_name='s3', use_ssl=True)
response = client.head_object(
Bucket='bucketname',
Key='full/path/to/file.jpg'
)
print(response['ContentLength'])

You can also get a list of all objects if multiple files need to be checked. For a given bucket run list_objects_v2 and then iterate through response 'Contents'. For example:
s3_client = boto3.client('s3')
response_contents = s3_client.list_objects_v2(
Bucket='name_of_bucket'
).get('Contents')
you'll get a list of dictionaries like this:
[{'Key': 'path/to/object1', 'LastModified': datetime, 'ETag': '"some etag"', 'Size': 2600, 'StorageClass': 'STANDARD'}, {'Key': 'path/to/object2', 'LastModified': 'datetime', 'ETag': '"some etag"', 'Size': 454, 'StorageClass': 'STANDARD'}, ... ]
Notice that each dictionary in the list contains 'Size' key, which is the size of your particular object. It's iterable
for rc in response_contents:
if rc.get('Key') == 'path/to/file':
print(f"Size: {rc.get('Size')}")
You get sizes for all files you might be interested in:
Size: 2600
Size: 454
Size: 2600
...

Related

How to filter Amazon EBS snapshots by image (AMI) ID?

I would like to get all Amazon EBS snapshots that are associated with a certain AMI (image).
Is that possible?
I can filter by tag ie.
previous_snapshots = ec2.describe_snapshots(Filters=[{'Name': 'tag:SnapAndDelete', 'Values': ['True']}])['Snapshots']
for snapshot in previous_snapshots:
print('Deleting snapshot {}'.format(snapshot['SnapshotId']))
ec2.delete_snapshot(SnapshotId=snapshot['SnapshotId'])
Is there a filter to use an ImageId instead?
Or could it be possible to get a list of all snapshots associated to an image id using describe_images? I would be happy with either of both.
ie.
images = ec2.describe_images(Owners=['self'])['Images']
Thanks
When calling describe_images(), Amazon EBS Snapshots are referenced in the BlockDeviceMappings section:
{
'Images': [
{
'CreationDate': 'string',
'ImageId': 'string',
'Platform': 'Windows',
'BlockDeviceMappings': [
{
'DeviceName': 'string',
'VirtualName': 'string',
'Ebs': {
'Iops': 123,
'SnapshotId': 'string', <--- This is the Amazon EBS Snapshot
'VolumeSize': 123,
'VolumeType': 'standard'|'io1'|'io2'|'gp2'|'sc1'|'st1'|'gp3'
...
Therefore, if you wish to retrieve all Amazon EBS Snapshots for a given AMI you can use code like this:
import boto3
ec2_client = boto3.client('ec2')
response = ec2_client.describe_images(ImageIds = ['ami-1234'])
first_ami = response['Images'][0]
snapshots = [device['Ebs']['SnapshotId'] for device in first_ami['BlockDeviceMappings'] if 'Ebs' in device]
for snapshot in snapshots:
print(snapshot)

S3 show buckets last modified

I'm trying to list the last modified file in S3 buckets for a report but the report is showing the first modified (ie when the first file was uploaded not the last file).
I'm using this:
top_level_folders[folder]['modified'] = obj.last_modified
and adding to the report here:
report.add_row([folder[1]['name'], folder[1]['objects'],
str(round(folder[1]['size'],2)), status, folder[1]['modified']])
I've tried adding
=obj.last_modified, reverse=True but keep getting invalid syntax errors.
This is what the report looks like:
I'm not exactly sure as to what you're doing when it comes to writing to the report, but the code below will return a list of dictionaries with the name of each bucket and the time the last-modified file was last modified. E.g.,
[
{
'Folder': 'bucket_1',
'Last Modified': '2021-11-30 13:10:32+00:00'
},
{
'Folder': 'bucket_2',
'Last Modified': '2021-09-27 17:18:27+00:00'
}
]
import datetime
import boto3
s3_client = boto3.client('s3',
aws_access_key_id="AKXXXXXXXXXXXXXXXXXX",
aws_secret_access_key="YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY",
region_name="eu-west-2"
)
def find_last_modified_file_in_bucket(bucket_name: str) -> datetime:
last_modified = []
for bucket_object in s3_client.list_objects(Bucket=bucket_name)["Contents"]:
last_modified.append(bucket_object["LastModified"])
return max(last_modified)
def fetch_last_modified() -> [{}]:
last_modified_file_by_bucket: list[{}] = []
for bucket_name in list(map(lambda bucket: bucket["Name"], s3_client.list_buckets()["Buckets"])):
latest_time_of_last_modified_file: datetime = find_last_modified_file_in_bucket(bucket_name)
last_modified_file_by_bucket.append(
{
"Folder": bucket_name,
"Last Modified": str(latest_time_of_last_modified_file)
}
)
return last_modified_file_by_bucket
Without the source code or knowledge of the type of folder, I can't say with certainty how you would use the above code to update the folder, but it will likely come down to iterating over the dict returned by fetch_last_modified(). E.g.,
def update_report(report: Report, folder_with_last_modified: dict):
for folder in folder_with_last_modified:
report.add_row(folder['Folder'], folder['Last Modified'])
folder_with_last_modified = fetch_last_modified()
update_report(report, folder_with_last_modified)

JDownloader API json.decoder.JSONDecodeError

I am using the python API of JDownloader myjdapi
With the device.linkgrabber.query_links() I got the following object:
{'enabled': True, 'name': 'EQJ_X8gUcAMQX13.jpg', 'packageUUID': 1581524887390, 'uuid': 1581524890696, 'url': 'https://pbs.twimg.com/media/x.jpg?name=orig', 'availability': 'ONLINE'}
Now I want to move to the download list with the function:
device.linkgrabber.move_to_downloadlist('1581524890696', '1581524887390')
The move_to_downloadlist function (githubrepo) says:
def move_to_downloadlist(self, link_ids, package_ids):
"""
Moves packages and/or links to download list.
:param package_ids: Package UUID's.
:type: list of strings.
:param link_ids: Link UUID's.
"""
params = [link_ids, package_ids]
resp = self.device.action(self.url + "/moveToDownloadlist", params)
return resp
But I get always json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The official API said its a 200 Error, and the reason can be anything.
How I can fix that?
The parameter names are link_ids and package_ids, that's plural. That would be a good indication that lists are expected here, not single values.
Try this:
device.linkgrabber.move_to_downloadlist(['1581524890696'], ['1581524887390'])

Python code breaks when attemting to download larger zipped csv file, works fine on smaller file

while working with small zipfiles(about 8MB) containg 25MB of CSV files the below code works exactly as it should. As soon as I attempt to download larger files (45MB zip file containing a 180MB csv) the code breaks and I get the following error message:
(venv) ufulu#ufulu awr % python get_awr_ranking_data.py
https://api.awrcloud.com/v2/get.php?action=get_topsites&token=REDACTED&project=REDACTED Client+%5Bw%5D&fileName=2017-01-04-2019-10-09
Traceback (most recent call last):
File "get_awr_ranking_data.py", line 101, in <module>
getRankingData(project['name'])
File "get_awr_ranking_data.py", line 67, in getRankingData
processRankingdata(rankDateData['details'])
File "get_awr_ranking_data.py", line 79, in processRankingdata
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
AttributeError: 'float' object has no attribute 'split'
My goal is to download data for 170 projects and save the data to sqlite DB.
Please bear with me me as I am a novice in the field of programming and python. I would greatly appreciate any help to fixing the code below as well as any other sugestions and improvements to making the code more robust and pythonic.
Thanks in advance
from dotenv import dotenv_values
import requests
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from sqlalchemy import create_engine
# SQL Alchemy setup
engine = create_engine('sqlite:///rankingdata.sqlite', echo=False)
# Excerpt from the initial API Call
data = {'projects': [{'name': 'Client1',
'id': '168',
'frequency': 'daily',
'depth': '5',
'kwcount': '80',
'last_updated': '2019-10-01',
'keywordstamp': 1569941983},
{
"depth": "5",
"frequency": "ondemand",
"id": "194",
"kwcount": "10",
"last_updated": "2019-09-30",
"name": "Client2",
"timestamp": 1570610327
},
{
"depth": "5",
"frequency": "ondemand",
"id": "196",
"kwcount": "100",
"last_updated": "2019-09-30",
"name": "Client3",
"timestamp": 1570610331
}
]}
#setup
api_url = 'https://api.awrcloud.com/v2/get.php?action='
urls = [] # processed URLs
urlbacklog = [] # URLs that didn't return a downloadable File
# API Call to recieve URL containing downloadable zip and csv
def getRankingData(project):
action = 'get_dates'
response = requests.get(''.join([api_url, action]),
params=dict(token=dotenv_values()['AWR_API'],
project=project))
response = response.json()
action2 = 'topsites_export'
rankDateData = requests.get(''.join([api_url, action2]),
params=dict(token=dotenv_values()['AWR_API'],
project=project, startDate=response['details']['dates'][0]['date'], stopDate=response['details']['dates'][-1]['date'] ))
rankDateData = rankDateData.json()
print(rankDateData['details'])
urls.append(rankDateData['details'])
processRankingdata(rankDateData['details'])
# API Call to download and unzip csv data and process it in pandas
def processRankingdata(url):
content = requests.get(url)
# {"response_code":25,"message":"Export in progress. Please come back later"}
if "response_code" not in content:
f = ZipFile(BytesIO(content.content))
#print(f.namelist()) to get all filenames in Zip
with f.open(f.namelist()[0], 'r') as g: rankingdatadf = pd.read_csv(g)
rankingdatadf = rankingdatadf[rankingdatadf['Search Engine'].str.contains("Google")]
domain = []
for row in rankingdatadf['URL']:
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
rankingdatadf['Domain'] = domain
rankingdatadf['Domain'] = rankingdatadf['Domain'].str.replace('www.', '')
rankingdatadf = rankingdatadf.drop(columns=['Title', 'Meta description', 'Snippet', 'Page'])
print(rankingdatadf['Search Engine'][0])
writeData(rankingdatadf)
else:
urlbacklog.append(url)
pass
# Finally write the data to database
def writeData(rankingdatadf):
table_name_from_file = project['name']
check = engine.has_table(table_name_from_file)
print(check) # boolean
if check == False:
rankingdatadf.to_sql(table_name_from_file, con=engine)
print(project['name'] + ' ...Done')
else:
print(project['name'] + ' ... already in DB')
for project in data['projects']:
getRankingData(project['name'])
The problem seems to be the split call on a float and not necessarily the download. Try changing line 79
from
domain.append(row.split("//")[-1].split("/")[0].split('?')[0])
to
domain.append(str(str(str(row).split("//")[-1]).split("/")[0]).split('?')[0])
It looks like you're trying to parse the network location portion of the URL here, you can also use urllib.parse to make this easier instead of chaining all the splits:
from urllib.parse import urlparse
...
for row in rankingdatadf['URL']:
domain.append(urlparse(row).netloc)
I think a malformed URL is causing you issues, try (to diagnose issue):
try :
for row in rankingdatadf['URL']:
try:
domain.append(urlparse(row).netloc)
catch Exception:
exit(row)
Looks like you figured it out above, you have a database entry with a NULL value for the URL field. Not sure what your fidelity requirements for this data set are but might want to enforce database rules for URL field, or use pandas to drop rows where URL is NaN.
rankingdatadf = rankingdatadf.dropna(subset=['URL'])

Unable to get just subfolder objects from s3 aws

I am using this function to get data from S3:
s3 = boto3.resource('s3')
s3client = boto3.client('s3')
Bucket = s3.Bucket('ais-django');
obj = s3.Object('ais-django', 'Event/')
list = s3client.list_objects_v2(Bucket='ais-django' ,Prefix='Event/' )
for s3_key in list:
filename = s3_key['Key']
When I use prefix for Event folder (path is like 'ais-django/Event/') it gives abnormal output like this:
{
'IsTruncated': False,
'Prefix': 'Event/',
'ResponseMetadata': {
'HTTPHeaders': {
'date': 'Mon, 11 Jun 2018 12:42:35 GMT',
'content-type': 'application/xml',
'transfer-encoding': 'chunked',
'x-amz-bucket-region': 'us-east-1',
'x-amz-request-id': '94ADDB21361252F3',
'server': 'AmazonS3',
'x-amz-id-2': 'IVuVQuB2V7nClm5FaX4FRbt6brS3gAiuwpERnZxknIWoZLH65LerURwmoynKW5sv37VP6FdbYho='
},
'RequestId': '94ADDB21361252F3',
'RetryAttempts': 0,
'HostId': 'IVuVQuB2V7nClm5FaX4FRbt6brS3gAiuwpERnZxknIWoZLH65LerURwmoynKW5sv37VP6FdbYho=',
'HTTPStatusCode': 200
},
'MaxKeys': 1000,
'Name': 'ais-django',
'KeyCount': 0
}
while without prefix when I add like this:
list = s3client.list_objects_v2(Bucket='ais-django' )[Contents]
it gives list of all objects.
So how I can get all objects in a specific folder ?
this is the way you should do it :)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('ais-django')
for o in bucket.objects.filter(Prefix='Event/test-event'):
print(o.key)
this is the result you will get
the result contains Event/test-event/ as there is no folder system in AWS s3 , everything is an object, hence Event/test-event/ as well as Event/test-event/image.jpg are both considered as objects.
if you want only contents , i.e , image only you can do it like this,
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('ais-django')
for o in bucket.objects.filter(Prefix='Event/test-event'):
filename=o.key
if filename.endswith(".jpeg") or filename.endswith(".jpg") or filename.endswith(".png"):
print(o.key)
Now in this case we are getting Event/test-event/18342087_1323920084341024_7613721308394107132_n.jpg as a result as we are filtering our results out and this is the only image object in my bucket right now

Categories

Resources