Amazon S3 query using regular expression

Amazon S3 query using regular expression - python

I have the following folder structure in S3.
myBucket/20190313/20190313_100000/_SUCCESS
I need to check whether there is _SUCCESS File present
Query I am currently using:
date = 20190313
bucket = s3Resource.Bucket(myBucket)
objs = list(bucket.objects.filter(Prefix=date+'/'))
I do not know what will be inside of the date folder. But is in yyyymmdd_hhmmss format.
Is there a way to query for that specific "_SUCCESS" file, if only "myBucket/20190313/" is known?

The API docs says that you cant use a regex
Limits the response to keys that begin with the specified prefix. You can use prefixes to separate a bucket into different groupings of keys. (You can think of using prefix to make groups in the same way you would use a folder in a file system.)

It's a long shot, depending on your object keys, but you could combine the prefix and the delimiter properties.
e.g.:
Prefix = date+'/'
Delimiter = '_'
As I'm sure that you are aware, there is no folder structure in S3 keys, rather a unique string to identify the object. The use of a delimiter of '/' creates a hierarchy, or more of a virtual folder structure.
Use of the delimiter property will change the virtual folder structure from using '/' to '_'. Provided that you don't use underscores in other keys, it will return a collection of strings between the end of the prefix, and the start of the next prefix, i.e. the '_' in '_SUCCESS'

Related

How can I bypass the boto3 function not allowing uppercase letters in bucket name?

I need to get a list of s3 keys from a bucket. I have a script that works great with boto3. The issue is that the bucket name I'm using has uppercase letters and this throws an error with boto3.
In looking at connect to bucket having uppercase letter this uses boto, but I'd like to use boto3 but OrdinaryCallingFormat() doesn't seem to be an option for boto3.
Or, I could adapt the script to work for boto, but I'm not sure how to do that. I tried:
s3 = boto.connect_s3(aws_access_key_id,aws_secret_access_key,
calling_format = OrdinaryCallingFormat())
Bucketname = s3.get_bucket('b-datasci/x-DI-S')
but that gave the error boto.exception.S3ResponseError: S3ResponseError: 404 Not Found.
With this attempt:
conn = boto.s3.connect_to_region(
'us-east-1',
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
calling_format=boto.s3.connection.OrdinaryCallingFormat(),
)
bucket = conn.get_bucket(Bucketname)
I got the error: xml.sax._exceptions.SAXParseException: <unknown>:1:0: no element found
How can I get this to work with boto, or else integrate OrdinaryCallingFormat()into boto3 to list the keys?

The problem is not the an uppercase character. It's the slash.
Bucket name must match this regex:
^[a-zA-Z0-9.\-_]{1,255}$
To list objects in bucket "mybucket" with prefix "b-datasci/x-DI-S", you're going too low-level. Use this instead:
import boto3
client = boto3.client('s3')
for obj in client.list_objects_v2(Bucket="mybucket", Prefix="b-datasci/x-DI-S"):
...

As pointed out by #John Rotenstein , all bucket name must be DNS-compliant, so a direct slash is not allowed for bucket name.
In fact the 2nd piece of code above does work, when I added this: for
key in bucket.get_all_keys(prefix='s-DI-S/', delimiter='/') and took
what was really the prefix off the Bucketname.
From the above OP comment, it seems OP maybe misunderstood S3 object store with a file directory system. Each s3 object are never store in a hierarchy format. The so call "prefix" is simply an text filter that retrieve object name that contains the the similar "prefix" in the object name.
In OP, there is a bucket name call b-datasci, while there are many object with the prefix x-DI-S store under the bucket.
In addition, the usage of delimiter under get_all_keys is not what OP think. If you check out AWS documentation on Listing Keys Hierarchically Using a Prefix and Delimiter, you will learn that it is just a secondary filter. i.e.
The prefix and delimiter parameters limit the kind of results returned by a list operation. Prefix limits results to only those keys that begin with the specified prefix, and delimiter causes list to roll up all keys that share a common prefix into a single summary list result.
p/s: OP should use boto3 because boto is deprecated by AWS for more than 2 years.

How to retrieve bucket prefixes in a filesystem style using boto3

Doing something like the following:
s3 = boto3.resource('s3')
bucket = s3.Bucket('a_dummy_bucket')
bucket.objects.all()
Will return all the objects under 'a_dummy_bucket' bucket, like:
test1/blah/blah/afile45645.zip
test1/blah/blah/afile23411.zip
test1/blah/blah/afile23411.zip
[...] 2500 files
test2/blah/blah/afile.zip
[...] 2500 files
test3/blah/blah/afile.zip
[...] 2500 files
Is there any way of getting, in this case, 'test1','test2', 'test3', etc... without paginating over all results?
For reaching 'test2' I need 3 paginated calls, each one with 1000 keys to know that there is a 'test2', and then other 3 with 1000 keys to reach 'test3', and so on.
How can I get all these prefixes without paginating over all results?
Thanks

I believe getting the Common Prefixes is what you are possibly looking for. Which can be done using this example:
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
AWS Documentation#Bucket.Get says the following regarding Common Prefixes:
A response can contain CommonPrefixes only if you specify a delimiter. When you do, CommonPrefixes contains all (if there are any) keys between Prefix and the next occurrence of the string specified by delimiter. In effect, CommonPrefixes lists keys that act like subdirectories in the directory specified by Prefix. For example, if prefix is notes/ and delimiter is a slash (/), in notes/summer/july, the common prefix is notes/summer/. All of the keys rolled up in a common prefix count as a single return when calculating the number of returns. See MaxKeys.
Type: String
Ancestor: ListBucketResult

There is no way to escape from pagination. You can specify less than the default page size of 1000, but not more than 1000. If you think paginating and finding the prefix is messy, try the AWS CLI - which internally does all pagination for you but gives you only the results that you want.
aws s3 ls s3://a_dummy_bucket/
PRE test1/
PRE test2/
PRE test3/

Use regex to isolate information in filename

I have various files that are formatted like so;
file_name_twenty_135032952.txt
where file_name_twenty is a description of the contents, and 13503295 is the id.
I want two different regexes; one to get the description out of the filename, and one to get the id.
Here are some other rules that the filenames follow:
The filename will never contains spaces, or uppercase characters
the id will always come directly before the extension
the id will always follow an underscore
the description may sometimes have numbers in it; for example, in this filename: part_1_of_file_324980332.txt, part_1_of_file is the description, and 324980332 is the id.
I've been toiling for a while and can't seem to figure out a regex to solve this. I'm using python, so any limitations thereof with its regex engine follow.

rsplit once on an underscore and to remove the extension from id.
s = "file_name_twenty_13503295.txt"
name, id = s.rsplit(".",1)[0].rsplit("_", 1)
print(name, id)
file_name_twenty 13503295

Parsing directories and detecting unexpected blanks

I'm trying to parse some directories and identifying folders witch do not have a specific correct pattern. Let's exemplify:
Correct: Level1\\Level2\\Level3\\Level4_ID\\Date\\Hour\\file.txt
Incorrect: Level1\\Level2\\Level3\\Level4\\Date\\Hour\\file.txt
Notice that the incorrect one does not have the _ID. My final desired goal is parse the data replacing the '\' for a delimiter to import for MS excel:
Level1;Level2;Level3;Level4;ID;Date;Hour;file.txt
Level1;Level2;Level3;Level4; ;Date;Hour;file.txt
I had successfully parsed all the correct data making this steps:
Let files be a list of my all directories
for i in arange(len(files)):
processed_str = files[i].replace(" ", "").replace("_", "\\")
processed_str = processed_str.split("\\")
My issue is detecting whether or not Level4 folder does have an ID after the underscore using the same script, since "files" contains both correct and incorrect directories.
The problem is that since the incorrect one does not have the ID, after performing split("\") I end up having the columns mixed without a blanck between Level4 and Date:
Level1;Level2;Level3;Level4;Date;Hour;file.txt
Thanks,

Do the "_ID" check after splitting the directories, that way you don't loose information. Assuming the directory names themselves don't contain escaped backslashes and that the ID field is always in level 4 (counting from 1), this should do it:
for i in arange(len(files)):
parts = files[i].split("\\")
if parts[3].endswith("_ID"):
parts.insert(4, parts[3][:-len("_ID")])
else:
parts.insert(4, " ")
final = ";".join(parts)

How can I get a list of all file names having same prefix on Amazon S3?

I am using boto and Python to store and retrieve files to and from Amazon S3.
I need to get the list of files present in a directory. I know there is no concept of directories in S3 so I am phrasing my question like how can I get a list of all file names having same prefix?
For example- Let's say I have following files-
Brad/files/pdf/abc.pdf
Brad/files/pdf/abc2.pdf
Brad/files/pdf/abc3.pdf
Brad/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
When I call foo("Brad"), it should return a list like this-
files/pdf/abc.pdf
files/pdf/abc2.pdf
files/pdf/abc3.pdf
files/pdf/abc4.pdf
What is the best way to do it?

user3's approach is a pure client side solution. I think it works well in small scale. If you have millions of object in one bucket, you may pay for many requests and bandwidth fee.
Alternatively, you can use delimiter and prefix parameter provided by GET BUCKET API to archive your requirement. There are many example in the document, see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
Needless to say, you can use boto to achieve this.

You can use startswith and list comprehension for this purpose as below:
paths=['Brad/files/pdf/abc.pdf','Brad/files/pdf/abc2.pdf','Brad/files/pdf/abc3.pdf','Brad/files/pdf/abc4.pdf','mybucket/files/pdf/new/','mybucket/files/pdf/new/abc.pdf','mybucket/files/pdf/2011/']
def foo(m):
return [p for p in paths if p.startswith(m+'/')]
print foo('Brad')
output:
['Brad/files/pdf/abc.pdf', 'Brad/files/pdf/abc2.pdf', 'Brad/files/pdf/abc3.pdf', 'Brad/files/pdf/abc4.pdf']
Using split and filter:
def foo(m):
return filter(lambda x: x.split('/')[0]== m, paths)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Amazon S3 query using regular expression - python

The API docs says that you cant use a regex Limits the response to keys that begin with the specified prefix. You can use prefixes to separate a bucket into different groupings of keys. (You can think of using prefix to make groups in the same way you would use a folder in a file system.)

Related

How can I bypass the boto3 function not allowing uppercase letters in bucket name?

How to retrieve bucket prefixes in a filesystem style using boto3

Use regex to isolate information in filename

Parsing directories and detecting unexpected blanks

How can I get a list of all file names having same prefix on Amazon S3?

Categories

Resources