AWS Lambda - Combine multiple CSV files from S3 into one file

AWS Lambda - Combine multiple CSV files from S3 into one file - python

I am trying to understand and learn how to get all my files from the specific bucket into one csv file. I have the files that are like logs and are always in the same format and are kept in the same bucket. I have this code to access them and read them:
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
x = obj.get()['Body'].read().decode('utf-8')
print(x)
It does print them with separation between specific files and also column headers.
The question I have got is, how can I modify my loop to get them into just one csv file?

You should create a file in /tmp/ and write the contents of each object into that file.
Then, when all files have been read, upload the file (or do whatever you want to do with it).
output = open('/tmp/outfile.txt', 'w')
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
output.write(obj.get()['Body'].read().decode('utf-8'))
output.close
Please note that there is a limit of 512MB in the /tmp/ directory.

Related

Read content of a file located under subfolders of S3 in Python

I'm trying to read a file content's (not to download it) from an S3 bucket. The problem is that the file is located under a multi-level folder. For instance, the full path could be s3://s3-bucket/folder-1/folder-2/my_file.json. How can I get that specific file instead of using my iterative approach that lists all objects?
Here is the code that I want to change:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucket')
for obj in my_bucket.objects.all():
key = obj.key
if key == 'folder-1/folder-2/my_file.json':
return obj.get()['Body'].read()
Can it be done in a simpler, more direct way?

Yes - there is no need to enumerate the bucket.
Read the file directly using s3.Object, providing the bucket name as the 1st parameter & the object key as the 2nd parameter.
"Folders" don't really exist in S3 - Amazon S3 doesn't use hierarchy to organize its objects and files. For the sake of organizational simplicity, the Amazon S3 console shows "folders" as a means of grouping objects but they are ultimately baked into your object key.
This should work:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object("s3-bucket", "folder-1/folder-2/my_file.json")
body = obj.get()['Body'].read()

How can I write the output inside the folder of a S3 bucket using Lambda fuction

I am coding in Python Lang. in AWS Lambda and I need to write the output which shows in output logs in a folder of S3 bucket. The code is,
client = boto3.client('s3')
ouput_data = b'Checksums are matching for file'
client.put_object(Body=ouput_data, Bucket='bucket', Key='key')
And now I want to add Checksums are matching for file and then I will need to add the file names. How can I add

Unzip a file and save its content into a database

I am building a website using Django where the user can upload a .zip file. I do not know how many sub folders the file has or which type of files it contains.
I want to:
1) Unzip the file
2) Get all the file in the unzipped directory (which might contains nested sub folders)
3) Save these files (the content, not the path) into the database.
I managed to unzip the file and to output the files path.
However this is snot exactly what I want. Because I do not care about the file path but the file itself.
In addition, since I am saving the unzipped file into my media/documents, if different users upload different zip, and all the zip files are unzipped, the folder media/documents would be huge and it would impossible to know who uploaded what.
Unzipping the .zip file
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
Getting path of file in subfolders
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
views.py # It is not perfect, it is just an idea. I am just debugging.
def homeupload(request):
if request.method == "POST":
my_entity = Uploading()
# my_entity.my_uploads = request.FILES["my_uploads"]
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
my_entity.save()

You really don't have to clutter up your filesystem when using a ZipFile, as it contains methods that allow you to read the files stored in the zip, directly to memory, and then you can save those objects to a database.
Specifically, we can use .infolist() or .namelist() to get a list of all the files in the zip, and .read() to actually get their contents:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item) for item in zipObj.namelist()]
Now file_objects is a list of bytes objects that have the content of all the files. I didn't bother saving the names or file paths because you said it was unneccessary, but that can be done too. To see what you can do, check out what actually get's returned from infolist
If you want to save these bytes objects to your database, it's usually possible if your database can support it (most can). If you however wanted to get these files as plain text and not bytes, you just have to convert them first with something like .decode:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]
Notice that we didn't save any files on our system at any point, so there's nothing to worry about a lot of user uploaded files cluttering up your system. However the database storage size on your disk will increase accordingly.

How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.

The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucketname)
all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))
if file_pat:
filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
else:
filtered_s3keys = all_s3keys
The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.
The second step is to download all the files:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
executor.map(lambda s3key: bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),
filtered_s3keys)
for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

How to get objects from a folder in an S3 bucket

I am trying to traverse all objects inside a specific folder in my S3 bucket. The code I already have is like follows:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
for obj in bucket.objects.filter(Prefix='folder/'):
do_stuff(obj)
I need to use boto3.resource and not client. This code is not getting any objects at all although I have a bunch of text files in the folder. Can someone advise?

Try adding the Delimiter attribute: Delimiter = '\' as you are filtering objects. The rest of the code looks fine.

I had to make sure to skip the first file. For some reason it thinks the folder name is the first file and that may not be what you want.
for video_item in source_bucket.objects.filter(Prefix="my-folder-name/", Delimiter='/'):
if video_item.key == 'my-folder-name/':
continue
do_something(video_item.key)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS Lambda - Combine multiple CSV files from S3 into one file - python

Related

Read content of a file located under subfolders of S3 in Python

How can I write the output inside the folder of a S3 bucket using Lambda fuction

Unzip a file and save its content into a database

How to append multiple files into one in Amazon's s3 using Python and boto3?

How to get objects from a folder in an S3 bucket

Categories

Resources