S3 appending random string in file name - python

I have a s3 folder with a csv file stored on it. I'm trying to download the last modified file. I'm using this script to get the last modified file:
s3_client = boto3.client('s3', aws_access_key_id=s3_extra_data['aws_access_key_id'],
aws_secret_access_key=s3_extra_data['aws_secret_access_key'])
response = s3_client.list_objects_v2(Bucket='test', Prefix='file_r/')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
response = s3_client.list_objects_v2(Bucket='test', Prefix=latest["Key"])[:-52].lower())
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
print("LATEST ->" + str(latest["Key"])[:-52].lower())
print("PATH ->" + str(latest["Key"]))
s3_client.download_file("test", latest["Key"], str(latest["Key"]))
This code lists my last modified object, the file name is part-00000-40f267f2-38dc-4bab-811c-4c3052fdb1ba-c000.csv and is inside the file_r folder.
Although, when I use s3_client.download_file i get the following error:
'file_r/part-00000-40f267f2-38dc-4bab-811c-4c3052fdb1ba-c000.csv.8cEebaeb'
When i print my path and my file I get the correct values
LATEST -> file_r/part
PATH -> file_r/part-00000-40f267f2-38dc-4bab-811c-4c3052fdb1ba-c000.csv
Why the value .8cEebaeb is appended after the .csv extension since the PATH is correct.
Any thoughs on that?

I had this issue when the local folder is not created.
folder1/folder2/filename.foo
If folder1 or folder2 does not exist locally, boto3 returns error :
FileNotFoundError: [Errno 2] No such file or directory: 'folder1/folder2/finemane.foo.F89bdcAc'

To solve the problem I changed the code to:
s3_client = boto3.client('s3', aws_access_key_id=s3_extra_data['aws_access_key_id'],
aws_secret_access_key=s3_extra_data['aws_secret_access_key'])
response = s3_client.list_objects_v2(Bucket='test', Prefix='file_r/')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
s3_client.download_file("test", latest["Key"], "FILE_NAME"))

Hi Please note that this issue can sometime happen due to number of char in the folder path in windows machines including the file name. I faced similar issue and when I tried to save to the same folder path it was giving me error till the time I reduced the number of char to 228.

Related

trying to use boto copy to s3 unless file exists

in my code below,
fn2 is the local file and "my_bucket_object.key" is a list of files in my s3 bucket.
I am looking at my local files, taking the latest one by creation date and then looking at the bucket and I only want to copy the latest one there (this is working) but not if it exists already. What is happening is that, even if the file is there in the bucket, the latest file is still getting copied, overwriting the one in the bucket with the same name.
the filename of the latest file is "bats6.csv"
I figured that specifying 'in' and 'not in' conditions, this would ensure that the file did not get copied if one with the same name is already there, but this isnt working.
Here is the code. Thanks alot.
import boto3
import botocore
import glob, os
import datetime
import os
exchanges = ["bats"]
for ex in exchanges:
csv_file_list = glob.glob(f"H:\SQL_EXPORTS\eod\candles\{ex}\\*.csv")
latest_file = max(csv_file_list, key=os.path.getctime)
path = f'H:\\SQL_EXPORTS\\eod\\candles\\{ex}\\'
fileBaseName = os.path.basename(latest_file).split('.')[0]
fn = path + fileBaseName + ".csv"
fn2 = fileBaseName + ".csv"
print(fn2)
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(f'eod-candles-{ex}')
for my_bucket.object in my_bucket.objects.all():
print(my_bucket.object.key)
if fn2 not in my_bucket.object.key:
#s3.meta.client.upload_file(fn, my_bucket, fn2)
s3.meta.client.upload_file(fn, f'eod-candles-{ex}', fn2)
elif fn2 in my_bucket.object.key:
print("file already exists")
You could make a List of the object keys and then check whether it exists:
object_keys = [object.key for object in my_bucket.objects.all()]
if fn2 not in object_keys:
s3.meta.client.upload_file(fn, f'eod-candles-{ex}', fn2)

File is not found when I try to upload the files to S3 using boto3

I'm following a simple tutorial on YouTube about how to automatically upload files in S3 using Python, and I'm getting this error shows that:
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'age.csv'
And this does not make sense to me, because files are there. For example, this my code looks like:
client = boto3.client('s3',
aws_access_key_id=access_key,
aws_secret_access_key=secret_access_key)
path = 'C:/Users/User/Desktop/python/projects/AWS-Data-Processing/example_data'
for file in os.listdir(path):
upload_file_bucket = 'my-uploaded-data'
print(file)
if '.txt' in file:
upload_file_key_txt = 'txt/' + str(file)
client.upload_file(file, upload_file_bucket, upload_file_key_txt)
print("txt")
elif '.csv' in file:
upload_file_key_csv = 'csv/' + str(file)
client.upload_file(file, upload_file_bucket, upload_file_key_csv)
print("csv")
And when I comment out the part where it says:
client.upload_file(file, upload_file_bucket, upload_file_key_txt)
it prints out either "txt" or "cvs", and I comment out to just read files such as:
for file in os.listdir(path):
upload_file_bucket = 'my-uploaded-data'
print(file)
Then it successfully prints out the file names. So I don't understand why I get the error of there is no file existing when there is. It sounds contradicting and I need some help to understand this error.
I read a post where I might need to download AWS CLI, so which I did but it didn't help. I'm guessing the problem lies in the function upload_file but I just don't understand how there is no file?
Any advice will be appreciated!
The upload_file function takes a full file path, and not just a name. It cannot guess what is your directory, so you need to prepend it or use a different way of iterating over the files.
Source: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

PermissionError: Forbidden when reading files from aws s3

I am working in python and jupyter notebook, and I am trying to read parquet files from an aws s3bucket, and convert them to a single pandas dataframe.
The bucket and folders are arranged like:
The bucket name: mybucket
First Folder: 123
Second Folder: Parquets.parquet
file1.snappy.parquet
file2.snappy.parquet
....
I am getting the full path with:
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append("s3://mybucket/"+key.key)
And then reading them with:
count = 0
keys = keys[2:]
for obj in bucket.objects.all():
subsrc = obj.Object()
key = obj.key
path = keys[count]
obj_df = pd.read_parquet(path)
df_list.append(obj_df)
count +=1
df = pd.concat(df_list)
But that is giving me:
PermissionError: Forbidden
pointing to the line 'obj_df = pd.read_parquet(path)'
I know I have full s3 access, so that should not be the issue. Thank you so much!
This is probably because the path to the data is incorrect.
(In the code above, you're doing pd.read_parquet(path) where path = keys[count], but I'm pretty sure that that's only the keys, which do not include the bucket name. )

How to fix the problem with reading text files of a list of files from a folder for creating rows of dataframe

I wanted to export the data from the text files into a dataframe in pandas. I have created a list of those files from the specific path of the folder. For 2 or 3 times the code worked just fine. I have become able to export from text files into dataframe. But sometimes later it showed some error that is FileNotFoundError. 'link1.txt' is the first file from the desired folder.
path = "F:/study/folder0/"
dir_list = os.listdir(path) #list of the files in folder
length = len(dir_list)
for i in range(length):
text = pd.read_csv(dir_list[i], sep = " ", header = None)
text['new'] = text.apply(' '.join, axis=1)
The error that I have got
FileNotFoundError: [Errno 2] File b'link1.txt' does not exist: b'link1.txt'
File not found mainly happens because the path you have provided is wrong or the file doesn't actually exist.
When you are doing pd.read_csv(dir_list[i], sep = " ", header = None)
you are giving each file/folder name rather than the path of the file. so in your current directory where you are writing the python code the file doesn't exist
One solution to this is appending path with each each element of dir list
path= r"c:\somedir\somefolder\"
dir_list2=[]
for x in range(length):
dir_list2.append(path+str(dir_list[x]))
The problem for you here would be the path you have taken has forward slash(/) while windows uses backword slashes for navigating.
Use this solution for your answer
path = "F:/study/folder0/"
dir_list = os.listdir(path) #list of the files in folder
length = len(dir_list)
for i in range(length):
text = pd.read_csv("F:\\study\\folder0\\"+str(dir_list[i]), sep = " ", header = None)
text['new'] = text.apply(' '.join, axis=1)
Thank you, hoped it helped.

Boto3 folder sync under new S3 'folder'

So, before anyone tells me about the flat structure of S3, I already know, but the fact is you can create 'folders' in S3. My objective with this Python code is to create a new folder named using the date of running and appending the user's input to this (which is the createS3Folder function) - I then want to sync a folder in a local directory to this folder.
The problem is that my upload_files function creates a new folder in S3 that exactly emulates the folder structure of my local set up.
Can anyone suggest how I would just sync the folder into the newly created one without changing names?
import sys
import boto3
import datetime
import os
teamName = raw_input("Please enter the name of your project: ")
bucketFolderName = ""
def createS3Folder():
date = datetime.date.today().strftime("%Y") + "." +
datetime.date.today().strftime("%B") + "." +
datetime.date.today().strftime("%d")
date1 = datetime.date.today()
date = str(date1) + "/" #In order to generate a file, you must
put "/" at the end of key
bucketFolderName = date + teamName + "/"
client = boto3.client('s3')
client.put_object(Bucket='MY_BUCKET',Key=bucketFolderName)
upload_files('/Users/local/directory/to/sync')
def upload_files(path):
session = boto3.Session()
s3 = session.resource('s3')
bucket = s3.Bucket('MY_BUCKET')
for subdir, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(subdir, file)
with open(full_path, 'rb') as data:
bucket.put_object(Key=bucketFolderName, Body=data)
def main():
createS3Folder()
if __name__ == "__main__":
main()
Your upload_files() function is uploading to:
bucket.put_object(Key=bucketFolderName, Body=data)
This means that the filename ("Key") on S3 will be the name of the 'folder'. It should be:
bucket.put_object(Key=bucketFolderName + '/' + file, Body=data)
The Key is the full path of the destination object, including the filename (not just a 'directory').
In fact, there is no need to create the 'folder' beforehand -- just upload to the desired Key.
If you are feeling lazy, use the AWS Command-Line Interface (CLI) aws s3 sync command to do it for you!
"the fact is you can create 'folders' in S3"
No, you can't.
You can create an empty object that looks like a folder in the console, but it is still not a folder, it still has no meaning, it is still unnecessary, and if you delete it via the API, all the files you thought were "in" the folder will still be in the bucket. (If you delete it from the console, all the contents are deleted from the bucket, because the console explicitly deletes every object starting with that key prefix.)
The folder you are creating is not a container and cannot have anything inside it, because S3 does not have folders that are containers.
If I want to store a file cat.png and make it look like it's in the hat/ folder, you simply set the object key to hat/cat.png. This has exactly the same effect as observed in the console, whether or not the hat/ folder was explicitly created or not.
To so what you want, you simply build the desired object key for each object with string manipulation, including your common prefix ("folder name") and / delimiters. Any folder structure the / delimiters imply will be displayed in the console as a result.

Categories

Resources