Get latest files in each sub folders using python

Get latest files in each sub folders using python - python

I have a list of folders I want to process and I want to get the latest files in each sub-directories.
For example, I want to find files that contain Received in the files name from each sub-directory. However, in each sub-directories, there might be multiple files with Received in their name. So I want to get the latest one only, meaning, only one file from each sub-folder.
Root_path = R:\\test
How can I achieve this please use python.

Related

how to read multiple wav files in python?

I want to read multiple wav files in a loop rather than reading every single one at one time
for, using os library, as I try to read files with this code I get error of file not found
The files are in different sub folder inside the main folder refer (GTZAN Dataset)
eg:
r = wav.read(directory+folder+r"\\"+file)
Since I've used OS library for pathing in this project, would like the answer in OS pathing format rather than regular pathing .
Note: I've declared directory in my code, it contains the original path in which a specific folder contains multiple wav files
I am able to get the folders and list them and also list the wav files in the subfolder but at the end not able to read them
The above code is the only I've tried since am using OS library pathing format.
I am able to get the folders and list them and also list the wav files in the subfolder but at the end not able to read them

how to iterate through folders and make changes in the files in python

I have a folder structure in the following way:
I need to go to the folder where csv files are present and add header to them and then go to the next folder , add header and so on. How to iterate through the multiple folders and do the changes to the csv file using python? Can anyone please tell me how to do it in python?

how to check duplicate files in airflow

I have incoming files in 'source-bucket'
I archive files after processing into another bucket 'archive-bucket'
in the current date-time folder.
eg:
gs://archive-bucket/module1/2021-06-25/source_file_20210622.csv
gs://archive-bucket/module1/2021-06-26/source_file_20210623.csv
gs://archive-bucket/module1/2021-06-27/source_file_20210624.csv
Every time I process a file, I want to check if the file is already processed by checking the if it is present in the archive folder.
duplicate_check = GoogleCloudStoragePrefixSensor(
task_id=f'detect_duplicate_{task_name}',
bucket=ARCHIVE_BUCKET,
prefix=f'module1/{DATE}/{source_file_name}')
This approach is only allowing to check for the particular date folder.
How to check if the 'source_file_<>.csv' is already present in the 'gs://archive-bucket/module1/all the date folders/'
even if the file is present in any date folder in the archive path, I need to fail further processing.
How can that be acheived?

I do not think you can do it easily. You could probably play with "delimiter" parameter https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-230214336 to achieve something similar. Maybe you can even try to set delimiter to be your file name and look for "module1/" prefix. Not sure about efficiency of that though.
The problem is that GCS is NOT a filesystem with folders. The "/" is just convenience to treat it as directory and the UI allows you to "browse" it in similar way, but in fact the GCS object are not stored in subfolders - whole name of the object is the only identifier and there nothing like "folders" in GCS. So you can only match the file names and matching by prefix is efficient. If you will have a lot of files, any other kind of matching might be slow.
What I recommend, maybe is to have a separate path. where you create empty objects corresponding to file names processsed. For example "processed/file.name" path. Without any structure. Then you could check for presence of the file name there. This will be rather efficient (but might not be atomic, depending how your processing looks like).

From Your requirement what i understand is, you want to move files once they are processed from src bucket to another bucket and you want to make sure that file is moved to dest bucket successfully.
Best way to do it is,
1)Maintain a small sql table to insert file path which is processed into table as "Processed" and whenever state is processed move those files to dest bucket. From this table you can check always what all files are processed and moved to dest bucket.
and
2) another approach is
if task1=task to process files
task2=pass processed files to bashoperator
and using gsutil option you can move files easily and also check in
that script whether it is being pushed.

Remove folders but the csv files

I have a folder (folder_1) that consists of many different subfolders, these subfolders (subfolder_1, subfolder_2, etc etc) all consist of 1 csv file. I'd like to delete all subfolders, and just keep all csv files. Is there a way to achieve this without specifying all subfolders?
Maybe a way to make an exception for csv files using shutil?
Thanks in advance.

What you can do is:
Loop over the subfolders.
Check this post.
Get the csv file(s) path. (replace the .txt to .csv from previous post)
Use the shutil.move("path/to/current/file.foo", "path/to/new/destination/for/file.foo") to move the file on a higher level.
use shutil.rmtree() to remove folder. (check here too)

how to extract a continuously updable Zip file on a website by Selenium and then unzip it for specific file names

I have used Selenium x Python to download a zip file daily but i am currently facing a few issues after downloading it on my local download folder
is it possible to use Python to read those files dynamically? let's say the date is always different. Can we simply add wildcard*? I am trying to move it from downloader folder to another folder but it always require me to name the file entirely.
how to unzip a file and look for specific files there? let's say those file will always start with files names "ABC202103xx.csv"
much appreciate for your help! any sample code will be truly appreciate!

Not knowing the excact name of a file in a local folder should usually not be a problem. You could just list all filenames in the local folder and then use a for loop to find the filename you need. For example let's assume that you have downloaded a zip file into a Downloads folder and you know it is named "file-X.zip" with X being any date.
import os
for filename in os.listdir("Downloads"):
if filename.startswith("file-") and filename.endswith(".zip"):
filename_you_are_looking_for = filename
break
To unzip files, I will refer you to this stackoverflow thread. Again, to look for specific files in there, you can use os.listdir.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get latest files in each sub folders using python - python

Related

how to read multiple wav files in python?

how to iterate through folders and make changes in the files in python

how to check duplicate files in airflow

Remove folders but the csv files

how to extract a continuously updable Zip file on a website by Selenium and then unzip it for specific file names

Categories

Resources