Verify the format of a filename in Python

Verify the format of a filename in Python - python

Every week I get two files with following pattern.
EMEA_{sample}_Tracker_{year}_KW{week}
E.g.
EMEA_G_Tracker_2019_KW52.xlsx
EMEA_BC_Tracker_2019_KW52.xlsx
Next files would look like these
EMEA_G_Tracker_2020_KW1.xlsx
EMEA_BC_Tracker_2020_KW1.xlsx
Placeholders:
sample = G or BC
year = current year [YYYY]
week = calendar week [0 - ~52]
The only changes are made in the placeholders, everything else will stay the same.
How can I extract these values from the filename and check if the filename has this format?
Right now I only read all files using os.walk():
path_files = "Files/"
files = []
for (_, _, filenames) in walk(path_files):
files.extend(filenames)
break

If filename is the name of the file you've got:
import re
result = re.match(r'EMEA_(.*?)_Tracker_(\d+)_KW(\d+)', filename)
sample, year, week = result.groups()

Here is an example of how to collect all files matching your pattern into a list using regex and list comprehension. Then you can use the list as you wish in later code.
import os
import re
# Compile the regular expression pattern.
re_emea = re.compile('^EMEA_(G|BC)_Tracker_20\d{2}_KW\d{1,2}.xlsx$')
# Set path to be searched.
path = '/home/username/Desktop/so/emea_files'
# Collect all filenames matching the pattern into a list.
files = [f for f in os.listdir(path) if re_emea.match(f)]
# View the results.
print(files)
All files in the directory:
['EMEA_G_Tracker_2020_KW2.xlsx',
'other_file_3.txt',
'EMEA_G_Tracker_2020_KW1.xlsx',
'other_file_2.txt',
'other_file_5.txt',
'other_file_4.txt',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'other_file_1.txt',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
The results from pattern matching:
['EMEA_G_Tracker_2020_KW2.xlsx',
'EMEA_G_Tracker_2020_KW1.xlsx',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
Hope this helps! If not, just give me a shout.

Related

searching specific string in list

How to search for every string in a list that starts with a specific string like:
path = (r"C:\Users\Example\Desktop")
desktop = os.listdir(path)
print(desktop)
#['faf.docx', 'faf.txt', 'faad.txt', 'gas.docx']
So my question is: how do i filter from every file that starts with "fa"?

For this specific cases, involving filenames in one directory, you can use globbing:
import glob
import os
path = (r"C:\Users\Example\Desktop")
pattern = os.path.join(path, 'fa*')
files = glob.glob(pattern)

This code filters all items out that start with "fa" and stores them in a separate list
filtered = [item for item in path if item.startswith("fa")]

All strings have a .startswith() method!
results = []
for value in os.listdir(path):
if value.startswith("fa"):
results.append(value)

Python grab substring between two specific characters

I have a folder with hundreds of files named like:
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
Convention:
year_month_ID_zone_date_0_L2A_B01.tif ("_0_L2A_B01.tif", and "zone" never change)
What I need is to iterate through every file and build a path based on their name in order to download them.
For example:
name = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
path = "2017/5/S2B_7VEG_20170528_0_L2A/B01.tif"
The path convention needs to be: path = year/month/ID_zone_date_0_L2A/B01.tif
I thought of making a loop which would "cut" my string into several parts every time it encounters a "_" character, then stitch the different parts in the right order to create my path name.
I tried this but it didn't work:
import re
filename =
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
try:
found = re.search('_(.+?)_', filename).group(1)
except AttributeError:
# _ not found in the original string
found = '' # apply your error handling
How could I achieve that on Python ?

Since you only have one separator character, you may as well simply use Python's built in split function:
import os
items = filename.split('_')
year, month = items[:2]
new_filename = '_'.join(items[2:])
path = os.path.join(year, month, new_filename)

Try the following code snippet
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
found = re.sub('(\d+)_(\d+)_(.*)_(.*)\.tif', r'\1/\2/\3/\4.tif', filename)
print(found) # prints 2017/05/S2B_7VEG_20170528_0_L2A/B01.tif

No need for a regex -- you can just use split().
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
parts = filename.split("_")
year = parts[0]
month = parts[1]

Maybe you can do like this:
from os import listdir, mkdir
from os.path import isfile, join, isdir
my_path = 'your_soure_dir'
files_name = [f for f in listdir(my_path) if isfile(join(my_path, f))]
def create_dir(files_name):
for file in files_name:
month = file.split('_', '1')[0]
week = file.split('_', '2')[1]
if not isdir(my_path):
mkdir(month)
mkdir(week)
### your download code

filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
temp = filename.split('_')
result = "/".join(temp)
print(result)
result is
2017/05/S2B/7VEG/20170528/0/L2A/B01.tif

iterating through specific files in folder with name matching pattern in python

I have a folder with a lot of csv files with different names.
I want to work only with the files that their name is made up of numbers only,
though I have no information of the range of the numbers in the title of the files.
for example, I have
['123.csv', 'not.csv', '75839.csv', '2.csv', 'bad.csv', '23bad8.csv']
and I would like to only work with ['123.csv', '75839.csv', '2.csv']
I tried the following code:
for f in file_list:
if f.startwith('1' or '2' or '3' ..... or '9'):
# do something
but this does not some the problem if the file name starts with a number but still includes letters or other symbols later.

You can use Regex to do the following:
import re
lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt']
pattern = re.compile('^[0-9]+.csv')
newlst = [re.findall(pattern, filename) for filename in lst_of_files if len(re.findall(pattern, filename)) > 0]
print(newlst)

You can do it this way:
file_list = ["123.csv", "not.csv", "75839.csv", "2.csv", "bad.csv", "23bad8.csv"]
for f in file_list:
name, ext = f.rsplit(".", 1) # split at the rightmost dot
if name.isnumeric():
print(f)
Output is
123.csv
75839.csv
2.csv

One of the approaches:
import re
lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt', '876.csv']
for f in lst_of_files:
if re.search(r'^[0-9]+.csv', f):
print (f)
Output:
12321.csv
123123.csv
876.csv

Extracting numbers from a filename string in python

I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.

Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]

you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )

Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.

Loop through a directory to search for two possible matching files

Ok so I'm writing a module that will take in some command line arguments, one of the arguments: fundCodes will be an array of funds: ['PUSFF', 'AGE', 'AIR']
My module has to search through files in a directory and look for files matching a certain format:
def file_match(self, fundCodes):
# Get a list of the files
files = set(os.listdir(self.unmappedDir))
# loop through all the files and search for matching file
for check_fund in fundCodes:
# set a file pattern
file_match = 'unmapped_positions_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
# Yet to be used...
file_trade_match = 'unmapped_trades_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
# look in the unmappeddir and see if there's a file with that name
if file_match in files:
# if there's a match, load unmapped positions as etl
filename = os.path.join(self.unmappedDir, file_match)
return self.read_file(filename)
else:
Logger.error('No file found with those dates/funds')
I'm trying to figure out the best way to search through the directory for two different formats.
Examples of the formats would be:
unmapped_trades_AGE_2018-07-01_2018-07-11.csv and
unmapped_positions_AGE_2018-07-01_2018-07-11.csv
I'm thinking I just need to assign each match to a variable and in my last iteration check if there's a file equal to either value right? It seems redundant though. Any other suggestions?

Just do two in tests. If you need both files to exist you can use and:
if file_match in files and file_trade_match in files:
# do something
else:
# log error
If you just want to process either file, you can do:
if file_match in files:
# do something
elif file_trade_match in files:
# do something else
else:
# log error

I would use regular expressions for this, e.g.
import re
import os
search_pattern = 'unmapped_{}_([\w]+)_([0-9\-]+)_([0-9\-]+).csv'
data_types = ['trades', 'positions']
pattern_dict = {data_type: search_pattern.format(data_type) for data_type in data_types}
def find_matching_files(search_dir, fund_codes):
if not os.path.isdir(search_dir):
raise ValueError('search_dir does not specify a directory')
search_files = os.listdir(search_dir)
matching_files = {data_type: [] for data_type in pattern_dict}
for fname in search_files:
for data_type, pattern in pattern_dict.items():
m = re.match(pattern, fname)
if m is not None and m.group(1) in fund_codes:
matching_files[data_type].append(fname)
return matching_files
print(find_matching_files('file_location/', ['PUSFF', 'AGE', 'AIR']))
where file_location/ is the directory to search, and a dictionary of the matching files separated into data types is returned

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Verify the format of a filename in Python - python

If filename is the name of the file you've got: import re result = re.match(r'EMEA_(.*?)_Tracker_(\d+)_KW(\d+)', filename) sample, year, week = result.groups()

Related

searching specific string in list

Python grab substring between two specific characters

iterating through specific files in folder with name matching pattern in python

Extracting numbers from a filename string in python

Loop through a directory to search for two possible matching files

Categories

Resources