iterating through specific files in folder with name matching pattern in python

iterating through specific files in folder with name matching pattern in python - python

I have a folder with a lot of csv files with different names.
I want to work only with the files that their name is made up of numbers only,
though I have no information of the range of the numbers in the title of the files.
for example, I have
['123.csv', 'not.csv', '75839.csv', '2.csv', 'bad.csv', '23bad8.csv']
and I would like to only work with ['123.csv', '75839.csv', '2.csv']
I tried the following code:
for f in file_list:
if f.startwith('1' or '2' or '3' ..... or '9'):
# do something
but this does not some the problem if the file name starts with a number but still includes letters or other symbols later.

You can use Regex to do the following:
import re
lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt']
pattern = re.compile('^[0-9]+.csv')
newlst = [re.findall(pattern, filename) for filename in lst_of_files if len(re.findall(pattern, filename)) > 0]
print(newlst)

You can do it this way:
file_list = ["123.csv", "not.csv", "75839.csv", "2.csv", "bad.csv", "23bad8.csv"]
for f in file_list:
name, ext = f.rsplit(".", 1) # split at the rightmost dot
if name.isnumeric():
print(f)
Output is
123.csv
75839.csv
2.csv

One of the approaches:
import re
lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt', '876.csv']
for f in lst_of_files:
if re.search(r'^[0-9]+.csv', f):
print (f)
Output:
12321.csv
123123.csv
876.csv

Related

Python grab substring between two specific characters

I have a folder with hundreds of files named like:
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
Convention:
year_month_ID_zone_date_0_L2A_B01.tif ("_0_L2A_B01.tif", and "zone" never change)
What I need is to iterate through every file and build a path based on their name in order to download them.
For example:
name = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
path = "2017/5/S2B_7VEG_20170528_0_L2A/B01.tif"
The path convention needs to be: path = year/month/ID_zone_date_0_L2A/B01.tif
I thought of making a loop which would "cut" my string into several parts every time it encounters a "_" character, then stitch the different parts in the right order to create my path name.
I tried this but it didn't work:
import re
filename =
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
try:
found = re.search('_(.+?)_', filename).group(1)
except AttributeError:
# _ not found in the original string
found = '' # apply your error handling
How could I achieve that on Python ?

Since you only have one separator character, you may as well simply use Python's built in split function:
import os
items = filename.split('_')
year, month = items[:2]
new_filename = '_'.join(items[2:])
path = os.path.join(year, month, new_filename)

Try the following code snippet
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
found = re.sub('(\d+)_(\d+)_(.*)_(.*)\.tif', r'\1/\2/\3/\4.tif', filename)
print(found) # prints 2017/05/S2B_7VEG_20170528_0_L2A/B01.tif

No need for a regex -- you can just use split().
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
parts = filename.split("_")
year = parts[0]
month = parts[1]

Maybe you can do like this:
from os import listdir, mkdir
from os.path import isfile, join, isdir
my_path = 'your_soure_dir'
files_name = [f for f in listdir(my_path) if isfile(join(my_path, f))]
def create_dir(files_name):
for file in files_name:
month = file.split('_', '1')[0]
week = file.split('_', '2')[1]
if not isdir(my_path):
mkdir(month)
mkdir(week)
### your download code

filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
temp = filename.split('_')
result = "/".join(temp)
print(result)
result is
2017/05/S2B/7VEG/20170528/0/L2A/B01.tif

Verify the format of a filename in Python

Every week I get two files with following pattern.
EMEA_{sample}_Tracker_{year}_KW{week}
E.g.
EMEA_G_Tracker_2019_KW52.xlsx
EMEA_BC_Tracker_2019_KW52.xlsx
Next files would look like these
EMEA_G_Tracker_2020_KW1.xlsx
EMEA_BC_Tracker_2020_KW1.xlsx
Placeholders:
sample = G or BC
year = current year [YYYY]
week = calendar week [0 - ~52]
The only changes are made in the placeholders, everything else will stay the same.
How can I extract these values from the filename and check if the filename has this format?
Right now I only read all files using os.walk():
path_files = "Files/"
files = []
for (_, _, filenames) in walk(path_files):
files.extend(filenames)
break

If filename is the name of the file you've got:
import re
result = re.match(r'EMEA_(.*?)_Tracker_(\d+)_KW(\d+)', filename)
sample, year, week = result.groups()

Here is an example of how to collect all files matching your pattern into a list using regex and list comprehension. Then you can use the list as you wish in later code.
import os
import re
# Compile the regular expression pattern.
re_emea = re.compile('^EMEA_(G|BC)_Tracker_20\d{2}_KW\d{1,2}.xlsx$')
# Set path to be searched.
path = '/home/username/Desktop/so/emea_files'
# Collect all filenames matching the pattern into a list.
files = [f for f in os.listdir(path) if re_emea.match(f)]
# View the results.
print(files)
All files in the directory:
['EMEA_G_Tracker_2020_KW2.xlsx',
'other_file_3.txt',
'EMEA_G_Tracker_2020_KW1.xlsx',
'other_file_2.txt',
'other_file_5.txt',
'other_file_4.txt',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'other_file_1.txt',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
The results from pattern matching:
['EMEA_G_Tracker_2020_KW2.xlsx',
'EMEA_G_Tracker_2020_KW1.xlsx',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
Hope this helps! If not, just give me a shout.

Extracting numbers from a filename string in python

I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.

Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]

you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )

Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.

select files from path

I have files in particular path and need to select one by one base on namefile (yyyymmdd.faifb1p16m2.nc) where yyyy is year, mm is month, and dd is date. I made code like this :
results=[]
base_dir = 'C:/DATA2013'
os.chdir(base_dir)
files = os.listdir('C:/DATA2013')
for f in files:
results += [each for each in os.listdir('C:/DATA2013')
if each.endswith('.faifb1p16m2.nc')]
What should I do next if I only select files for January, and then February, and so on. Thank you.

You can do :
x = [i for i in results if i[4:6] == '01']
It will list all file names for January.
Assuming that your all files of same format as you have described in the question.

Two regexes:
\d{4}(?:\d?|\d{2})(?:\d?|\d{2})\.faifb1p16m2\.nc
\d{8}\.faifb1p16m2\.nc
Sample data:
20140131.faifb1p16m2.nc
2014131.faifb1p16m2.nc
201412.faifb1p16m2.nc
201411.faifb1p16m2.nc
20141212.faifb1p16m2.nc
2014121.faifb1p16m2.nc
201411.faifb1p16m2.nc
The first regex will match all 7 of those entries. The second regex will match only 1, and 5. I probably made the regexes way more complicated than I needed to.
You're going to want the second regex, but I'm just listing the first as an example.
from glob import glob
import re
re1 = r'\d{4}(?:\d?|\d{2})(?:\d?|\d{2})\.faifb1p16m2\.nc'
re2 = r'\d{8}\.faifb1p16m2\.nc'
l = [f for f in glob('*.faifb1p16m2.nc') if re.search(re1, f)]
m = [f for f in glob('*.faifb1p16m2.nc') if re.search(re2, f)]
print l
print
print m
#Then, suppose you want to filter and select everything with '12' in the list m
print filter(lambda x: x[4:6] == '12', m)
As another similar solution shows you can ditch glob for os.listdir(), so:
l = [f for f in glob('*.faifb1p16m2.nc') if re.search(re1, f)]`
Becomes:
l = [f for f in os.listdir() if re.search(re1, f)]
And then the rest of the code is great. One of the great things about using glob is that you can use iglob which is just like glob, but as an iterator, which can help with performance when going through a directory with lots of files.
One more thing, here's another stackoverflow post with an overview of python's infamous lambda feature. It's often used for the functions map, reduce, filter, and so on.

To validate filenames, you could use datetime.strptime() method:
#!/usr/bin/env python
import os
from datetime import datetime
from glob import glob
suffix = '.faifb1p16m2.nc'
def parse_date(path):
try:
return datetime.strptime(os.path.basename(path), '%Y%m%d' + suffix)
except ValueError:
return None # failed to parse
paths_by_month = [[] for _ in range(12 + 1)]
for path in glob(r'C:\DATA2013\*' + suffix): # for each nc-file in the directory
date = parse_date(path)
paths_by_month[date and date.month or 0].append(path)
print(paths_by_month[2]) # February paths
print(paths_by_month[0]) # paths with unrecognized date

try this:
from os import *
results = []
base_dir = 'C://local'
chdir(base_dir)
files = listdir(base_dir)
for f in files:
if '.faifb1p16m2.nc' in f and f[4:6] == '01': #describe the month in this string
print f

How to read filenames in a folder and access them in an alphabetical and increasing number order?

I would like to ask how to efficiently handle accessing of filenames in a folder in the right order (alphabetical and increasing in number).
For example, I have the following files in a folder: apple1.dat, apple2.dat, apple10.dat, banana1.dat, banana2.dat, banana10.dat. I would like to read the contents of the files such that apple1.dat will be read first and banana10.dat will be read last.
Thanks.
This is what I did so far.
from glob import glob
files=glob('*.dat')
for list in files
# I read the files here in order
But as pointed out, apple10.dat comes before apple2.dat

from glob import glob
import os
files_list = glob(os.path.join(my_folder, '*.dat'))
for a_file in sorted(files_list):
# do whatever with the file
# 'open' or 'with' statements depending on your python version

try this one.
import os
def get_sorted_files(Directory)
filenamelist = []
for root, dirs, files in os.walk(Directory):
for name in files:
fullname = os.path.join(root, name)
filenamelist.append(fullname)
return sorted(filenamelist)

You have to cast the numbers to an int first. Doing it the long way would require breaking the names into the strings and numbers, casting the numbers to an int and sorting. Perhaps someone else has a shorter or more efficient way.
def split_in_two(str_in):
## go from right to left until a letter is found
## assume first letter of name is not a digit
for ctr in range(len(str_in)-1, 0, -1):
if not str_in[ctr].isdigit():
return str_in[:ctr+1], str_in[ctr+1:] ## ctr+1 = first digit
## default for no letters found
return str_in, "0"
files=['apple1.dat', 'apple2.dat', 'apple10.dat', 'apple11.dat',
'banana1.dat', 'banana10.dat', 'banana2.dat']
print sorted(files) ## sorted as you say
sort_numbers = []
for f in files:
## split off '.dat.
no_ending = f[:-4]
str_1, str_2 = split_in_two(no_ending)
sort_numbers.append([str_1, int(str_2), ".dat"])
sort_numbers.sort()
print sort_numbers

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

iterating through specific files in folder with name matching pattern in python - python

You can do it this way: file_list = ["123.csv", "not.csv", "75839.csv", "2.csv", "bad.csv", "23bad8.csv"] for f in file_list: name, ext = f.rsplit(".", 1) # split at the rightmost dot if name.isnumeric(): print(f) Output is 123.csv 75839.csv 2.csv

One of the approaches: import re lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt', '876.csv'] for f in lst_of_files: if re.search(r'^[0-9]+.csv', f): print (f) Output: 12321.csv 123123.csv 876.csv

Related

Python grab substring between two specific characters

Verify the format of a filename in Python

Extracting numbers from a filename string in python

select files from path

How to read filenames in a folder and access them in an alphabetical and increasing number order?

Categories

Resources