Python grab substring between two specific characters - python

I have a folder with hundreds of files named like:
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
Convention:
year_month_ID_zone_date_0_L2A_B01.tif ("_0_L2A_B01.tif", and "zone" never change)
What I need is to iterate through every file and build a path based on their name in order to download them.
For example:
name = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
path = "2017/5/S2B_7VEG_20170528_0_L2A/B01.tif"
The path convention needs to be: path = year/month/ID_zone_date_0_L2A/B01.tif
I thought of making a loop which would "cut" my string into several parts every time it encounters a "_" character, then stitch the different parts in the right order to create my path name.
I tried this but it didn't work:
import re
filename =
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
try:
found = re.search('_(.+?)_', filename).group(1)
except AttributeError:
# _ not found in the original string
found = '' # apply your error handling
How could I achieve that on Python ?

Since you only have one separator character, you may as well simply use Python's built in split function:
import os
items = filename.split('_')
year, month = items[:2]
new_filename = '_'.join(items[2:])
path = os.path.join(year, month, new_filename)

Try the following code snippet
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
found = re.sub('(\d+)_(\d+)_(.*)_(.*)\.tif', r'\1/\2/\3/\4.tif', filename)
print(found) # prints 2017/05/S2B_7VEG_20170528_0_L2A/B01.tif

No need for a regex -- you can just use split().
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
parts = filename.split("_")
year = parts[0]
month = parts[1]

Maybe you can do like this:
from os import listdir, mkdir
from os.path import isfile, join, isdir
my_path = 'your_soure_dir'
files_name = [f for f in listdir(my_path) if isfile(join(my_path, f))]
def create_dir(files_name):
for file in files_name:
month = file.split('_', '1')[0]
week = file.split('_', '2')[1]
if not isdir(my_path):
mkdir(month)
mkdir(week)
### your download code

filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
temp = filename.split('_')
result = "/".join(temp)
print(result)
result is
2017/05/S2B/7VEG/20170528/0/L2A/B01.tif

Related

searching specific string in list

How to search for every string in a list that starts with a specific string like:
path = (r"C:\Users\Example\Desktop")
desktop = os.listdir(path)
print(desktop)
#['faf.docx', 'faf.txt', 'faad.txt', 'gas.docx']
So my question is: how do i filter from every file that starts with "fa"?
For this specific cases, involving filenames in one directory, you can use globbing:
import glob
import os
path = (r"C:\Users\Example\Desktop")
pattern = os.path.join(path, 'fa*')
files = glob.glob(pattern)
This code filters all items out that start with "fa" and stores them in a separate list
filtered = [item for item in path if item.startswith("fa")]
All strings have a .startswith() method!
results = []
for value in os.listdir(path):
if value.startswith("fa"):
results.append(value)

iterating through specific files in folder with name matching pattern in python

I have a folder with a lot of csv files with different names.
I want to work only with the files that their name is made up of numbers only,
though I have no information of the range of the numbers in the title of the files.
for example, I have
['123.csv', 'not.csv', '75839.csv', '2.csv', 'bad.csv', '23bad8.csv']
and I would like to only work with ['123.csv', '75839.csv', '2.csv']
I tried the following code:
for f in file_list:
if f.startwith('1' or '2' or '3' ..... or '9'):
# do something
but this does not some the problem if the file name starts with a number but still includes letters or other symbols later.
You can use Regex to do the following:
import re
lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt']
pattern = re.compile('^[0-9]+.csv')
newlst = [re.findall(pattern, filename) for filename in lst_of_files if len(re.findall(pattern, filename)) > 0]
print(newlst)
You can do it this way:
file_list = ["123.csv", "not.csv", "75839.csv", "2.csv", "bad.csv", "23bad8.csv"]
for f in file_list:
name, ext = f.rsplit(".", 1) # split at the rightmost dot
if name.isnumeric():
print(f)
Output is
123.csv
75839.csv
2.csv
One of the approaches:
import re
lst_of_files = ['temo1.csv', '12321.csv', '123123.csv', 'fdao123.csv', '12312asdv.csv', '123otk123.csv', '123.txt', '876.csv']
for f in lst_of_files:
if re.search(r'^[0-9]+.csv', f):
print (f)
Output:
12321.csv
123123.csv
876.csv

Verify the format of a filename in Python

Every week I get two files with following pattern.
EMEA_{sample}_Tracker_{year}_KW{week}
E.g.
EMEA_G_Tracker_2019_KW52.xlsx
EMEA_BC_Tracker_2019_KW52.xlsx
Next files would look like these
EMEA_G_Tracker_2020_KW1.xlsx
EMEA_BC_Tracker_2020_KW1.xlsx
Placeholders:
sample = G or BC
year = current year [YYYY]
week = calendar week [0 - ~52]
The only changes are made in the placeholders, everything else will stay the same.
How can I extract these values from the filename and check if the filename has this format?
Right now I only read all files using os.walk():
path_files = "Files/"
files = []
for (_, _, filenames) in walk(path_files):
files.extend(filenames)
break
If filename is the name of the file you've got:
import re
result = re.match(r'EMEA_(.*?)_Tracker_(\d+)_KW(\d+)', filename)
sample, year, week = result.groups()
Here is an example of how to collect all files matching your pattern into a list using regex and list comprehension. Then you can use the list as you wish in later code.
import os
import re
# Compile the regular expression pattern.
re_emea = re.compile('^EMEA_(G|BC)_Tracker_20\d{2}_KW\d{1,2}.xlsx$')
# Set path to be searched.
path = '/home/username/Desktop/so/emea_files'
# Collect all filenames matching the pattern into a list.
files = [f for f in os.listdir(path) if re_emea.match(f)]
# View the results.
print(files)
All files in the directory:
['EMEA_G_Tracker_2020_KW2.xlsx',
'other_file_3.txt',
'EMEA_G_Tracker_2020_KW1.xlsx',
'other_file_2.txt',
'other_file_5.txt',
'other_file_4.txt',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'other_file_1.txt',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
The results from pattern matching:
['EMEA_G_Tracker_2020_KW2.xlsx',
'EMEA_G_Tracker_2020_KW1.xlsx',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
Hope this helps! If not, just give me a shout.

How to append the grand-parent folder name to the filename?

I have directory where multiple folders exist and within each folder,a file exist inside another folder. Below is the structure
C:\users\TPCL\New\20190919_xz.txt
C:\users\TPCH\New\20190919_abc.txt
Objective:
I want to rename the file names like below:
C:\users\TPCL\New\20190919_xz_TPCL.txt
C:\users\TPCH\New\20190919_abc_TPCH.txt
My Approach:
for root,dirs,filename in os.walk('C\users\TPCL\New'):
prefix = os.path.basename(root)
for f in filename:
os.rename(os.path.join(root,f),os.path.join(root,"{}_{}".format(f,prefix)))
The above approach is yielding the following result:
C:\users\TPCL\New\20190919_xz_New.txt
C:\users\TPCH\New\20190919_abc_New.txt
So the question is: How to get the grand-parent folder name get appended, instead of parent folder name?
You need to use both dirname and basename to do this.
Use os.path.dirname to get the directory name (excluding the last part) and
then use os.path.basename to get the last part of the pathname.
Replace
prefix = os.path.basename(root)
with
os.path.basename(os.path.dirname(root))
Please refer this:
https://docs.python.org/3.7/library/os.path.html#os.path.basename
https://docs.python.org/3.7/library/os.path.html#os.path.dirname
Using PurePath from pathlib you can get the parts of the path. If the path contains the filename its grand-parent folder will be at index -3.
In [23]: from pathlib import PurePath
In [24]: p = r'C:\users\TPCL\New\20190919_xz_TPCL.txt'
In [25]: g = PurePath(p)
In [26]: g.parts
Out[26]: ('C:\\', 'users', 'TPCL', 'New', '20190919_xz_TPCL.txt')
In [27]: g.parts[-3]
Out[27]: 'TPCL'
If the path does not contain the filename the grand=parent would be at index -2.
Your process would look something like this:
import os.path
from pathlib import PurePath
for root,dirs,fnames in os.walk(topdirectory):
#print(root)
try:
addition = PurePath(root).parts[-2]
for name in fnames:
n,ext = os.path.splitext(name)
newname = n + '_' + addition + ext
print(name, os.path.join(root,newname))
except IndexError:
pass
I added the try/except to filter out paths that don't have grand-parents - it isn't necessary if you know it isn't needed.
You can split the path string using '\' and then count back to what would be considered the grandparent
directory for any given file and then append it. For example, if you have
filename = "dir1\dir2\file.txt"
splitPaths = filename.split('\') // gives you ['dir1', 'dir2', 'file.txt']
the last entry is the file name, the second to last is the parent, and the third to last is the grandparent and so on. You can then append whichever directory you want to the end of the string.

Python split url to find image name and extension

I am looking for a way to extract a filename and extension from a particular url using Python
lets say a URL looks as follows
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
How would I go about getting the following.
filename = "da4ca3509a7b11e19e4a12313813ffc0_7"
file_ext = ".jpg"
try:
# Python 3
from urllib.parse import urlparse
except ImportError:
# Python 2
from urlparse import urlparse
from os.path import splitext, basename
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
disassembled = urlparse(picture_page)
filename, file_ext = splitext(basename(disassembled.path))
Only downside with this is that your filename will contain a preceding / which you can always remove yourself.
Try with urlparse.urlsplit to split url, and then os.path.splitext to retrieve filename and extension (use os.path.basename to keep only the last filename) :
import urlparse
import os.path
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
print os.path.splitext(os.path.basename(urlparse.urlsplit(picture_page).path))
>>> ('da4ca3509a7b11e19e4a12313813ffc0_7', '.jpg')
filename = picture_page.split('/')[-1].split('.')[0]
file_ext = '.'+picture_page.split('.')[-1]
# Here's your link:
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
#Here's your filename and ext:
filename, ext = (picture_page.split('/')[-1].split('.'))
When you do picture_page.split('/'), it will return a list of strings from your url split by a /.
If you know python list indexing well, you'd know that -1 will give you the last element or the first element from the end of the list.
In your case, it will be the filename: da4ca3509a7b11e19e4a12313813ffc0_7.jpg
Splitting that by delimeter ., you get two values:
da4ca3509a7b11e19e4a12313813ffc0_7 and jpg, as expected, because they are separated by a period which you used as a delimeter in your split() call.
Now, since the last split returns two values in the resulting list, you can tuplify it.
Hence, basically, the result would be like:
filename,ext = ('da4ca3509a7b11e19e4a12313813ffc0_7', 'jpg')
os.path.splitext will help you extract the filename and extension once you have extracted the relevant string from the URL using urlparse:
fName, ext = os.path.splitext('yourImage.jpg')
This is the easiest way to find image name and extension using regular expression.
import re
import sys
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
regex = re.compile('(.*\/(?P<name>\w+)\.(?P<ext>\w+))')
print regex.search(picture_page).group('name')
print regex.search(picture_page).group('ext')
>>> import re
>>> s = 'picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"'
>>> re.findall(r'\/([a-zA-Z0-9_]*)\.[a-zA-Z]*\"$',s)[0]
'da4ca3509a7b11e19e4a12313813ffc0_7'
>>> re.findall(r'([a-zA-Z]*)\"$',s)[0]
'jpg'

Categories

Resources