how to read file name with two variable parts

how to read file name with two variable parts - python

There is a folder with multiple Excel files in it. They are named in a systematic way. Like below:
a_b_12_2021043036548.xlsx
The a_b_12_ part is fixed
20210430 changes every day
36548 also change every day, and there is no rule for it, other than that it's always five digits
I have to read this Excel file every day from another script, and save it as a dataframe. How can I do this?
I tried the following lines but failed
datetime_format = datetime.datetime(2021, 4, 30) # I just want to change the date here to read the related excel
x = datetime_format.strftime("%Y%m%d")
file1 = r'C:/report/FG/a_b_12_' + x + * + '.xlsx' #failed
file1 = r'C:/report/FG/a_b_12_' + x + r'[\d\]+' + '.xlsx' #failed

Use glob.glob:
from glob import glob
pattern = rf'C:/report/FG/a_b_12_{x}*.xlsx'
matched_files = glob(pattern)
Assuming your assumption holds, and there is indeed exactly one such file, matched_files[0] will be it.

datetime_format = datetime.datetime(2021, 4, 30)
x=datetime_format.strftime("%Y%m%d")
dailyrunningnumber = 36548
file1 = 'C:/report/FG/a_b_12_{}.xlsx'.format(x)
file1 = 'C:/report/FG/a_b_12_{}{}.xlsx'.format(x, str(dailyrunningnumber))
Edited:
import glob
# listing all xlsx absolute dirs
f_glob = rf'C:/report/FG/a_b_12_*.xlsx'
f_names = glob.glob(f_glob)
print(f_names)

Assuming the second part is always a date, and that you know the date, just add a glob after that.
In Python, * in isolation is multiplication; the glob in your code needs to be a quoted string, like "*" or '*', just like you quote the rest of the file name.
import datetime
import glob
datetime_format = datetime.datetime(2021, 4, 30)
x = datetime_format.strftime("%Y%m%d")
matching_files = = glob.glob('C:/report/FG/a_b_12_' + x + '*.xlsx')
for file1 in matching_files:
# ...
If you are certain that the glob will always match a single file, of course, just use file1 = matching_files[0]
The temporary variables are nice for legibility, but really not particularly useful, so this can be refactored to
datetime_format = datetime.datetime(2021, 4, 30)
for file1 glob.glob('C:/report/FG/a_b_12_' + datetime_format.strftime("%Y%m%d") + '*.xlsx'):
# ...
You need an r'...' string when your string contains literal backslashes, but since you are using forward slash for the Windows directory separator (which is altogether more convenient anyway) I removed the r.
If you really want a regex solution, try
import re
import os
pattern = re.compile(r'^a_b_12_(\d+{8})(\d{5})\.xls$')
for file in os.scandir('C:/report/FG/'):
matched = pattern.match(file.name):
if matched:
file1 = file.name
date = matched.group(1)
suffix = matched.group(2)
break # or whatever

Related

Read in a csv file using wildcards

I need to read in a csv file daily but certain numbers in the file name will change each day. The filename with directory included is C:\siglocal\pairoffs\\logs_20220804_084056_9500_capped_delta_for_singlestockdelta.csv
I have tried the below where I enter an asterisk after the _08 on the first row of the file path here. There are 9 digits after this part of the file name that change daily and then the last part of the file name (_capped_delta_for_singlestockdelta.csv) stays the same.
Any ideas what I need to do here?
df = pd.read_csv(r'C:\siglocal\pairoffs\\logs_20220804_08*' + '_capped_delta_for_singlestockdelta.csv')

I do not see how this is a pandas problem. If I understand correctly you are looking for a possibility to build a string with variables. Here you can use the .format() statements:
r'C:\siglocal\pairoffs\\logs_20220804_08{0}_capped_delta_for_singlestockdelta.csv'.format(day)

Perhaps use os.walk(...) and a regular expression to evaluate the files in the folder. Here's one possible implementation:
import os
import re
# define the folder where the files are located
src_folder = r"C:\_temp"
# define the regular expression to filter the files
file_regex = "logs_20220804_08([0-9][0-9][0-9][0-9]_[09][0-9][0-9][0-9])" \
+ "_capped_delta_for_singlestockdelta.csv"
for dir_path, dir_names, file_names in os.walk(src_folder):
# Each iteration contains:
# dir_path - current folder for the iteration
# dir_names - list of folders in the dir_path.
# file_names - list of files in the dir_path.
for file_name in file_names:
print("Evaluating file({}) in folder({})"
.format(file_name, dir_path))
match_obj = re.match(file_regex, file_name, re.M | re.I)
# match_obj will be None if there isn't a match
if match_obj:
print("{}File({}) matches our regular expression."
.format(" " * 5, file_name))
print("{}Changing number value is: {}"
.format(" " * 5, match_obj.group(1)))
else:
print("{}No match for file ({})"
.format(" " * 5, file_name))

Python grab substring between two specific characters

I have a folder with hundreds of files named like:
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
Convention:
year_month_ID_zone_date_0_L2A_B01.tif ("_0_L2A_B01.tif", and "zone" never change)
What I need is to iterate through every file and build a path based on their name in order to download them.
For example:
name = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
path = "2017/5/S2B_7VEG_20170528_0_L2A/B01.tif"
The path convention needs to be: path = year/month/ID_zone_date_0_L2A/B01.tif
I thought of making a loop which would "cut" my string into several parts every time it encounters a "_" character, then stitch the different parts in the right order to create my path name.
I tried this but it didn't work:
import re
filename =
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
try:
found = re.search('_(.+?)_', filename).group(1)
except AttributeError:
# _ not found in the original string
found = '' # apply your error handling
How could I achieve that on Python ?

Since you only have one separator character, you may as well simply use Python's built in split function:
import os
items = filename.split('_')
year, month = items[:2]
new_filename = '_'.join(items[2:])
path = os.path.join(year, month, new_filename)

Try the following code snippet
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
found = re.sub('(\d+)_(\d+)_(.*)_(.*)\.tif', r'\1/\2/\3/\4.tif', filename)
print(found) # prints 2017/05/S2B_7VEG_20170528_0_L2A/B01.tif

No need for a regex -- you can just use split().
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
parts = filename.split("_")
year = parts[0]
month = parts[1]

Maybe you can do like this:
from os import listdir, mkdir
from os.path import isfile, join, isdir
my_path = 'your_soure_dir'
files_name = [f for f in listdir(my_path) if isfile(join(my_path, f))]
def create_dir(files_name):
for file in files_name:
month = file.split('_', '1')[0]
week = file.split('_', '2')[1]
if not isdir(my_path):
mkdir(month)
mkdir(week)
### your download code

filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
temp = filename.split('_')
result = "/".join(temp)
print(result)
result is
2017/05/S2B/7VEG/20170528/0/L2A/B01.tif

How to add a fixed number to the integer part of a filename?

Using Python, I need to add 100 to the integer part of some filenames to rename the files. The files look like this: 0000000_6dee7e249cf3.log where 6dee7e249cf3 is a random number. At the end I should have:
0000000_6dee7e249cf3.log should change to 0000100_6dee7e249cf3.log
0000001_12b2bb88d493.log should change to 0000101_12b2bb88d493.log
etc, etc…
I can print the initial files using:
initial: glob('{0:07d}_*[a-z]*'.format(NUM))
but the final files returns an empty list:
final: glob('{0:07d}_*[a-z]*'.format(NUM+100))
Moreover, I cannot not rename initial to final using os.rename because it can not read the list created using the globe function.

I've included your regex search. It looks like glob doesn't handle regex, but re does
import os
import re
#for all files in current directory
for f in os.listdir('./'):
#if the first 7 chars are numbers
if re.search('[0-9]{7}',f):
lead_int = int(f.split('_')[0])
#if the leading integer is less than 100
if lead_int < 100:
# rename this file with leading integer + 100
os.rename(f,'%07d_%s'%(lead_int + 100,f.split('_')[-1]))

Split the file name value using '_' separator and use those two values to reconstruct your file name.
s = name.split('_')
n2 = str(int(s[0]) + 100)
new_name = s[0][:len(s[0]) - len(n2)] + n2 + '_' + s[1]

How to change names of a list of numpy files?

I have list of numbpy files, I need to change their names, In fact, let's assume that I have this list of files:
AES_Trace=1_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy
AES_Trace=2_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy
AES_Trace=3_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy
What I need to change is the number of files, as a result I must have:
AES_Trace=100001_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy
AES_Trace=100002_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy
AES_Trace=100003_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy
I have tried:
import os
import numpy as np
import struct
path_For_Numpy_Files='C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy'
os.chdir(path_For_Numpy_Files)
list_files_Without_Sort=os.listdir(os.getcwd())
list_files_Sorted=sorted((list_files_Without_Sort),key=os.path.getmtime)
for file in list_files_Sorted:
print (file)
os.rename(file,file[11]+100000)
I think that is not the good solution, firstly It doesn't work, then it gives me this error:
os.rename(file,file[11]+100000)
IndexError: string index out of range

Your file variable is a str, so you can't add an int like 10000 to it.
>>> file = 'Tracenumber=01_Pltx5=23.npy'
>>> '{}=1000{}'.format(file.split('=')[0],file.split('=')[1:])
'Tracenumber=100001_Pltx5=23.npy'
So, you can rather use
os.rename(file,'{}=1000{}'.format(file.split('=')[0],file.split('=')[1:]))

I'm sure that you can do this in one line, or with regex but I think that clarity is more valuable. Try this:
import os
path = 'C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy'
file_names = os.listdir(path)
for file in file_names:
start = file[0:file.index("Trace=")+6]
end = file[file.index("_key"):]
num = file[len(start): file.index(end)]
new_name = start + str(100000+int(num)) + end
os.rename(os.path.join(path, file), os.path.join(path, new_name))
This will work with numbers >9, which the other answer will stick extra zeros onto.

Shutil multiple files after reading with "pydicom"

What I basicalling want is for myvar to vary between 1-280 so that I can use this to read the file using pydicom. I.e. I want to read the files between /data/lfs2/model-mie/inputDataTest/subj2/mp2rage/0-280_tfl3d1.IMA. Then if M is true in gender then I want to shutil them into a folder. Doesnt seem to be working with count.
Thanks for the help!
from pydicom import dicomio
myvar = str(count(0))
import shutil
file = "/data/lfs2/model-mie/inputDataTest/subj2/mp2rage/" + myvar + "_tfl3d1.IMA"
ds = dicomio.read_file(file)
gender = ds.PatientSex
print(gender)
if gender == "M":
shutil.copy(file, "/mnt/nethomes/s4232182/Desktop/New")

I think the range() function should do what you want, something like this:
import shutil
from pydicom import dicomio
for i in range(281):
filename = "/data/lfs2/model-mie/inputDataTest/subj2/mp2rage/" + str(i) + "_tfl3d1.IMA"
ds = dicomio.read_file(filename)
if ds.get('PatientSex') == "M":
shutil.copy(filename, "/mnt/nethomes/s4232182/Desktop/New" )
I've also used ds.get() to avoid problems if the dataset does not contain a PatientSex data element.
In one place in your question, the numbering is 1-280, in another it is 0-280. If the former, then use range(1, 281) instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to read file name with two variable parts - python

Use glob.glob: from glob import glob pattern = rf'C:/report/FG/a_b_12_{x}*.xlsx' matched_files = glob(pattern) Assuming your assumption holds, and there is indeed exactly one such file, matched_files[0] will be it.

Related

Read in a csv file using wildcards

Python grab substring between two specific characters

How to add a fixed number to the integer part of a filename?

How to change names of a list of numpy files?

Shutil multiple files after reading with "pydicom"

Categories

Resources