Read in a csv file using wildcards

Read in a csv file using wildcards - python

I need to read in a csv file daily but certain numbers in the file name will change each day. The filename with directory included is C:\siglocal\pairoffs\\logs_20220804_084056_9500_capped_delta_for_singlestockdelta.csv
I have tried the below where I enter an asterisk after the _08 on the first row of the file path here. There are 9 digits after this part of the file name that change daily and then the last part of the file name (_capped_delta_for_singlestockdelta.csv) stays the same.
Any ideas what I need to do here?
df = pd.read_csv(r'C:\siglocal\pairoffs\\logs_20220804_08*' + '_capped_delta_for_singlestockdelta.csv')

I do not see how this is a pandas problem. If I understand correctly you are looking for a possibility to build a string with variables. Here you can use the .format() statements:
r'C:\siglocal\pairoffs\\logs_20220804_08{0}_capped_delta_for_singlestockdelta.csv'.format(day)

Perhaps use os.walk(...) and a regular expression to evaluate the files in the folder. Here's one possible implementation:
import os
import re
# define the folder where the files are located
src_folder = r"C:\_temp"
# define the regular expression to filter the files
file_regex = "logs_20220804_08([0-9][0-9][0-9][0-9]_[09][0-9][0-9][0-9])" \
+ "_capped_delta_for_singlestockdelta.csv"
for dir_path, dir_names, file_names in os.walk(src_folder):
# Each iteration contains:
# dir_path - current folder for the iteration
# dir_names - list of folders in the dir_path.
# file_names - list of files in the dir_path.
for file_name in file_names:
print("Evaluating file({}) in folder({})"
.format(file_name, dir_path))
match_obj = re.match(file_regex, file_name, re.M | re.I)
# match_obj will be None if there isn't a match
if match_obj:
print("{}File({}) matches our regular expression."
.format(" " * 5, file_name))
print("{}Changing number value is: {}"
.format(" " * 5, match_obj.group(1)))
else:
print("{}No match for file ({})"
.format(" " * 5, file_name))

Related

Python grab substring between two specific characters

I have a folder with hundreds of files named like:
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
Convention:
year_month_ID_zone_date_0_L2A_B01.tif ("_0_L2A_B01.tif", and "zone" never change)
What I need is to iterate through every file and build a path based on their name in order to download them.
For example:
name = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
path = "2017/5/S2B_7VEG_20170528_0_L2A/B01.tif"
The path convention needs to be: path = year/month/ID_zone_date_0_L2A/B01.tif
I thought of making a loop which would "cut" my string into several parts every time it encounters a "_" character, then stitch the different parts in the right order to create my path name.
I tried this but it didn't work:
import re
filename =
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
try:
found = re.search('_(.+?)_', filename).group(1)
except AttributeError:
# _ not found in the original string
found = '' # apply your error handling
How could I achieve that on Python ?

Since you only have one separator character, you may as well simply use Python's built in split function:
import os
items = filename.split('_')
year, month = items[:2]
new_filename = '_'.join(items[2:])
path = os.path.join(year, month, new_filename)

Try the following code snippet
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
found = re.sub('(\d+)_(\d+)_(.*)_(.*)\.tif', r'\1/\2/\3/\4.tif', filename)
print(found) # prints 2017/05/S2B_7VEG_20170528_0_L2A/B01.tif

No need for a regex -- you can just use split().
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
parts = filename.split("_")
year = parts[0]
month = parts[1]

Maybe you can do like this:
from os import listdir, mkdir
from os.path import isfile, join, isdir
my_path = 'your_soure_dir'
files_name = [f for f in listdir(my_path) if isfile(join(my_path, f))]
def create_dir(files_name):
for file in files_name:
month = file.split('_', '1')[0]
week = file.split('_', '2')[1]
if not isdir(my_path):
mkdir(month)
mkdir(week)
### your download code

filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
temp = filename.split('_')
result = "/".join(temp)
print(result)
result is
2017/05/S2B/7VEG/20170528/0/L2A/B01.tif

Verify the format of a filename in Python

Every week I get two files with following pattern.
EMEA_{sample}_Tracker_{year}_KW{week}
E.g.
EMEA_G_Tracker_2019_KW52.xlsx
EMEA_BC_Tracker_2019_KW52.xlsx
Next files would look like these
EMEA_G_Tracker_2020_KW1.xlsx
EMEA_BC_Tracker_2020_KW1.xlsx
Placeholders:
sample = G or BC
year = current year [YYYY]
week = calendar week [0 - ~52]
The only changes are made in the placeholders, everything else will stay the same.
How can I extract these values from the filename and check if the filename has this format?
Right now I only read all files using os.walk():
path_files = "Files/"
files = []
for (_, _, filenames) in walk(path_files):
files.extend(filenames)
break

If filename is the name of the file you've got:
import re
result = re.match(r'EMEA_(.*?)_Tracker_(\d+)_KW(\d+)', filename)
sample, year, week = result.groups()

Here is an example of how to collect all files matching your pattern into a list using regex and list comprehension. Then you can use the list as you wish in later code.
import os
import re
# Compile the regular expression pattern.
re_emea = re.compile('^EMEA_(G|BC)_Tracker_20\d{2}_KW\d{1,2}.xlsx$')
# Set path to be searched.
path = '/home/username/Desktop/so/emea_files'
# Collect all filenames matching the pattern into a list.
files = [f for f in os.listdir(path) if re_emea.match(f)]
# View the results.
print(files)
All files in the directory:
['EMEA_G_Tracker_2020_KW2.xlsx',
'other_file_3.txt',
'EMEA_G_Tracker_2020_KW1.xlsx',
'other_file_2.txt',
'other_file_5.txt',
'other_file_4.txt',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'other_file_1.txt',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
The results from pattern matching:
['EMEA_G_Tracker_2020_KW2.xlsx',
'EMEA_G_Tracker_2020_KW1.xlsx',
'EMEA_BC_Tracker_2019_KW52.xlsx',
'EMEA_G_Tracker_2019_KW52.xlsx',
'EMEA_BC_Tracker_2020_KW2.xlsx',
'EMEA_BC_Tracker_2020_KW1.xlsx']
Hope this helps! If not, just give me a shout.

Loop through a directory to search for two possible matching files

Ok so I'm writing a module that will take in some command line arguments, one of the arguments: fundCodes will be an array of funds: ['PUSFF', 'AGE', 'AIR']
My module has to search through files in a directory and look for files matching a certain format:
def file_match(self, fundCodes):
# Get a list of the files
files = set(os.listdir(self.unmappedDir))
# loop through all the files and search for matching file
for check_fund in fundCodes:
# set a file pattern
file_match = 'unmapped_positions_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
# Yet to be used...
file_trade_match = 'unmapped_trades_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
# look in the unmappeddir and see if there's a file with that name
if file_match in files:
# if there's a match, load unmapped positions as etl
filename = os.path.join(self.unmappedDir, file_match)
return self.read_file(filename)
else:
Logger.error('No file found with those dates/funds')
I'm trying to figure out the best way to search through the directory for two different formats.
Examples of the formats would be:
unmapped_trades_AGE_2018-07-01_2018-07-11.csv and
unmapped_positions_AGE_2018-07-01_2018-07-11.csv
I'm thinking I just need to assign each match to a variable and in my last iteration check if there's a file equal to either value right? It seems redundant though. Any other suggestions?

Just do two in tests. If you need both files to exist you can use and:
if file_match in files and file_trade_match in files:
# do something
else:
# log error
If you just want to process either file, you can do:
if file_match in files:
# do something
elif file_trade_match in files:
# do something else
else:
# log error

I would use regular expressions for this, e.g.
import re
import os
search_pattern = 'unmapped_{}_([\w]+)_([0-9\-]+)_([0-9\-]+).csv'
data_types = ['trades', 'positions']
pattern_dict = {data_type: search_pattern.format(data_type) for data_type in data_types}
def find_matching_files(search_dir, fund_codes):
if not os.path.isdir(search_dir):
raise ValueError('search_dir does not specify a directory')
search_files = os.listdir(search_dir)
matching_files = {data_type: [] for data_type in pattern_dict}
for fname in search_files:
for data_type, pattern in pattern_dict.items():
m = re.match(pattern, fname)
if m is not None and m.group(1) in fund_codes:
matching_files[data_type].append(fname)
return matching_files
print(find_matching_files('file_location/', ['PUSFF', 'AGE', 'AIR']))
where file_location/ is the directory to search, and a dictionary of the matching files separated into data types is returned

Python: moving file to a newly created directory

I've got my script creating a bunch of files (size varies depending on inputs) and I want to be certain files in certain folders based on the filenames.
So far I've got the following but although directories are being created no files are being moved, I'm not sure if the logic in the final for loop makes any sense.
In the below code I'm trying to move all .png files ending in _01 into the sub_frame_0 folder.
Additionally is their someway to increment both the file endings _01 to _02 etc., and the destn folder ie. from sub_frame_0 to sub_frame_1 to sub_frame_2 and so on.
for index, i in enumerate(range(num_sub_frames+10)):
path = os.makedirs('./sub_frame_{}'.format(index))
# Slice layers into sub-frames and add to appropriate directory
list_of_files = glob.glob('*.tif')
for fname in list_of_files:
image_slicer.slice(fname, num_sub_frames) # Slices the .tif frames into .png sub-frames
list_of_sub_frames = glob.glob('*.png')
for i in list_of_sub_frames:
if i == '*_01.png':
shutil.move(os.path.join(os.getcwd(), '*_01.png'), './sub_frame_0/')

As you said, the logic of the final loop does not make sense.
if i == '*_01.ng'
It would evaluate something like 'image_01.png' == '*_01.png' and be always false.
Regexp should be the way to go, but for this simple case you just can slice the number from the file name.
for i in list_of_sub_frames:
frame = int(i[-6:-4]) - 1
shutil.move(os.path.join(os.getcwd(), i), './sub_frame_{}/'.format(frame))
If i = 'image_01.png' then i[-6:-4] would take '01', convert it to integer and then just subtract 1 to follow your schema.

A simple fix would be to check if '*_01.png' is in the file name i and change the shutil.move to include i, the filename. (It's also worth mentioning that iis not a good name for a filepath
list_of_sub_frames = glob.glob('*.png')
for i in list_of_sub_frames:
if '*_01.png' in i:
shutil.move(os.path.join(os.getcwd(), i), './sub_frame_0/')
Additionally is [there some way] to increment both the file endings _01 to _02 etc., and the destn folder ie. from sub_frame_0 to sub_frame_1 to sub_frame_2 and so on.
You could create file names doing something as simple as this:
for i in range(10):
#simple string parsing
file_name = 'sub_frame_'+str(i)
folder_name = 'folder_sub_frame_'+str(i)

Here is a complete example using regular expressions. This also implements the incrementing of file names/destination folders
import os
import glob
import shutil
import re
num_sub_frames = 3
# No need to enumerate range list without start or step
for index in range(num_sub_frames+10):
path = os.makedirs('./sub_frame_{0:02}'.format(index))
# Slice layers into sub-frames and add to appropriate directory
list_of_files = glob.glob('*.tif')
for fname in list_of_files:
image_slicer.slice(fname, num_sub_frames) # Slices the .tif frames into .png sub-frames
list_of_sub_frames = glob.glob('*.png')
for name in list_of_sub_frames:
m = re.search('(?P<fname>.+?)_(?P<num>\d+).png', name)
if m:
num = int(m.group('num'))+1
newname = '{0}_{1:02}.png'.format(m.group('fname'), num)
newpath = os.path.join('./sub_frame_{0:02}/'.format(num), newname)
print m.group() + ' -> ' + newpath
shutil.move(os.path.join(os.getcwd(), m.group()), newpath)

Extract first (and last) part of file name and copy them to new directory

I am trying to write a Windows batch script to extract the first and last parts of a filename.
I have multiple files named like this:
"John Doe_RandomWalk_4202.m"
"Tim Meyer_plot_3c_4163.pdf"
I would like to make directories like so:
Directory "John Doe" contains "RandomWalk.m"
Directory "Time Meyer" contains "plot_3c.pdf"
They seem to follow this pattern: "FirstName LastName_filename_[number].extension"
I'm not too competed with regex. I'm trying to do this with a windows batch script, however I am open to solutions in another language like Python etc.
Here is what I came up with:
Sorry for not including my attempt earlier. Here is what I came up with, its rather messy:
import os,re
reg_exp = re.compile('_\d\d')
filename = "John Doe_RandomWalk_4202.m" ;
extension = filename.split('.')[-1];
directory_name = filename.split('_')[0];
desired_filename = filename.split('_')[1];
final_filename = desired_filename + '.' + extension
Thanks

If neither firstname nor lastname can contain an underscore, then you don't need regular expressions.
#!/usr/bin/python
import collections, os, shutil
directory_structure = collections.defaultdict(list)
for orig_filename in list_of_your_files:
name, *filename, extension = orig_filename.split("_")
extension = "." + extension.split(".")[-1]
filename = '_'.join(filename) + extension
directory_structure[name].append((filename,orig_filename))
for directory, filenames in directory_structure.items():
try:
os.mkdir(directory)
except OSError:
pass # directory already exists
for filename in filenames:
newfile, oldfile = filename
shutil.copyfile(oldfile, os.path.join(directory,newfile))
If you're doing this using absolute paths, this becomes a little more difficult because you'll have to use os.path to strip off the filename from the rest of the path, then join it back together for the shutil.copyfile, but I don't see anything about absolute paths in your question.

Since you were originally hoping for a batch implementation and I'm terrible at Python...
#echo off
setlocal enabledelayedexpansion
set "source_dir=C:\path\to\where\your\files\are"
set "target_dir=C:\path\to\where\your\files\will\be"
:: Get a list of all files in the source directory
for /F "tokens=1,* delims=_" %%A in ('dir /b "%source_dir%"') do (
set "folder_name=%%A"
set name_part=%%~nB
set file_ext=%%~xB
REM Make a new directory based on the name if it does not exist
if not exist "!target_dir!\!folder_name!" mkdir "!target_dir!\!folder_name!"
REM Drop the last token from the name_part and store the new value in the new_filename variable
call :dropLastToken !name_part! new_filename
REM If you want to move instead of copy, change "copy" to "move"
copy "!source_dir!\!folder_name!_!name_part!!file_ext!" "!target_dir!\!folder_name!\!new_filename!!file_ext!"
)
:: End the script so that the function doesn't get called at the very end with no parameters
exit /b
:dropLastToken
setlocal enabledelayedexpansion
set f_name=%1
:: Replace underscores with spaces for later splitting
set f_name=!f_name:_= !
:: Get the last token
for %%D in (!f_name!) do set last_token=%%D
:: Remove the last_token substring from new_filename
set f_name=!f_name: %last_token%=!
:: Put the underscores back
set f_name=!f_name: =_!
endlocal&set %2=%f_name%

Since we don't know the structure of the filenames beforehand, a regex might suit you better. Here's an implementation in Python:
import os
import re
# Grab everything in this directory that is a file
files = [x for x in os.listdir(".") if os.path.isfile(x)]
# A dictionary of name: filename pairs.
output = {}
for f in files:
"""
Match against the parts we need:
^ --> starts with
([a-zA-Z]+) --> one or more alphanumeric characters
or underscores (group 1)
\s+ --> followed by one or more spaces
([a-zA-Z]+)_ --> Another alphanumeric block, followed
by an underscore (group 2)
(\w+) --> A block of alphanumeric characters or underscores (group 3)
_\d+\. --> Underscore, one or more digits, and a period
(.+) --> One or more characters (not EOL) (group 4)
$ --> End of string
"""
m = re.match("^([a-zA-Z]+)\s+([a-zA-Z]+)_(\w+)_\d+\.(.+)$", f)
if m:
# If we match, grab the parts we need and stuff it in our dict
name = m.group(1) + " " + m.group(2)
filename = m.group(3) + "." + m.group(4)
output[name] = filename
for name in output:
print 'Directory "{}" contains "{}"'.format(name, output[name])
Note that this isn't optimally compact, but is relatively easy to read. You should also be able to do:
import os
import re
output = {name: filename for (name, filename) in [
(m.group(1) + " " + m.group(2), m.group(3) + "." + m.group(4))
for m in [
re.match("^(\w+)\s+([a-zA-Z]+)_(\w+)_\d+\.(.+)$", f)
for f in os.listdir(".") if os.path.isfile(f)
]
if m
]
}

With my limited python experience I came up with this. It works, though probably not the best way:
import collections, os, shutil
list_of_your_files = [f for f in os.listdir('.') if os.path.isfile(f)];
for filename in list_of_your_files:
extension = filename.split('.')[-1];
directory = filename.split('_')[0];
desired_filename = filename.split('_')[1];
final_filename = desired_filename + '.' + extension
try:
os.mkdir(directory)
except OSError:
pass # directory already exists
##newfile, oldfile = final_filename
shutil.copyfile(filename, os.path.join(directory,final_filename))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read in a csv file using wildcards - python

I do not see how this is a pandas problem. If I understand correctly you are looking for a possibility to build a string with variables. Here you can use the .format() statements: r'C:\siglocal\pairoffs\\logs_20220804_08{0}_capped_delta_for_singlestockdelta.csv'.format(day)

Related

Python grab substring between two specific characters

Verify the format of a filename in Python

Loop through a directory to search for two possible matching files

Python: moving file to a newly created directory

Extract first (and last) part of file name and copy them to new directory

Categories

Resources