How to check set of files conform to a naming scheme

How to check set of files conform to a naming scheme - python

I have a bunch of files (TV episodes, although that is fairly arbitrary) that I want to check match a specific naming/organisation scheme..
Currently: I have three arrays of regex, one for valid filenames, one for files missing an episode name, and one for valid paths.
Then, I loop though each valid-filename regex, if it matches, append it to a "valid" dict, if not, do the same with the missing-ep-name regexs, if it matches this I append it to an "invalid" dict with an error code (2:'missing epsiode name'), if it matches neither, it gets added to invalid with the 'malformed name' error code.
The current code can be found here
I want to add a rule that checks for the presence of a folder.jpg file in each directory, but to add this would make the code substantially more messy in it's current state..
How could I write this system in a more expandable way?
The rules it needs to check would be..
File is in the format Show Name - [01x23] - Episode Name.avi or Show Name - [01xSpecial02] - Special Name.avi or Show Name - [01xExtra01] - Extra Name.avi
If filename is in the format Show Name - [01x23].avi display it a 'missing episode name' section of the output
The path should be in the format Show Name/season 2/the_file.avi (where season 2 should be the correct season number in the filename)
each Show Name/season 1/ folder should contain "folder.jpg"
.any ideas? While I'm trying to check TV episodes, this concept/code should be able to apply to many things..
The only thought I had was a list of dicts in the format:
checker = [
{
'name':'valid files',
'type':'file',
'function':check_valid(), # runs check_valid() on all files
'status':0 # if it returns True, this is the status the file gets
}

I want to add a rule that checks for
the presence of a folder.jpg file in
each directory, but to add this would
make the code substantially more messy
in it's current state..
This doesn't look bad. In fact your current code does it very nicely, and Sven mentioned a good way to do it as well:
Get a list of all the files
Check for "required" files
You would just have have add to your dictionary a list of required files:
checker = {
...
'required': ['file', 'list', 'for_required']
}
As far as there being a better/extensible way to do this? I am not exactly sure. I could only really think of a way to possibly drop the "multiple" regular expressions and build off of Sven's idea for using a delimiter. So my strategy would be defining a dictionary as follows (and I'm sorry I don't know Python syntax and I'm a tad to lazy to look it up but it should make sense. The /regex/ is shorthand for a regex):
check_dict = {
'delim' : /\-/,
'parts' : [ 'Show Name', 'Episode Name', 'Episode Number' ],
'patterns' : [/valid name/, /valid episode name/, /valid number/ ],
'required' : ['list', 'of', 'files'],
'ignored' : ['.*', 'hidden.txt'],
'start_dir': '/path/to/dir/to/test/'
}
Split the filename based on the delimiter.
Check each of the parts.
Because its an ordered list you can determine what parts are missing and if a section doesn't match any pattern it is malformed. Here the parts and patterns have a 1 to 1 ratio. Two arrays instead of a dictionary enforces the order.
Ignored and required files can be listed. The . and .. files should probably be ignored automatically. The user should be allowed to input "globs" which can be shell expanded. I'm thinking here of svn:ignore properties, but globbing is natural for listing files.
Here start_dir would be default to the current directory but if you wanted a single file to run automated testing of a bunch of directories this would be useful.
The real loose end here is the path template and along the same lines what path is required for "valid files". I really couldn't come up with a solid idea without writing one large regular expression and taking groups from it... to build a template. It felt a lot like writing a TextMate language grammar. But that starts to stray on the ease of use. The real problem was that the path template was not composed of parts, which makes sense but adds complexity.
Is this strategy in tune with what you were thinking of?

maybe you should take the approach of defaulting to: "the filename is correct" and work from there to disprove that statement:
with the fact that you only allow filenames with: 'show name', 'season number x episode number' and 'episode name', you know for certain that these items should be separated by a "-" (dash) so you have to have 2 of those for a filename to be correct.
if that checks out, you can use your code to check that the show name matches the show name as seen in the parent's parent folder (case insensitive i assume), the season number matches the parents folder numeric value (with or without an extra 0 prepended).
if however you don't see the correct amount of dashes you instantly know that there is something wrong and stop before the rest of the tests etc.
and separately you can check if the file folder.jpg exists and take the necessary actions. or do that first and filter that file from the rest of the files in that folder.

Related

Extract text from a config file [duplicate]

This question already has answers here:
Parse key value pairs in a text file
(7 answers)
Closed 1 year ago.
I'm using a config file to inform my Python script of a few key-values, for use in authenticating the user against a website.
I have three variables: the URL, the user name, and the API token.
I've created a config file with each key on a different line, so:
url:<url string>
auth_user:<user name>
auth_token:<API token>
I want to be able to extract the text after the key words into variables, also stripping any "\n" that exist at the end of the line. Currently I'm doing this, and it works but seems clumsy:
with open(argv[1], mode='r') as config_file:
lines = config_file.readlines()
for line in lines:
url_match = match('jira_url:', line)
if url_match:
jira_url = line[9:].split("\n")[0]
user_match = match('auth_user:', line)
if user_match:
auth_user = line[10:].split("\n")[0]
token_match = match('auth_token', line)
if token_match:
auth_token = line[11:].split("\n")[0]
Can anybody suggest a more elegant solution? Specifically it's the ... = line[10:].split("\n")[0] lines that seem clunky to me.
I'm also slightly confused why I can't reuse my match object within the for loop, and have to create new match objects for each config item.

you could use a .yml file and read values with yaml.load() function:
import yaml
with open('settings.yml') as file:
settings = yaml.load(file, Loader=yaml.FullLoader)
now you can access elements like settings["url"] and so on

If the format is always <tag>:<value> you can easily parse it by splitting the line at the colon and filling up a custom dictionary:
config_file = open(filename,"r")
lines = config_file.readlines()
config_file.close()
settings = dict()
for l in lines:
elements = l[:-1].split(':')
settings[elements[0]] = ':'.join(elements[1:])
So, you get a dictionary that has the tags as keys and the values as values. You can then just refer to these dictionary entries in your pogram.
(e.g.: if you need the auth_token, just call settings["auth_token"]

if you can add 1 line for config file, configparser is good choice
https://docs.python.org/3/library/configparser.html
[1] config file : 1.cfg
[DEFAULT] # configparser's config file need section name
url:<url string>
auth_user:<user name>
auth_token:<API token>
[2] python scripts
import configparser
config = configparser.ConfigParser()
config.read('1.cfg')
print(config.get('DEFAULT','url'))
print(config.get('DEFAULT','auth_user'))
print(config.get('DEFAULT','auth_token'))
[3] output
<url string>
<user name>
<API token>
also configparser's methods is useful
whey you can't guarantee config file is always complete

You have a couple of great answers already, but I wanted to step back and provide some guidance on how you might approach these problems in the future. Getting quick answers sometimes prevents you from understanding how those people knew about the answers in the first place.
When you zoom out, the first thing that strikes me is that your task is to provide config, using a file, to your program. Software has the remarkable property of solve-once, use-anywhere. Config files have been a problem worth solving for at least 40 years, so you can bet your bottom dollar you don't need to solve this yourself. And already-solved means someone has already figured out all the little off-by-one and edge-case dramas like stripping line endings and dealing with expected input. The challenge of course, is knowing what solution already exists. If you haven't spent 40 years peeling back the covers of computers to see how they tick, it's difficult to "just know". So you might have a poke around on Google for "config file format" or something.
That would lead you to one of the most prevalent config file systems on the planet - the INI file. Just as useful now as it was 30 years ago, and as a bonus, looks not too dissimilar to your example config file. Then you might search for "read INI file in Python" or something, and come across configparser and you're basically done.
Or you might see that sometime in the last 30 years, YAML became the more trendy option, and wouldn't you know it, PyYAML will do most of the work for you.
But none of this gets you any better at using Python to extract from text files in general. So zooming in a bit, you want to know how to extract parts of lines in a text file. Again, this problem is an age-old problem, and if you were to learn about this problem (rather than just be handed the solution), you would learn that this is called parsing and often involves tokenisation. If you do some research on, say "parsing a text file in python" for example, you would learn about the general techniques that work regardless of the language, such as looping over lines and splitting each one in turn.
Zooming in one more step closer, you're looking to strip the new line off the end of the string so it doesn't get included in your value. Once again, this ain't a new problem, and with the right keywords you could dig up the well-trodden solutions. This is often called "chomping" or "stripping", and with some careful search terms, you'd find rstrip() and friends, and not have to do awkward things like splitting on the '\n' character.
Your final question is about re-using the match object. This is much harder to research. But again, the "solution" wont necessarily show you where you went wrong. What you need to keep in mind is that the statements in the for loop are sequential. To think them through you should literally execute them in your mind, one after one, and imagine what's happening. Each time you call match, it either returns None or a Match object. You never use the object, except to check for truthiness in the if statement. And next time you call match, you do so with different arguments so you get a new Match object (or None). Therefore, you don't need to keep the object around at all. You can simply do:
if match('jira_url:', line):
jira_url = line[9:].split("\n")[0]
if match('auth_user:', line):
auth_user = line[10:].split("\n")[0]
and so on. Not only that, if the first if triggered then you don't need to bother calling match again - it will certainly not trigger any of other matches for the same line. So you could do:
if match('jira_url:', line):
jira_url = line[9:].rstrip()
elif match('auth_user:', line):
auth_user = line[10:].rstrip()
and so on.
But then you can start to think - why bother doing all these matches on the colon, only to then manually split the string at the colon afterwards? You could just do:
tokens = line.rstrip().split(':')
if token[0] == 'jira_url':
jira_url = token[1]
elif token[0] == 'auth_user':
auth_user = token[1]
If you keep making these improvements (and there's lots more to make!), eventually you'll end up re-writing configparse, but at least you'll have learned why it's often a good idea to use an existing library where practical!

Use regex to isolate information in filename

I have various files that are formatted like so;
file_name_twenty_135032952.txt
where file_name_twenty is a description of the contents, and 13503295 is the id.
I want two different regexes; one to get the description out of the filename, and one to get the id.
Here are some other rules that the filenames follow:
The filename will never contains spaces, or uppercase characters
the id will always come directly before the extension
the id will always follow an underscore
the description may sometimes have numbers in it; for example, in this filename: part_1_of_file_324980332.txt, part_1_of_file is the description, and 324980332 is the id.
I've been toiling for a while and can't seem to figure out a regex to solve this. I'm using python, so any limitations thereof with its regex engine follow.

rsplit once on an underscore and to remove the extension from id.
s = "file_name_twenty_13503295.txt"
name, id = s.rsplit(".",1)[0].rsplit("_", 1)
print(name, id)
file_name_twenty 13503295

Parsing directories and detecting unexpected blanks

I'm trying to parse some directories and identifying folders witch do not have a specific correct pattern. Let's exemplify:
Correct: Level1\\Level2\\Level3\\Level4_ID\\Date\\Hour\\file.txt
Incorrect: Level1\\Level2\\Level3\\Level4\\Date\\Hour\\file.txt
Notice that the incorrect one does not have the _ID. My final desired goal is parse the data replacing the '\' for a delimiter to import for MS excel:
Level1;Level2;Level3;Level4;ID;Date;Hour;file.txt
Level1;Level2;Level3;Level4; ;Date;Hour;file.txt
I had successfully parsed all the correct data making this steps:
Let files be a list of my all directories
for i in arange(len(files)):
processed_str = files[i].replace(" ", "").replace("_", "\\")
processed_str = processed_str.split("\\")
My issue is detecting whether or not Level4 folder does have an ID after the underscore using the same script, since "files" contains both correct and incorrect directories.
The problem is that since the incorrect one does not have the ID, after performing split("\") I end up having the columns mixed without a blanck between Level4 and Date:
Level1;Level2;Level3;Level4;Date;Hour;file.txt
Thanks,

Do the "_ID" check after splitting the directories, that way you don't loose information. Assuming the directory names themselves don't contain escaped backslashes and that the ID field is always in level 4 (counting from 1), this should do it:
for i in arange(len(files)):
parts = files[i].split("\\")
if parts[3].endswith("_ID"):
parts.insert(4, parts[3][:-len("_ID")])
else:
parts.insert(4, " ")
final = ";".join(parts)

Incremental Saves

I am trying to write up a script on incremental saves but there are a few hiccups that I am running into.
If the file name is "aaa.ma", I will get the following error - ValueError: invalid literal for int() with base 10: 'aaa' # and it does not happens if my file is named "aaa_0001"
And this happens if I wrote my code in this format: Link
As such, to rectify the above problem, I input in an if..else.. statement - Link, it seems to have resolved the issue on hand, but I was wondering if there is a better approach to this?
Any advice will be greatly appreciated!

Use regexes for better flexibility especially for file rename scripts like these.
In your case, since you know that the expected filename format is "some_file_name_<increment_number>", you can use regexes to do the searching and matching for you. The reason we should do this is because people/users may are not machines, and may not stick to the exact naming conventions that our scripts expect. For example, the user may name the file aaa_01.ma or even aaa001.ma instead of aaa_0001 that your script currently expects. To build this flexibility into your script, you can use regexes. For your use case, you could do:
# name = lastIncFile.partition(".")[0] # Use os.path.split instead
name, ext = os.path.splitext(lastIncFile)
import re
match_object = re.search("([a-zA-Z]*)_*([0-9]*)$", name)
# Here ([a-zA-Z]*) would be group(1) and would have "aaa" for ex.
# and ([0-9]*) would be group(2) and would have "0001" for ex.
# _* indicates that there may be an _, or not.
# The $ indicates that ([0-9]*) would be the LAST part of the name.
padding = 4 # Try and parameterize as many components as possible for easy maintenance
default_starting = 1
verName = str(default_starting).zfill(padding) # Default verName
if match_object: # True if the version string was found
name = match_object.group(1)
version_component = match_object.group(2)
if version_component:
verName = str(int(version_component) + 1).zfill(padding)
newFileName = "%s_%s.%s" % (name, verName, ext)
incSaveFilePath = os.path.join(curFileDir, newFileName)
Check out this nice tutorial on Python regexes to get an idea what is going on in the above block. Feel free to tweak, evolve and build the regex based on your use cases, tests and needs.
Extra tips:
Call cmds.file(renameToSave=True) at the beginning of the script. This will ensure that the file does not get saved over itself accidentally, and forces the script/user to rename the current file. Just a safety measure.
If you want to go a little fancy with your regex expression and make them more readable, you could try doing this:
match_object = re.search("(?P<name>[a-zA-Z]*)_*(?P<version>[0-9]*)$", name)
name = match_object.group('name')
version_component = match_object('version')
Here we use the ?P<var_name>... syntax to assign a dict key name to the matching group. Makes for better readability when you access it - mo.group('version') is much more clearer than mo.group(2).
Make sure to go through the official docs too.
Save using Maya's commands. This will ensure Maya does all it's checks while and before saving:
cmds.file(rename=incSaveFilePath)
cmds.file(save=True)
Update-2:
If you want space to be checked here's an updated regex:
match_object = re.search("(?P<name>[a-zA-Z]*)[_ ]*(?P<version>[0-9]*)$", name)
Here [_ ]* will check for 0 - many occurrences of _ or (space). For more regex stuff, trying and learn on your own is the best way. Check out the links on this post.
Hope this helps.

Exact match in strings in python

I am trying to find a sub-string in a string, but I am not achieving the results I want.
I have several strings that contains the direction to different directories:
'/Users/mymac/Desktop/test_python/result_files_Sample_8_11/logs',
'/Users/mymac/Desktop/test_python/result_files_Sample_8_1/logs',
'/Users/mymac/Desktop/test_python/result_files_Sample_8_9/logs'
Here is the part of my code here I am trying to find the exact match to the sub-string:
for name in sample_names:
if (dire.find(name)!=-1):
for files in os.walk(dire):
for file in files:
list_files=[]
list_files.append(file)
file_dict[name]=list_files
Everything works fine except that when it looks for Sample_8_1 in the string that contains the directory, the if condition also accepts the name Sample_8_11. How can I make it so that it makes an exact match to prevent from entering the same directory more than once?

You could try searching for sample_8_1/ (i.e., include the following slash). I guess given your code that would be dire.find(name+'/'). This just a quick and dirty approach.

Assuming that dire is populated with absolute path names
for name in sample_names:
if name in dire:
...
e.g.
samples = ['/home/msvalkon/work/tmp_1',
'/home/msvalkon/work/tmp_11']
dirs = ['/home/msvalkon/work/tmp_11']
for name in samples:
if name in dirs:
print "Entry %s matches" % name
Entry /home/msvalkon/work/tmp_11 matches

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.