Exact match in strings in python

Exact match in strings in python - python

I am trying to find a sub-string in a string, but I am not achieving the results I want.
I have several strings that contains the direction to different directories:
'/Users/mymac/Desktop/test_python/result_files_Sample_8_11/logs',
'/Users/mymac/Desktop/test_python/result_files_Sample_8_1/logs',
'/Users/mymac/Desktop/test_python/result_files_Sample_8_9/logs'
Here is the part of my code here I am trying to find the exact match to the sub-string:
for name in sample_names:
if (dire.find(name)!=-1):
for files in os.walk(dire):
for file in files:
list_files=[]
list_files.append(file)
file_dict[name]=list_files
Everything works fine except that when it looks for Sample_8_1 in the string that contains the directory, the if condition also accepts the name Sample_8_11. How can I make it so that it makes an exact match to prevent from entering the same directory more than once?

You could try searching for sample_8_1/ (i.e., include the following slash). I guess given your code that would be dire.find(name+'/'). This just a quick and dirty approach.

Assuming that dire is populated with absolute path names
for name in sample_names:
if name in dire:
...
e.g.
samples = ['/home/msvalkon/work/tmp_1',
'/home/msvalkon/work/tmp_11']
dirs = ['/home/msvalkon/work/tmp_11']
for name in samples:
if name in dirs:
print "Entry %s matches" % name
Entry /home/msvalkon/work/tmp_11 matches

Related

Python - Regex - Match anything except

I'm trying to get my regular expression to work but can't figure out what I'm doing wrong. I am trying to find any file that is NOT in a specific format. For example all files are dates that are in this format MM-DD-YY.pdf (ex. 05-13-17.pdf). I want to be able to find any files that are not written in that format.
I can create a regex to find those with:
(\d\d-\d\d-\d\d\.pdf)
I tried using the negative lookahead so it looked like this:
(?!\d\d-\d\d-\d\d\.pdf)
That works in not finding those anymore but it doesn't find the files that are not like it.
I also tried adding a .* after the group but then that finds the whole list.
(?!\d\d-\d\d-\d\d\.pdf).*
I'm searching through a small list right now for testing:
05-17-17.pdf Test.pdf 05-48-2017.pdf 03-14-17.pdf
Is there a way to accomplish what I'm looking for?
Thanks!

You can try this:
import re
s = "Test.docx 04-05-2017.docx 04-04-17.pdf secondtest.pdf"
new_data = re.findall("[a-zA-Z]+\.[a-zA-Z]+|\d{1,}-\d{1,}-\d{4}\.[a-zA-Z]+", s)
Output:
['Test.docx', '04-05-2017.docx', 'secondtest.pdf']

First find all that are matching, then remove them from your list separately. firstFindtheMatching method first finds matching names using re library:
def firstFindtheMatching(listoffiles):
"""
:listoffiles: list is the name of the files to check if they match a format
:final_string: any file that doesn't match the format 01-01-17.pdf (MM-DD-YY.pdf) is put in one str type output. (ALSO) I'm returning the listoffiles so in that you can see the whole output in one place but you really won't need that.
"""
import re
matchednames = re.findall("\d{1,2}-\d{1,2}-\d{1,2}\.pdf", listoffiles)
#connect all output in one string for simpler handling using sets
final_string = ' '.join(matchednames)
return(final_string, listoffiles)
Here is the output:
('05-08-17.pdf 04-08-17.pdf 08-09-16.pdf', '05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf')
set(['08-09-2016.pdf', 'some-all-letters.pdf', 'Test.pdf'])
I've used the main below if you like to regenerate the results. Good thing about doing it this way is that you can add more regex to your firstFindtheMatching(). It helps you to keep things separate.
def main():
filenames= "05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf"
[matchednames , alllist] = firstFindtheMatching(filenames)
print(matchednames, alllist)
notcommon = set(filenames.split()) - set(matchednames.split())
print(notcommon)
if __name__ == '__main__':
main()

Navigating a "directory" structure in a text file

I am making a Python script which will allow, among other things, downloading files from an S3 filestore. I'm using the boto module to do this. As a first step, I get a list of files in a user-specified bucket. I'm storing that list in a temporary text file. Although S3 doesn't really have directories, we fake it the same way as everyone else by prepending a fake path to the filename. So, suppose I have the following in my bucket:
2015-04-12/logs/east/01.gz
2015-04-12/logs/east/02.gz
2015-04-12/logs/west/01.gz
2015-04-12/logs/west/02.gz
2015-04-12/summary
2015-04-13/logs/east/01.gz
2015-04-13/logs/east/02.gz
2015-04-13/logs/west/01.gz
2015-04-13/logs/west/02.gz
2015-04-13/summary
README
This is a very, very short version of the file. The real one is about 35,000 lines, so it needs to be presented to the user in a manageable way. I'm looking for suggestions on how to go about this. The way I've attempted has worked well, except that it assumed that everything would share a common directory path length. As you can see, that's no longer true. I'm assured that more variations will be coming, so I'd like to accommodate essentially arbitrary directory/file structures.
My method was, in effect, to extract the leftmost part of each path (that is, the top-level directory), create a uniq'd list of those, and present that to the user to choose. Then, when they choose, take everything starting with their choice and extract the second part of the path (if it existed), uniq those and present them to the user. When they choose, concatenate their first selection, a /, and their second selection, and repeat until there's no more path left. This is unwieldy and it's hard to say, for example, "this directory contains both files and directories."
How would you go about this? I'm having a hard time wrapping my head around this without creating an awkward presentation and spaghettified code. Thank you.

If I understand your question correctly, you want to be able to "drill down" into a list of path-like strings, correct?
If so, I'd suggest the newer pathlib module in the standard library. The code I'll show allows you to do something like this:
Current path:
1: 2015-04-12/
2: 2015-04-13/
3: README
? 2
Current path: 2015-04-13
1: logs/
2: summary
? 1
Current path: 2015-04-13/logs
1: east/
2: west/
? 2
Current path: 2015-04-13/logs/west
1: 01.gz
2: 02.gz
? 1
You have selected: 2015-04-13/logs/west/01.gz
Now for the code... First, we import pathlib and convert our list of strings to a list of pathlib.Path objects:
import pathlib
paths = (
"""
2015-04-12/logs/east/01.gz
2015-04-12/logs/east/02.gz
2015-04-12/logs/west/01.gz
2015-04-12/logs/west/02.gz
2015-04-12/summary
2015-04-13/logs/east/01.gz
2015-04-13/logs/east/02.gz
2015-04-13/logs/west/01.gz
2015-04-13/logs/west/02.gz
2015-04-13/summary
README""").split()
paths = [pathlib.Path(p) for p in paths]
Now I'll want to make some helper functions. First is a menu function that asks the user to select an entry from a list of choices. This will return an element of the list:
def menu(choices):
for i, choice in enumerate(choices, start=1):
message = '{}: {}'.format(i, choice)
print(message)
while True:
try:
selection = choices[int(input('? ')) - 1]
except (ValueError, IndexError):
message = 'Invalid selection: must be between 1 and {}.'
print(message.format(len(choices)))
else:
return selection
We'll need a list of choices to give to that function, so we'll make a path_choices function which does as much. We give this function a container of full paths and the current path the the user has selected. It then returns the "next steps" that the user can take. For example, if we have a list of possibilities: ['foo/apple', 'foo/banana/one.txt', 'foo/orange/pear/summary.txt'], and curpath is foo, then this function will return {'apple', 'banana/', 'orange/'}. Note that the directories have trailing slashes, which is nice.
def path_choices(possibilities, curpath):
choices = set()
for path in possibilities:
parts = path.relative_to(curpath).parts
root = parts[0]
if len(parts) > 1:
root += '/'
choices.add(root)
return choices
Lastly, we'll have a simple function to filter a container of paths, only returning paths which start with curpath, and which aren't in fact equal to curpath:
def filter_paths(possibilities, curpath):
for path in possibilities:
if path != curpath and str(path).startswith(str(curpath)):
yield path
After this, it's just a matter of gluing these functions together:
curpath = ''
possibilities = paths
while possibilities:
print('Current path: {}'.format(curpath))
choices = sorted(path_choices(possibilities, curpath))
selection = menu(choices)
if curpath:
curpath /= selection
else:
curpath = pathlib.Path(selection)
possibilities = list(filter_paths(possibilities, curpath))
print()
print('You have selected: ', curpath)

Use regex to isolate information in filename

I have various files that are formatted like so;
file_name_twenty_135032952.txt
where file_name_twenty is a description of the contents, and 13503295 is the id.
I want two different regexes; one to get the description out of the filename, and one to get the id.
Here are some other rules that the filenames follow:
The filename will never contains spaces, or uppercase characters
the id will always come directly before the extension
the id will always follow an underscore
the description may sometimes have numbers in it; for example, in this filename: part_1_of_file_324980332.txt, part_1_of_file is the description, and 324980332 is the id.
I've been toiling for a while and can't seem to figure out a regex to solve this. I'm using python, so any limitations thereof with its regex engine follow.

rsplit once on an underscore and to remove the extension from id.
s = "file_name_twenty_13503295.txt"
name, id = s.rsplit(".",1)[0].rsplit("_", 1)
print(name, id)
file_name_twenty 13503295

Parsing directories and detecting unexpected blanks

I'm trying to parse some directories and identifying folders witch do not have a specific correct pattern. Let's exemplify:
Correct: Level1\\Level2\\Level3\\Level4_ID\\Date\\Hour\\file.txt
Incorrect: Level1\\Level2\\Level3\\Level4\\Date\\Hour\\file.txt
Notice that the incorrect one does not have the _ID. My final desired goal is parse the data replacing the '\' for a delimiter to import for MS excel:
Level1;Level2;Level3;Level4;ID;Date;Hour;file.txt
Level1;Level2;Level3;Level4; ;Date;Hour;file.txt
I had successfully parsed all the correct data making this steps:
Let files be a list of my all directories
for i in arange(len(files)):
processed_str = files[i].replace(" ", "").replace("_", "\\")
processed_str = processed_str.split("\\")
My issue is detecting whether or not Level4 folder does have an ID after the underscore using the same script, since "files" contains both correct and incorrect directories.
The problem is that since the incorrect one does not have the ID, after performing split("\") I end up having the columns mixed without a blanck between Level4 and Date:
Level1;Level2;Level3;Level4;Date;Hour;file.txt
Thanks,

Do the "_ID" check after splitting the directories, that way you don't loose information. Assuming the directory names themselves don't contain escaped backslashes and that the ID field is always in level 4 (counting from 1), this should do it:
for i in arange(len(files)):
parts = files[i].split("\\")
if parts[3].endswith("_ID"):
parts.insert(4, parts[3][:-len("_ID")])
else:
parts.insert(4, " ")
final = ";".join(parts)

How to check set of files conform to a naming scheme

I have a bunch of files (TV episodes, although that is fairly arbitrary) that I want to check match a specific naming/organisation scheme..
Currently: I have three arrays of regex, one for valid filenames, one for files missing an episode name, and one for valid paths.
Then, I loop though each valid-filename regex, if it matches, append it to a "valid" dict, if not, do the same with the missing-ep-name regexs, if it matches this I append it to an "invalid" dict with an error code (2:'missing epsiode name'), if it matches neither, it gets added to invalid with the 'malformed name' error code.
The current code can be found here
I want to add a rule that checks for the presence of a folder.jpg file in each directory, but to add this would make the code substantially more messy in it's current state..
How could I write this system in a more expandable way?
The rules it needs to check would be..
File is in the format Show Name - [01x23] - Episode Name.avi or Show Name - [01xSpecial02] - Special Name.avi or Show Name - [01xExtra01] - Extra Name.avi
If filename is in the format Show Name - [01x23].avi display it a 'missing episode name' section of the output
The path should be in the format Show Name/season 2/the_file.avi (where season 2 should be the correct season number in the filename)
each Show Name/season 1/ folder should contain "folder.jpg"
.any ideas? While I'm trying to check TV episodes, this concept/code should be able to apply to many things..
The only thought I had was a list of dicts in the format:
checker = [
{
'name':'valid files',
'type':'file',
'function':check_valid(), # runs check_valid() on all files
'status':0 # if it returns True, this is the status the file gets
}

I want to add a rule that checks for
the presence of a folder.jpg file in
each directory, but to add this would
make the code substantially more messy
in it's current state..
This doesn't look bad. In fact your current code does it very nicely, and Sven mentioned a good way to do it as well:
Get a list of all the files
Check for "required" files
You would just have have add to your dictionary a list of required files:
checker = {
...
'required': ['file', 'list', 'for_required']
}
As far as there being a better/extensible way to do this? I am not exactly sure. I could only really think of a way to possibly drop the "multiple" regular expressions and build off of Sven's idea for using a delimiter. So my strategy would be defining a dictionary as follows (and I'm sorry I don't know Python syntax and I'm a tad to lazy to look it up but it should make sense. The /regex/ is shorthand for a regex):
check_dict = {
'delim' : /\-/,
'parts' : [ 'Show Name', 'Episode Name', 'Episode Number' ],
'patterns' : [/valid name/, /valid episode name/, /valid number/ ],
'required' : ['list', 'of', 'files'],
'ignored' : ['.*', 'hidden.txt'],
'start_dir': '/path/to/dir/to/test/'
}
Split the filename based on the delimiter.
Check each of the parts.
Because its an ordered list you can determine what parts are missing and if a section doesn't match any pattern it is malformed. Here the parts and patterns have a 1 to 1 ratio. Two arrays instead of a dictionary enforces the order.
Ignored and required files can be listed. The . and .. files should probably be ignored automatically. The user should be allowed to input "globs" which can be shell expanded. I'm thinking here of svn:ignore properties, but globbing is natural for listing files.
Here start_dir would be default to the current directory but if you wanted a single file to run automated testing of a bunch of directories this would be useful.
The real loose end here is the path template and along the same lines what path is required for "valid files". I really couldn't come up with a solid idea without writing one large regular expression and taking groups from it... to build a template. It felt a lot like writing a TextMate language grammar. But that starts to stray on the ease of use. The real problem was that the path template was not composed of parts, which makes sense but adds complexity.
Is this strategy in tune with what you were thinking of?

maybe you should take the approach of defaulting to: "the filename is correct" and work from there to disprove that statement:
with the fact that you only allow filenames with: 'show name', 'season number x episode number' and 'episode name', you know for certain that these items should be separated by a "-" (dash) so you have to have 2 of those for a filename to be correct.
if that checks out, you can use your code to check that the show name matches the show name as seen in the parent's parent folder (case insensitive i assume), the season number matches the parents folder numeric value (with or without an extra 0 prepended).
if however you don't see the correct amount of dashes you instantly know that there is something wrong and stop before the rest of the tests etc.
and separately you can check if the file folder.jpg exists and take the necessary actions. or do that first and filter that file from the rest of the files in that folder.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.