How to recognize the folder path and remove it? - python

I'm doing a log parsing server and receive a list of full path of logs. Now I know the format of folder name must be timestamp like 12-28-2020_11-34-22-026.
Since I don't know where does user put the log folder. How should I recognize what is the folder path and remove it?
INPUT:
files = [
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/log.DEBUG',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/log.INFO',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/summary.txt',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/UnitTest/ResultSheet_2020-12-28_11-34-37.txt',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/UnitTest/Emulator8080/logcat_emulator_8080_12-28-2020_11-34-24-826.txt',
]
OUTPUT:
files = [
'/12-28-2020_11-34-22-026/log.DEBUG',
'/12-28-2020_11-34-22-026/log.INFO',
'/12-28-2020_11-34-22-026/summary.txt',
'/12-28-2020_11-34-22-026/UnitTest/ResultSheet_2020-12-28_11-34-37.txt',
'/12-28-2020_11-34-22-026/UnitTest/Emulator8080/logcat_emulator_8080_12-28-2020_11-34-24-826.txt',
]

You want to check what (first) part of a path matches some pattern.
The pattern you're matching as a regex:
re.compile(r'^\d{2}-\d{2}-\d{4}_\d{2}-\d{2}-\d{2}-\d{3}$')
Python's standard library pathlib has a very robust way of splitting the path using the Path class' .parts property, so you can find the match:
import re
from pathlib import Path
files = [
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/log.DEBUG',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/log.INFO',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/summary.txt',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/UnitTest/ResultSheet_2020-12-28_11-34-37.txt',
'/usr/local/home/username/Downloads/12-28-2020_11-34-22-026/UnitTest/Emulator8080/logcat_emulator_8080_12-28-2020_11-34-24-826.txt',
]
rgx = re.compile(r'^\d{2}-\d{2}-\d{4}_\d{2}-\d{2}-\d{2}-\d{3}$')
log_files = []
for fn in files:
for i, part in enumerate(Path(fn).parts):
if rgx.match(part):
# str() because you're asking for strings, but could just leave them as paths
log_files.append(str(Path(*Path(fn).parts[i:])))
break
print(log_files)
Output:
['12-28-2020_11-34-22-026\\log.DEBUG', '12-28-2020_11-34-22-026\\log.INFO', '12-28-2020_11-34-22-026\\summary.txt', '12-28-2020_11-34-22-026\\UnitTest\\ResultSheet_2020-12-28_11-34-37.txt', '12-28-2020_11-34-22-026\\UnitTest\\Emulator8080\\logcat_emulator_8080_12-28-2020_11-34-24-826.txt']
What this part log_files.append(str(Path(*Path(fn).parts[i:]))) does:
appends a new result to log_files
the result is the string representation str() of the Path() created from the parts of Path(fn) from index i and onwards.
Path() expects the parts as separate arguments, so the list is spread with the spreading operator *.
after the match is found and the remaining parts combined and appended, the loop can stop (break), since even if the pattern occurs again, you don't want it to match again for that path.
Mind you, in your desired outcome, you have all the paths start with a / - I think that's a mistake, as that would suggest the paths start in the root, while they are paths relative to some other folders. But of course you can add the / if you need it somehow.

Related

Python RE Directories and slashes

Let's say I have a string that is a root directory that has been entered
'C:/Users/Me/'
Then I use os.listdir() and join with it to create a list of subdirectories.
I end up with a list of strings that are like below:
'C:/Users/Me/Adir\Asubdir\'
and so on.
I want to split the subdirectories and capture each directory name as its own element. Below is one attempt. I am seemingly having issues with the \ and / characters. I assume \ is escaping, so '[\\/]' to me that says look for \ or / so then '[\\/]([\w\s]+)[\\/]' as a match pattern should look for any word between two slashes... but the output is only ['/Users/'] and nothing else is matched. So I then I add a escape for the forward slash.
'[\\\/]([\w\s]+)[\\\/]'
However, my output then only becomes ['Users','ADir'] so that is confusing the crud out of me.
My question is namely how do I tokenize each directory from a string using both \ and / but maybe also why is my RE not working as I expect?
Minimal Example:
import re, os
info = re.compile('[\\\/]([\w ]+)[\\\/]')
root = 'C:/Users/i12500198/Documents/Projects/'
def getFiles(wdir=os.getcwd()):
files = (os.path.join(wdir,file) for file in os.listdir(wdir)
if os.path.isfile(os.path.join(wdir,file)))
return list(files)
def getDirs(wdir=os.getcwd()):
dirs = (os.path.join(wdir,adir) for adir in os.listdir(wdir)
if os.path.isdir(os.path.join(wdir,adir)))
return list(dirs)
def walkSubdirs(root,below=[]):
subdirs = getDirs(root)
for aDir in subdirs:
below.append(aDir)
walkSubdirs(aDir,below)
return below
subdirs = walkSubdirs(root)
for aDir in subdirs:
files = getFiles(aDir)
for f in files:
finfo = info.findall(f)
print(f)
print(finfo)
I want to split the subdirectories and capture each directory name as its own element
Instead of regular expressions, I suggest you use one of Python's standard functions for parsing filesystem paths.
Here is one using pathlib:
from pathlib import Path
p = Path("C:/Users/Me/ADir\ASub Dir\2 x 2 Dir\\")
p.parts
#=> ('C:\\', 'Users', 'Me', 'ADir', 'ASub Dir\x02 x 2 Dir')
Note that the behaviour of pathlib.Path depends on the system running Python. Since I'm on a Linux machine, I actually used pathlib.PureWindowsPath here. I believe the output should be accurate for those of you on Windows.

how to list the images paths in the directory in python?

I created one folder called dataset, then in this folder i created subfolder called subfold1, subfold2
names=[]
for users in os.listdir("dataset"):
names.append(users)
print(names)
Output:
['subfold1','subfold2']
In the subfold1 , i have 5 images and subfold2 also i have 5 images
Then, i want to list the images paths which i have inside the subfold1 and subfol2?
path= []
for name in names:
for image in os.listdir("dataset/{}".format(name)):
path_string = os.path.join("dataset/{}".format(name), image)
path.append(path_string)
print(path)
My output is
['dataset/subfold1\\1_1.jpg', 'dataset/subfold1\\1_2.jpg', 'dataset/subfold1\\1_3.jpg', 'dataset/subfold1\\1_4.jpg', 'dataset/subfold1\\1_5.jpg', 'dataset/subfold2\\2_1.jpg', 'dataset/subfold3\\2_2.jpg', 'dataset/subfold2\\2_3.jpg', 'dataset/subfold2\\2_4.jpg', 'dataset/subfold2\\2_5.jpg']
I want the correct paths
You code works correctly in Linux.
However you may want to simplify it by using os.walk. Please see below:
new_names = []
for dirpath, dirnames, filenames in os.walk('dataset'):
for filename in filenames:
new_names.append(os.path.join(dirpath, filename))
print(new_names)
which gives me following output:
['dataset/subfold2/93.jpg', 'dataset/subfold2/99.jpg', 'dataset/subfold2/97.jpg', 'dataset/subfold1/3.jpg', 'dataset/subfold1/2.jpg', 'dataset/subfold1/1.jpg']
I think you're in windows OS. As you know in Windows \ is the address separator.
And as \ is the escape character in Python (it will be followed by another character indicating a special character, for example, \t means tab), thus the \\ means \, and your addresses are totally correct and you can change the / to \\ in compliance with the Windows rule. BTW, I strongly suggest you to apply the pathlib library. It is more convenient and powerful.
from pathlib import Path
p = Path('MyPictures')
for image in p.iterdir():
print(image)
quick solution;
use os.path.normpath for normalizing a path (modifying every separator to os.path.sep AND adapting the path to the operating system)
paths = [
'dataset/subfold1\\1_1.jpg',
'dataset/subfold1\\1_2.jpg',
'dataset/subfold1\\1_3.jpg',
'dataset/subfold1\\1_4.jpg',
'dataset/subfold1\\1_5.jpg',
'dataset/subfold2\\2_1.jpg',
'dataset/subfold3\\2_2.jpg',
'dataset/subfold2\\2_3.jpg',
'dataset/subfold2\\2_4.jpg',
'dataset/subfold2\\2_5.jpg'
]
import os
paths = list(map(os.path.normpath, paths))
>>> paths
out
['dataset\\subfold1\\1_1.jpg',
'dataset\\subfold1\\1_2.jpg',
'dataset\\subfold1\\1_3.jpg',
'dataset\\subfold1\\1_4.jpg',
'dataset\\subfold1\\1_5.jpg',
'dataset\\subfold2\\2_1.jpg',
'dataset\\subfold3\\2_2.jpg',
'dataset\\subfold2\\2_3.jpg',
'dataset\\subfold2\\2_4.jpg',
'dataset\\subfold2\\2_5.jpg']
extra info:
you cant get rid of this \\ from this 'dataset/subfold1\\1_1.jpg' because that is the string __repr__ of the element, and when you do __repr__ you see double backslash because its escaped. if you will actually print the value on the screen you will see just one \
quick demo:
print('dataset\\subfold1\\1_1.jpg')
out
dataset\subfold1\1_1.jpg
if you really want to join the paths with / then make your own join function 4 paths (also i dont recommend this, i made that in the past and i realised that it was worthless, because os.path is handling everything for you)
but if you are on windows and you really want to have linux path separator you can try this:
paths = [
'dataset/subfold1\\1_1.jpg',
'dataset/subfold1\\1_2.jpg',
'dataset/subfold1\\1_3.jpg',
'dataset/subfold1\\1_4.jpg',
'dataset/subfold1\\1_5.jpg',
'dataset/subfold2\\2_1.jpg',
'dataset/subfold3\\2_2.jpg',
'dataset/subfold2\\2_3.jpg',
'dataset/subfold2\\2_4.jpg',
'dataset/subfold2\\2_5.jpg'
]
paths = [path.replace("\\", "/") for path in paths]
>>> paths
out
['dataset/subfold1/1_1.jpg',
'dataset/subfold1/1_2.jpg',
'dataset/subfold1/1_3.jpg',
'dataset/subfold1/1_4.jpg',
'dataset/subfold1/1_5.jpg',
'dataset/subfold2/2_1.jpg',
'dataset/subfold3/2_2.jpg',
'dataset/subfold2/2_3.jpg',
'dataset/subfold2/2_4.jpg',
'dataset/subfold2/2_5.jpg']`

Why are os.path.join() on a list and os.path.sep.join() on a list not functionally identical?

I'm working on a program that needs to split and rejoin some file paths, and I'm not sure why os.path.join(*list) and os.path.sep.join(list) produce different results when there is a drive letter present in the separated path.
import os
path = 'C:\\Users\\choglan\\Desktop'
separatedPath = path.split(os.path.sep)
# ['C:', 'Users', 'choglan', 'Desktop']
path = os.path.sep.join(separatedPath)
# C:\\Users\\choglan\\Desktop
print(path)
path = os.path.join(*separatedPath)
# C:Users\\choglan\\Desktop
print(path)
Why does this happen? And should I just use os.path.sep.join(list) for my program even though os.path.join(*list) seems to be more commonly used?
os.path.join is not intended to be the inverse of path.split(os.path.sep). If you read the docs, you'll find a description of a much more complicated process than just sticking os.path.sep between the arguments. The most relevant part is the following:
On Windows... Note that since there is a current directory for each drive, os.path.join("c:", "foo") represents a path relative to the current directory on drive C: (c:foo), not c:\foo.
You should probably be using pathlib.PurePath(path).parts rather than path.split(os.path.sep).
os.path.sep isn't an independent object with its own methods, it's a string. So the join method on it just pastes together the strings with that character between them.
>>> type(os.path.sep)
<class 'str'>
You can use join from any string.
>>> '|'.join(separatedPath)
'C:|Users|choglan|Desktop'

How to manipulate directory paths to work across multiple OS?

I'm writing a python script which has to internally create output path from the input path. However, I am facing issues to create the path which I can use irrespective of OS.
I have tried to use os.path.join and it has its own limitations.
Apart from that, I think simple string concatenation is not the way to go.
Pathlib can be an option but I am not allowed to use it.
import os
inputpath = "C:\projects\django\hereisinput"
lastSlash = left.rfind("\\")
# This won't work as os path join stops at a slash
outputDir = os.path.join(left[:lastSlash], "\internal\morelevel\outputpath")
OR
OutDir = left[:lastSlash] + "\internal\morelevel\outputpath"
Expected output path :
C:\projects\django\internal\morelevel\outputpath
Also, the above code doesn't do it OS Specific where in Linux, the slash will be different.
Is os.sep() some option ?
From the documentation os.path.join can join "one or more path components...". So you could split "\internal\morelevel\outputpath" up into each of its components and pass all of them to your os.path.join function instead. That way you don't need to "hard-code" the separator between the path components. For example:
paths = ("internal", "morelevel", "outputpath")
outputDir = os.path.join(left[:lastSlash], *paths)
Remember that the backslash (\) is a special character in Python so your strings containing singular backslashes wouldn't work as you expect them to! You need to escape them with another \ in front.
This part of your code lastSlash = left.rfind("\\") might also not work on any operating system. You could rather use something like os.path.split to get the last part of the path that you need. For example, _, lastSlash = os.path.split(left).
Assuming your original path is "C:\projects\django\hereisinput", your other part of the path as "internal\morelevel\outputpath" (notice this is a relative path, not absolute), you could always move your primary back one folder (or more) and then append the second part. Do note that your first path needs to contain only folders and can be absolute or relative, while your second path must always be relative for this hack to work
path_1 = r"C:\projects\django\hereisinput"
path_2 = r"internal\morelevel\outputpath"
path_1_one_folder_down = os.path.join(path_1, os.path.pardir)
final_path = os.path.join(path_1_one_folder_down, path_2)
'C:\\projects\\django\\hereisinput\\..\\internal\\morelevel\\outputpath'

os.path.join producing an extra forward slash

I am trying to join an absolute path and variable folder path depending on the variable run. However when I use the following code it inserts a forward slash after a string, which I don't require. How can I remove the slash after Folder_?
import os
currentwd = os.getcwd()
folder = '001'
run_folder = os.path.join(currentwd, 'Folder_', folder)
print run_folder
The output I get using this code is:
/home/xkr/Workspace/Folder_/001
You are asking os.path.join() to take multiple path elements and join them. It is doing its job.
Don't use os.path.join() to produce filenames; just use concatenation:
run_folder = os.path.join(currentwd, 'Folder_' + folder)
or use string formatting; the latter can give you such nice features such as automatic padding of integers:
folder = 1
run_folder = os.path.join(currentwd, 'Folder_{:03d}'.format(folder))
That way you can increment folder past 10 or 100 and still have the correct number of leading zeros.
Note that you don't have to use os.getcwd(); you could also use os.path.abspath(), it'll make relative paths absolute based on the current working directory:
run_folder = os.path.abspath('Folder_' + folder)

Categories

Resources