I am looking for a way to extract a filename and extension from a particular url using Python
lets say a URL looks as follows
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
How would I go about getting the following.
filename = "da4ca3509a7b11e19e4a12313813ffc0_7"
file_ext = ".jpg"
try:
# Python 3
from urllib.parse import urlparse
except ImportError:
# Python 2
from urlparse import urlparse
from os.path import splitext, basename
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
disassembled = urlparse(picture_page)
filename, file_ext = splitext(basename(disassembled.path))
Only downside with this is that your filename will contain a preceding / which you can always remove yourself.
Try with urlparse.urlsplit to split url, and then os.path.splitext to retrieve filename and extension (use os.path.basename to keep only the last filename) :
import urlparse
import os.path
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
print os.path.splitext(os.path.basename(urlparse.urlsplit(picture_page).path))
>>> ('da4ca3509a7b11e19e4a12313813ffc0_7', '.jpg')
filename = picture_page.split('/')[-1].split('.')[0]
file_ext = '.'+picture_page.split('.')[-1]
# Here's your link:
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
#Here's your filename and ext:
filename, ext = (picture_page.split('/')[-1].split('.'))
When you do picture_page.split('/'), it will return a list of strings from your url split by a /.
If you know python list indexing well, you'd know that -1 will give you the last element or the first element from the end of the list.
In your case, it will be the filename: da4ca3509a7b11e19e4a12313813ffc0_7.jpg
Splitting that by delimeter ., you get two values:
da4ca3509a7b11e19e4a12313813ffc0_7 and jpg, as expected, because they are separated by a period which you used as a delimeter in your split() call.
Now, since the last split returns two values in the resulting list, you can tuplify it.
Hence, basically, the result would be like:
filename,ext = ('da4ca3509a7b11e19e4a12313813ffc0_7', 'jpg')
os.path.splitext will help you extract the filename and extension once you have extracted the relevant string from the URL using urlparse:
fName, ext = os.path.splitext('yourImage.jpg')
This is the easiest way to find image name and extension using regular expression.
import re
import sys
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
regex = re.compile('(.*\/(?P<name>\w+)\.(?P<ext>\w+))')
print regex.search(picture_page).group('name')
print regex.search(picture_page).group('ext')
>>> import re
>>> s = 'picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"'
>>> re.findall(r'\/([a-zA-Z0-9_]*)\.[a-zA-Z]*\"$',s)[0]
'da4ca3509a7b11e19e4a12313813ffc0_7'
>>> re.findall(r'([a-zA-Z]*)\"$',s)[0]
'jpg'
Related
I have a folder with hundreds of files named like:
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
Convention:
year_month_ID_zone_date_0_L2A_B01.tif ("_0_L2A_B01.tif", and "zone" never change)
What I need is to iterate through every file and build a path based on their name in order to download them.
For example:
name = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
path = "2017/5/S2B_7VEG_20170528_0_L2A/B01.tif"
The path convention needs to be: path = year/month/ID_zone_date_0_L2A/B01.tif
I thought of making a loop which would "cut" my string into several parts every time it encounters a "_" character, then stitch the different parts in the right order to create my path name.
I tried this but it didn't work:
import re
filename =
"2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
try:
found = re.search('_(.+?)_', filename).group(1)
except AttributeError:
# _ not found in the original string
found = '' # apply your error handling
How could I achieve that on Python ?
Since you only have one separator character, you may as well simply use Python's built in split function:
import os
items = filename.split('_')
year, month = items[:2]
new_filename = '_'.join(items[2:])
path = os.path.join(year, month, new_filename)
Try the following code snippet
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
found = re.sub('(\d+)_(\d+)_(.*)_(.*)\.tif', r'\1/\2/\3/\4.tif', filename)
print(found) # prints 2017/05/S2B_7VEG_20170528_0_L2A/B01.tif
No need for a regex -- you can just use split().
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
parts = filename.split("_")
year = parts[0]
month = parts[1]
Maybe you can do like this:
from os import listdir, mkdir
from os.path import isfile, join, isdir
my_path = 'your_soure_dir'
files_name = [f for f in listdir(my_path) if isfile(join(my_path, f))]
def create_dir(files_name):
for file in files_name:
month = file.split('_', '1')[0]
week = file.split('_', '2')[1]
if not isdir(my_path):
mkdir(month)
mkdir(week)
### your download code
filename = "2017_05_S2B_7VEG_20170528_0_L2A_B01.tif"
temp = filename.split('_')
result = "/".join(temp)
print(result)
result is
2017/05/S2B/7VEG/20170528/0/L2A/B01.tif
In python, suppose I have a path like this:
/folderA/folderB/folderC/folderD/
How can I get just the folderD part?
Use os.path.normpath, then os.path.basename:
>>> os.path.basename(os.path.normpath('/folderA/folderB/folderC/folderD/'))
'folderD'
The first strips off any trailing slashes, the second gives you the last part of the path. Using only basename gives everything after the last slash, which in this case is ''.
With python 3 you can use the pathlib module (pathlib.PurePath for example):
>>> import pathlib
>>> path = pathlib.PurePath('/folderA/folderB/folderC/folderD/')
>>> path.name
'folderD'
If you want the last folder name where a file is located:
>>> path = pathlib.PurePath('/folderA/folderB/folderC/folderD/file.py')
>>> path.parent.name
'folderD'
You could do
>>> import os
>>> os.path.basename('/folderA/folderB/folderC/folderD')
UPDATE1: This approach works in case you give it /folderA/folderB/folderC/folderD/xx.py. This gives xx.py as the basename. Which is not what you want I guess. So you could do this -
>>> import os
>>> path = "/folderA/folderB/folderC/folderD"
>>> if os.path.isdir(path):
dirname = os.path.basename(path)
UPDATE2: As lars pointed out, making changes so as to accomodate trailing '/'.
>>> from os.path import normpath, basename
>>> basename(normpath('/folderA/folderB/folderC/folderD/'))
'folderD'
Here is my approach:
>>> import os
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD/test.py'))
folderD
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD/'))
folderD
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD'))
folderC
I was searching for a solution to get the last foldername where the file is located, I just used split two times, to get the right part. It's not the question but google transfered me here.
pathname = "/folderA/folderB/folderC/folderD/filename.py"
head, tail = os.path.split(os.path.split(pathname)[0])
print(head + " " + tail)
I like the parts method of Path for this:
grandparent_directory, parent_directory, filename = Path(export_filename).parts[-3:]
log.info(f'{t: <30}: {num_rows: >7} Rows exported to {grandparent_directory}/{parent_directory}/{filename}')
If you use the native python package pathlib it's really simple.
>>> from pathlib import Path
>>> your_path = Path("/folderA/folderB/folderC/folderD/")
>>> your_path.stem
'folderD'
Suppose you have the path to a file in folderD.
>>> from pathlib import Path
>>> your_path = Path("/folderA/folderB/folderC/folderD/file.txt")
>>> your_path.name
'file.txt'
>>> your_path.parent
'folderD'
During my current projects, I'm often passing rear parts of a path to a function and therefore use the Path module. To get the n-th part in reverse order, I'm using:
from typing import Union
from pathlib import Path
def get_single_subpath_part(base_dir: Union[Path, str], n:int) -> str:
if n ==0:
return Path(base_dir).name
for _ in range(n):
base_dir = Path(base_dir).parent
return getattr(base_dir, "name")
path= "/folderA/folderB/folderC/folderD/"
# for getting the last part:
print(get_single_subpath_part(path, 0))
# yields "folderD"
# for the second last
print(get_single_subpath_part(path, 1))
#yields "folderC"
Furthermore, to pass the n-th part in reverse order of a path containing the remaining path, I use:
from typing import Union
from pathlib import Path
def get_n_last_subparts_path(base_dir: Union[Path, str], n:int) -> Path:
return Path(*Path(base_dir).parts[-n-1:])
path= "/folderA/folderB/folderC/folderD/"
# for getting the last part:
print(get_n_last_subparts_path(path, 0))
# yields a `Path` object of "folderD"
# for second last and last part together
print(get_n_last_subparts_path(path, 1))
# yields a `Path` object of "folderc/folderD"
Note that this function returns a Pathobject which can easily be converted to a string (e.g. str(path))
path = "/folderA/folderB/folderC/folderD/"
last = path.split('/').pop()
str = "/folderA/folderB/folderC/folderD/"
print str.split("/")[-2]
I want to insert a directory name in the middle of a given file path, like this:
directory_name = 'new_dir'
file_path0 = 'dir1/dir2/dir3/dir4/file.txt'
file_path1 = some_func(file_path0, directory_name, position=2)
print(file_path1)
>>> 'dir1/dir2/new_dir/dir3/dir4/file.txt'
I looked through the os.path and pathlib packages, but it looks like they don't have a function that allows for inserting in the middle of a file path. I tried:
import sys,os
from os.path import join
path_ = file_path0.split(os.sep)
path_.insert(2, 'new_dir')
print(join(path_))
but this results in the error
"expected str, bytes or os.PathLike object, not list"
Does anyone know standard python functions that allow such inserting in the middle of a file path? Alternatively - how can I turn path_ to something that can be processed by os.path. I am new to pathlib, so maybe I missed something out there
Edit: Following the answers to the question I can suggest the following solutions:
1.) As Zach Favakeh suggests and as written in this answer just correct my code above to join(*path_) by using the 'splat' operator * and everything is solved.
2.) As suggested by buran you can use the pathlib package, in very short it results in:
from pathlib import PurePath
path_list = list(PurePath(file_path0).parts)
path_list.insert(2, 'new_dir')
file_path1 = PurePath('').joinpath(*path_list)
print(file_path1)
>>> 'dir1/dir2/new_dir/dir3/dir4/file.txt'
Take a look at pathlib.PurePath.parts. It will return separate components of the path and you can insert at desired position and construct the new path
>>> from pathlib import PurePath
>>> file_path0 = 'dir1/dir2/dir3/dir4/file.txt'
>>> p = PurePath(file_path0)
>>> p.parts
('dir1', 'dir2', 'dir3', 'dir4', 'file.txt')
>>> spam = list(p.parts)
>>> spam.insert(2, 'new_dir')
>>> new_path = PurePath('').joinpath(*spam)
>>> new_path
PurePosixPath('dir1/dir2/new_dir/dir3/dir4/file.txt')
This will work with path as a str as well as with pathlib.Path objects
Since you want to use join on a list to produce the pathname, you should do the following using the "splat" operator: Python os.path.join() on a list
Edit: You could also take your np array and concatenate its elements into a string using np.array2string, using '/' as your separator parameter:https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.array2string.html
Hope this helps.
Solution using regex. The regex will create groups of the following
[^\/]+ - non-'/' characters(i.e. directory names)
\w+\.\w+ - word characters then '.' then word characters (i.e. file name)
import re
directory_name = 'new_dir'
file_path0 = 'dir1/dir2/dir3/dir4/file.txt'
position = 2
regex = re.compile(r'([^\/]+|\w+\.\w+)')
tokens = re.findall(regex, file_path0)
tokens.insert(position, directory_name)
file_path1 = '/'.join(tokens)
Result:
'dir1/dir2/new_dir/dir3/dir4/file.txt'
Your solution has only one flaw. After inserting the new directory in the path list path_.insert(2, 'new_dir')you need to call os.path.join(*path_) to get the new modified path. The error that you get is because you are passing a list as parameter to the join function, but you have to unpack it.
In my case, I knew the portion of path that would precede the insertion point (i.e., "root"). However, the position of the insertion point was not constant due to the possibility of having varying number of path components in the root path. I used Path.relative_to() to break the full path to yield an insertion point for the new_dir.
from pathlib import Path
directory_name = Path('new_dir')
root = Path('dir1/dir2/')
file_path0 = Path('dir1/dir2/dir3/dir4/file.txt')
# non-root component of path
chld = file_path0.relative_to(root)
file_path1 = root / directory_name / chld
print(file_path1)
Result:
'dir1/dir2/new_dir/dir3/dir4/file.txt'
I made a try with your need:
directory_name = '/new_dir'
file_path0 = 'dir1/dir2/dir3/dir4/file.txt'
before_the_newpath = 'dir1/dir2'
position = file_path0.split(before_the_newpath)
new_directory = before_the_newpath + directory_name + position[1]
Hope it helps.
I want to create a list of all the filepath names that match a specific string e.g. "04_DEM" so I can do further processing on the files inside those directories?
e.g.
INPUT
C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif
C:\directory\Australia\04DEM\DEM_CD23_1233.tif
C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif
C:\directory\Australia\05DSM\DSM_CD23_1232.tif
WANTED OUTPUT
C:\directory\NewZealand\04DEM\
C:\directory\Australia\04DEM\
This makes sure that only those files are processed, as some other files in the directories also have the same string "DEM" included in their filename, which I do not want to modify.
This is my bad attempt due to being a rookie with Py code
import os
for dirnames in os.walk('D:\Canterbury_2017Copy'):
print dirnames
if dirnames=='04_DEM' > listofdirectoriestoprocess.txt
print "DONE CHECK TEXT FILE"
You can use os.path for this:
import os
lst = [r'C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif',
r'C:\directory\Australia\04DEM\DEM_CD23_1233.tif',
r'C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif',
r'C:\directory\Australia\05DSM\DSM_CD23_1232.tif']
def filter_paths(lst, x):
return [os.path.split(i)[0] for i in lst if os.path.normpath(i).split(os.sep)[3] == x]
res = list(filter_paths(lst, '04DEM'))
# ['C:\\directory\\NewZealand\\04DEM',
# 'C:\\directory\\Australia\\04DEM']
Use in to check if a required string is in another string.
This is one quick way:
new_list = []
for path in path_list:
if '04DEM' in path:
new_list.append(path)
Demo:
s = 'C:/directory/NewZealand/04DEM/DEM_CD23_1232.tif'
if '04DEM' in s:
print(True)
# True
Make sure you use / or \\ as directory separator instead of \ because the latter escapes characters.
First, you select via regex using re, and then use pathlib:
import re
import pathlib
pattern = re.compile('04DEM')
# You use pattern.search() if s is IN the string
# You use pattern.match() if s COMPLETELY matches the string.
# Apply the correct function to your use case.
files = [s in list_of_files if pattern.search(s)]
all_pruned_paths = set()
for p in files:
total = ""
for d in pathlib.Path(p):
total = os.path.join(total, d)
if pattern.search(s):
break
all_pruned_paths.add(total)
result = list(all_pruned_paths)
This is more robust than using in because you might need to form more complicated queries in the future.
I basically want to call only the part of my string that falls before the "."
For example, if my filename is sciPHOTOf105w0.fits, I want to call "sciPHOTOf105w" as its own string so that I can use it to name a new file that's related to it. How do you do this? I can't just use numeral values "ex. if my file name is 'file', file[5:10]." I need to be able to collect everything up to the dot without having to count, because the file names can be of different lengths.
You can also use os.path like so:
>>> from os.path import splitext
>>> splitext('sciPHOTOf105w0.fits') # separates extension from file name
('sciPHOTOf105w0', '.fits')
>>> splitext('sciPHOTOf105w0.fits')[0]
'sciPHOTOf105w0'
If your file happens to have a longer path, this approach will also account for your full path.
import os.path
filename = "sciPHOTOf105w0.fits"
root, ext = os.path.splitext(filename)
print "root is: %s" % root
print "ext is: %s" % ext
result:
>root is: sciPHOTOf105w0
>ext is: .fits
In [33]: filename = "sciPHOTOf105w0.fits"
In [34]: filename.rpartition('.')[0]
Out[34]: 'sciPHOTOf105w0'
In [35]: filename.rsplit('.', 1)[0]
Out[35]: 'sciPHOTOf105w0'
You can use .index() on a string to find the first occurence of a substring.
>>> filename = "sciPHOTOf105w0.fits"
>>> filename.index('.')
14
>>> filename[:filename.index('.')]
'sciPHOTOf105w0'