Question about paths in Python - python

let's say i have directory paths looking like this:
this/is/the/basedir/path/a/include
this/is/the/basedir/path/b/include
this/is/the/basedir/path/a
this/is/the/basedir/path/b
In Python, how can i split these paths up so they will look like this instead:
a/include
b/include
a
b
If i run os.path.split(path)[1] it will display:
include
include
a
b
What should i be trying out here, should i be looking at some regex command or can this be done without it? Thanks in advance.
EDIT ALL: I solved it using regular expressions, damn handy tool :)

Perhaps something like this, depends on how hardcoded your prefix is:
def removePrefix(path, prefix):
plist = path.split(os.sep)
pflist = prefix.split(os.sep)
rest = plist[len(pflist):]
return os.path.join(*rest)
Usage:
print removePrefix("this/is/the/basedir/path/b/include", "this/is/the/basedir/path")
b/include
Assuming you're on a platform where the directory separator (os.sep) really is the forward slash).
This code tries to handle paths as something a little more high-level than mere strings. It's not optimal though, you could (or should) do more cleaning and canonicalization to be safer.

Maybe something like this:
result = []
prefix = os.path.commonprefix(list_of_paths)
for path in list_of_paths:
result.append(os.path.relpath(path, prefix))
This works only in 2.6. The relapath in 2.5 and before does the work only in case the path is the current working directory.

what about partition?
It Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.
data = """this/is/the/basedir/path/a/include
this/is/the/basedir/path/b/include
this/is/the/basedir/path/a
this/is/the/basedir/path/b"""
for line in data.splitlines():
print line.partition("this/is/the/basedir/path/")[2]
#output
a/include
b/include
a
b
Updated for the new comment by author:
It looks like u need rsplit for different directories by whether the directory endswith "include" of not:
import os.path
data = """this/is/the/basedir/path/a/include
this/is/the/basedir/path/b/include
this/is/the/basedir/path/a
this/is/the/basedir/path/b"""
for line in data.splitlines():
if line.endswith('include'):
print '/'.join(line.rsplit("/",2)[-2:])
else:
print os.path.split(line)[1]
#or just
# print line.rsplit("/",1)[-1]
#output
a/include
b/include
a
b

While the criterion is not 100% clear, it seems from the OP's comment that the key issue is specifically whether the path's last component ends in "include". If that is the case, and to avoid going wrong when the last component is e.g. "dontinclude" (as another answer does by trying string matching instead of path matching), I suggest:
def lastpart(apath):
pieces = os.path.split(apath)
final = -1
if pieces[-1] == 'include':
final = -2
return '/'.join(pieces[final:])

Related

Select the second specific string if found more than one in a string variable [duplicate]

I'm looking for a simple method of identifying the last position of a string inside another string ... for instance. If I had: file = C:\Users\User\Desktop\go.py
and I wanted to crop this so that file = go.py
Normally I would have to run C:\Users\User\Desktop\go.py through a loop + find statement, and Evey time it encountered a \ it would ask ... is the the last \ in the string? ... Once I found the last \ I would then file = file[last\:len(file)]
I'm curious to know if there is a faster neater way to do this.. preferably without a loop.
Something like file = [file('\',last):len(file)]
If there is nothing like what I've shown above ... then can we place the loop inside the [:] somehow. Something like file = [for i in ...:len(file)]
thanks :)
If it is only about file paths, you can use os.path.basename:
>>> import os
>>> os.path.basename(file)
'go.py'
Or if you are not running the code on Windows, you have to use ntpath instead of os.path.
You could split the string into a list then get the last index of the list.
Sample:
>>> file = 'C:\Users\User\Desktop\go.py'
>>> print(file.split('\\')[-1])
go.py
I agree with Felix on that file paths should be handled using os.path.basename. However, you might want to have a look at the built in string function rpartition.
>>> file = 'C:\Users\User\Desktop\go.py'
>>> before, separator, after = file.rpartition('\\')
>>> before
'C:\\Users\\User\\Desktop'
>>> separator
'\\'
>>> after
'go.py'
There's also the rfind function which gives you the last index of a substring.
>>> file.rfind('\\')
21
I realize that I'm a bit late to the party, but since this is one of the top results when searching for e.g. "find last in str python" on Google, I think it might help someone to add this information.
For the general purpose case (as the OP said they like the generalisation of the split solution)...... try the rfind(str) function.
"myproject-version2.4.5.customext.zip".rfind(".")
edit: apologies, I hadn't realized how old this thread was... :-/
For pathname manipulations you want to be using os.path.
For this specific problem you want to use the os.path.basename(path) function which will return the last component of a path, or an empty string if the path ends in a slash (ie. the path of a folder rather than a file).
import os.path
print os.path.basename("C:\Users\User\Desktop\go.py")
Gives:
go.py

Control order of pathlib and string concatenation

I have a directory I want to save files to, saved as a Path object called dir. I want to autogenerate files names at that path using string concatenation.
The only way I can get this to work in a single line is just through string concatenation:
dir = Path('./Files')
constantString = 'FileName'
changingString = '_001'
path2newfile = dir.as_posix() + '/' + constantString + changingString
print(path2newfile) # ./Files/Filename_001
... which is overly verbose and not platform independent.
What I'd want to do is use pathlib's / operator for easy manipulation of the new file path that is also platform independent. This would require ensuring that the string concatenation happens first, but the only way I know how to do that is to set a (pointless) variable:
filename = constantString + changingString
path2newfile = dir / filename
But I honestly don't see why this should have to take two lines.
If I instead assume use "actual" strings (ie. not variables containing strings), I can do something like this:
path2newfile = dir / 'Filename' '_001'
But this doesn't work with variables.
path2newfile = dir / constantString changingString
# SyntaxError: invalid syntax
So I think the base question is how do I control the order of operators in python? Or at least make the concatenation operator + act before the Path operator /.
Keep in mind this is a MWE. My actual problem is a bit more complicated and has to be repeated several times in the code.
Just use parentheses surrounding your string contatenation:
path2newfile = dir / (constantString + changingString)
Have you considered using Python f-strings?
It seems like your real-world example has a "template-y" feel to it, so something like:
path / f"constant part {variable_part}"
may work.
Use os.path.join().
It's both platform-independent and you can plug the desired path parts as arguments.

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring
Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

How to use filename.find()?

I am new to python, and am trying to understand a script that has the following lines:
dotInd = fileName.find(".")
if dotInd <> -1:
newFC = fileName[0:dotInd]
outFC = newFC + "_buffer"
else:
outFC = fileName + "_buffer"
I have not been able to find what fileName.find(".") is doing, and what the condition dotInd<>-1 means
(Confused about the <> thing)
Any help would be apreciated, also, is there a place where you cand find a list of what all python functions do? Thanks
fileName is an identifier, and refers to an object of type str. You are looking for str.find(). The method returns -1 if the sought-after text is not found, a position otherwise.
<> is an archaic and deprecated way of spelling !=, so it tests if the '.' has been found; if so, the returned position is used to slice the string, removing everything from the '.' to the end.
The code could be better written as:
outFC = fileName.partition('.')[0] + '_buffer'
which will result in the same output without str.find() and testing the output. See the str.partition() function documentation for more information.
It would be more correct still to use os.path.splitext() function to prevent splitting on a leading . (signifying a hidden file on POSIX systems):
import os.path
outFC = os.path.splitext(fileName)[0] + '_buffer'

How can I split a URL string up into separate parts in Python?

I decided that I'll learn Python tonight :)
I know C pretty well (wrote an OS in it), so I'm not a noob in programming, so everything in Python seems pretty easy, but I don't know how to solve this problem:
let's say I have this address:
http://example.com/random/folder/path.html
Now how can I create two strings from this, one containing the "base" name of the server, so in this example it would be
http://example.com/
and another containing the thing without the last filename, so in this example it would be
http://example.com/random/folder/
Also I of course know the possibility to just find the third and last slash respectively, but is there a better way?
Also it would be cool to have the trailing slash in both cases, but I don't care since it can be added easily.
So is there a good, fast, effective solution for this? Or is there only "my" solution, finding the slashes?
The urlparse module in Python 2.x (or urllib.parse in Python 3.x) would be the way to do it.
>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>
If you wanted to do more work on the path of the file under the URL, you can use the posixpath module:
>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'
After that, you can use posixpath.join to glue the parts together.
Note: Windows users will choke on the path separator in os.path. The posixpath module documentation has a special reference to URL manipulation, so all's good.
If this is the extent of your URL parsing, Python's inbuilt rpartition will do the job:
>>> URL = "http://example.com/random/folder/path.html"
>>> Segments = URL.rpartition('/')
>>> Segments[0]
'http://example.com/random/folder'
>>> Segments[2]
'path.html'
From Pydoc, str.rpartition:
Splits the string at the last occurrence of sep, and returns a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself
What this means is that rpartition does the searching for you, and splits the string at the last (right most) occurrence of the character you specify (in this case / ). It returns a tuple containing:
(everything to the left of char , the character itself , everything to the right of char)
I have no experience with Python, but I found the urlparse module, which should do the job.
In Python, a lot of operations are done using lists. The urlparse module mentioned by Sebasian Dietz may well solve your specific problem, but if you're generally interested in Pythonic ways to find slashes in strings, for example, try something like this:
url = 'http://example.com/random/folder/path.html'
# Create a list of each bit between slashes
slashparts = url.split('/')
# Now join back the first three sections 'http:', '' and 'example.com'
basename = '/'.join(slashparts[:3]) + '/'
# All except the last one
dirname = '/'.join(slashparts[:-1]) + '/'
print 'slashparts = %s' % slashparts
print 'basename = %s' % basename
print 'dirname = %s' % dirname
The output of this program is this:
slashparts = ['http:', '', 'example.com', 'random', 'folder', 'path.html']
basename = http://example.com/
dirname = http://example.com/random/folder/
The interesting bits are split, join, the slice notation array[A:B] (including negatives for offsets-from-the-end) and, as a bonus, the % operator on strings to give printf-style formatting.
It seems like the posixpath module mentioned in sykora's answer is not available in my Python setup (Python 2.7.3).
As per this article, it seems that the "proper" way to do this would be using...
urlparse.urlparse and urlparse.urlunparse can be used to detach and reattach the base of the URL
The functions of os.path can be used to manipulate the path
urllib.url2pathname and urllib.pathname2url (to make path name manipulation portable, so it can work on Windows and the like)
So for example (not including reattaching the base URL)...
>>> import urlparse, urllib, os.path
>>> os.path.dirname(urllib.url2pathname(urlparse.urlparse("http://example.com/random/folder/path.html").path))
'/random/folder'
You can use Python's library furl:
f = furl.furl("http://example.com/random/folder/path.html")
print(str(f.path)) # '/random/folder/path.html'
print(str(f.path).split("/")) # ['', 'random', 'folder', 'path.html']
To access word after first "/", use:
str(f.path).split("/") # 'random'

Categories

Resources