How to get a basename of .tar.gz file in python?

How to get a basename of .tar.gz file in python? - python

I would like to get the basename of a tar.gz file in python.
So from "foo/bar/alice.tar.gz" i want alice
What I have so far is:
url = "foo/bar/alice.tar.gz"
Path(Path(url).stem).stem
print(url)
~ alice
is there a smoother way to do so? What if my url is something like "foo/bar/alice.tar.gz.tar.gz.tar.gz" ?
Thanks in advance.

i think that will do the job:
>>> import os
>>> base=os.path.basename("foo/bar/alice.tar.gz")
>>> base
'alice.tar.gz'
>>> name = base.split('.')
'['alice', 'tar', 'gz']'
>>> name = base.split('.')[0]
'alice'

You can use str.partition():
result = Path(url).stem.partition('.')[0]

Related

Concise way to split a path so that it includes the filename and two directories up in Python?

What is the most concise way to split a path so that it includes the filename and two directories up in Python?
>>> path = r'/absolute/path/to/file.txt'
>>> os.path.dirname(path)
Gives:
/absolute/path/to
While:
>>> from pathlib import Path
>>> path = r'/absolute/path/to/file.txt'
>>> Path(path).parents[1]
gives:
/absolute/path
What would be the most concise strategy to give me:
to/file.txt
?

>>> os.path.join(*pathlib.Path(path).parts[-2:])
'to/file.txt'

This is one way.
path = r'/absolute/path/to/file.txt'
res = '/'.join(path.split('/')[-2:])
print(res)
# to/file.txt
A less concise, but better, alternative:
res = os.path.join(*os.path.normpath(path).split(os.sep)[-2:])

take urls from lines of a file in python

This is a line of a file and I want to take only the url after the word uri and the url after smallPictureUrl to use it later but i can not find a proper way
The asterisks represent text or numbers or both together and the are different in every line who looks like this so they can not be helpfull, the have not a pattern to take advantage of it
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__
\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
\",\"width\":180,\"height\":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
in something more simple like:
{"displayName":"Jim Test","firstName":"*","lastName":"*"}
i managed to take the name for example Jim Test after displayName with using the re.search('(?<="displayName":")(\w+) (\w+)',line) but for the other is very complicated if you can give me any direction or advice .
a line is exactly like this
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s200x200/*_*_*_*.jpg","timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.40652557319224},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/*_*_*_a.jpg\",\"width\":180,\"height\":120}}}","subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s100x100/*_*_*_a.jpg","contactId":"**==","contactType":"USER","friendshipStatus":"ARE_FRIENDS","graphApiWriteId":"contact_*:*:*","hugePictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s720x720/*_*_*_*.jpg","profileFbid":"*","isMobilePushable":"NO","lookupKey":null,"name":{"displayName":"* *","firstName":"*","lastName":"*"},"nameSearchTokens":["*","*"],"phones":[],"phoneticName":{"displayName":null,"firstName":null,"lastName":null},"isMemorialized":false,"communicationRank":0.4183731,"canViewerSendGift":false,"canMessage":true}

The value associated with timelineCoverPhoto seems to be stringified JSON, so you could do something admittedly ugly like this:
import json
s = {
"subscribeStatus": "IS_SUBSCRIBED",
"bigPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto": "{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg \",\"width\":180,\"height\":135}}}",
"smallPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg"
}
j = json.loads(s.get('timelineCoverPhoto'))
print "uri:", j.get('photo').get('image_lowres').get('uri')
uri: https://fbcdn-*-*-*.*.*/*-*-*/*.jpg

#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]

If you not okay with using json, how about this ?
>>> print mytext
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{"focus":{"x":0.5,"y":0.49137931034483},"photo":{"__type__
":{"name":"Photo"},"image_lowres":{"uri":"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
","width":180,"height":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
>>> uri = re.findall(r'uri\"\:\"[\'"]?([^\'" >]+)', mytext) #gets the uri
>>> smallpicurl = re.findall(r'smallPictureUrl\"\:\"[\'"]?([^\'" >]+)', mytext) # gets the smallPictureUrl
>>> ''.join(uri).rstrip()
'https://fbcdn-*-*-*.*.*/*-*-*/*.jpg' # uri
>>> ''.join(smallpicurl).rstrip()
'https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg' # smallPictureUrl

Python split url to find image name and extension

I am looking for a way to extract a filename and extension from a particular url using Python
lets say a URL looks as follows
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
How would I go about getting the following.
filename = "da4ca3509a7b11e19e4a12313813ffc0_7"
file_ext = ".jpg"

try:
# Python 3
from urllib.parse import urlparse
except ImportError:
# Python 2
from urlparse import urlparse
from os.path import splitext, basename
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
disassembled = urlparse(picture_page)
filename, file_ext = splitext(basename(disassembled.path))
Only downside with this is that your filename will contain a preceding / which you can always remove yourself.

Try with urlparse.urlsplit to split url, and then os.path.splitext to retrieve filename and extension (use os.path.basename to keep only the last filename) :
import urlparse
import os.path
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
print os.path.splitext(os.path.basename(urlparse.urlsplit(picture_page).path))
>>> ('da4ca3509a7b11e19e4a12313813ffc0_7', '.jpg')

filename = picture_page.split('/')[-1].split('.')[0]
file_ext = '.'+picture_page.split('.')[-1]

# Here's your link:
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
#Here's your filename and ext:
filename, ext = (picture_page.split('/')[-1].split('.'))
When you do picture_page.split('/'), it will return a list of strings from your url split by a /.
If you know python list indexing well, you'd know that -1 will give you the last element or the first element from the end of the list.
In your case, it will be the filename: da4ca3509a7b11e19e4a12313813ffc0_7.jpg
Splitting that by delimeter ., you get two values:
da4ca3509a7b11e19e4a12313813ffc0_7 and jpg, as expected, because they are separated by a period which you used as a delimeter in your split() call.
Now, since the last split returns two values in the resulting list, you can tuplify it.
Hence, basically, the result would be like:
filename,ext = ('da4ca3509a7b11e19e4a12313813ffc0_7', 'jpg')

os.path.splitext will help you extract the filename and extension once you have extracted the relevant string from the URL using urlparse:
fName, ext = os.path.splitext('yourImage.jpg')

This is the easiest way to find image name and extension using regular expression.
import re
import sys
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
regex = re.compile('(.*\/(?P<name>\w+)\.(?P<ext>\w+))')
print regex.search(picture_page).group('name')
print regex.search(picture_page).group('ext')

>>> import re
>>> s = 'picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"'
>>> re.findall(r'\/([a-zA-Z0-9_]*)\.[a-zA-Z]*\"$',s)[0]
'da4ca3509a7b11e19e4a12313813ffc0_7'
>>> re.findall(r'([a-zA-Z]*)\"$',s)[0]
'jpg'

How to join absolute and relative urls?

I have two urls:
url1 = "http://127.0.0.1/test1/test2/test3/test5.xml"
url2 = "../../test4/test6.xml"
How can I get an absolute url for url2?

You should use urlparse.urljoin :
>>> import urlparse
>>> urlparse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'
With Python 3 (where urlparse is renamed to urllib.parse) you could use it as follow:
>>> import urllib.parse
>>> urllib.parse.urljoin(url1, url2)
'http://127.0.0.1/test1/test4/test6.xml'

If your relative path consists of multiple parts, you have to join them separately, since urljoin would replace the relative path, not join it. The easiest way to do that is to use posixpath.
>>> import urllib.parse
>>> import posixpath
>>> url1 = "http://127.0.0.1"
>>> url2 = "test1"
>>> url3 = "test2"
>>> url4 = "test3"
>>> url5 = "test5.xml"
>>> url_path = posixpath.join(url2, url3, url4, url5)
>>> urllib.parse.urljoin(url1, url_path)
'http://127.0.0.1/test1/test2/test3/test5.xml'
See also: How to join components of a path when you are constructing a URL in Python

es = ['http://127.0.0.1', 'test1', 'test4', 'test6.xml']
base = ''
map(lambda e: urlparse.urljoin(base, e), es)

You can use reduce to achieve Shikhar's method in a cleaner fashion.
>>> import urllib.parse
>>> from functools import reduce
>>> reduce(urllib.parse.urljoin, ["http://moc.com/", "path1/", "path2/", "path3/"])
'http://moc.com/path1/path2/path3/'
Note that with this method each fragment should have trailing forward-slash, with no leading forward-slash, to indicate it is a path fragment being joined.
This is more correct/informative, telling you that path1/ is a URI path fragment, and not the full path (e.g. /path1/) or an unknown (e.g. path1). An unknown could be either, but they are handled as a full path.
If you need to add / to a fragment lacking it, you could do:
uri = uri if uri.endswith("/") else f"{uri}/"
To learn more about URI resolution, Wikipedia has some nice examples.
Updates
Just noticed Peter Perron commented about reduce on Shikhar's answer, but I'll leave this here then to demonstrate how that's done.
Updated wikipedia URL

For python 3.0+ the correct way to join urls is:
from urllib.parse import urljoin
urljoin('https://10.66.0.200/', '/api/org')
# output : 'https://10.66.0.200/api/org'

>>> from urlparse import urljoin
>>> url1 = "http://www.youtube.com/user/khanacademy"
>>> url2 = "/user/khanacademy"
>>> urljoin(url1, url2)
'http://www.youtube.com/user/khanacademy'
Simple.

Python: Regex help

str = "a\b\c\dsdf\matchthis\erwe.txt"
The last folder name.
Match "matchthis"

Without using regex, just do:
>>> import os
>>> my_str = "a/b/c/dsdf/matchthis/erwe.txt"
>>> my_dir_path = os.path.dirname(my_str)
>>> my_dir_path
'a/b/c/dsdf/matchthis'
>>> my_dir_name = os.path.basename(my_dir_path)
>>> my_dir_name
'matchthis'

Better to use os.path.split(path) since it's platform independent. You'll have to call it twice to get the final directory:
path_file = "a\b\c\dsdf\matchthis\erwe.txt"
path, file = os.path.split(path_file)
path, dir = os.path.split(path)

>>> str = "a\\b\\c\\dsdf\\matchthis\\erwe.txt"
>>> str.split("\\")[-2]
'matchthis'

x = "a\b\c\d\match\something.txt"
match = x.split('\\')[-2]

>>> import re
>>> print re.match(r".*\\(.*)\\[^\\]*", r"a\b\c\dsdf\matchthis\erwe.txt").groups()
('matchthis',)
As #chrisaycock and #rafe-kettler pointed out. Use the x.split(r'\') if you can. It is way faster, readable and more pythonic. If you really need a regex then use one.
EDIT:
Actually, os.path is best. Platform independent. unix/windows etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get a basename of .tar.gz file in python? - python

i think that will do the job: >>> import os >>> base=os.path.basename("foo/bar/alice.tar.gz") >>> base 'alice.tar.gz' >>> name = base.split('.') '['alice', 'tar', 'gz']' >>> name = base.split('.')[0] 'alice'

You can use str.partition(): result = Path(url).stem.partition('.')[0]

Related

Concise way to split a path so that it includes the filename and two directories up in Python?

take urls from lines of a file in python

Python split url to find image name and extension

How to join absolute and relative urls?

Python: Regex help

Categories

Resources