I am trying to build a URL by joining some dynamic components. I thought of using something like os.path.join() BUT for URLs in my case. From research I found urlparse.urljoin() does the same thing. However, it looks like it only take two arguments at one time.
I have the following so far which works but looks repetitive:
a = urlparse.urljoin(environment, schedule_uri)
b = urlparse.urljoin(a, str(events_to_hours))
c = urlparse.urljoin(b, str(events_from_date))
d = urlparse.urljoin(c, str(api_version))
e = urlparse.urljoin(d, str(id))
url = e + '.json'
Output = http://example.com/schedule/12/20160322/v1/1.json
The above works and I tried to make it shorter this way:
url_join_items = [environment, schedule_uri, str(events_to_hours),
str(events_from_date), str(api_version), str(id), ".json"]
new_url = ""
for url_items in url_join_items:
new_url = urlparse.urljoin(new_url, url_items)
Output: http://example.com/schedule/.json
But the second implementation does not work. Please suggest me how to fix this or the better way of doing it.
EDIT 1:
The output from the reduce solution looks like this (unfortunately):
Output: http://example.com/schedule/.json
Using join
Have you tried simply "/".join(url_join_items). Does not http always use the forward slash? You might have to manually setup the prefix "https://" and the suffix, though.
Something like:
url = "https://{}.json".format("/".join(url_join_items))
Using reduce and urljoin
Here is a related question on SO that explains to some degree the thinking behind the implementation of urljoin. Your use case does not appear to be the best fit.
When using reduce and urljoin, I'm not sure it will do what the question intends, which is semantically like os.path.join, but for urls. Consider the following:
from urllib.parse import urljoin
from functools import reduce
parts_1 = ["a","b","c","d"]
parts_2 = ["https://","server.com","somedir","somefile.json"]
parts_3 = ["https://","server.com/","somedir/","somefile.json"]
out1 = reduce(urljoin, parts_1)
print(out1)
d
out2 = reduce(urljoin, parts_2)
print(out2)
https:///somefile.json
out3 = reduce(urljoin, parts_3)
print(out3)
https:///server.com/somedir/somefile.json
Note that with the exception of the extra "/" after the https prefix, the third output is probably closest to what the asker intends, except we've had to do all the work of formatting the parts with the separator.
I also needed something similar and came up with this solution:
from urllib.parse import urljoin, quote_plus
def multi_urljoin(*parts):
return urljoin(parts[0], "/".join(quote_plus(part.strip("/"), safe="/") for part in parts[1:]))
print(multi_urljoin("https://server.com", "path/to/some/dir/", "2019", "4", "17", "some_random_string", "image.jpg"))
This prints 'https://server.com/path/to/some/dir/2019/4/17/some_random_string/image.jpg'
Here's a bit silly but workable solution, given that parts is a list of URL parts in order
my_url = '/'.join(parts).replace('//', '/').replace(':/', '://')
I wish replace would have a from option but it does not hence the second one is to recover https:// double slash
Nice thing is you don't have to worry about parts already having (or not having) any slashes
Simple solution will be:
def url_join(*parts: str) -> str:
import re
line = '/'.join(parts)
line = re.sub('/{2,}', '/', line)
return re.sub(':/', '://', line)
Related
I'm confused when trying to replace specific text in python
my code is:
Image = "/home/user/Picture/Image-1.jpg"
Image2 = Image.replace("-1", "_s", 1)
print(Image)
print(Image2)
Output:
/home/user/Picture/Image-1.jpg
/home/user/Picture/Image_s.jpg
The output what I want from Image2 is:
/home/user/Picture/Image-1_s.jpg
You are replacing the -1 with _s
If you want to keep the -1 as well, you can just add it in the replacement
Image = "/home/user/Picture/Image-1.jpg"
Image2 = Image.replace("-1", "-1_s", 1)
print(Image)
print(Image2)
Output
/home/user/Picture/Image-1.jpg
/home/user/Picture/Image-1_s.jpg
If the digits can be variable, you can also use a pattern with for example 2 capture groups, and then use those capture groups in the replacement with _s in between
import re
pattern = r"(/home/user/Picture/Image-\d+)(\.jpg)\b"
s = "/home/user/Picture/Image-1.jpg\n/home/user/Picture/Image-2.jpg"
print(re.sub(pattern, r"\1_s\2", s))
Output
/home/user/Picture/Image-1_s.jpg
/home/user/Picture/Image-2_s.jpg
Or for example only taking the /Image- into account and then use the full match in the replacement instead of using capture groups:
import re
pattern = r"/Image-\d+(?=\.jpg)\b"
s = "/home/user/Picture/Image-1.jpg\n/home/user/Picture/Image-2.jpg"
print(re.sub(pattern, r"\g<0>_s", s))
Output
/home/user/Picture/Image-1_s.jpg
/home/user/Picture/Image-2_s.jpg
The behaviour of the code you wrote is working as I would have expected from reading it. Now, as to how to correct it to do what you expected/wanted it to do is a little different. You don't necessarily need to replace here, instead, you can consider appending what you need, as it seems from the behaviour you are looking for is in fact appending something to the end of the path before the extension.
We can try to help the code a bit by making it a little more "generic" by allowing us to simply "append" anything to the end of a string. The steps we can do to achieve this is (for other readers, yes there are more foolproof ways to do this, for now sticking to a simple example) :
split the string at . so that you end up with a list containing:
["/home/user/Picture/Image-1", "jpg"]
Append to the first element what you need to the end of the string so you end up with:
"/home/user/Picture/Image-1_s"
Use join to re-craft your string, but use .:
".".join(["/home/user/Picture/Image-1_s", "jpg"])
You will finally get:
/home/user/Picture/Image-1_s.jpg
Coding the above, we can have it work as follows:
>>> Image1 = "/home/user/Picture/Image-1.jpg"
>>> img_split = Image1.split(".")
>>> img_split
['/home/user/Picture/Image-1', 'jpg']
>>> img_split[0] = img_split[0] + "_s"
>>> img_split
['/home/user/Picture/Image-1_s', 'jpg']
>>> final_path = ".".join(img_split)
>>> final_path
'/home/user/Picture/Image-1_s.jpg'
More idiomatic approach using Python's pathlib module is an interesting solution too.
from pathlib import Path
Image1 = "/home/user/Picture/Image-1.jpg"
p = Path(Image1)
# you have access to all the parts you need. Like the path to the file:
p.parent # outputs PosixPath('/home/user/Picture/')
# The name of the file without extension
p.stem # outputs 'Image-1'
# The extension of the file
p.suffix # outputs '.jpg'
# Finally, we get to now rename it using the rename method!
p.rename(p.parent / f"{p.stem}_s{p.suffix}")
# This will now result in the following object with renamed file!
# PosixPath('/home/user/Picture/Image-1_s.jpg')
The replace function replaces "-1" with "_s".
If you want the output to be: /home/user/Picture/Image-1_s.jpg
You should replace "-1" with "-1_s".
Try:
Image = "/home/user/Picture/Image-1.jpg"
Image2 = Image.replace("-1", "-1_s")
print(Image)
print(Image2)
Try this
i think you should append the string in a certain position not replace
Image = "/home/user/Picture/Image-1.jpg"
Image2 = Image[:26]+ '_s' + Image[26:]
print(Image2)
The output
I get the desired output with the following code:
row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
new_row = row.split('s3://bucket-name/')[1]
print(new_row)
qwe/2022/02/24/qwe.csv
I want to achieve this while having the bucket name saved in a variable, like this:
bucket_name="bucket-name"
new_row = row.split('s3://'+bucket_name+'/')[1]
This doesn't work (says invalid syntax).
Is there another way I can define this or will I have to use a different function to split?
Oops you have missed quotes
bucket_name='bucket-name'
new_row = row.split('s3://'+bucket_name+'/')[1]
ouytput
'qwe/2022/02/24/qwe.csv'
You can also do like this:
row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
bucket_name='bucket-name'
new_row = row.split(f"""s3://{bucket_name}/""")[1]
I don't see any advantage to split when you could just slice the url to get the part you want.
>>> row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
>>> bucket_name = "bucket-name"
>>> row[len("s3://" + bucket_name + "/"):]
'qwe/2022/02/24/qwe.csv'
But since this is a URL, you will have more robust solution if you parse the url. You can use the parts to verify that you got the string you want and it will deal with other issues such appended query strings.
from urllib.parse import urlsplit
row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
parts = urlsplit(row)
if parts.scheme != "s3":
raise ValueError("not s3 bucket")
if parts.netloc != "bucket-name":
raise ValueError("not my bucket")
print(parts.path[1:])
I am trying to run a scraper I found online but receive a ValueError: too many values to unpack on this line of code
k, v = piece.split("=")
This line is part of this function
def format_url(url):
# make sure URLs aren't relative, and strip unnecssary query args
u = urlparse(url)
scheme = u.scheme or "https"
host = u.netloc or "www.amazon.com"
path = u.path
if not u.query:
query = ""
else:
query = "?"
for piece in u.query.split("&"):
k, v = piece.split("=")
if k in settings.allowed_params:
query += "{k}={v}&".format(**locals())
query = query[:-1]
return "{scheme}://{host}{path}{query}".format(**locals())
If you have any input it would be appreciated, thank you.
Instead of parsing the urls yourself, you can use urlparse.parse_qs function:
>>> from urlparse import urlparse, parse_qs
>>> URL = 'https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello'
>>> parsed_url = urlparse(URL)
>>> parse_qs(parsed_url.query)
{'i': ['main'], 'enc': [' Hello '], 'mode': ['front'], 'sid': ['12ab']}
(source)
This is due to the fact that one of the pieces contains two or more '=' characters. In that case you thus return a list of three or more elements. And you cannot assign it to the two values.
You can solve that problem, by splitting at most one '=' by adding an additional parameter to the .split(..) call:
k, v = piece.split("=",1)
But now we still do not have guarantees that there is an '=' in the piece string anyway.
We can however use the urllib.parse module in python-3.x (urlparse in python-2.x):
from urllib.parse import urlparse, parse_qsl
purl = urlparse(url)
quer = parse_qsl(purl.query)
for k,v in quer:
# ...
pass
Now we have decoded the query string as a list of key-value tuples we can process separately. I would advice to build up a URL with the urllib as well.
You haven't shown any basic debugging: what is piece at the problem point? If it has more than a single = in the string, the split operation will return more than 2 values -- hence your error message.
If you want to split on only the first =, then use index to get the location, and grab the slices you need:
pos = piece.index('=')
k = piece[:pos]
v = piece[pos+1:]
This is a line of a file and I want to take only the url after the word uri and the url after smallPictureUrl to use it later but i can not find a proper way
The asterisks represent text or numbers or both together and the are different in every line who looks like this so they can not be helpfull, the have not a pattern to take advantage of it
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__
\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
\",\"width\":180,\"height\":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
in something more simple like:
{"displayName":"Jim Test","firstName":"*","lastName":"*"}
i managed to take the name for example Jim Test after displayName with using the re.search('(?<="displayName":")(\w+) (\w+)',line) but for the other is very complicated if you can give me any direction or advice .
a line is exactly like this
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s200x200/*_*_*_*.jpg","timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.40652557319224},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/*_*_*_a.jpg\",\"width\":180,\"height\":120}}}","subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s100x100/*_*_*_a.jpg","contactId":"**==","contactType":"USER","friendshipStatus":"ARE_FRIENDS","graphApiWriteId":"contact_*:*:*","hugePictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s720x720/*_*_*_*.jpg","profileFbid":"*","isMobilePushable":"NO","lookupKey":null,"name":{"displayName":"* *","firstName":"*","lastName":"*"},"nameSearchTokens":["*","*"],"phones":[],"phoneticName":{"displayName":null,"firstName":null,"lastName":null},"isMemorialized":false,"communicationRank":0.4183731,"canViewerSendGift":false,"canMessage":true}
The value associated with timelineCoverPhoto seems to be stringified JSON, so you could do something admittedly ugly like this:
import json
s = {
"subscribeStatus": "IS_SUBSCRIBED",
"bigPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto": "{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg \",\"width\":180,\"height\":135}}}",
"smallPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg"
}
j = json.loads(s.get('timelineCoverPhoto'))
print "uri:", j.get('photo').get('image_lowres').get('uri')
uri: https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]
If you not okay with using json, how about this ?
>>> print mytext
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{"focus":{"x":0.5,"y":0.49137931034483},"photo":{"__type__
":{"name":"Photo"},"image_lowres":{"uri":"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
","width":180,"height":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
>>> uri = re.findall(r'uri\"\:\"[\'"]?([^\'" >]+)', mytext) #gets the uri
>>> smallpicurl = re.findall(r'smallPictureUrl\"\:\"[\'"]?([^\'" >]+)', mytext) # gets the smallPictureUrl
>>> ''.join(uri).rstrip()
'https://fbcdn-*-*-*.*.*/*-*-*/*.jpg' # uri
>>> ''.join(smallpicurl).rstrip()
'https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg' # smallPictureUrl
I have a section of a log file that looks like this:
"/log?action=End&env=123&id=8000&cat=baseball"
"/log?action=start&get=3210&rsa=456&key=golf"
I want to parse out each section so the results would look like this:
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
('/log?action=', 'start', 'get=3210', 'rsa=456', 'key=golf')
I've looked into regex and matching, but a lot of my logs have different sequences which leads me to believe that it is not possible. Any suggestions?
This is clearly a fragment of a URL, so the best way to parse it is to use URL parsing tools. The stdlib comes with urlparse, which does exactly what you want.
For example:
>>> import urlparse
>>> s = "/log?action=End&env=123&id=8000&cat=baseball"
>>> bits = urlparse.urlparse(s)
>>> variables = urlparse.parse_qs(bits.query)
>>> variables
{'action': ['End'], 'cat': ['baseball'], 'env': ['123'], 'id': ['8000']}
If you really want to get the format you asked for, you can use parse_qsl instead, and then join the key-value pairs back together. I'm not sure why you want the /log to be included in the first query variable, or the first query variable's value to be separate from its variable, but even that is doable if you insist:
>>> variables = urlparse.parse_qsl(s)
>>> result = (variables[0][0] + '=', variables[0][1]) + tuple(
'='.join(kv) for kv in variables[1:])
>>> result
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
If you're using Python 3.x, just change the urlparse to urllib.parse, and the rest is exactly the same.
You can split a couple times:
s = '/log?action=End&env=123&id=8000&cat=baseball'
L = s.split("&")
L[0:1]=L[0].split("=")
Output:
['/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball']
It's a bit hard to say without knowing what the domain of possible inputs is, but here's a guess at what will work for you:
log = "/log?action=End&env=123&id=8000&cat=baseball\n/log?action=start&get=3210&rsa=456&key=golf"
logLines = [line.split("&") for line in log.split('\n')]
logLines = [tuple(line[0].split("=")+line[1:]) for line in logLines]
print logLines
OUTPUT:
[('/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball'),
('/log?action', 'start', 'get=3210', 'rsa=456', 'key=golf')]
This assumes that you don't really need the "=" at the end of the first string.