Splitting a string based on another variable - python

I get the desired output with the following code:
row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
new_row = row.split('s3://bucket-name/')[1]
print(new_row)
qwe/2022/02/24/qwe.csv
I want to achieve this while having the bucket name saved in a variable, like this:
bucket_name="bucket-name"
new_row = row.split('s3://'+bucket_name+'/')[1]
This doesn't work (says invalid syntax).
Is there another way I can define this or will I have to use a different function to split?

Oops you have missed quotes
bucket_name='bucket-name'
new_row = row.split('s3://'+bucket_name+'/')[1]
ouytput
'qwe/2022/02/24/qwe.csv'

You can also do like this:
row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
bucket_name='bucket-name'
new_row = row.split(f"""s3://{bucket_name}/""")[1]

I don't see any advantage to split when you could just slice the url to get the part you want.
>>> row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
>>> bucket_name = "bucket-name"
>>> row[len("s3://" + bucket_name + "/"):]
'qwe/2022/02/24/qwe.csv'
But since this is a URL, you will have more robust solution if you parse the url. You can use the parts to verify that you got the string you want and it will deal with other issues such appended query strings.
from urllib.parse import urlsplit
row='s3://bucket-name/qwe/2022/02/24/qwe.csv'
parts = urlsplit(row)
if parts.scheme != "s3":
raise ValueError("not s3 bucket")
if parts.netloc != "bucket-name":
raise ValueError("not my bucket")
print(parts.path[1:])

Related

how to print after the keyword from python?

i have following string in python
b'{"personId":"65a83de6-b512-4410-81d2-ada57f18112a","persistedFaceIds":["792b31df-403f-4378-911b-8c06c06be8fa"],"name":"waqas"}'
I want to print the all alphabet next to keyword "name" such that my output should be
waqas
Note the waqas can be changed to any number so i want print any name next to keyword name using string operation or regex?
First you need to decode the string since it is binary b. Then use literal eval to make the dictionary, then you can access by key
>>> s = b'{"personId":"65a83de6-b512-4410-81d2-ada57f18112a","persistedFaceIds":["792b31df-403f-4378-911b-8c06c06be8fa"],"name":"waqas"}'
>>> import ast
>>> ast.literal_eval(s.decode())['name']
'waqas'
It is likely you should be reading your data into your program in a different manner than you are doing now.
If I assume your data is inside a JSON file, try something like the following, using the built-in json module:
import json
with open(filename) as fp:
data = json.load(fp)
print(data['name'])
if you want a more algorithmic way to extract the value of name:
s = b'{"personId":"65a83de6-b512-4410-81d2-ada57f18112a",\
"persistedFaceIds":["792b31df-403f-4378-911b-8c06c06be8fa"],\
"name":"waqas"}'
s = s.decode("utf-8")
key = '"name":"'
start = s.find(key) + len(key)
stop = s.find('"', start + 1)
extracted_string = s[start : stop]
print(extracted_string)
output
waqas
You can convert the string into a dictionary with json.loads()
import json
mystring = b'{"personId":"65a83de6-b512-4410-81d2-ada57f18112a","persistedFaceIds":["792b31df-403f-4378-911b-8c06c06be8fa"],"name":"waqas"}'
mydict = json.loads(mystring)
print(mydict["name"])
# output 'waqas'
First you need to convert the string into a proper JSON Format by removing b from the string using substring in python suppose you have a variable x :
import json
x = x[1:];
dict = json.loads(x) //convert JSON string into dictionary
print(dict["name"])

Python ValueError: too many values to unpack for crawler

I am trying to run a scraper I found online but receive a ValueError: too many values to unpack on this line of code
k, v = piece.split("=")
This line is part of this function
def format_url(url):
# make sure URLs aren't relative, and strip unnecssary query args
u = urlparse(url)
scheme = u.scheme or "https"
host = u.netloc or "www.amazon.com"
path = u.path
if not u.query:
query = ""
else:
query = "?"
for piece in u.query.split("&"):
k, v = piece.split("=")
if k in settings.allowed_params:
query += "{k}={v}&".format(**locals())
query = query[:-1]
return "{scheme}://{host}{path}{query}".format(**locals())
If you have any input it would be appreciated, thank you.
Instead of parsing the urls yourself, you can use urlparse.parse_qs function:
>>> from urlparse import urlparse, parse_qs
>>> URL = 'https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello'
>>> parsed_url = urlparse(URL)
>>> parse_qs(parsed_url.query)
{'i': ['main'], 'enc': [' Hello '], 'mode': ['front'], 'sid': ['12ab']}
(source)
This is due to the fact that one of the pieces contains two or more '=' characters. In that case you thus return a list of three or more elements. And you cannot assign it to the two values.
You can solve that problem, by splitting at most one '=' by adding an additional parameter to the .split(..) call:
k, v = piece.split("=",1)
But now we still do not have guarantees that there is an '=' in the piece string anyway.
We can however use the urllib.parse module in python-3.x (urlparse in python-2.x):
from urllib.parse import urlparse, parse_qsl
purl = urlparse(url)
quer = parse_qsl(purl.query)
for k,v in quer:
# ...
pass
Now we have decoded the query string as a list of key-value tuples we can process separately. I would advice to build up a URL with the urllib as well.
You haven't shown any basic debugging: what is piece at the problem point? If it has more than a single = in the string, the split operation will return more than 2 values -- hence your error message.
If you want to split on only the first =, then use index to get the location, and grab the slices you need:
pos = piece.index('=')
k = piece[:pos]
v = piece[pos+1:]

Python: Join multiple components to build a URL

I am trying to build a URL by joining some dynamic components. I thought of using something like os.path.join() BUT for URLs in my case. From research I found urlparse.urljoin() does the same thing. However, it looks like it only take two arguments at one time.
I have the following so far which works but looks repetitive:
a = urlparse.urljoin(environment, schedule_uri)
b = urlparse.urljoin(a, str(events_to_hours))
c = urlparse.urljoin(b, str(events_from_date))
d = urlparse.urljoin(c, str(api_version))
e = urlparse.urljoin(d, str(id))
url = e + '.json'
Output = http://example.com/schedule/12/20160322/v1/1.json
The above works and I tried to make it shorter this way:
url_join_items = [environment, schedule_uri, str(events_to_hours),
str(events_from_date), str(api_version), str(id), ".json"]
new_url = ""
for url_items in url_join_items:
new_url = urlparse.urljoin(new_url, url_items)
Output: http://example.com/schedule/.json
But the second implementation does not work. Please suggest me how to fix this or the better way of doing it.
EDIT 1:
The output from the reduce solution looks like this (unfortunately):
Output: http://example.com/schedule/.json
Using join
Have you tried simply "/".join(url_join_items). Does not http always use the forward slash? You might have to manually setup the prefix "https://" and the suffix, though.
Something like:
url = "https://{}.json".format("/".join(url_join_items))
Using reduce and urljoin
Here is a related question on SO that explains to some degree the thinking behind the implementation of urljoin. Your use case does not appear to be the best fit.
When using reduce and urljoin, I'm not sure it will do what the question intends, which is semantically like os.path.join, but for urls. Consider the following:
from urllib.parse import urljoin
from functools import reduce
parts_1 = ["a","b","c","d"]
parts_2 = ["https://","server.com","somedir","somefile.json"]
parts_3 = ["https://","server.com/","somedir/","somefile.json"]
out1 = reduce(urljoin, parts_1)
print(out1)
d
out2 = reduce(urljoin, parts_2)
print(out2)
https:///somefile.json
out3 = reduce(urljoin, parts_3)
print(out3)
https:///server.com/somedir/somefile.json
Note that with the exception of the extra "/" after the https prefix, the third output is probably closest to what the asker intends, except we've had to do all the work of formatting the parts with the separator.
I also needed something similar and came up with this solution:
from urllib.parse import urljoin, quote_plus
def multi_urljoin(*parts):
return urljoin(parts[0], "/".join(quote_plus(part.strip("/"), safe="/") for part in parts[1:]))
print(multi_urljoin("https://server.com", "path/to/some/dir/", "2019", "4", "17", "some_random_string", "image.jpg"))
This prints 'https://server.com/path/to/some/dir/2019/4/17/some_random_string/image.jpg'
Here's a bit silly but workable solution, given that parts is a list of URL parts in order
my_url = '/'.join(parts).replace('//', '/').replace(':/', '://')
I wish replace would have a from option but it does not hence the second one is to recover https:// double slash
Nice thing is you don't have to worry about parts already having (or not having) any slashes
Simple solution will be:
def url_join(*parts: str) -> str:
import re
line = '/'.join(parts)
line = re.sub('/{2,}', '/', line)
return re.sub(':/', '://', line)

How to fetch single item out of long string?

I have a very string as output of function as follows:
tmp = <"last seen":1568,"reviews [{"id":15869,"author":"abnbvg","changes":........>
How will I fetch the "id":15869 out of it?
The string content looks like JSON, so either use the json module or use a regular expression to extract the specific string you need.
The data looks like a JSON string. Use:
try:
import json
except ImportError:
import simplejson as json
tmp = '"last seen":1568,"reviews":[{"id":15869,"author":"abnbvg"}]'
data = json.loads('{{{}}}'.format(tmp))
>>> print data
{u'reviews': [{u'id': 15869, u'author': u'abnbvg'}], u'last seen': 1568}
>>> print data['reviews'][0]['id']
15869
Note that I wrapped the string in { and } to make a dictionary. You might not have to do that if the actual JSON string is already encapsulated with braces.
If id is the only thing you need from the string and it will always be something like {"id":15869,"author":"abnbvg"..., then you can go with sinple string split instead of json conversion.
tmp = '"last seen":1568,"reviews" : [{"id":15869,"author":"abnbvg","changes":........'
tmp1 = tmp.split('"id":', 1)[1]
id = tmp1.split(",", 1)[0]
Please note that tmp1 line may raise IndexError in case there is no "id" key found in the string. You can use -1 instead of 1 to side step. But in this way, you can report that "id" is not found.
try:
tmp1 = tmp.split('"id":', 1)[1]
id = tmp1.split(",", 1)[0]
except IndexError:
print "id key is not present in the json"
id = None
If you do really need more variables from the json string, please go with mhawke's solution of converting the json to dictionary and getting the value. You can use ast.literal_eval
from ast import literal_eval
tmp = '"last seen":1568,"reviews" : [{"id":15869,"author":"abnbvg","changes":........'
tmp_dict = literal_eval("""{%s}"""%(tmp))
print tmp_dict["reviews"][0]["id"]
In the second case, if you need to collect all the "id" keys in the list, this will help:
id_list =[]
for id_dict in tmp_dict["reviews"]:
id_list.append(id_dict["id"])
print id_list

problem about split a string

I wrote a program to read a registry entry from a file.
And the entry looks like this:
reg='HKEY_LOCAL_MACHINE\SOFTWARE\TT\Tools\SYS\exePath' #it means rootKey=HKEY_LOCAL_MACHINE, subKey='SOFTWARE\TT\Tools\SYS', property=exePath
I want to read this entry from the file and break it into rootKey, subKey and property.
Apparently, I can do it this way:
rootKey = reg.split('\\', 1)[0]
subKey = reg.split('\\', 1)[1].rsplit('\\', 1)[0] #might be a stupid way
property = reg.rsplit('\\, 1)[1]
Maybe the entry is a stupid one, but any better way to break it into parts like above?
import re
t=re.search(r"(.+?)\\(.+)\\(.+)", reg)
t.groups()
('HKEY_LOCAL_MACHINE', 'SOFTWARE\\TT\\Tools\\SYS', 'exePath')
How about doing the following? There's no need to call .split() so many times, anyway...
s = reg.split('\\')
property = s.pop()
root_key = s.pop(0)
sub_key = '\\'.join(s)
I like to use partition over split when I can, because partition ensures each of the returned tuple elements is a string.
root_key, _, s = reg.partition("\\")
_, sub_key, property = s.rpartition("\\") # note, _r_partition

Categories

Resources