take urls from lines of a file in python - python

This is a line of a file and I want to take only the url after the word uri and the url after smallPictureUrl to use it later but i can not find a proper way
The asterisks represent text or numbers or both together and the are different in every line who looks like this so they can not be helpfull, the have not a pattern to take advantage of it
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__
\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
\",\"width\":180,\"height\":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
in something more simple like:
{"displayName":"Jim Test","firstName":"*","lastName":"*"}
i managed to take the name for example Jim Test after displayName with using the re.search('(?<="displayName":")(\w+) (\w+)',line) but for the other is very complicated if you can give me any direction or advice .
a line is exactly like this
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s200x200/*_*_*_*.jpg","timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.40652557319224},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/*_*_*_a.jpg\",\"width\":180,\"height\":120}}}","subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s100x100/*_*_*_a.jpg","contactId":"**==","contactType":"USER","friendshipStatus":"ARE_FRIENDS","graphApiWriteId":"contact_*:*:*","hugePictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s720x720/*_*_*_*.jpg","profileFbid":"*","isMobilePushable":"NO","lookupKey":null,"name":{"displayName":"* *","firstName":"*","lastName":"*"},"nameSearchTokens":["*","*"],"phones":[],"phoneticName":{"displayName":null,"firstName":null,"lastName":null},"isMemorialized":false,"communicationRank":0.4183731,"canViewerSendGift":false,"canMessage":true}

The value associated with timelineCoverPhoto seems to be stringified JSON, so you could do something admittedly ugly like this:
import json
s = {
"subscribeStatus": "IS_SUBSCRIBED",
"bigPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto": "{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg \",\"width\":180,\"height\":135}}}",
"smallPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg"
}
j = json.loads(s.get('timelineCoverPhoto'))
print "uri:", j.get('photo').get('image_lowres').get('uri')
uri: https://fbcdn-*-*-*.*.*/*-*-*/*.jpg

#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]

If you not okay with using json, how about this ?
>>> print mytext
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{"focus":{"x":0.5,"y":0.49137931034483},"photo":{"__type__
":{"name":"Photo"},"image_lowres":{"uri":"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
","width":180,"height":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
>>> uri = re.findall(r'uri\"\:\"[\'"]?([^\'" >]+)', mytext) #gets the uri
>>> smallpicurl = re.findall(r'smallPictureUrl\"\:\"[\'"]?([^\'" >]+)', mytext) # gets the smallPictureUrl
>>> ''.join(uri).rstrip()
'https://fbcdn-*-*-*.*.*/*-*-*/*.jpg' # uri
>>> ''.join(smallpicurl).rstrip()
'https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg' # smallPictureUrl

Related

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something
Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']
I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')
Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))
You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

Get list from string with exec in python

I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]
Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation
What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer
You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]

How do i extract text from double quotes and add it to string ? python 3.x

import re
response_text = '{"captchaData":"/9j/4AAQSkZJRgABAQAAAQABAAD//gATYWJmNjUxYjM1ZjA3ZWRiMgD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCABGAMgDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigAooooAKKKKACikzRmgBaKTNGaAFopM0ZoAWikzRmgBaKTNGaAFopM0ZoAWikzRmgBaKTNGaAFopM0tABRRRQAUUUUAFFFFABRRRQAhooNJQAtFQPe2sb7HuYVf+60gBqZWDAFSCD0IouK6YUUtFAwopKWgBKKKWgBKWkrzTxzqmvafrkdlb6iyW90AYljUKVycYJ61lVqqnHmaMMRXVGHO1c9MpKbGgjiSMEkKoGT1NOrU3FooooASnU2nDpQAUUUUAFFFFABRRRQAUUUUAIetct4/1K70zw00lmzI8kqxtIvVVIP8Ahj8a6k9agu7W3vrV7e6iWWFxhkYZBqKkXKLSdjOtBzpuMXZs5Hw3omna14GgSe3RpZ1fdMwy4fcRuz1rhNJ8S6v4Uv57Xd5qxs0bwSklQwOMj0rv7TX4ptQHh/w1BEqQA752HyIAecDvya4jx/pEum63HPJMZjdx72kKhcuODwPbH5151bSCnT3jo2ePiVy041KT1jo2jc0Lx7qt/r4trmBGDoyRwRrg+Z1GSfoa0YvF+qaX4oGma/FAkMo3I8X8IOcc9xkYrh9JuDD4y0m6Bx50sLMfUthXP57q6rx+BF4v0K4A5yv6SA/1ohVn7Ny5tUwpV6vsnPmd4v8ABmrc+O5NP1uG11DSpLa0mwUmdvm2k43Ef5xXZM6Km9mAXrknArzr4rxjydLlxyGkXP12/wCFc5qevXPiPUdO04TOtoBDEVB+8xA3E+vOa2eJdKcoy12sdEsZKhUnCfvbW+Z7DHqFlK+yO8t3b+6sqk/zqxWa/h/S5LEWZsofKC7RhQCPcHrn3rjfDmv3GjeJbvw/qVw0ttGziGWQ5KYGRk+hFdEqrg0p9TsnXdOUVUW/XzPQJp4bdN88scSf3nYKP1rzbxXIl98SNFijdXjHkAlTkH94Sf0rX8LSnxTql/rV6m+CJ/JtInGVQdSceuMc/Wub8i1tPiuwUrHa28hlP91AI95+nOa569TnhF9GzjxVV1acWtnJHrVLXJaNqmpeK5ri6gnax0yJ/Li2IDJIe5JOQB+FV9X8Rah4U1e2ivpRe6fcDiQqFkTB56cH8q6PbxUebp3Ot4qCjzv4e52lLTUdZI1dDlWAII7inVsdIUo6U2nDpQAUUUUAFFFFABRRRQAUUUUAIetNkXfGy5xkEZpxpKAPFPC+pnwp4pk/tBHRNrQzALkryOcd+RW58Qbz+2dKs721tZxawyEefImwNuHYHnHHWvSXs7WWUSyW0LyDo7Rgn86W4tYbu2e3njV4nGCrAEflXGsLJU3T5tDzY4GapSo82j8jwjTiJNW0JEOWEkakDsfOY/1Fdv4/Xz/Fnh+3H3mdR+cgH9Kuj4Z2UF3HdWmo3MUsbiRCyKwBByOMCpdV8I6rfeI4NZXUreR7d1MUUkRUBVOQMgmsI0Kkabi1u0c0MLWhSlBx3a7bIzPixIBDpUXctK35bf8AGuH8NLv8T6YP+nmPP/fQruvGHhfxH4g1FZ1jtDDCpSJElOcZzk5A5rAXSb7QPFun32o2n2eB5w5KsGROeeR+dZ14Sdbna0ujLFU5yxPtGmldHsleEeLbnzfF+pyxN0lKZHsNp/ka9b13xHa6VpzPDKk93IMW8MZ3","captchaMime":"image/jpeg","captchaToken":"ALXfmJpxoaxq6LYBXm-kJzIl0Yd5mHG1XbttsBX-EKxMYtYNIc6uTv89fmRxeWZGEgpi2L9sjXYlkm6Vplav_wy2KjdB5J4j3i5fB6CEuPOMXIjEql6mPBJ8-YJTCpOzzk8kOcW5nuBbuLOdMVyVxquLbWjqLZzHeN0iT4Jm4SIZ9mQNfapNGkE","status":"CAPTCHA"}'
love = '"captchaData":"mydata"'
session_token_temp = re.search(r'(\"captchaData\":\")(\w*)',
response_text).group()
session_token = str(session_token_temp)
i want to extract value of captchaData and captchaToken and add the data to string like this
extracted_data = (value_of_captchaData)
extracted_data2 = (value_of_captchaToken)
It sounds like what you are actually trying to do is to parse JSON. JSON is a format that is often used to represent data on the web.
If you are using requests (which it sounds like from the tag), you can use .json() to parse the result. Otherwise use the built-in json module.
r = requests.get("https://httpbin.org/get")
data = r.json()
or
import json
data = json.loads('"key": "value"')
If all you want is to remove a given character before and after a string, you can use strip
'"some string"'.strip('"') # 'some string'
You can use ast.literal_eval instead of regex:
import ast
response_text = '{"captchaData":"/9j/4AAQSkZJRgABAQAAAQABAAD//gATYWJmNjUxYjM1ZjA3ZWRiMgD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCABGAMgDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iiigAooooAKKKKACikzRmgBaKTNGaAFopM0ZoAWikzRmgBaKTNGaAFopM0ZoAWikzRmgBaKTNGaAFopM0tABRRRQAUUUUAFFFFABRRRQAhooNJQAtFQPe2sb7HuYVf+60gBqZWDAFSCD0IouK6YUUtFAwopKWgBKKKWgBKWkrzTxzqmvafrkdlb6iyW90AYljUKVycYJ61lVqqnHmaMMRXVGHO1c9MpKbGgjiSMEkKoGT1NOrU3FooooASnU2nDpQAUUUUAFFFFABRRRQAUUUUAIetct4/1K70zw00lmzI8kqxtIvVVIP8Ahj8a6k9agu7W3vrV7e6iWWFxhkYZBqKkXKLSdjOtBzpuMXZs5Hw3omna14GgSe3RpZ1fdMwy4fcRuz1rhNJ8S6v4Uv57Xd5qxs0bwSklQwOMj0rv7TX4ptQHh/w1BEqQA752HyIAecDvya4jx/pEum63HPJMZjdx72kKhcuODwPbH5151bSCnT3jo2ePiVy041KT1jo2jc0Lx7qt/r4trmBGDoyRwRrg+Z1GSfoa0YvF+qaX4oGma/FAkMo3I8X8IOcc9xkYrh9JuDD4y0m6Bx50sLMfUthXP57q6rx+BF4v0K4A5yv6SA/1ohVn7Ny5tUwpV6vsnPmd4v8ABmrc+O5NP1uG11DSpLa0mwUmdvm2k43Ef5xXZM6Km9mAXrknArzr4rxjydLlxyGkXP12/wCFc5qevXPiPUdO04TOtoBDEVB+8xA3E+vOa2eJdKcoy12sdEsZKhUnCfvbW+Z7DHqFlK+yO8t3b+6sqk/zqxWa/h/S5LEWZsofKC7RhQCPcHrn3rjfDmv3GjeJbvw/qVw0ttGziGWQ5KYGRk+hFdEqrg0p9TsnXdOUVUW/XzPQJp4bdN88scSf3nYKP1rzbxXIl98SNFijdXjHkAlTkH94Sf0rX8LSnxTql/rV6m+CJ/JtInGVQdSceuMc/Wub8i1tPiuwUrHa28hlP91AI95+nOa569TnhF9GzjxVV1acWtnJHrVLXJaNqmpeK5ri6gnax0yJ/Li2IDJIe5JOQB+FV9X8Rah4U1e2ivpRe6fcDiQqFkTB56cH8q6PbxUebp3Ot4qCjzv4e52lLTUdZI1dDlWAII7inVsdIUo6U2nDpQAUUUUAFFFFABRRRQAUUUUAIetNkXfGy5xkEZpxpKAPFPC+pnwp4pk/tBHRNrQzALkryOcd+RW58Qbz+2dKs721tZxawyEefImwNuHYHnHHWvSXs7WWUSyW0LyDo7Rgn86W4tYbu2e3njV4nGCrAEflXGsLJU3T5tDzY4GapSo82j8jwjTiJNW0JEOWEkakDsfOY/1Fdv4/Xz/Fnh+3H3mdR+cgH9Kuj4Z2UF3HdWmo3MUsbiRCyKwBByOMCpdV8I6rfeI4NZXUreR7d1MUUkRUBVOQMgmsI0Kkabi1u0c0MLWhSlBx3a7bIzPixIBDpUXctK35bf8AGuH8NLv8T6YP+nmPP/fQruvGHhfxH4g1FZ1jtDDCpSJElOcZzk5A5rAXSb7QPFun32o2n2eB5w5KsGROeeR+dZ14Sdbna0ujLFU5yxPtGmldHsleEeLbnzfF+pyxN0lKZHsNp/ka9b13xHa6VpzPDKk93IMW8MZ3","captchaMime":"image/jpeg","captchaToken":"ALXfmJpxoaxq6LYBXm-kJzIl0Yd5mHG1XbttsBX-EKxMYtYNIc6uTv89fmRxeWZGEgpi2L9sjXYlkm6Vplav_wy2KjdB5J4j3i5fB6CEuPOMXIjEql6mPBJ8-YJTCpOzzk8kOcW5nuBbuLOdMVyVxquLbWjqLZzHeN0iT4Jm4SIZ9mQNfapNGkE","status":"CAPTCHA"}'
new_data = ast.literal_eval(response_text)
print(new_data["captchaData"])
print(new_data['captchaToken'])

Python: Join multiple components to build a URL

I am trying to build a URL by joining some dynamic components. I thought of using something like os.path.join() BUT for URLs in my case. From research I found urlparse.urljoin() does the same thing. However, it looks like it only take two arguments at one time.
I have the following so far which works but looks repetitive:
a = urlparse.urljoin(environment, schedule_uri)
b = urlparse.urljoin(a, str(events_to_hours))
c = urlparse.urljoin(b, str(events_from_date))
d = urlparse.urljoin(c, str(api_version))
e = urlparse.urljoin(d, str(id))
url = e + '.json'
Output = http://example.com/schedule/12/20160322/v1/1.json
The above works and I tried to make it shorter this way:
url_join_items = [environment, schedule_uri, str(events_to_hours),
str(events_from_date), str(api_version), str(id), ".json"]
new_url = ""
for url_items in url_join_items:
new_url = urlparse.urljoin(new_url, url_items)
Output: http://example.com/schedule/.json
But the second implementation does not work. Please suggest me how to fix this or the better way of doing it.
EDIT 1:
The output from the reduce solution looks like this (unfortunately):
Output: http://example.com/schedule/.json
Using join
Have you tried simply "/".join(url_join_items). Does not http always use the forward slash? You might have to manually setup the prefix "https://" and the suffix, though.
Something like:
url = "https://{}.json".format("/".join(url_join_items))
Using reduce and urljoin
Here is a related question on SO that explains to some degree the thinking behind the implementation of urljoin. Your use case does not appear to be the best fit.
When using reduce and urljoin, I'm not sure it will do what the question intends, which is semantically like os.path.join, but for urls. Consider the following:
from urllib.parse import urljoin
from functools import reduce
parts_1 = ["a","b","c","d"]
parts_2 = ["https://","server.com","somedir","somefile.json"]
parts_3 = ["https://","server.com/","somedir/","somefile.json"]
out1 = reduce(urljoin, parts_1)
print(out1)
d
out2 = reduce(urljoin, parts_2)
print(out2)
https:///somefile.json
out3 = reduce(urljoin, parts_3)
print(out3)
https:///server.com/somedir/somefile.json
Note that with the exception of the extra "/" after the https prefix, the third output is probably closest to what the asker intends, except we've had to do all the work of formatting the parts with the separator.
I also needed something similar and came up with this solution:
from urllib.parse import urljoin, quote_plus
def multi_urljoin(*parts):
return urljoin(parts[0], "/".join(quote_plus(part.strip("/"), safe="/") for part in parts[1:]))
print(multi_urljoin("https://server.com", "path/to/some/dir/", "2019", "4", "17", "some_random_string", "image.jpg"))
This prints 'https://server.com/path/to/some/dir/2019/4/17/some_random_string/image.jpg'
Here's a bit silly but workable solution, given that parts is a list of URL parts in order
my_url = '/'.join(parts).replace('//', '/').replace(':/', '://')
I wish replace would have a from option but it does not hence the second one is to recover https:// double slash
Nice thing is you don't have to worry about parts already having (or not having) any slashes
Simple solution will be:
def url_join(*parts: str) -> str:
import re
line = '/'.join(parts)
line = re.sub('/{2,}', '/', line)
return re.sub(':/', '://', line)

Python: parsing sections of a log file

I have a section of a log file that looks like this:
"/log?action=End&env=123&id=8000&cat=baseball"
"/log?action=start&get=3210&rsa=456&key=golf"
I want to parse out each section so the results would look like this:
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
('/log?action=', 'start', 'get=3210', 'rsa=456', 'key=golf')
I've looked into regex and matching, but a lot of my logs have different sequences which leads me to believe that it is not possible. Any suggestions?
This is clearly a fragment of a URL, so the best way to parse it is to use URL parsing tools. The stdlib comes with urlparse, which does exactly what you want.
For example:
>>> import urlparse
>>> s = "/log?action=End&env=123&id=8000&cat=baseball"
>>> bits = urlparse.urlparse(s)
>>> variables = urlparse.parse_qs(bits.query)
>>> variables
{'action': ['End'], 'cat': ['baseball'], 'env': ['123'], 'id': ['8000']}
If you really want to get the format you asked for, you can use parse_qsl instead, and then join the key-value pairs back together. I'm not sure why you want the /log to be included in the first query variable, or the first query variable's value to be separate from its variable, but even that is doable if you insist:
>>> variables = urlparse.parse_qsl(s)
>>> result = (variables[0][0] + '=', variables[0][1]) + tuple(
'='.join(kv) for kv in variables[1:])
>>> result
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
If you're using Python 3.x, just change the urlparse to urllib.parse, and the rest is exactly the same.
You can split a couple times:
s = '/log?action=End&env=123&id=8000&cat=baseball'
L = s.split("&")
L[0:1]=L[0].split("=")
Output:
['/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball']
It's a bit hard to say without knowing what the domain of possible inputs is, but here's a guess at what will work for you:
log = "/log?action=End&env=123&id=8000&cat=baseball\n/log?action=start&get=3210&rsa=456&key=golf"
logLines = [line.split("&") for line in log.split('\n')]
logLines = [tuple(line[0].split("=")+line[1:]) for line in logLines]
print logLines
OUTPUT:
[('/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball'),
('/log?action', 'start', 'get=3210', 'rsa=456', 'key=golf')]
This assumes that you don't really need the "=" at the end of the first string.

Categories

Resources