I have a section of a log file that looks like this:
"/log?action=End&env=123&id=8000&cat=baseball"
"/log?action=start&get=3210&rsa=456&key=golf"
I want to parse out each section so the results would look like this:
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
('/log?action=', 'start', 'get=3210', 'rsa=456', 'key=golf')
I've looked into regex and matching, but a lot of my logs have different sequences which leads me to believe that it is not possible. Any suggestions?
This is clearly a fragment of a URL, so the best way to parse it is to use URL parsing tools. The stdlib comes with urlparse, which does exactly what you want.
For example:
>>> import urlparse
>>> s = "/log?action=End&env=123&id=8000&cat=baseball"
>>> bits = urlparse.urlparse(s)
>>> variables = urlparse.parse_qs(bits.query)
>>> variables
{'action': ['End'], 'cat': ['baseball'], 'env': ['123'], 'id': ['8000']}
If you really want to get the format you asked for, you can use parse_qsl instead, and then join the key-value pairs back together. I'm not sure why you want the /log to be included in the first query variable, or the first query variable's value to be separate from its variable, but even that is doable if you insist:
>>> variables = urlparse.parse_qsl(s)
>>> result = (variables[0][0] + '=', variables[0][1]) + tuple(
'='.join(kv) for kv in variables[1:])
>>> result
('/log?action=', 'End', 'env=123', 'id=8000', 'cat=baseball')
If you're using Python 3.x, just change the urlparse to urllib.parse, and the rest is exactly the same.
You can split a couple times:
s = '/log?action=End&env=123&id=8000&cat=baseball'
L = s.split("&")
L[0:1]=L[0].split("=")
Output:
['/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball']
It's a bit hard to say without knowing what the domain of possible inputs is, but here's a guess at what will work for you:
log = "/log?action=End&env=123&id=8000&cat=baseball\n/log?action=start&get=3210&rsa=456&key=golf"
logLines = [line.split("&") for line in log.split('\n')]
logLines = [tuple(line[0].split("=")+line[1:]) for line in logLines]
print logLines
OUTPUT:
[('/log?action', 'End', 'env=123', 'id=8000', 'cat=baseball'),
('/log?action', 'start', 'get=3210', 'rsa=456', 'key=golf')]
This assumes that you don't really need the "=" at the end of the first string.
Related
I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]
Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation
What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer
You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]
I am trying to run a scraper I found online but receive a ValueError: too many values to unpack on this line of code
k, v = piece.split("=")
This line is part of this function
def format_url(url):
# make sure URLs aren't relative, and strip unnecssary query args
u = urlparse(url)
scheme = u.scheme or "https"
host = u.netloc or "www.amazon.com"
path = u.path
if not u.query:
query = ""
else:
query = "?"
for piece in u.query.split("&"):
k, v = piece.split("=")
if k in settings.allowed_params:
query += "{k}={v}&".format(**locals())
query = query[:-1]
return "{scheme}://{host}{path}{query}".format(**locals())
If you have any input it would be appreciated, thank you.
Instead of parsing the urls yourself, you can use urlparse.parse_qs function:
>>> from urlparse import urlparse, parse_qs
>>> URL = 'https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello'
>>> parsed_url = urlparse(URL)
>>> parse_qs(parsed_url.query)
{'i': ['main'], 'enc': [' Hello '], 'mode': ['front'], 'sid': ['12ab']}
(source)
This is due to the fact that one of the pieces contains two or more '=' characters. In that case you thus return a list of three or more elements. And you cannot assign it to the two values.
You can solve that problem, by splitting at most one '=' by adding an additional parameter to the .split(..) call:
k, v = piece.split("=",1)
But now we still do not have guarantees that there is an '=' in the piece string anyway.
We can however use the urllib.parse module in python-3.x (urlparse in python-2.x):
from urllib.parse import urlparse, parse_qsl
purl = urlparse(url)
quer = parse_qsl(purl.query)
for k,v in quer:
# ...
pass
Now we have decoded the query string as a list of key-value tuples we can process separately. I would advice to build up a URL with the urllib as well.
You haven't shown any basic debugging: what is piece at the problem point? If it has more than a single = in the string, the split operation will return more than 2 values -- hence your error message.
If you want to split on only the first =, then use index to get the location, and grab the slices you need:
pos = piece.index('=')
k = piece[:pos]
v = piece[pos+1:]
I am trying to build a URL by joining some dynamic components. I thought of using something like os.path.join() BUT for URLs in my case. From research I found urlparse.urljoin() does the same thing. However, it looks like it only take two arguments at one time.
I have the following so far which works but looks repetitive:
a = urlparse.urljoin(environment, schedule_uri)
b = urlparse.urljoin(a, str(events_to_hours))
c = urlparse.urljoin(b, str(events_from_date))
d = urlparse.urljoin(c, str(api_version))
e = urlparse.urljoin(d, str(id))
url = e + '.json'
Output = http://example.com/schedule/12/20160322/v1/1.json
The above works and I tried to make it shorter this way:
url_join_items = [environment, schedule_uri, str(events_to_hours),
str(events_from_date), str(api_version), str(id), ".json"]
new_url = ""
for url_items in url_join_items:
new_url = urlparse.urljoin(new_url, url_items)
Output: http://example.com/schedule/.json
But the second implementation does not work. Please suggest me how to fix this or the better way of doing it.
EDIT 1:
The output from the reduce solution looks like this (unfortunately):
Output: http://example.com/schedule/.json
Using join
Have you tried simply "/".join(url_join_items). Does not http always use the forward slash? You might have to manually setup the prefix "https://" and the suffix, though.
Something like:
url = "https://{}.json".format("/".join(url_join_items))
Using reduce and urljoin
Here is a related question on SO that explains to some degree the thinking behind the implementation of urljoin. Your use case does not appear to be the best fit.
When using reduce and urljoin, I'm not sure it will do what the question intends, which is semantically like os.path.join, but for urls. Consider the following:
from urllib.parse import urljoin
from functools import reduce
parts_1 = ["a","b","c","d"]
parts_2 = ["https://","server.com","somedir","somefile.json"]
parts_3 = ["https://","server.com/","somedir/","somefile.json"]
out1 = reduce(urljoin, parts_1)
print(out1)
d
out2 = reduce(urljoin, parts_2)
print(out2)
https:///somefile.json
out3 = reduce(urljoin, parts_3)
print(out3)
https:///server.com/somedir/somefile.json
Note that with the exception of the extra "/" after the https prefix, the third output is probably closest to what the asker intends, except we've had to do all the work of formatting the parts with the separator.
I also needed something similar and came up with this solution:
from urllib.parse import urljoin, quote_plus
def multi_urljoin(*parts):
return urljoin(parts[0], "/".join(quote_plus(part.strip("/"), safe="/") for part in parts[1:]))
print(multi_urljoin("https://server.com", "path/to/some/dir/", "2019", "4", "17", "some_random_string", "image.jpg"))
This prints 'https://server.com/path/to/some/dir/2019/4/17/some_random_string/image.jpg'
Here's a bit silly but workable solution, given that parts is a list of URL parts in order
my_url = '/'.join(parts).replace('//', '/').replace(':/', '://')
I wish replace would have a from option but it does not hence the second one is to recover https:// double slash
Nice thing is you don't have to worry about parts already having (or not having) any slashes
Simple solution will be:
def url_join(*parts: str) -> str:
import re
line = '/'.join(parts)
line = re.sub('/{2,}', '/', line)
return re.sub(':/', '://', line)
This is a line of a file and I want to take only the url after the word uri and the url after smallPictureUrl to use it later but i can not find a proper way
The asterisks represent text or numbers or both together and the are different in every line who looks like this so they can not be helpfull, the have not a pattern to take advantage of it
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__
\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
\",\"width\":180,\"height\":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
in something more simple like:
{"displayName":"Jim Test","firstName":"*","lastName":"*"}
i managed to take the name for example Jim Test after displayName with using the re.search('(?<="displayName":")(\w+) (\w+)',line) but for the other is very complicated if you can give me any direction or advice .
a line is exactly like this
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s200x200/*_*_*_*.jpg","timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.40652557319224},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/*_*_*_a.jpg\",\"width\":180,\"height\":120}}}","subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s100x100/*_*_*_a.jpg","contactId":"**==","contactType":"USER","friendshipStatus":"ARE_FRIENDS","graphApiWriteId":"contact_*:*:*","hugePictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s720x720/*_*_*_*.jpg","profileFbid":"*","isMobilePushable":"NO","lookupKey":null,"name":{"displayName":"* *","firstName":"*","lastName":"*"},"nameSearchTokens":["*","*"],"phones":[],"phoneticName":{"displayName":null,"firstName":null,"lastName":null},"isMemorialized":false,"communicationRank":0.4183731,"canViewerSendGift":false,"canMessage":true}
The value associated with timelineCoverPhoto seems to be stringified JSON, so you could do something admittedly ugly like this:
import json
s = {
"subscribeStatus": "IS_SUBSCRIBED",
"bigPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto": "{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg \",\"width\":180,\"height\":135}}}",
"smallPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg"
}
j = json.loads(s.get('timelineCoverPhoto'))
print "uri:", j.get('photo').get('image_lowres').get('uri')
uri: https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]
If you not okay with using json, how about this ?
>>> print mytext
{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
"timelineCoverPhoto":"{"focus":{"x":0.5,"y":0.49137931034483},"photo":{"__type__
":{"name":"Photo"},"image_lowres":{"uri":"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
","width":180,"height":135}}}",
"subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
>>> uri = re.findall(r'uri\"\:\"[\'"]?([^\'" >]+)', mytext) #gets the uri
>>> smallpicurl = re.findall(r'smallPictureUrl\"\:\"[\'"]?([^\'" >]+)', mytext) # gets the smallPictureUrl
>>> ''.join(uri).rstrip()
'https://fbcdn-*-*-*.*.*/*-*-*/*.jpg' # uri
>>> ''.join(smallpicurl).rstrip()
'https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg' # smallPictureUrl
Going to re-word the question.
Basically I'm wondering what is the easiest way to manipulate a string formatted like this:
Safety/Report/Image/489
or
Safety/Report/Image/490
And sectioning off each word seperated by a slash(/), and storing each section(token) into a store so I can call it later. (Reading in about 1200 cells from a CSV file).
The answer for your question:
>>> mystring = "Safety/Report/Image/489"
>>> mystore = mystring.split('/')
>>> mystore
['Safety', 'Report', 'Image', '489']
>>> mystore[2]
'Image'
>>>
If you want to store data from more than one string, then you have several options depending on how do you want to organize it. For example:
liststring = ["Safety/Report/Image/489",
"Safety/Report/Image/490",
"Safety/Report/Image/491"]
dictstore = {}
for line, string in enumerate(liststring):
dictstore[line] = string.split('/')
print dictstore[1][3]
print dictstore[2][3]
prints:
490
491
In this case you can use in the same way a dictionary or a list (a list of lists) for storage. In case each string has a especial identifier (one better than the line number), then the dictionary is the option to choose.
I don't quite understand your code and don't have too much time to study it, but I thought that the following might be helpful, at least if order isn't important ...
in_strings = ['Safety/Report/Image/489',
'Safety/Report/Image/490',
'Other/Misc/Text/500'
]
out_dict = {}
for in_str in in_strings:
level1, level2, level3, level4 = in_str.split('/')
out_dict.setdefault(level1, {}).setdefault(
level2, {}).setdefault(
level3, []).append(level4)
print out_dict
{'Other': {'Misc': {'Text': ['500']}}, 'Safety': {'Report': {'Image': ['489', '490']}}}
If your csv is line seperated:
#do something to load the csv
split_lines = [x.strip() for x in csv_data.split('\n')]
for line_data in split_lines:
split_parts = [x.strip() for x in line_data.split('/')]
# do something with individual part data
# such as some_variable = split_parts[1] etc
# if using indexes, I'd be sure to catch for index errors in case you
# try to go to index 3 of something with only 2 parts
check out the python csv module for some importing help (I'm not too familiar).