I have a long string containing attributes, in order to parse this I am attempting to extract the 'lists' from the string, I'm having some trouble particularly when dealing with multi-dimensional lists.
An Example String:
'a="foo",c=[d="test",f="bar",g=[h="some",i="text"],j="over"],k="here",i=[j="baz"]'
I would like to extract
c=[d="test",f="bar",g=[h="some",i="text"],j="over"]
and
i=[j="baz"]
from this string.
Is this possible using regex?
I've tried numerous different regex, this is my most recent one:
([^\W0-9]\w*=\[.*\])
This string looks like a JSON object, with a few differences. My plan is to turn this into a JSON string, then parse it. After that, it is a matter of picking out what you want:
import json
import re
def str2obj(the_string):
out = re.sub(r"(\w+)=", f'"\\1":', the_string)
out = out.replace("[", "{").replace("]", "}")
out = "{%s}" % out
out = json.loads(out)
return out
string_object = 'a="foo",c=[d="test",f="bar",g=[h="some",i="text"],j="over"],k="here",i=[j="baz"]'
json_object = str2obj(string_object)
print(json_object)
assert json_object["a"] == "foo"
assert json_object["c"] == {
'd': 'test',
'f': 'bar',
'g': {'h': 'some', 'i': 'text'},
'j': 'over'
}
assert json_object["k"] == "here"
assert json_object["i"] == {"j": "baz"}
Output:
{'a': 'foo', 'c': {'d': 'test', 'f': 'bar', 'g': {'h': 'some', 'i': 'text'}, 'j': 'over'}, 'k': 'here', 'i': {'j': 'baz'}}
Notes
The re.sub call replace a= with "a":
The replace calls turn the square brackets into the curly ones
There is no error checking in the code, I assume what you have is valid in term of balanced brackets
Related
Suppose I have a list in the form;
lst = ["5kxn"] # 1 string only for example
5k denotes 5000 and xn denotes n times, the processed list should be;
[5*1e3 for i in range(n)] # float values
#Not this literally but a list of n 5000's.
I am aware I can do this using non re methods but it could be bug prone, and my re skills are not good enough to come up with a method to pull off this conversion
Here is a dictionary of multipliers:
replace_dict = {'a': '1e-18', 'f': '1e-15', 'p': '1e-12',
'n': '1e-9', 'u': '1e-6', 'm': '1e-3',
'c': '1e-2', 'd': '1e-1', 'da': '1e1',
'h': '1e2', 'k': '1e3', 'M': '1e6',
'G': '1e9', 'T': '1e12', "P": '1e15',
'E': '1e18'}
Desired output is a list. For example ["2kx1","3kx2","4k"] will be [2000.0,3000.0,3000.0,4000.0]
import re
replace_dict = {'a': '1e-18', 'f': '1e-15', 'p': '1e-12',
'n': '1e-9', 'u': '1e-6', 'm': '1e-3',
'c': '1e-2', 'd': '1e-1', 'da': '1e1',
'h': '1e2', 'k': '1e3', 'M': '1e6',
'G': '1e9', 'T': '1e12', "P": '1e15',
'E': '1e18'}
def str_to_list(list_str):
regex = re.compile(r"([0-9]+)([^x]+)(x[0-9]+)?")
list_numbers = []
for string in list_str:
parsed = re.findall(regex, string)[0]
n = 1 if parsed[2] == '' else int(parsed[2].replace('x', ''))
list_numbers += [float(parsed[0]) * eval(replace_dict[parsed[1]])] * n
return list_numbers
result = str_to_list(["2kx1","3kx2","4k"])
print(result) # [2000.0, 3000.0, 3000.0, 4000.0]
A bit of explanation:
([0-9]+): captures what comes before the unit prefix, e.g. k.
([^x]+): captures the unit prefix (anything that is not an "x"). This could be refined so it only accepts the letters defined in replace_dict. The + is needed because of the 'da' prefix.
(x[0-9]+)?: captures the multiplier, e.g. x2, if exists.
The method re.findall returns a list, in these case containing a tuple with the groups captured by the regex, e.g.: [('22', 'k', 'x2')] for "22kx2". We take [0] to work directly with the tuple.
If the "xn" is missing, re.findall will return an empty string for the group (x[0-9]+)? since there's not any match, e.g.: [('2', 'k', '')] for "2k". That's why n is 1 if that string is empty, else we discard the x (replacing it with an empty string, i.e. replace('x', '')), so we only take the number, e.g. 2 in "x2".
Finally, we concatenate the resulting list to list_numbers, e.g. list_numbers += [2 * eval("1e3")] * 2 in case of "2kx2".
Hope that was clear enough :)
Here is a regex based approach which ultimately uses the eval() function to evaluate a string expression to generate a list:
def get_list(inp):
replace_dict = {'a': '1e-18', 'f': '1e-15', 'p': '1e-12',
'n': '1e-9', 'u': '1e-6', 'm': '1e-3',
'c': '1e-2', 'd': '1e-1', 'da': '1e1',
'h': '1e2', 'k': '1e3', 'M': '1e6',
'G': '1e9', 'T': '1e12', 'P': '1e15',
'E': '1e18'}
parts = re.findall(r'^(\d+)(\w+)x(\w+)$', inp)
expr = "[" + parts[0][0] + "*" + replace_dict[parts[0][1]] + " for i in range(" + parts[0][2] + ")]"
return expr
expr = get_list("5kxn")
print(expr) # [5*1e3 for i in range(n)]
n = 5
lst = eval(expr)
print(lst) # [5000.0, 5000.0, 5000.0, 5000.0, 5000.0]
I started to translate Morse code to English and there is an issue. Here is my code:
morse_dict = {
'a': '.-',
'b': '-...',
'c': '-.-.',
'd': '-..',
'e': '.',
'f': '..-.',
'g': '--.',
'h': '....',
'i': '..',
'j': '.---',
'k': '-.-',
'l': '.-..',
'm': '--',
'n': '-.',
'o': '---',
'p': '.--.',
'q': '--.-',
'r': '.-.',
's': '...',
't': '-',
'u': '..-',
'v': '...-',
'w': '.--',
'x': '-..-',
'y': '-.--',
'z': '--..',
}
def morse_decrypt(message):
m1 = message.split()
new_str = []
letter = ''
for i,n in morse_dict.items():
if n in m1:
letter = str(i)
new_str.append(letter)
return ''.join(new_str)
print(morse_decrypt('... --- ...'))
>>>os
But when I try to use function it prints each character one time. I don't know what the problem. What I am doing wrong?
Your morse_dict translates alphabetical letters into Morse code letters. But you want the reverse since you're trying to decrypt rather than encrypt. Either rewrite your dictionary or use
morse_to_alpha = dict(map(reversed, morse_dict.items()))
to flip the key-value pairs.
Once you've done that, then you can look up each message chunk in the translation dictionary (rather than the other way around):
def morse_decrypt(message):
morse_to_alpha = dict(map(reversed, morse_dict.items()))
return "".join(map(morse_to_alpha.get, message.split()))
This still breaks encapsulation. morse_to_alpha should be made into a parameter so you're not accessing global state, and it's wasteful to flip the dict for every translation. I'll leave these adjustments for you to handle.
It's also unclear how to handle errors; this raises a (not particularly clearly named) exception if the Morse code is invalid.
You have a dictionary with the Key as the letter and the code as the value.
Python dictionaries are lookup by key not value(Unfortunately), BUT there is a way around this as you probably found.
Pull the dictionary into items for letter and code like you were doing, BUT put the code to lookup in the first FOR loop. :)
def morse_decrypt(message):
global morse_dict
msgList = message.split(" ")
msgEnglish = ""
for codeLookup in msgList:
for letter, code in morse_dict.items():
if(code == codeLookup):
msgEnglish += letter
return msgEnglish
print(morse_decrypt('... --- ...'))
references:
Get key by value in dictionary
https://www.w3schools.com/python/python_dictionaries.asp
Why 'tz': 'America/New_York' and 'tz':\s+'\w+\/\w+' give different counts of the occurrences of the substring in the string?
The string from where I am trying to count the number of occurrences of the substring ,has been extracted from a Json file.
[{'c': 'US', 'nk': , 'tz': 'America/New_York', 'gr': 'MA', 'g': 'A6qOVH', 'h': 'wfLQtf', 'l': 'orofrog', 'al': 'en-S,en;q=0.8', 'hh': '1.usa.gov', 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf', 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991', 't': 1331923247, 'hc': 31822918, 'cy': 'Danvers', 'll': [42.576698, -70.954903]}]
import re
d=open("filepath")
str1=d.read()
list1=re.findall(r"('tz':\s+'\w+\/\w+')",str1,re.I|re.M)
w=open("newfilepath","w")
for i in list1:
w.writelines(i)
w.writelines("\n")
Don't use a regexp, use ast.literal_eval() to parse a Python literal.
import ast
with open("filepath") AS d:
str1 = d.read()
list1 = ast.literal_eval(str1)
with open("newfilepath", "2") AS w:
for i in list1:
w.write("'tz': " + i['tz'] + "\n")
I need to parse JSON like this:
{
"entity": " a=123455 b=234234 c=S d=CO e=1 f=user1 timestamp=null",
"otherField": "text"
}
I want to get values for a, b, c, d, e, timestamp separately. Is there a better way than assigning the entity value to a string, then parsing with REGEX?
There is nothing to the JSON standard that parses that value for you, you'll have to do this in Python.
It could be easier to just split that string on whitespace, then on =:
entities = dict(keyvalue.split('=', 1) for keyvalue in data['entity'].split())
This results in:
>>> data = {'entity': " a=123455 b=234234 c=S d=CO e=1 f=user1 timestamp=null"}
>>> dict(keyvalue.split('=', 1) for keyvalue in data['entity'].split())
{'a': '123455', 'c': 'S', 'b': '234234', 'e': '1', 'd': 'CO', 'f': 'user1', 'timestamp': 'null'}
What about this:
>>> dic = dict(item.split("=") for item in s['entity'].strip().split(" "))
>>> dic
>>> {'a': '123455', 'c': 'S', 'b': '234234', 'e': '1', 'd': 'CO', 'f': 'user1', 'timestamp':'null'}
>>> dic['a']
'123455'
>>> dic['b']
'234234'
>>> dic['c']
'S'
>>> dic['d']
'CO'
>>>
I have a dictonairy I want to compare to my string, for the each ke in the dictoniary which matches that in the string I wish to convert the string character to that of the dictoniary
I want to compare my dictionary to my string character by character and when they match replace the strings character with the value of the dictionary's match e.g. if A is in the string it will match to A in the dictionary and be replaced with T which is written to the file line2_u_rev_comp. However the error KeyError: '\n' occurs instead. What is this signaling and how can it be removed?
REV_COMP = {
'A': 'T',
'T': 'A',
'C': 'G',
'G': 'C',
'N': 'N',
'U': 'A'
}
tbl = REV_COMP
line2_u_rev_comp = [tbl[k] for k in line2_u_rev[::-1]]
''.join(line2_u_rev_comp)
'\n' means new line, and you can get rid of it (and other extraneous whitespace) using str.strip, e.g.:
line2_u_rev_comp = [tbl[k] for k in line2_u_rev.strip()[::-1]]
line2_u_rev_comp = [tbl.get(k,k) ... ]
this will either get it from the dictionary or return itself
The problem is the tbl[k] but you don't check if the key exists in the dict, if not you need to return k it self.
you also need to reverse again the list since your for statement is reversed.
Try this code:
line2_u_rev = "MY TEST IS THIS"
REV_COMP = {
'A': 'T',
'T': 'A',
'C': 'G',
'G': 'C',
'N': 'N',
'U': 'A'
}
tbl = REV_COMP
line2_u_rev_comp = [tbl[k] if k in tbl else k for k in line2_u_rev[::-1]][::-1]
print ''.join(line2_u_rev_comp)
Output:
MY AESA IS AHIS