Python Regex: US phone number parsing

Python Regex: US phone number parsing - python

I am a complete newbie in Regex.
I need to parse US phone numbers in a different format into 3 strings: area code (no '()'), next 3 digits, last 4 digits. No '-'.
I also need to reject (message Error):
916-111-1111 ('-' after the area code)
(916)111 -1111 (white space before '-')
( 916)111-1111 (any space inside of area code) - (916 ) - must be
rejected too
(a56)111-1111 (any non-digits inside of area code)
lack of '()' for the area code
it should OK: ' (916) 111-1111 ' (spaces anywhere except as above)
here is my regex:
^\s*\(?(\d{3})[\)\-][\s]*?(\d{3})[-]?(\d{4})\s*$
This took me 2 days.
It did not fail 916-111-1111 (availability of '-' after area code). I am sure there are some other deficiencies.
I would appreciate your help very much. Even hints.
Valid:
'(916) 111-1111'
'(916)111-1111 '
' (916) 111-1111'
INvalid:
'916-111-1111' - no () or '-' after area code
'(916)111 -1111' - no space before '-'
'( 916)111-1111' - no space inside ()
'(abc) 111-11i1' because of non-digits

You can do this:
import re
r = r'\((\d{3})\)\s*?(\d{3})\-(\d{4,5})'
l = ['(916) 111-11111', '(916)111-1111 ', ' (916) 111-1111', '916-111-1111', '(916)111 -1111', '( 916)111-1111', '(abc) 111-11i1']
print([re.findall(r, x) for x in l])
# [[('916', '111', '11111')], [('916', '111', '1111')], [('916', '111', '1111')], [], [], [], []]

You can simplify the regex if you consider (1) providing easy user interface rather than asking users to modify inputs or (2) taking the numbers to a backend storage as follows:
"(\d{1,3})\D*(\d{3})\D*(\d{4})"
Since you want to print the error message, the found regex groups should be rechecked according to the error messages as follows:
Code:
import re
def get_failed_reason(s):
space_regex = r"(\s+)"
area_code_regex = r"\s*(\D*)(\d{1,3})(\D*)(\d{3})(\D*)-(\d{4})"
results = re.findall(area_code_regex, s)
if 0 == len(results):
area_code_alpha_regex = r"\((\D+)\)"
results = re.findall(area_code_alpha_regex, s)
if len(results) > 0:
return "because of non-digits"
return "no matches"
results = results[0]
space_results = re.findall(space_regex, results[0])
if 0 == len(space_results):
space_results = re.findall(space_regex, results[2])
if 0 != len(space_results):
return "no space inside ()"
alpha_code_regex = r"(\D+)"
alpha_results = re.findall(alpha_code_regex, results[0])
if 0 == len(alpha_results):
alpha_results = re.findall(alpha_code_regex, results[2])
if 0 != len(alpha_results):
if "(" not in results[0] or ")" not in results[2]:
return "no () or '-' after area code"
if 0 != len(results[-2]):
return "no space before '-'"
return "because of non-digits in area code"
return "unspecified"
if __name__ == '__main__':
phone_numbers = ["916-111-1111", "(916)111-1111", "(916)111 -1111 ", " (916) 111-1111", "(916 )111-1111", "( 916)111-1111", "- (916 )111-1111", "(a56)111-1111", "(56a)111-1111", "(916) 111-1111 ", "(abc) 111-1111"]
valid_regex = r"\s*(\()(\d{1,3})(\))(\D*)(\d{3})([^\d\s]*)-(\d{4})"
for phone_number_str in phone_numbers:
results = re.findall(valid_regex, phone_number_str)
if 0 == len(results):
reason = get_failed_reason(phone_number_str)
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Failed:\t{phone_number_str: <30}- {reason}")
continue
area_code = results[0][1]
first_number = results[0][4]
second_number = results[0][6]
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Valid:\t{phone_number_str: <30}- Area code: {area_code}, First number: {first_number}, Second number: {second_number}")
Result:
[main] Failed: [916-111-1111] - no () or '-' after area code
[main] Valid: [(916)111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916)111 -1111 ] - no space before '-'
[main] Valid: [ (916) 111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916 )111-1111] - no space inside ()
[main] Failed: [( 916)111-1111] - no space inside ()
[main] Failed: [- (916 )111-1111] - no space inside ()
[main] Failed: [(a56)111-1111] - because of non-digits in area code
[main] Failed: [(56a)111-1111] - because of non-digits in area code
[main] Valid: [(916) 111-1111 ] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(abc) 111-1111] - because of non-digits
Note: D represents non-digit characters.

Related

How to parse Log file to object list

I'm working with data tipe Log (ROS).
Multiple objects are saved in Log file like this:
header:
seq: 2
stamp:
secs: 1596526199
nsecs: 140017032
frame_id: ''
level: 2
name: "/replicator_node"
msg: "Replicator node dumping to /tmp/replicator_dumps"
file: "replicator_node.py"
function: "__init__"
line: 218
topics: [/move_mongodb_entries/status, /move_mongodb_entries/goal, /move_mongodb_entries/result,
/move_mongodb_entries/cancel, /rosout, /move_mongodb_entries/feedback]
header:
seq: 2
stamp:
secs: 1596526198
nsecs: 848793029
frame_id: ''
level: 2
name: "/mongo_server"
msg: "2020-08-04T09:29:58.848+0200 [initandlisten] connection accepted from 127.0.0.1:58672\
\ #1 (1 connection now open)"
file: "mongodb_server.py"
function: "_mongo_loop"
line: 139
topics: [/rosout]
As you can see not everything is in same line as it's name.
I want to pars it to have object list - so I could access it like that:
object[1].msg would give me:
"2020-08-04T09:29:58.848+0200 [initandlisten] connection accepted from 127.0.0.1:58672 #1 (1 connection now open)"
Also, sometimes file name is something like: \home\nfoo\foo.py which results in log file as:
file: "\home
foo\foo.py"

It's an interesting exercise... Assuming that the structure is really consistent for all log entries, you can try something like this - pretty convoluted, but it works for the example in the question:
ros = """[your log above]"""
def manage_lists_2(log_ind, list_1, list_2, mystr):
if log_ind == 0:
list_1.append(mystr.split(':')[0].strip())
list_2[-log_ind].append(mystr.split(':')[1].strip())
m_keys2 = []
m_key_vals2 = [[],[]]
header_keys2 = []
header_key_vals2 = [[],[]]
stamp_keys2 = []
stamp_key_vals2 = [[],[]]
for log in logs:
for l in log.splitlines():
if l[0]!=" ":
items = [m_keys2, m_key_vals2]
elif l[0:3] != " ":
items = [header_keys2, header_key_vals2]
else:
items = [stamp_keys2, stamp_key_vals2]
manage_lists_2(logs.index(log), items[0], items[1], l)
for val in m_key_vals2:
for a, b, in zip(m_keys2,val):
print(a, ": ",b)
if a == "header":
for header_key in header_keys2:
print('\t',header_key,':',header_key_vals2[m_keys2.index(a)][header_keys2.index(header_key)])
if header_key == "stamp":
for stamp_key in stamp_keys2:
print('\t\t',stamp_key,':',stamp_key_vals2[m_keys2.index(a)][stamp_keys2.index(stamp_key)])
print('---')
Output:
header :
seq : 2
stamp :
secs : 1596526199
nsecs : 140017032
frame_id : 'one id'
level : 2
name : "/replicator_node"
msg : "Replicator node dumping to /tmp/replicator_dumps"
file : "replicator_node.py"
function : "__init__"
line : 218
topics : [/move_mongodb_entries/status, /move_mongodb_entries/goal, /move_mongodb_entries/result, /move_mongodb_entries/cancel, /rosout, /move_mongodb_entries/feedback]
---
header :
seq : 2
stamp :
secs : 1596526199
nsecs : 140017032
frame_id : 'one id'
level : 3
name : "/mongo_server"
msg : "2020-08-04T09
file : "mongodb_server.py"
function : "_mongo_loop"
line : 139
topics : [/rosout]
Having gone through that, I would recommend that - if you are going to do this on a regular basis - you find a way to store the data in xml format; it's a natural fit for it.

Regular Expression with python need to select closest section

Need some help with RegEx over python.
I have this text:
part101_add(
name = "part101-1",
dev2_serial = "dev_l622_01",
serial_port = "/dev/tty-part101-1",
yok_serial = "YT8388"
)
yok_tar_add("YT8388", None)
part2_add(
name = "part2-1",
serial_number = "SERIALNUMBER",
serial_port = "/dev/tty-part2-1",
yok_serial = "YT03044",
yok_port_board = "N"
)
yok_tar_add("YT03044", None)
I need to select all part*_add and its content.
for example:
part101_add:
name = "part101-1",
dev2_serial = "dev_l622_01",
serial_port = "/dev/tty-part101-1",
yok_serial = "YT8388"
part2_add:
serial_number = "SERIALNUMBER",
serial_port = "/dev/tty-part2-1",
yok_serial = "YT03044",
yok_port_board = "N"
problem is that im unable to separate the results.
when using this pattern:
regex = r"(.*?_add)\([\s\S.]*\)"
Thanks for your help.

I would precise the pattern to only match at the start and end of the line, and use a lazy quantifier with [\s\S]:
r"(?m)^(part\d+_add)\([\s\S]*?\)$"
See this regex demo
Details:
(?m) - an inline re.MULTILINE modifier version to make ^ match the line start and $ to match the line end
^ - start of a line
(part\d+_add) - Group 1 capturing part, 1+ digits, _add
\( - a literal (
[\s\S]*? - any 0+ chars, as few as possible up to
\)$ - a ) at the end of the line.

ignore certain word on if condition

How to create an if condition on a string that fails if there is anything other than "Ignore keyword" followed by "foo- "
Eg: following two strings should pass and fail respectively:
success = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore
keyword\n random stuff;\nfoo- Ignore keyword'
fail = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore
keyword\n random stuff;\nfoo- Ignore keyword\nfoo- this
should fail'
I was trying along this line and wasn't able to make it work:
In [80]: if 'foo- ' in fail or re.search('.*Ignore.*keyword', fail):
print 'fail'
....:
fail
In [81]: if 'foo- ' in success or re.search('.*Ignore.*keyword', success):
print 'fail'
....:
fail

Now it does exactly what were you asking for:
success = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore keyword\n random stuff;\nfoo- Ignore keyword'
fail = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore keyword\n random stuff;\nfoo- Ignore keyword\nfoo- this should fail'
# to be successful only "Ignore keyword" after "foo- "
elem_to_search = "foo- "
keyword = "Ignore keyword"
def check_sentence(elem_to_search, keyword, string_to_check):
index = 0
keyword_first = len(elem_to_search)
keyword_last = len(elem_to_search) + len(keyword)
while index < len(string_to_check):
index = string_to_check.find(elem_to_search, index)
if index == -1:
break
elif string_to_check[index + keyword_first : index + keyword_last] == keyword:
index += len(elem_to_search)
else:
print "fail"
break
check_sentence("foo- ", "Ignore keyword", success) will pass
check_sentence("foo- ", "Ignore keyword", fail) will print fail.

Determine if a line has parenthesis or quotes that are not closed in python

Im looking for a simple way to open a file, and search each line to see if that line has unclosed parens and quotes. If the line has unclosed parens/quotes, i want to print that line to a file. I know i could do it with an ugly blob of if/for statements, but i know that python probably has a better way with either the re module (which i know nothing about) or something else but i dont know the language well enough to do so.
Thanks!
Edit: some example lines. May be easier to read if you copy this into notepad or something and turn off word wrap (some lines can be quite long). Also, there are over 100k lines in the file so something efficent would be great!
SL ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13

If you don't think there will be backwards unmatched parens (i.e. ")(") you can do this:
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
for line in readfile:
if line.count("(") != line.count(")") or line.count('"') % 2 != 0:
outfile.write(line)
Otherwise you will have to count them one at a time to check for mismatches, like this:
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
for line in readfile:
count = 0
for char in line:
if char == ")":
count -= 1
elif char == "(":
count += 1
if count < 0:
break
if count != 0 or text.count('"') % 2 != 0:
outfile.write(line)
I can't think of any better way to handle it. Python doesn't support recursive regular expressions, so a regular expression solution is right out.
One more thing about this: Given your data, it might be better to put that into a function and split your strings, which is easy to do with a regex, like this:
import re
splitre = re.compile(".*?=(.*?)(?:(?=\s*?\S*?=)|(?=\s*$))")
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
for line in readfile:
def matchParens(text):
count = 0
for char in text:
if char == ")":
count -= 1
elif char == "(":
count += 1
if count < 0:
break
return count != 0 or text.count('"') % 2 != 0
if any(matchParens(text) for text in splitre.findall(line)):
outfile.write(line)
The reason why that might be better is that it checks each value pair individually, that way if you have an open paren in one value pair and a close paren in a later one, it won't think that there are no unbalanced parens.

It may seem like overkill to use a parser package, but it's pretty quick:
text = """\
SL ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13 GOOD
PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C429A13 GOOD
PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C42"("9A13 GOOD
"""
from pyparsing import nestedExpr, quotedString
paired_exprs = nestedExpr('(',')') | quotedString
for i, line in enumerate(text.splitlines(), start=1):
# use pyparsing expression to strip out properly nested quotes/parentheses
stripped_line = paired_exprs.suppress().transformString(line)
# if there are any quotes or parentheses left, they were not
# properly nested
if any(unwanted in stripped_line for unwanted in '()"\''):
print i, ':', line
Prints:
10 : PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
11 : PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
13 : PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD

Just extract all the interesting symbols from a line.
Push the opening symbols onto stack and pop from the stack whenever you get a
closing symbol.
If the stack is clean, the symbols are balanced. If
the stack underflows or doesn't get fully unwound you have unbalanced line.
Sample code for checking a line follows - I've inserted a stray bracket into the first line.
d = """SL ID=0X14429A0B TY=STANDARD OWN=0X429A(03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13"""
def unbalanced(line):
close_symbols = {'"' : '"', '(': ")", '[': ']', "'" : "'"}
syms = [x for x in line if x in '\'"[]()']
stack = []
for s in syms:
try:
if len(stack) > 0 and s == close_symbols[stack[-1]]:
stack.pop()
else:
stack.append(s)
except: # catches stack underflow or closing symbol lookup
return True
return len(stack) != 0
print unbalanced("hello 'there' () []")
print unbalanced("hello 'there\"' () []")
print unbalanced("][")
lines = d.splitlines() # in your case you can do open("file.txt").readlines()
print [line for line in lines if unbalanced(line)]
For large files, you don't want to read all the files into memory so use fragment like this instead:
with open("file.txt") as infile:
for line in infile:
if unbalanced(line):
print line

Regex - If you're lines contain no nested parentheses, the solution is pretty straightforward:
for line in myFile:
if re.search(r"\([^\(\)]*($|\()", line):
#this line contains unbalanced parentheses.
If you're working with the possibility of nested statements, it gets a little more complicated:
for line in myFile:
paren_stack = []
for char in line:
if char == '(':
paren_stack.append(char)
elif char == ')':
if paren_stack:
paren_stack.pop()
else:
#this line contains unbalanced parentheses.

check this code
from tokenize import *
def syntaxCheck(line):
def readline():
yield line
yield ''
par,quo,dquo = 0,0,0
count = { '(': (1,0,0),')': (-1,0,0),"'": (0,1,0),'"':(0,0,1) }
for countPar, countQuo,countDQuo in (
count.get(token,(0,0))+(token,) for _,token,_,_,_ in tokenize(readline().__next__)):
par += countPar
quo ^= countQuo
dquo ^= countDQuo
return par,quo,dquo
note that parentheses inside closed quotes doesnt count, as it counted as single string token.

I would just do something like:
for line in open(file, r):
if line.count('"') % 2 != 0 or line.count('(') != line.count(')'):
print(line)
But I can't be sure that'll suit your needs exactly.
More robust:
for line in open(file, r):
paren_count = 0
paren_count_start_quote = 0
quote_open = False
for char in line:
if char == ')':
paren_count -= 1
elif char == '(':
paren_count += 1
elif char == '"':
quote_open = not quote_open
if quote_open:
paren_count_start_quote = paren_count
elif paren_count != paren_count_start_quote:
print(line)
break
if paren_count < 0:
break
if quote_open or paren_count != 0:
print(line)
Didn't test the robust one, should work I think. It now makes sure things like: ( " ) ", where a set of parens closed inside a quote print the line.

Should the parens and quotes be closed on each line? If that is the case, you could do a simple count on the parenthesis and quotes. If it's even, they're matched. If it's odd, one is missing. Put that logic in a function, dump the lines of the text file into an array, and call map to execute the function for each string in the array.
My python's rusty, but that's how I would do it assuming everything "should" be on the same line.

Well my solution may not be as fancy, but I say you just count the number of parenthesis and quotes. If it doesn't come out to an even number, you know you're missing something!

Python: load text as python object [duplicate]

This question already has answers here:
How to convert raw javascript object to a dictionary?
(6 answers)
Closed 9 months ago.
I have a such text to load: https://sites.google.com/site/iminside1/paste
I'd prefer to create a python dictionary from it, but any object is OK. I tried pickle, json and eval, but didn't succeeded. Can you help me with this?
Thanks!
The results:
a = open("the_file", "r").read()
json.loads(a)
ValueError: Expecting property name: line 1 column 1 (char 1)
pickle.loads(a)
KeyError: '{'
eval(a)
File "<string>", line 19
from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия', terminal: ''},
^
SyntaxError: invalid syntax

Lifted almost straight from the pyparsing examples page:
# read text from web page
import urllib
page = urllib.urlopen("https://sites.google.com/site/iminside1/paste")
html = page.read()
page.close()
start = html.index("<pre>")+len("<pre>")+3 #skip over 3-byte header
end = html.index("</pre>")
text = html[start:end]
print text
# parse dict-like syntax
from pyparsing import (Suppress, Regex, quotedString, Word, alphas,
alphanums, oneOf, Forward, Optional, dictOf, delimitedList, Group, removeQuotes)
LBRACK,RBRACK,LBRACE,RBRACE,COLON,COMMA = map(Suppress,"[]{}:,")
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
string_ = Word(alphas,alphanums+"_") | quotedString.setParseAction(removeQuotes)
bool_ = oneOf("true false").setParseAction(lambda t: t[0]=="true")
item = Forward()
key = string_
dict_ = LBRACE - Optional(dictOf(key+COLON, item+Optional(COMMA))) + RBRACE
list_ = LBRACK - Optional(delimitedList(item)) + RBRACK
item << (real | integer | string_ | bool_ | Group(list_ | dict_ ))
result = item.parseString(text,parseAll=True)[0]
print result.data[0].dump()
print result.data[0].segments[0].dump(indent=" ")
print result.data[0].segments[0].flights[0].dump(indent=" - ")
print result.data[0].segments[0].flights[0].flightLegs[0].dump(indent=" - - ")
for seg in result.data[6].segments:
for flt in seg.flights:
fltleg = flt.flightLegs[0]
print "%(airline)s %(airlineCode)s %(flightNo)s" % fltleg,
print "%s -> %s" % (fltleg["from"].code, fltleg["to"].code)
Prints:
[['index', 0], ['serviceClass', '??????'], ['prices', [3504, ...
- eTicketing: true
- index: 0
- prices: [3504, 114.15000000000001, 89.769999999999996]
- segments: [[['indexSegment', 0], ['stopsCount', 0], ['flights', ...
- serviceClass: ??????
[['indexSegment', 0], ['stopsCount', 0], ['flights', [[['index', 0], ...
- flights: [[['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ...
- indexSegment: 0
- stopsCount: 0
- [['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ['flight...
- - flightLegs: [[['flightNo', '309'], ['eTicketing', 'true'], ['air...
- - index: 0
- - minAvailSeats: 9
- - stops: []
- - time: PT2H45M
- - [['flightNo', '309'], ['eTicketing', 'true'], ['airplane', 'Boe...
- - - airline: ?????????
- - - airlineCode: UN
- - - airplane: Boeing 737-500
- - - availSeats: 9
- - - classCode: I
- - - eTicketing: true
- - - fareBasis: IPROW
- - - flightClass: ECONOMY
- - - flightNo: 309
- - - from: - - [['code', 'DME'], ['airport', '??????????'], ...
- - - airport: ??????????
- - - city: ??????
- - - code: DME
- - - country: ??????
- - - terminal:
- - - fromDate: 2010-10-15
- - - fromTime: 10:40:00
- - - time:
- - - to: - - [['code', 'TXL'], ['airport', 'Berlin-Tegel'], ...
- - - airport: Berlin-Tegel
- - - city: ??????
- - - code: TXL
- - - country: ????????
- - - terminal:
- - - toDate: 2010-10-15
- - - toTime: 11:25:00
airBaltic BT 425 SVO -> RIX
airBaltic BT 425 SVO -> RIX
airBaltic BT 423 SVO -> RIX
airBaltic BT 423 SVO -> RIX
EDIT: fixed grouping and expanded output dump to show how to access individual key fields of results, either by index (within list) or as attribute (within dict).

If you really have to load the bulls... this data is (see my comment), you's propably best of with a regex adding missing quotes. Something like r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:" to find things to quote and r"\'\1\'\:" as replacement (off the top of my head, I have to test it first).
Edit: After some troulbe with backward-references in Python 3.1, I finally got it working with these:
>>> pattern = r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:"
>>> test = '{"foo": {bar: 1}}'
>>> repl = lambda match: '"{}":'.format(match.group(1))
>>> eval(re.sub(pattern, repl, test))
{'foo': {'bar': 1}}

Till now with help of delnan and a little investigation I can load it into dict with eval:
pattern = r"\b(?P<word>\w+):"
x = re.sub(pattern, '"\g<word>":',open("the_file", "r").read())
y = x.replace("true", '"true"')
d = eval(y)
Still looking for more efficient and maybe simpler solution.. I don't like to use "eval" for some reasons.

Extension of the DominiCane's version:
import re
quote_keys_regex = re.compile(r'([\{\s,])(\w+)(:)')
def js_variable_to_python(js_variable):
"""Convert a javascript variable into JSON and then load the value"""
# when in_string is not None, it contains the character that has opened the string
# either simple quote or double quote
in_string = None
# cut the string:
# r"""{ a:"f\"irst", c:'sec"ond'}"""
# becomes
# ['{ a:', '"', 'f\\', '"', 'irst', '"', ', c:', "'", 'sec', '"', 'ond', "'", '}']
l = re.split(r'(["\'])', js_variable)
# previous part (to check the escape character antislash)
previous_p = ""
for i, p in enumerate(l):
# parse characters inside a ECMA string
if in_string:
# we are in a JS string: replace the colon by a temporary character
# so quote_keys_regex doesn't have to deal with colon inside the JS strings
l[i] = l[i].replace(':', chr(1))
if in_string == "'":
# the JS string is delimited by simple quote.
# This is not supported by JSON.
# simple quote delimited string are converted to double quote delimited string
# here, inside a JS string, we escape the double quote
l[i] = l[i].replace('"', r'\"')
# deal with delimieters and escape character
if not in_string and p in ('"', "'"):
# we are not in string
# but p is double or simple quote
# that's the start of a new string
# replace simple quote by double quote
# (JSON doesn't support simple quote)
l[i] = '"'
in_string = p
continue
if p == in_string:
# we are in a string and the current part MAY close the string
if len(previous_p) > 0 and previous_p[-1] == '\\':
# there is an antislash just before: the JS string continue
continue
# the current p close the string
# replace simple quote by double quote
l[i] = '"'
in_string = None
# update previous_p
previous_p = p
# join the string
s = ''.join(l)
# add quote arround the key
# { a: 12 }
# becomes
# { "a": 12 }
s = quote_keys_regex.sub(r'\1"\2"\3', s)
# replace the surogate character by colon
s = s.replace(chr(1), ':')
# load the JSON and return the result
return json.loads(s)
It deals only with int, null and string. I don't know about float.
Note that the usage chr(1): the code doesn't work if this character in js_variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex: US phone number parsing - python

Related

How to parse Log file to object list

Regular Expression with python need to select closest section

ignore certain word on if condition

Determine if a line has parenthesis or quotes that are not closed in python

Python: load text as python object [duplicate]

Categories

Resources