pyparsing a field that may or may not contain values - python

I have a dataset that resemebles the following:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
The problem that I am having is I can't figure out how I could properly write a capture for the "Capture MICR - Serial Field". This field could either be blank or contain an alphanumeric of varying length (I have the same problem with the other fields that could either be populated or blank.
I have tried some variations of the following, but am still coming up short.
pp.Literal("Capture MICR - Serial:") + pp.White(" ", min=1, max=0) + (pp.Word(pp.printables) ^ pp.White(" ", min=1, max=0))("crd_micr_serial") + pp.FollowedBy(pp.Literal("Pos44:"))
I think that part of the problem is that the Or matches a parse for the longest match, which in this case could be a long whitespace character, with a single alphanumeric, but I would still want to capture the single value.
Thanks for everyone's help.

The simplest way to parse text like "A: valueA B: valueB C: valueC" is to use pyparsing's SkipTo class:
a_expr = "A:" + SkipTo("B:")
b_expr = "B:" + SkipTo("C:")
c_expr = "C:" + SkipTo(LineEnd())
line_parser = a_expr + b_expr + c_expr
I'd like to enhance this just a bit more:
add a parse action to strip off leading and trailing whitespace
add a results name to make it easy to get the results after the line has been parsed
Here is how that simple parser looks:
NL = LineEnd()
a_expr = "A:" + SkipTo("B:").addParseAction(lambda t: [t[0].strip()])('A')
b_expr = "B:" + SkipTo("C:").addParseAction(lambda t: [t[0].strip()])('B')
c_expr = "C:" + SkipTo(NL).addParseAction(lambda t: [t[0].strip()])('C')
line_parser = a_expr + b_expr + c_expr
line_parser.runTests("""
A: 100 B: Fred C:
A: B: a value with spaces C: 42
""")
Gives:
A: 100 B: Fred C:
['A:', '100', 'B:', 'Fred', 'C:', '']
- A: '100'
- B: 'Fred'
- C: ''
A: B: a value with spaces C: 42
['A:', '', 'B:', 'a value with spaces', 'C:', '42']
- A: ''
- B: 'a value with spaces'
- C: '42'
I try to avoid copy/paste code when I can, and would rather automate the "A is followed by B" and
"C is followed by end-of-line" with a list describing the different prompt strings, and then walking that list to build each
sub expression:
import pyparsing as pp
def make_prompt_expr(s):
'''Define the expression for prompts as 'ABC:' '''
return pp.Combine(pp.Literal(s) + ':')
def make_field_value_expr(next_expr):
'''Define the expression for the field value as SkipTo(what comes next)'''
return pp.SkipTo(next_expr).addParseAction(lambda t: [t[0].strip()])
def make_name(s):
'''Convert prompt string to identifier form for results names'''
return ''.join(s.split()).replace('-','_')
# use split to easily define list of prompts in order - makes it easy to update later if new prompts are added
prompts = "Capture MICR - Serial/Pos44/Trrt/Acct/Tc/Opt4/Split".split('/')
# keep a list of all the prompt-value expressions
exprs = []
# get a list of this-prompt, next-prompt pairs
for this_, next_ in zip(prompts, prompts[1:] + [None]):
field_name = make_name(this_)
if next_ is not None:
next_expr = make_prompt_expr(next_)
else:
next_expr = pp.LineEnd()
# define the prompt-value expression for the current prompt string and add to exprs
this_expr = make_prompt_expr(this_) + make_field_value_expr(next_expr)(field_name)
exprs.append(this_expr)
# define a line parser as the And of all of the generated exprs
line_parser = pp.And(exprs)
line_parser.runTests("""\
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
""")
Gives:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
['Capture MICR - Serial:', '', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', '', 'Split:', '']
- Acct: ''
- CaptureMICR_Serial: ''
- Opt4: ''
- Pos44: ''
- Split: ''
- Tc: '2064'
- Trrt: '32904'
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
['Capture MICR - Serial:', '1729XYZ', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', 'XXL', 'Split:', '50']
- Acct: ''
- CaptureMICR_Serial: '1729XYZ'
- Opt4: 'XXL'
- Pos44: ''
- Split: '50'
- Tc: '2064'
- Trrt: '32904'

Does this do what you want?
I used the Combine merely so that both arms of the Or would produce similar results, ie, with 'Pos44:' at the end of the result string where it can be plucked off. I'm unhappy about resorting to a regex.
>>> import pyparsing as pp
>>> record_A = 'Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:'
>>> record_B = 'Capture MICR - Serial: 76ZXP67 Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:'
>>> parser_fragment = pp.Combine(pp.White()+pp.Literal('Pos44:'))
>>> parser = pp.Literal('Capture MICR - Serial:')+pp.Or([parser_fragment,pp.Regex('.*?(?:Pos44\:)')])
>>> parser.parseString(record_A)
(['Capture MICR - Serial:', ' Pos44:'], {})
>>> parser.parseString(record_B)
(['Capture MICR - Serial:', '76ZXP67 Pos44:'], {})

Related

Python Regex: US phone number parsing

I am a complete newbie in Regex.
I need to parse US phone numbers in a different format into 3 strings: area code (no '()'), next 3 digits, last 4 digits. No '-'.
I also need to reject (message Error):
916-111-1111 ('-' after the area code)
(916)111 -1111 (white space before '-')
( 916)111-1111 (any space inside of area code) - (916 ) - must be
rejected too
(a56)111-1111 (any non-digits inside of area code)
lack of '()' for the area code
it should OK: ' (916) 111-1111 ' (spaces anywhere except as above)
here is my regex:
^\s*\(?(\d{3})[\)\-][\s]*?(\d{3})[-]?(\d{4})\s*$
This took me 2 days.
It did not fail 916-111-1111 (availability of '-' after area code). I am sure there are some other deficiencies.
I would appreciate your help very much. Even hints.
Valid:
'(916) 111-1111'
'(916)111-1111 '
' (916) 111-1111'
INvalid:
'916-111-1111' - no () or '-' after area code
'(916)111 -1111' - no space before '-'
'( 916)111-1111' - no space inside ()
'(abc) 111-11i1' because of non-digits
You can do this:
import re
r = r'\((\d{3})\)\s*?(\d{3})\-(\d{4,5})'
l = ['(916) 111-11111', '(916)111-1111 ', ' (916) 111-1111', '916-111-1111', '(916)111 -1111', '( 916)111-1111', '(abc) 111-11i1']
print([re.findall(r, x) for x in l])
# [[('916', '111', '11111')], [('916', '111', '1111')], [('916', '111', '1111')], [], [], [], []]
You can simplify the regex if you consider (1) providing easy user interface rather than asking users to modify inputs or (2) taking the numbers to a backend storage as follows:
"(\d{1,3})\D*(\d{3})\D*(\d{4})"
Since you want to print the error message, the found regex groups should be rechecked according to the error messages as follows:
Code:
import re
def get_failed_reason(s):
space_regex = r"(\s+)"
area_code_regex = r"\s*(\D*)(\d{1,3})(\D*)(\d{3})(\D*)-(\d{4})"
results = re.findall(area_code_regex, s)
if 0 == len(results):
area_code_alpha_regex = r"\((\D+)\)"
results = re.findall(area_code_alpha_regex, s)
if len(results) > 0:
return "because of non-digits"
return "no matches"
results = results[0]
space_results = re.findall(space_regex, results[0])
if 0 == len(space_results):
space_results = re.findall(space_regex, results[2])
if 0 != len(space_results):
return "no space inside ()"
alpha_code_regex = r"(\D+)"
alpha_results = re.findall(alpha_code_regex, results[0])
if 0 == len(alpha_results):
alpha_results = re.findall(alpha_code_regex, results[2])
if 0 != len(alpha_results):
if "(" not in results[0] or ")" not in results[2]:
return "no () or '-' after area code"
if 0 != len(results[-2]):
return "no space before '-'"
return "because of non-digits in area code"
return "unspecified"
if __name__ == '__main__':
phone_numbers = ["916-111-1111", "(916)111-1111", "(916)111 -1111 ", " (916) 111-1111", "(916 )111-1111", "( 916)111-1111", "- (916 )111-1111", "(a56)111-1111", "(56a)111-1111", "(916) 111-1111 ", "(abc) 111-1111"]
valid_regex = r"\s*(\()(\d{1,3})(\))(\D*)(\d{3})([^\d\s]*)-(\d{4})"
for phone_number_str in phone_numbers:
results = re.findall(valid_regex, phone_number_str)
if 0 == len(results):
reason = get_failed_reason(phone_number_str)
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Failed:\t{phone_number_str: <30}- {reason}")
continue
area_code = results[0][1]
first_number = results[0][4]
second_number = results[0][6]
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Valid:\t{phone_number_str: <30}- Area code: {area_code}, First number: {first_number}, Second number: {second_number}")
Result:
[main] Failed: [916-111-1111] - no () or '-' after area code
[main] Valid: [(916)111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916)111 -1111 ] - no space before '-'
[main] Valid: [ (916) 111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916 )111-1111] - no space inside ()
[main] Failed: [( 916)111-1111] - no space inside ()
[main] Failed: [- (916 )111-1111] - no space inside ()
[main] Failed: [(a56)111-1111] - because of non-digits in area code
[main] Failed: [(56a)111-1111] - because of non-digits in area code
[main] Valid: [(916) 111-1111 ] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(abc) 111-1111] - because of non-digits
Note: D represents non-digit characters.

pattern to dictionary of lists Python

I have a file like this
module1 instance1(.wire1 (connectionwire1), .wire2 (connectionwire2),.... ,wire100 (connectionwire100)) ; module 2 instance 2(.wire1 (newconnectionwire1), .wire2 (newconnectionwire2),.... ,wire99 (newconnectionwire99))
Ther wires are repeated along modules. There can be many modules.
I want to build a dictionary like this (not every wire in 2nd module is a duplicate).
[wire1:[(module1, instance1, connection1), (module2, instance2,newconnection1), wire2:[(module1 instance1 connection2),(module2, instance2,newconnection1)]... wire99:module2, instance2, connection99), ]
I am splitting the string on ; then splitting on , and then ( to get wire and connectionwire strings . I am not sure how to fill the data structure though so the wire is the key and module, instancename and connection are elements.
Goal- get this datastructure- [ wire: (module, instance, connectionwire) ]
filedata=file.read()
realindex=list(find_pos(filedata,';'))
tempindex=0
for l in realindex:
module=filedata[tempindex:l]
modulename=module.split()[0]
openbracketindex=module.find("(")
closebracketindex=module.strip("\n").find(");")
instancename=module[:openbracketindex].split()[1]
tempindex=l
tempwires=module[openbracketindex:l+1]
#got to split wires on commas
for tempw in tempwires.split(","):
wires=tempw
listofwires.append(wires)
Using the re module.
import re
from collections import defaultdict
s = "module1 instance1(.wire1 (connectionwire1), .wire2 (connectionwire2), .wire100 (connectionwire100)) ; module2 instance2(.wire1 (newconnectionwire1), .wire2 (newconnectionwire2), wire99 (newconnectionwire99))'
d = defaultdict(list)
module_pattern = r'(\w+)\s(\w+)\(([^;]+)'
mod_rex = re.compile(module_pattern)
wire_pattern = r'\.(\w+)\s\(([^\)]+)'
wire_rex = re.compile(wire_pattern)
for match in mod_rex.finditer(s):
#print '\n'.join(match.groups())
module, instance, wires = match.groups()
for match in wire_rex.finditer(wires):
wire, connection = match.groups()
#print '\t', wire, connection
d[wire].append((module, instance, connection))
for k, v in d.items():
print k, ':', v
Produces
wire1 : [('module1', 'instance1', 'connectionwire1'), ('module2', 'instance2', 'newconnectionwire1')]
wire2 : [('module1', 'instance1', 'connectionwire2'), ('module2', 'instance2', 'newconnectionwire2')]
wire100 : [('module1', 'instance1', 'connectionwire100')]
Answer provided by wwii using re is correct. I'm sharing an example of how you can solve your problem using pyparsing module which makes parsing human readable and easy to do.
from pyparsing import Word, alphanums, Optional, ZeroOrMore, Literal, Group, OneOrMore
from collections import defaultdict
s = 'module1 instance1(.wire1 (connectionwire1), .wire2 (connectionwire2), .wire100 (connectionwire100)) ; module2 instance2(.wire1 (newconnectionwire1), .wire2 (newconnectionwire 2), .wire99 (newconnectionwire99))'
connection = Word(alphanums)
wire = Word(alphanums)
module = Word(alphanums)
instance = Word(alphanums)
dot = Literal(".").suppress()
comma = Literal(",").suppress()
lparen = Literal("(").suppress()
rparen = Literal(")").suppress()
semicolon = Literal(";").suppress()
wire_connection = Group(dot + wire("wire") + lparen + connection("connection") + rparen + Optional(comma))
wire_connections = Group(OneOrMore(wire_connection))
module_instance = Group(module("module") + instance("instance") + lparen + ZeroOrMore(wire_connections("wire_connections")) + rparen + Optional(semicolon))
module_instances = OneOrMore(module_instance)
results = module_instances.parseString(s)
# create a dict
d = defaultdict(list)
for r in results:
m = r['module']
i = r['instance']
for wc in r['wire_connections']:
w = wc['wire']
c = wc['connection']
d[w].append((m, i, c))
print d
Output:
defaultdict(<type 'list'>, {'wire1': [('module1', 'instance1', 'connectionwire1'), ('module2', 'instance2', 'newconnectionwire1')], 'wire2': [('module1', 'instance1', 'connectionwire2'), ('module2', 'instance2', 'newconnectionwire2')], 'wire100': [('module1', 'instance1', 'connectionwire100')], 'wire99': [('module2', 'instance2', 'newconnectionwire99')]})

Python - how to parse this with regex correctly? its parsing all the E.164 but except the local format

Its working for 0032, 32, +32 but not as 0487365060 (which is a valid term)
to_user = "0032487365060"
# ^(?:\+|00)(\d+)$ Parse the 0032, 32, +32 & 0487365060
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
to_user = "32487365060"
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
to_user = "+32487365060"
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
Not working:
to_user = "0487365060"
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
Your last example doesn't work because it does not start with either + or 00. If that is optional you need to mark it as such:
r'^(?:\+|00)?(\d+)$'
Note that neither does your second example match; it doesn't start with + or 00 either.
Demo:
>>> import re
>>> samples = ('0032487365060', '32487365060', '+32487365060', '0487365060')
>>> pattern = re.compile(r'^(?:\+|00)?(\d+)$')
>>> for sample in samples:
... match = pattern.search(sample)
... if match is not None:
... print 'matched:', match.group(1)
... else:
... print 'Sample {} did not match'.format(sample)
...
matched: 32487365060
matched: 32487365060
matched: 32487365060
matched: 0487365060
Taking account of the question AND the comment, and in absence of more info (particularly on the length of the sequence of digits that must follow the 32 part, and if it is always 32 or may be another sequence), what I finally understand you want cab be obtained with:
import re
for to_user in ("0032487365060",
"32487365060",
"+32487365060",
"0487365060"):
m = re.sub('^(?:\+32|0032|32|0)(\d{9})$','32\\1', to_user)
print m
Something like this #eyquem method, to cover all the international codes from + and 00 into without +, 00 only for Belgium it should be default 32+the number:
import re
for to_user in (# Belgium
"0032487365060",
"32487365060",
"+32487365060",
"0487365060",
# USA
"0012127773456",
"12127773456",
"+12127773456",
# UK
"004412345678",
"4412345678",
"+4412345678"):
m = re.sub('^(?:\+|00|32|0)(\d{9})$','32\\1', to_user)
m = m.replace("+","")
m = re.sub('^(?:\+|00)(\d+)$', '\\1', m)
print m
Output:
32487365060
32487365060
32487365060
32487365060
12127773456
12127773456
12127773456
4412345678
4412345678
4412345678
Execution Successful!
Why not to use phonenumbers lib
>>> phonenumbers.parse("0487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)
and other 3 is ok to
>>> phonenumbers.parse("0032487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)
>>> phonenumbers.parse("+320487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)
>>> phonenumbers.parse("320487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)

Can't get the UNICODE chars

I have encountered in a problem while i'm trying to get the UNICODE chars and to put them in a list. The problem is that i'm getting the hex code of the symbols and not the symbols themselves..
Can anyone help me with that?
My code:
KeysLst = []
for i in range(1000, 1100):
char = unichr(i)
KeysLst.append(char)
print KeysLst
Output:
[u'\u03e8', u'\u03e9', u'\u03ea', u'\u03eb', u'\u03ec', u'\u03ed', u'\u03ee', u'\u03ef', u'\u03f0', u'\u03f1', u'\u03f2', u'\u03f3', u'\u03f4', u'\u03f5', u'\u03f6', u'\u03f7', u'\u03f8', u'\u03f9', u'\u03fa', u'\u03fb', u'\u03fc', u'\u03fd', u'\u03fe', u'\u03ff', u'\u0400', u'\u0401', u'\u0402', u'\u0403', u'\u0404', u'\u0405', u'\u0406', u'\u0407', u'\u0408', u'\u0409', u'\u040a', u'\u040b', u'\u040c', u'\u040d', u'\u040e', u'\u040f', u'\u0410', u'\u0411', u'\u0412', u'\u0413', u'\u0414', u'\u0415', u'\u0416', u'\u0417', u'\u0418', u'\u0419', u'\u041a', u'\u041b', u'\u041c', u'\u041d', u'\u041e', u'\u041f', u'\u0420', u'\u0421', u'\u0422', u'\u0423', u'\u0424', u'\u0425', u'\u0426', u'\u0427', u'\u0428', u'\u0429', u'\u042a', u'\u042b', u'\u042c', u'\u042d', u'\u042e', u'\u042f', u'\u0430', u'\u0431', u'\u0432', u'\u0433', u'\u0434', u'\u0435', u'\u0436', u'\u0437', u'\u0438', u'\u0439', u'\u043a', u'\u043b', u'\u043c', u'\u043d', u'\u043e', u'\u043f', u'\u0440', u'\u0441', u'\u0442', u'\u0443', u'\u0444', u'\u0445', u'\u0446', u'\u0447', u'\u0448', u'\u0449', u'\u044a', u'\u044b']
You did get unicode characters.
However, Python is showing you unicode literal escapes, to make debugging easier. Those u'\u03e8' values are still one-character unicoe strings though.
Try printing the individual values in your list:
>>> print KeysLst[0]
Ϩ
>>> print KeysLst[1]
ϩ
>>> KeysLst[0]
u'\u03e8'
>>> KeysLst[1]
u'\u03e9'
The unicode escape representation is used for any codepoint outside of the printable ASCII range:
>>> u'A'
u'A'
>>> u'\n'
u'\n'
>>> u'\x86'
u'\x86'
>>> u'\u0025'
u'%'
When you print a list, you get the repr of the elements inside the list (surrounded by brackets and delimited by a comma).
If you are trying to print the unicode glyphs, try
KeysLst = []
for i in range(1000, 1100):
char = unichr(i)
KeysLst.append(char)
for char in KeysLst:
print char,
which yields
Ϩ ϩ Ϫ ϫ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϶ Ϸ ϸ Ϲ Ϻ ϻ ϼ Ͻ Ͼ Ͽ Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы

Python: load text as python object [duplicate]

This question already has answers here:
How to convert raw javascript object to a dictionary?
(6 answers)
Closed 9 months ago.
I have a such text to load: https://sites.google.com/site/iminside1/paste
I'd prefer to create a python dictionary from it, but any object is OK. I tried pickle, json and eval, but didn't succeeded. Can you help me with this?
Thanks!
The results:
a = open("the_file", "r").read()
json.loads(a)
ValueError: Expecting property name: line 1 column 1 (char 1)
pickle.loads(a)
KeyError: '{'
eval(a)
File "<string>", line 19
from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия', terminal: ''},
^
SyntaxError: invalid syntax
Lifted almost straight from the pyparsing examples page:
# read text from web page
import urllib
page = urllib.urlopen("https://sites.google.com/site/iminside1/paste")
html = page.read()
page.close()
start = html.index("<pre>")+len("<pre>")+3 #skip over 3-byte header
end = html.index("</pre>")
text = html[start:end]
print text
# parse dict-like syntax
from pyparsing import (Suppress, Regex, quotedString, Word, alphas,
alphanums, oneOf, Forward, Optional, dictOf, delimitedList, Group, removeQuotes)
LBRACK,RBRACK,LBRACE,RBRACE,COLON,COMMA = map(Suppress,"[]{}:,")
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
string_ = Word(alphas,alphanums+"_") | quotedString.setParseAction(removeQuotes)
bool_ = oneOf("true false").setParseAction(lambda t: t[0]=="true")
item = Forward()
key = string_
dict_ = LBRACE - Optional(dictOf(key+COLON, item+Optional(COMMA))) + RBRACE
list_ = LBRACK - Optional(delimitedList(item)) + RBRACK
item << (real | integer | string_ | bool_ | Group(list_ | dict_ ))
result = item.parseString(text,parseAll=True)[0]
print result.data[0].dump()
print result.data[0].segments[0].dump(indent=" ")
print result.data[0].segments[0].flights[0].dump(indent=" - ")
print result.data[0].segments[0].flights[0].flightLegs[0].dump(indent=" - - ")
for seg in result.data[6].segments:
for flt in seg.flights:
fltleg = flt.flightLegs[0]
print "%(airline)s %(airlineCode)s %(flightNo)s" % fltleg,
print "%s -> %s" % (fltleg["from"].code, fltleg["to"].code)
Prints:
[['index', 0], ['serviceClass', '??????'], ['prices', [3504, ...
- eTicketing: true
- index: 0
- prices: [3504, 114.15000000000001, 89.769999999999996]
- segments: [[['indexSegment', 0], ['stopsCount', 0], ['flights', ...
- serviceClass: ??????
[['indexSegment', 0], ['stopsCount', 0], ['flights', [[['index', 0], ...
- flights: [[['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ...
- indexSegment: 0
- stopsCount: 0
- [['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ['flight...
- - flightLegs: [[['flightNo', '309'], ['eTicketing', 'true'], ['air...
- - index: 0
- - minAvailSeats: 9
- - stops: []
- - time: PT2H45M
- - [['flightNo', '309'], ['eTicketing', 'true'], ['airplane', 'Boe...
- - - airline: ?????????
- - - airlineCode: UN
- - - airplane: Boeing 737-500
- - - availSeats: 9
- - - classCode: I
- - - eTicketing: true
- - - fareBasis: IPROW
- - - flightClass: ECONOMY
- - - flightNo: 309
- - - from: - - [['code', 'DME'], ['airport', '??????????'], ...
- - - airport: ??????????
- - - city: ??????
- - - code: DME
- - - country: ??????
- - - terminal:
- - - fromDate: 2010-10-15
- - - fromTime: 10:40:00
- - - time:
- - - to: - - [['code', 'TXL'], ['airport', 'Berlin-Tegel'], ...
- - - airport: Berlin-Tegel
- - - city: ??????
- - - code: TXL
- - - country: ????????
- - - terminal:
- - - toDate: 2010-10-15
- - - toTime: 11:25:00
airBaltic BT 425 SVO -> RIX
airBaltic BT 425 SVO -> RIX
airBaltic BT 423 SVO -> RIX
airBaltic BT 423 SVO -> RIX
EDIT: fixed grouping and expanded output dump to show how to access individual key fields of results, either by index (within list) or as attribute (within dict).
If you really have to load the bulls... this data is (see my comment), you's propably best of with a regex adding missing quotes. Something like r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:" to find things to quote and r"\'\1\'\:" as replacement (off the top of my head, I have to test it first).
Edit: After some troulbe with backward-references in Python 3.1, I finally got it working with these:
>>> pattern = r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:"
>>> test = '{"foo": {bar: 1}}'
>>> repl = lambda match: '"{}":'.format(match.group(1))
>>> eval(re.sub(pattern, repl, test))
{'foo': {'bar': 1}}
Till now with help of delnan and a little investigation I can load it into dict with eval:
pattern = r"\b(?P<word>\w+):"
x = re.sub(pattern, '"\g<word>":',open("the_file", "r").read())
y = x.replace("true", '"true"')
d = eval(y)
Still looking for more efficient and maybe simpler solution.. I don't like to use "eval" for some reasons.
Extension of the DominiCane's version:
import re
quote_keys_regex = re.compile(r'([\{\s,])(\w+)(:)')
def js_variable_to_python(js_variable):
"""Convert a javascript variable into JSON and then load the value"""
# when in_string is not None, it contains the character that has opened the string
# either simple quote or double quote
in_string = None
# cut the string:
# r"""{ a:"f\"irst", c:'sec"ond'}"""
# becomes
# ['{ a:', '"', 'f\\', '"', 'irst', '"', ', c:', "'", 'sec', '"', 'ond', "'", '}']
l = re.split(r'(["\'])', js_variable)
# previous part (to check the escape character antislash)
previous_p = ""
for i, p in enumerate(l):
# parse characters inside a ECMA string
if in_string:
# we are in a JS string: replace the colon by a temporary character
# so quote_keys_regex doesn't have to deal with colon inside the JS strings
l[i] = l[i].replace(':', chr(1))
if in_string == "'":
# the JS string is delimited by simple quote.
# This is not supported by JSON.
# simple quote delimited string are converted to double quote delimited string
# here, inside a JS string, we escape the double quote
l[i] = l[i].replace('"', r'\"')
# deal with delimieters and escape character
if not in_string and p in ('"', "'"):
# we are not in string
# but p is double or simple quote
# that's the start of a new string
# replace simple quote by double quote
# (JSON doesn't support simple quote)
l[i] = '"'
in_string = p
continue
if p == in_string:
# we are in a string and the current part MAY close the string
if len(previous_p) > 0 and previous_p[-1] == '\\':
# there is an antislash just before: the JS string continue
continue
# the current p close the string
# replace simple quote by double quote
l[i] = '"'
in_string = None
# update previous_p
previous_p = p
# join the string
s = ''.join(l)
# add quote arround the key
# { a: 12 }
# becomes
# { "a": 12 }
s = quote_keys_regex.sub(r'\1"\2"\3', s)
# replace the surogate character by colon
s = s.replace(chr(1), ':')
# load the JSON and return the result
return json.loads(s)
It deals only with int, null and string. I don't know about float.
Note that the usage chr(1): the code doesn't work if this character in js_variable.

Categories

Resources