Python: load text as python object [duplicate] - python

This question already has answers here:
How to convert raw javascript object to a dictionary?
(6 answers)
Closed 9 months ago.
I have a such text to load: https://sites.google.com/site/iminside1/paste
I'd prefer to create a python dictionary from it, but any object is OK. I tried pickle, json and eval, but didn't succeeded. Can you help me with this?
Thanks!
The results:
a = open("the_file", "r").read()
json.loads(a)
ValueError: Expecting property name: line 1 column 1 (char 1)
pickle.loads(a)
KeyError: '{'
eval(a)
File "<string>", line 19
from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия', terminal: ''},
^
SyntaxError: invalid syntax

Lifted almost straight from the pyparsing examples page:
# read text from web page
import urllib
page = urllib.urlopen("https://sites.google.com/site/iminside1/paste")
html = page.read()
page.close()
start = html.index("<pre>")+len("<pre>")+3 #skip over 3-byte header
end = html.index("</pre>")
text = html[start:end]
print text
# parse dict-like syntax
from pyparsing import (Suppress, Regex, quotedString, Word, alphas,
alphanums, oneOf, Forward, Optional, dictOf, delimitedList, Group, removeQuotes)
LBRACK,RBRACK,LBRACE,RBRACE,COLON,COMMA = map(Suppress,"[]{}:,")
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
string_ = Word(alphas,alphanums+"_") | quotedString.setParseAction(removeQuotes)
bool_ = oneOf("true false").setParseAction(lambda t: t[0]=="true")
item = Forward()
key = string_
dict_ = LBRACE - Optional(dictOf(key+COLON, item+Optional(COMMA))) + RBRACE
list_ = LBRACK - Optional(delimitedList(item)) + RBRACK
item << (real | integer | string_ | bool_ | Group(list_ | dict_ ))
result = item.parseString(text,parseAll=True)[0]
print result.data[0].dump()
print result.data[0].segments[0].dump(indent=" ")
print result.data[0].segments[0].flights[0].dump(indent=" - ")
print result.data[0].segments[0].flights[0].flightLegs[0].dump(indent=" - - ")
for seg in result.data[6].segments:
for flt in seg.flights:
fltleg = flt.flightLegs[0]
print "%(airline)s %(airlineCode)s %(flightNo)s" % fltleg,
print "%s -> %s" % (fltleg["from"].code, fltleg["to"].code)
Prints:
[['index', 0], ['serviceClass', '??????'], ['prices', [3504, ...
- eTicketing: true
- index: 0
- prices: [3504, 114.15000000000001, 89.769999999999996]
- segments: [[['indexSegment', 0], ['stopsCount', 0], ['flights', ...
- serviceClass: ??????
[['indexSegment', 0], ['stopsCount', 0], ['flights', [[['index', 0], ...
- flights: [[['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ...
- indexSegment: 0
- stopsCount: 0
- [['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ['flight...
- - flightLegs: [[['flightNo', '309'], ['eTicketing', 'true'], ['air...
- - index: 0
- - minAvailSeats: 9
- - stops: []
- - time: PT2H45M
- - [['flightNo', '309'], ['eTicketing', 'true'], ['airplane', 'Boe...
- - - airline: ?????????
- - - airlineCode: UN
- - - airplane: Boeing 737-500
- - - availSeats: 9
- - - classCode: I
- - - eTicketing: true
- - - fareBasis: IPROW
- - - flightClass: ECONOMY
- - - flightNo: 309
- - - from: - - [['code', 'DME'], ['airport', '??????????'], ...
- - - airport: ??????????
- - - city: ??????
- - - code: DME
- - - country: ??????
- - - terminal:
- - - fromDate: 2010-10-15
- - - fromTime: 10:40:00
- - - time:
- - - to: - - [['code', 'TXL'], ['airport', 'Berlin-Tegel'], ...
- - - airport: Berlin-Tegel
- - - city: ??????
- - - code: TXL
- - - country: ????????
- - - terminal:
- - - toDate: 2010-10-15
- - - toTime: 11:25:00
airBaltic BT 425 SVO -> RIX
airBaltic BT 425 SVO -> RIX
airBaltic BT 423 SVO -> RIX
airBaltic BT 423 SVO -> RIX
EDIT: fixed grouping and expanded output dump to show how to access individual key fields of results, either by index (within list) or as attribute (within dict).

If you really have to load the bulls... this data is (see my comment), you's propably best of with a regex adding missing quotes. Something like r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:" to find things to quote and r"\'\1\'\:" as replacement (off the top of my head, I have to test it first).
Edit: After some troulbe with backward-references in Python 3.1, I finally got it working with these:
>>> pattern = r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:"
>>> test = '{"foo": {bar: 1}}'
>>> repl = lambda match: '"{}":'.format(match.group(1))
>>> eval(re.sub(pattern, repl, test))
{'foo': {'bar': 1}}

Till now with help of delnan and a little investigation I can load it into dict with eval:
pattern = r"\b(?P<word>\w+):"
x = re.sub(pattern, '"\g<word>":',open("the_file", "r").read())
y = x.replace("true", '"true"')
d = eval(y)
Still looking for more efficient and maybe simpler solution.. I don't like to use "eval" for some reasons.

Extension of the DominiCane's version:
import re
quote_keys_regex = re.compile(r'([\{\s,])(\w+)(:)')
def js_variable_to_python(js_variable):
"""Convert a javascript variable into JSON and then load the value"""
# when in_string is not None, it contains the character that has opened the string
# either simple quote or double quote
in_string = None
# cut the string:
# r"""{ a:"f\"irst", c:'sec"ond'}"""
# becomes
# ['{ a:', '"', 'f\\', '"', 'irst', '"', ', c:', "'", 'sec', '"', 'ond', "'", '}']
l = re.split(r'(["\'])', js_variable)
# previous part (to check the escape character antislash)
previous_p = ""
for i, p in enumerate(l):
# parse characters inside a ECMA string
if in_string:
# we are in a JS string: replace the colon by a temporary character
# so quote_keys_regex doesn't have to deal with colon inside the JS strings
l[i] = l[i].replace(':', chr(1))
if in_string == "'":
# the JS string is delimited by simple quote.
# This is not supported by JSON.
# simple quote delimited string are converted to double quote delimited string
# here, inside a JS string, we escape the double quote
l[i] = l[i].replace('"', r'\"')
# deal with delimieters and escape character
if not in_string and p in ('"', "'"):
# we are not in string
# but p is double or simple quote
# that's the start of a new string
# replace simple quote by double quote
# (JSON doesn't support simple quote)
l[i] = '"'
in_string = p
continue
if p == in_string:
# we are in a string and the current part MAY close the string
if len(previous_p) > 0 and previous_p[-1] == '\\':
# there is an antislash just before: the JS string continue
continue
# the current p close the string
# replace simple quote by double quote
l[i] = '"'
in_string = None
# update previous_p
previous_p = p
# join the string
s = ''.join(l)
# add quote arround the key
# { a: 12 }
# becomes
# { "a": 12 }
s = quote_keys_regex.sub(r'\1"\2"\3', s)
# replace the surogate character by colon
s = s.replace(chr(1), ':')
# load the JSON and return the result
return json.loads(s)
It deals only with int, null and string. I don't know about float.
Note that the usage chr(1): the code doesn't work if this character in js_variable.

Related

Python Regex: US phone number parsing

I am a complete newbie in Regex.
I need to parse US phone numbers in a different format into 3 strings: area code (no '()'), next 3 digits, last 4 digits. No '-'.
I also need to reject (message Error):
916-111-1111 ('-' after the area code)
(916)111 -1111 (white space before '-')
( 916)111-1111 (any space inside of area code) - (916 ) - must be
rejected too
(a56)111-1111 (any non-digits inside of area code)
lack of '()' for the area code
it should OK: ' (916) 111-1111 ' (spaces anywhere except as above)
here is my regex:
^\s*\(?(\d{3})[\)\-][\s]*?(\d{3})[-]?(\d{4})\s*$
This took me 2 days.
It did not fail 916-111-1111 (availability of '-' after area code). I am sure there are some other deficiencies.
I would appreciate your help very much. Even hints.
Valid:
'(916) 111-1111'
'(916)111-1111 '
' (916) 111-1111'
INvalid:
'916-111-1111' - no () or '-' after area code
'(916)111 -1111' - no space before '-'
'( 916)111-1111' - no space inside ()
'(abc) 111-11i1' because of non-digits
You can do this:
import re
r = r'\((\d{3})\)\s*?(\d{3})\-(\d{4,5})'
l = ['(916) 111-11111', '(916)111-1111 ', ' (916) 111-1111', '916-111-1111', '(916)111 -1111', '( 916)111-1111', '(abc) 111-11i1']
print([re.findall(r, x) for x in l])
# [[('916', '111', '11111')], [('916', '111', '1111')], [('916', '111', '1111')], [], [], [], []]
You can simplify the regex if you consider (1) providing easy user interface rather than asking users to modify inputs or (2) taking the numbers to a backend storage as follows:
"(\d{1,3})\D*(\d{3})\D*(\d{4})"
Since you want to print the error message, the found regex groups should be rechecked according to the error messages as follows:
Code:
import re
def get_failed_reason(s):
space_regex = r"(\s+)"
area_code_regex = r"\s*(\D*)(\d{1,3})(\D*)(\d{3})(\D*)-(\d{4})"
results = re.findall(area_code_regex, s)
if 0 == len(results):
area_code_alpha_regex = r"\((\D+)\)"
results = re.findall(area_code_alpha_regex, s)
if len(results) > 0:
return "because of non-digits"
return "no matches"
results = results[0]
space_results = re.findall(space_regex, results[0])
if 0 == len(space_results):
space_results = re.findall(space_regex, results[2])
if 0 != len(space_results):
return "no space inside ()"
alpha_code_regex = r"(\D+)"
alpha_results = re.findall(alpha_code_regex, results[0])
if 0 == len(alpha_results):
alpha_results = re.findall(alpha_code_regex, results[2])
if 0 != len(alpha_results):
if "(" not in results[0] or ")" not in results[2]:
return "no () or '-' after area code"
if 0 != len(results[-2]):
return "no space before '-'"
return "because of non-digits in area code"
return "unspecified"
if __name__ == '__main__':
phone_numbers = ["916-111-1111", "(916)111-1111", "(916)111 -1111 ", " (916) 111-1111", "(916 )111-1111", "( 916)111-1111", "- (916 )111-1111", "(a56)111-1111", "(56a)111-1111", "(916) 111-1111 ", "(abc) 111-1111"]
valid_regex = r"\s*(\()(\d{1,3})(\))(\D*)(\d{3})([^\d\s]*)-(\d{4})"
for phone_number_str in phone_numbers:
results = re.findall(valid_regex, phone_number_str)
if 0 == len(results):
reason = get_failed_reason(phone_number_str)
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Failed:\t{phone_number_str: <30}- {reason}")
continue
area_code = results[0][1]
first_number = results[0][4]
second_number = results[0][6]
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Valid:\t{phone_number_str: <30}- Area code: {area_code}, First number: {first_number}, Second number: {second_number}")
Result:
[main] Failed: [916-111-1111] - no () or '-' after area code
[main] Valid: [(916)111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916)111 -1111 ] - no space before '-'
[main] Valid: [ (916) 111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916 )111-1111] - no space inside ()
[main] Failed: [( 916)111-1111] - no space inside ()
[main] Failed: [- (916 )111-1111] - no space inside ()
[main] Failed: [(a56)111-1111] - because of non-digits in area code
[main] Failed: [(56a)111-1111] - because of non-digits in area code
[main] Valid: [(916) 111-1111 ] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(abc) 111-1111] - because of non-digits
Note: D represents non-digit characters.

Convert lvm.conf to python dict using pyparsing

I'm trying to convert lvm.conf to python (JSON like) object.
LVM (Logical Volume Management) configuration file looks like this:
# Configuration section config.
# How LVM configuration settings are handled.
config {
# Configuration option config/checks.
# If enabled, any LVM configuration mismatch is reported.
# This implies checking that the configuration key is understood by
# LVM and that the value of the key is the proper type. If disabled,
# any configuration mismatch is ignored and the default value is used
# without any warning (a message about the configuration key not being
# found is issued in verbose mode only).
checks = 1
# Configuration option config/abort_on_errors.
# Abort the LVM process if a configuration mismatch is found.
abort_on_errors = 0
# Configuration option config/profile_dir.
# Directory where LVM looks for configuration profiles.
profile_dir = "/etc/lvm/profile"
}
local {
}
log {
verbose=0
silent=0
syslog=1
overwrite=0
level=0
indent=1
command_names=0
prefix=" "
activation=0
debug_classes=["memory","devices","activation","allocation","lvmetad","metadata","cache","locking","lvmpolld","dbus"]
}
I'd like to get Python dict, like this:
{ "section_name"":
{"value1" : 1,
"value2" : "some_string",
"value3" : [list, of, strings]}... and so on.}
The parser function:
def parseLvmConfig2(path="/etc/lvm/lvm.conf"):
try:
EQ, LBRACE, RBRACE, LQ, RQ = map(pp.Suppress, "={}[]")
comment = pp.Suppress("#") + pp.Suppress(pp.restOfLine)
configSection = pp.Word(pp.alphas + "_") + LBRACE
sectionKey = pp.Word(pp.alphas + "_")
sectionValue = pp.Forward()
entry = pp.Group(sectionKey + EQ + sectionValue)
real = pp.Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = pp.Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
listval = pp.Regex(r'(?:\[)(.*)?(?:\])').setParseAction(lambda x: eval(x[0]))
pp.dblQuotedString.setParseAction(pp.removeQuotes)
struct = pp.Group(pp.ZeroOrMore(entry) + RBRACE)
sectionValue << (pp.dblQuotedString | real | integer | listval)
parser = pp.ZeroOrMore(configSection + pp.Dict(struct))
res = parser.parseFile(path)
print(res)
except (pp.ParseBaseException, ) as e:
print("lvm.conf bad format {0}".format(e))
The result is messy and the question is, how to make pyparsing do the job, without additional logic?
UPDATE(SOLVED):
For anyone who wants to understand pyparsing better, please check #PaulMcG explanation below. (Thanks for pyparsing, Paul! )
import pyparsing as pp
def parseLvmConf(conf="/etc/lvm/lvm.conf", res_type="dict"):
EQ, LBRACE, RBRACE, LQ, RQ = map(pp.Suppress, "={}[]")
comment = "#" + pp.restOfLine
integer = pp.nums
real = pp.Word(pp.nums + "." + pp.nums)
pp.dblQuotedString.setParseAction(pp.removeQuotes)
scalar_value = real | integer | pp.dblQuotedString
list_value = pp.Group(LQ + pp.delimitedList(scalar_value) + RQ)
key = pp.Word(pp.alphas + "_", pp.alphanums + '_')
key_value = pp.Group(key + EQ + (scalar_value | list_value))
struct = pp.Forward()
entry = key_value | pp.Group(key + struct)
struct <<= pp.Dict(LBRACE + pp.ZeroOrMore(entry) + RBRACE)
parser = pp.Dict(pp.ZeroOrMore(entry))
parser.ignore(comment)
try:
#return lvm.conf as dict
if res_type == "dict":
return parser.parseFile(conf).asDict()
# return lvm.conf as list
elif res_type == "list":
return parser.parseFile(conf).asList()
else:
#return lvm.conf as ParseResults
return parser.parseFile(conf)
except (pp.ParseBaseException,) as e:
print("lvm.conf bad format {0}".format(e))
Step 1 should always be to at least rough out a BNF for the format you are going to parse. This really helps organize your thoughts, and gets you thinking about the structure and data you are parsing, before starting to write actual code.
Here is a BNF that I came up with for this config (it looks like a Python string because that makes it easy to paste into your code for future reference - but pyparsing does not work with or require such strings, they are purely a design tool):
BNF = '''
key_struct ::= key struct
struct ::= '{' (key_value | key_struct)... '}'
key_value ::= key '=' (scalar_value | list_value)
key ::= word composed of alphas and '_'
list_value ::= '[' scalar_value [',' scalar_value]... ']'
scalar_value ::= real | integer | double-quoted-string
comment ::= '#' rest-of-line
'''
Notice that the opening and closing {}'s and []'s are at the same level, rather than having an opener in one expression and a closer in another.
This BNF also will allow for structs nested within structs, which is not strictly required in the sample text you posted, but since your code looked to be supporting that, I included it.
Translating to pyparsing is pretty straightforward from here, working bottom-up through the BNF:
EQ, LBRACE, RBRACE, LQ, RQ = map(pp.Suppress, "={}[]")
comment = "#" + pp.restOfLine
integer = ppc.integer #pp.Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
real = ppc.real #pp.Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
pp.dblQuotedString.setParseAction(pp.removeQuotes)
scalar_value = real | integer | pp.dblQuotedString
# `delimitedList(expr)` is a shortcut for `expr + ZeroOrMore(',' + expr)`
list_value = pp.Group(LQ + pp.delimitedList(scalar_value) + RQ)
key = pp.Word(pp.alphas + "_", pp.alphanums + '_')
key_value = pp.Group(key + EQ + (scalar_value | list_value))
struct = pp.Forward()
entry = key_value | pp.Group(key + struct)
struct <<= (LBRACE + pp.ZeroOrMore(entry) + RBRACE)
parser = pp.ZeroOrMore(entry)
parser.ignore(comment)
Running this code:
try:
res = parser.parseString(lvm_source)
# print(res.dump())
res.pprint()
return res
except (pp.ParseBaseException, ) as e:
print("lvm.conf bad format {0}".format(e))
Gives this nested list:
[['config',
['checks', 1],
['abort_on_errors', 0],
['profile_dir', '/etc/lvm/profile']],
['local'],
['log',
['verbose', 0],
['silent', 0],
['syslog', 1],
['overwrite', 0],
['level', 0],
['indent', 1],
['command_names', 0],
['prefix', ' '],
['activation', 0],
['debug_classes',
['memory',
'devices',
'activation',
'allocation',
'lvmetad',
'metadata',
'cache',
'locking',
'lvmpolld',
'dbus']]]]
I think the format you would prefer is one where you can access the values as keys in a nested dict or in a hierarchical object. Pyparsing has a class called Dict that will do this at parse time, so that the results names are automatically assigned for nested subgroups. Change these two lines to have their sub-entries automatically dict-ified:
struct <<= pp.Dict(LBRACE + pp.ZeroOrMore(entry) + RBRACE)
parser = pp.Dict(pp.ZeroOrMore(entry))
Now if we call dump() instead of pprint(), we'll see the hierarchical naming:
[['config', ['checks', 1], ['abort_on_errors', 0], ['profile_dir', '/etc/lvm/profile']], ['local'], ['log', ['verbose', 0], ['silent', 0], ['syslog', 1], ['overwrite', 0], ['level', 0], ['indent', 1], ['command_names', 0], ['prefix', ' '], ['activation', 0], ['debug_classes', ['memory', 'devices', 'activation', 'allocation', 'lvmetad', 'metadata', 'cache', 'locking', 'lvmpolld', 'dbus']]]]
- config: [['checks', 1], ['abort_on_errors', 0], ['profile_dir', '/etc/lvm/profile']]
- abort_on_errors: 0
- checks: 1
- profile_dir: '/etc/lvm/profile'
- local: ''
- log: [['verbose', 0], ['silent', 0], ['syslog', 1], ['overwrite', 0], ['level', 0], ['indent', 1], ['command_names', 0], ['prefix', ' '], ['activation', 0], ['debug_classes', ['memory', 'devices', 'activation', 'allocation', 'lvmetad', 'metadata', 'cache', 'locking', 'lvmpolld', 'dbus']]]
- activation: 0
- command_names: 0
- debug_classes: ['memory', 'devices', 'activation', 'allocation', 'lvmetad', 'metadata', 'cache', 'locking', 'lvmpolld', 'dbus']
- indent: 1
- level: 0
- overwrite: 0
- prefix: ' '
- silent: 0
- syslog: 1
- verbose: 0
You can then access the fields as res['config']['checks'] or res.log.indent.

python difflib character diff with unifed contextual format

I need to display character difference per line in a unix unified diff like style. Is there a way to do that using difflib?
I can get "unified diff" and "character per line diff" separately using difflib.unified_diff and difflib.Differ() (ndiff) respectively, but how can I combine them?
This is what I am looking for:
#
# This is difflib.unified
#
>>> print ''.join(difflib.unified_diff('one\ntwo\nthree\n'.splitlines(1), 'ore\ntree\nemu\n'.splitlines(1), 'old', 'new'))
--- old
+++ new
## -1,3 +1,3 ##
-one
-two
-three
+ore
+tree
+emu
>>>
#
# This is difflib.Differ
#
>>> print ''.join(difflib.ndiff('one\ntwo\nthree\n'.splitlines(1), 'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
>>>
#
# I want the merge of above two, something like this...
#
>>> print ''.join(unified_with_ndiff('one\ntwo\nthree\n'.splitlines(1), 'ore\ntree\nemu\n'.splitlines(1))),
--- old
+++ new
## -1,3 +1,3 ##
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
>>>
Found the answer on my own after digging into the source code of difflib.
'''
# mydifflib.py
#author: Amit Barik
#summary: Overrides difflib.Differ to present the user with unified format (for Python 2.7).
Its basically merging of difflib.unified_diff() and difflib.Differ.compare()
'''
from difflib import SequenceMatcher
from difflib import Differ
class UnifiedDiffer(Differ):
def unified_diff(self, a, b, fromfile='', tofile='', fromfiledate='',
tofiledate='', n=3, lineterm='\n'):
r"""
Compare two sequences of lines; generate the resulting delta, in unified
format
Each sequence must contain individual single-line strings ending with
newlines. Such sequences can be obtained from the `readlines()` method
of file-like objects. The delta generated also consists of newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.
Example:
>>> print ''.join(Differ().unified_diff('one\ntwo\nthree\n'.splitlines(1),
... 'ore\ntree\nemu\n'.splitlines(1)),
... 'old.txt', 'new.txt', 'old-date', 'new-date'),
--- old.txt old-date
+++ new.txt new-date
## -1,5 +1,5 ##
context1
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
context2
"""
started = False
for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
if not started:
fromdate = '\t%s' % fromfiledate if fromfiledate else ''
todate = '\t%s' % tofiledate if tofiledate else ''
yield '--- %s%s%s' % (fromfile, fromdate, lineterm)
yield '+++ %s%s%s' % (tofile, todate, lineterm)
started = True
i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3], group[-1][4]
yield "## -%d,%d +%d,%d ##%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
for tag, i1, i2, j1, j2 in group:
if tag == 'replace':
for line in a[i1:i2]:
g = self._fancy_replace(a, i1, i2, b, j1, j2)
elif tag == 'equal':
for line in a[i1:i2]:
g = self._dump(' ', a, i1, i2)
if n > 0:
for line in g:
yield line
continue
elif tag == 'delete':
for line in a[i1:i2]:
g = self._dump('-', a, i1, i2)
elif tag == 'insert':
for line in b[j1:j2]:
g = self._dump('+', b, j1, j2)
else:
raise ValueError, 'unknown tag %r' % (tag,)
for line in g:
yield line
def main():
# Test
a ='context1\none\ntwo\nthree\ncontext2\n'.splitlines(1)
b = 'context1\nore\ntree\nemu\ncontext2\n'.splitlines(1)
x = UnifiedDiffer().unified_diff(a, b, 'old.txt', 'new.txt', 'old-date', 'new-date', n=1)
print ''.join(x)
if __name__ == '__main__':
main()

Match words before square bracket and extract the value

Hi i have an log data
IPADDRESS['192.168.0.10'] - - PlATFORM['windows'] - - BROWSER['chrome'] - - EPOCH['1402049518'] - - HEXA['0x3c0593ee']
I want to extract the corresponding value inside the bracket using regular expression
for example (this does not work)
ipaddress = re.findall(r"\[(IPADDRESS[^]]*)",output)
ouput:
ipaddress = 192.168.0.10
You can simply get all the elements as a dictionary like this
print dict(re.findall(r'([a-zA-Z]+)\[\'(.*?)\'\]', data))
Output
{'BROWSER': 'chrome',
'EPOCH': '1402049518',
'HEXA': '0x3c0593ee',
'IPADDRESS': '192.168.0.10',
'PlATFORM': 'windows'}
s1 = "IPADDRESS['192.168.0.10'] - - PlATFORM['windows'] - - BROWSER['chrome'] - - EPOCH['1402049518'] - - HEXA['0x3c0593ee']"
import re
print re.findall('\[\'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\'\]',s1)[0]
192.168.0.10
To get all in a list:
s1 = "IPADDRESS['192.168.0.10'] - - PlATFORM['windows'] - - BROWSER['chrome'] - - EPOCH['1402049518'] - - HEXA['0x3c0593ee']"
print re.findall('\[\'(.*?)\'\]',s1)
['192.168.0.10', 'windows', 'chrome', '1402049518', '0x3c0593ee']
result=re.findall('\[\'(.*?)\'\]',s1)
ipaddress, platform, browser, epoch, hexa = result # assign all variables
print "ipaddress = {}, platform = {}, browser = {}, epoch = {}, hexa = {}".format(ipaddress,platform,browser,epoch,hexa)
ipaddress = 192.168.0.10, platform = windows, browser = chrome, epoch = 1402049518, hexa = 0x3c0593ee

Parsing significant whitespace with a PEG in Python (specificly Parsley)

I'm creating a syntax that supports significant whitespace (most like the "Z" lisp variant than Python or yaml, but same idea)
I came across this article on how to do significant whitespace parsing in a pegasus a PEG parser for C#
But I've been less than successful at converting that to parsley, looks like the #STATE# variable in Pegasus follows backtracking in some way.
This is the closest I've gotten to a simple parser, If I use the version of indent with look ahead it can't parse children, and if I use the version without, it can't parse siblings.
If this is a limitation of parsley and I need to use PyPEG or Parsimonious or something, I'm open to that, but it seems like if the internal indent variable could follow the PEGs internal backtracking this would all work.
import parsley
def indent(s):
s['i'] += 2
print('indent i=%d' % s['i'])
def deindent(s):
s['i'] -= 2
print('deindent i=%d' % s['i'])
grammar = parsley.makeGrammar(r'''
id = <letterOrDigit+>
eol = '\n' | end
nots = anything:x ?(x != ' ')
node = I:i id:name eol !(fn_print(_state['i'], name)) -> i, name
#I = !(' ' * _state['i'])
I = (' '*):spaces ?(len(spaces) == _state['i'])
#indent = ~~(!(' ' * (_state['i'] + 2)) nots) -> fn_indent(_state)
#deindent = ~~(!(' ' * (_state['i'] - 2)) nots) -> fn_deindent(_state)
indent = -> fn_indent(_state)
deindent = -> fn_deindent(_state)
child_list = indent (ntree+):children deindent -> children
ntree = node:parent (child_list?):children -> parent, children
nodes = ntree+
''', {
'_state': {'i': 0},
'fn_indent': indent,
'fn_deindent': deindent,
'fn_print': print,
})
test_string = '\n'.join((
'brother',
' brochild1',
#' gchild1',
#' brochild2',
#' grandchild',
'sister',
#' sischild',
#'brother2',
))
nodes = grammar(test_string).nodes()

Categories

Resources