Comparing whether JSON paths are equivalent other than indexes

Comparing whether JSON paths are equivalent other than indexes - python

Following the syntax from jsonpath_ng:
path = '$.data.objects[*].currencies[*].name'
related = '$.data.objects[4].currencies[0].name'
unrelated = '$.data.objects[4].currencies[0].value'
I am looking to compare whether two strings representing JSON paths in Python are equivalent if we ignore any indexes. For example, related would be compliant with path whereas unrelated wouldn't.
Is there a cleaner way to do this other than regex? I am already using jsonpath_ng in this module but cannot see support for this functionality.
To be clear: this operation is independent of any subsequent JSON object referencing, I just want to determine whether the paths themselves are similar.

Related

How to list constant attributes of external Python library, likely a wrapped C library?

I am working with a Python library that I do not have the source code for and which consists of compiled *.pyd modules. The library has attributes that appear to be constants (integers) which set various parameters in what I assume is a wrapped C-library.
For example:
# pixet is the library name
pixet.PX_TPX3_OPM_TOATOT # is equal to integer 0
pixet.PX_TPX3_OPM_EVENT_ITOT # is equal to integer 2
Since I do not have the source or the header files, I am not sure of what all the possibilities are and they do not show up in dir(pixet). Other than some kind of silly brute-force try all possible attribute names, is there a way that I can get a list of the integer constants, or are they hidden in the compiled code?

Using python to parse a large set of filenames concatenated from inconsistent object names

/tldr Looking to parse a large set of filenames that are a concatenation of two names (container + child) for the original two names where nomenclature is inconsistent. Python library suggestions or any other guidance appreciated.
I am looking for a way to parse strings for information where the nomenclature and formatting of information within those strings will likely be inconsistent to some degree.
Background
Industry: Automation controls
Problem to be solved:
Time series data is exported from an automation system with a single data point being saved to a single .csv file. (example: If the controls system were an environmental controls system the point might be the measured temperature of a room taken at 15 minute intervals.) It is possible to have an environment where there are a few dozen points that export to CSV files or several thousand points that export to CSV files. The structure that the points are normally stored in is as follows: points are contained within a controller, controllers are integrated under a management system and occasionally management systems could be integrated into another management system. The resulting structure is a simple hierarchical tree.
The filenames associated with the CSV files are assembled from the path structure of each point as follows: Directories are created for the management systems (nested if necessary) and under those directories are the CSV files where the filename is a concatenation of the controller name and the point name.
I have written a python script that processes a monthly export of the CSV files (currently about 5500 of them [growing]) into a structured data store and another that assembles spreadsheets for others to review. Currently, I am using some really ugly regular expressions and even uglier string.find()s with a list of static string values that I have hand entered to parse out control names and point names for each file so that they can be inserted into the structured data store.
Unfortunately, as mentioned above, the nomenclature used in these environments are rarely consistent. Point names vary widely. The point referenced above might be known as ROOMTEMP, RM_T, RM-T, ROOM-T, ZN_T, ZNT, RMT or several other possibilities. This applies to almost any point contained within a controller. Controller names are also somewhat inconsistent where they may be named for what type of device they are controlling, the geographic location of the device or even an asset number associated with the device.
I would very much like to get out of the business of hand writing regular expressions to parse file names every time a new location is added. I would like to write code that reads in filenames and looks for patterns across the filenames and then makes a recommendation for parsing the controller and point name out of each filename. I already have an interface where I can assign controller name and point name to each point object by hand so if there are errors with the parse I can modify the results. Ideally, the patterns created by the existing objects would influence the suggested names of new files being parsed.
Some examples of filenames are as follows:
UNIT1254_SAT.csv, UNIT1254_RMT.csv, UNIT1254_fil.csv, AHU_5311_CLG_O.csv, QE239-01_DISCH_STPT.csv, HX_E2_CHW_Return.csv, Plant_RM221_CHW_Sys_Enable.csv, TU_E7_Actual Clg Setpoint.csv, 1725_ROOMTEMP.csv, 1725_DA_T.csv, 1725_RA_T.csv
The order will always be consistent where it is a concatenation of controller name and then point name. There will most likely be a consistent character used to separate controller name from point name (normally an underscore, but occasionally a dash or some other character.)
Does anyone have any recommendations on how to get started with parsing these file names? I’ve thought through a few ideas, but keep shelving them before trying them prior to implementation because I keep finding the potential for performance issues or identifying failure points. The rest of my code is working pretty much the way I need it to, I just haven’t figured out an efficient or useful way to pull the correct names out of the filename. Unfortunately, It is not an option to modify the names on the control system side to be consistent.

I don't know if the following code will help you, but I hope it'll give you at least some idea.
Considering that a filename as "QE239-01_STPT_1725_ROOMTEMP_DA" can contain following names
'QE239-01'
'QE239-01_STPT'
'QE239-01_STPT_1725'
'QE239-01_STPT_1725_ROOMTEMP'
'QE239-01_STPT_1725_ROOMTEMP_DA'
'STPT'
'STPT_1725'
'STPT_1725_ROOMTEMP'
'STPT_1725_ROOMTEMP_DA'
'1725'
'1725_ROOMTEMP'
'1725_ROOMTEMP_DA'
'ROOMTEMP'
'ROOMTEMP_DA'
'DA'
as being possible elements (container name or point name) of the filename,
I defined the function treat() to return this list from the name.
Then the code treats all the filenames to find all the possible elements of filenames.
The function is based on the idea that in the chosen example the element ROOMTEMP can't follow the element STPT because STPT_ROOMTEMP isn't a possible container name in this example string since there is 1725 between these two elements.
And then, with the help of a function in difflib module, I try to discriminate elements that may have some similarity, in order to try to detect patterns under which several elements of names can be gathered.
You must play with the value passed as argument to cutoff parameter to choose what could be the best to give interesting results for you.
It's far from being good, certainly, but I didn't understood all aspects of your problem.
s =\
"""UNIT1254_SAT
UNIT1254_RMT
UNIT1254_fil
AHU_5311_CLG_O
QE239-01_DISCH_STPT,
HX_E2_CHW_Return
Plant_RM221_CHW_Sys_Enable
TU_E7_Actual Clg Setpoint
1725_ROOMTEMP
1725_DA_T
1725_RA_T
UNT147_ROOMTEMP
TRU_EZ_RM_T
HXX_V2_RM-T
RHXX_V2_ROOM-T
SIX8_ZN_T
Plint_RP228_ZNT
SOHO79_EZ_RMT"""
li = s.split('\n')
print(li)
print('- - - - - - - - - - - - - - - - - ')
import difflib
from pprint import pprint
def treat(name):
lu = name.split('_')
W = []
while lu:
W.extend('_'.join(lu[0:x]) for x in range(1,len(lu)+1))
lu.pop(0)
return W
if 0:
q = "QE239-01_STPT_1725_ROOMTEMP_DA"
pprint(treat(q))
print('==========================================')
WALL = []
for t in li:
WALL.extend(treat(t))
pprint(WALL)
for x in WALL:
j = set(difflib.get_close_matches(x, WALL, n=9000000, cutoff=0.7 ))
if len(j)>1:
print(j,'\n')

How to parse for length of functions in a python module?

Python modules can be parsed with ast.parse.
With the module object returned, nodes can be iterated and attributes like lineno and name can be accessed.
Currently, the way I use to determine the length (number of lines) of a function is to find the node X that corresponds to the function, and get the node Y right after it, then Y.lineno - X.lineno is the length (possibly with blank lines included).
But this method is not fast enough to me. Is there any tools that I can use to do this kind of static analysis?

strange output from yaml.dump

I've just started using yaml and I love it. However, the other day I came across a case that seemed really odd and I am not sure what is causing it. I have a list of file path locations and another list of file path destinations. I create a dictionary out of them and then use yaml to dump it out to read later (I work with artists and use yaml so that it is human readable as well).
sorry for the long lists:
source = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr']
dest = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr']
dictionary = dict(zip(source, dest))
print yaml.dump(dictionary)
this is the output that I get:
{/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhaw
k_diff_diffuse_v0006.exr,
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v00
06/blackhawk_maskBurnt_diffuse_v0006.1031.exr,
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr}
It comes back in fine with a yaml.load, but this is not useful for artists to be able to edit if need be.

This is the first question in the FAQ.
By default, PyYAML chooses the style of a collection depending on whether it has nested collections. If a collection has nested collections, it will be assigned the block style. Otherwise it will have the flow style.
If you want collections to be always serialized in the block style, set the parameter default_flow_style of dump() to False.
So:
>>> print yaml.dump(dictionary, default_flow_style=False)
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr
Still not exactly beautiful, but when you have strings longer than 80 characters as keys, it's about as good as you can reasonably expect.
If you model (part of) the filesystem hierarchy in your object hierarchy, or create aliases (or dynamic aliasers) for parts of the tree, etc., the YAML will look a lot nicer. But that's something you have to actually do at the object-model level; as far as YAML is concerned, those long paths full of repeated prefixes are just strings.

Python JSON module not throwing exception for invalid JSON

So, valid JSON must be an Object or Array, correct? I was expecting the following code to throw an exception but it is not:
>>> import json
>>> json.loads("245235")
245235

That's not invalid JSON*. Number is a valid JSON type, just like object. http://en.wikipedia.org/wiki/JSON#Data_types.2C_syntax_and_example any of these types could appear on its own, albeit that object and array are probably the most common top level types.
*according to the python implementation
EDIT:
As pointed out in a deleted (not sure why) answer, the python docs suggest that the JSON RFC does require the top level object to be of array or object type, but that the json module doesn't enforce this. Since a lot of what I know about JSON has come from working with the python json module, I don't know how portable this behavior is.
As requested, this is noted at http://docs.python.org/2/library/json.html#standard-compliance:
This module does not comply with the RFC in a strict fashion,
implementing some extensions that are valid JavaScript but not valid
JSON. In particular:
Top-level non-object, non-array values are accepted and output;
Infinite and NaN number values are accepted and output;
Repeated names within an object are accepted, and only the value of the last name-value pair is used.

JSON data can have a wide range of types including strings, numbers and booleans

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.