Python regular expressions (regex), convert JSON to text file for parsing - python

I annotated some video frames with VGG annotator that gives me the annotations in JSON format and want to parse it to extract values I need (x, y coordinates).
I have looked at other postings on this site but nothing seems to match my case as the length of the filename changes, ie. frame number 0 to 9 then 10 to 99, 100 to 999, 1000 to 9999, increasing by one digit.
I have tried import glob and using wildcard ranges, single characters and asterisks.
My code now:
#Edited
while count < 1200:
x = data[key]['regions']['0']['shape_attributes']['cx']
y = data[key]['regions']['0']['shape_attributes']['cy']
pts = (x, y)
xy.append(pts)
count += 1
f = open("coordinates.txt", "w")
f.write(xy)
f.close()
JSON looks like:
"shape_attributes": {
"name": "point",
"cx": 400,
"cy": 121
},
"region_attributes": {}
}
}
},
"frame48.jpg78647": {
"fileref": "",
"size": 78647,
"filename": "frame48.jpg",
"base64_img_data": "",
"file_attributes": {},
"regions": {
"0": {
"shape_attributes": {
"name": "point",
"cx": 404,
"cy": 114
},
"region_attributes": {}
}
}
Edit: I am going to convert the JSON to .txt file and parse that to get my values as I have no idea how to do so directly now.
I tried converting to string and parsing the string per below:
This did the job of getting x, y coordinates (3 digit ints) only appended to a list which I am going to convert to a list of tuples of (x,y) and print to a text file for use later as labels for a neural network where I'm tracking coordinates of a tennis ball on tennis matches on TV.
xy.append(re.findall(r'\b\d\d\d\b', datatxt))

You can't wildcard keys in a dictionary. Do you actually care about the keys at all - are there entries you want to ignore, or are you happy to have any/all of them?
If the keys are unimportant, then take data.values() which will be a list of the dictionaries, and you can go through the first 1,200 entries of that.
If there are keys not in the format you give, then loop through them and check they match first:
for key in data.keys():
m = re.match('frame(\d+).jpg(\d+)$', key)
if not m: continue
f1, f2 = map(int, m.groups())
if f1<0 or f1>1199 or f2<10000 or f2>99999: continue
x = data[key]['regions']['0']['shape_attributes']['cx']
y = data[key]['regions']['0']['shape_attributes']['cy']
...

Related

How to print out a value in a json, with only 1 'searchstring'

payload = {
"data": {
"name": "John",
"surname": "Doe"
}
}
print(payload["data"]["name"])
I want to print out the value of 'name' inside the json. I know the way to do it like above. But is there also a way to print out the value of 'name' with only 1 'search string'?
I'm looking for something like this
print(payload["data:name"])
Output:
John
If you were dealing with nested attributes of an object I would suggest operator.attrgetter, however, the itemgetter in the same module does not seems to support nested key access. It is fairly easy to implement something similar tho:
payload = {
"data": {
"name": "John",
"surname": "Doe",
"address": {
"postcode": "667"
}
}
}
def get_key_path(d, path):
# Remember latest object
obj = d
# For each key in the given list of keys
for key in path:
# Look up that key in the last object
if key not in obj:
raise KeyError(f"Object {obj} has no key {key}")
# now we know the key exists, replace
# last object with obj[key] to move to
# the next level
obj = obj[key]
return obj
print(get_key_path(payload, ["data"]))
print(get_key_path(payload, ["data", "name"]))
print(get_key_path(payload, ["data", "address", "postcode"]))
Output:
$ python3 ~/tmp/so.py
{'name': 'John', 'surname': 'Doe', 'address': {'postcode': '667'}}
John
667
You can always later decide on a separator character and use a single string instead of path, however, you need to make sure this character does not appear in a valid key. For example, using |, the only change you need to do in get_key_path is:
def get_key_path(d, path):
obj = d
for key in path.split("|"): # Here
...
There isn't really a way you can do this by using the 'search string'. You can use the get() method, but like getting it using the square brackets, you will have to first parse the dictionary inside the data key.
You could try creating your own function that uses something like:
str.split(sep=None, maxsplit=-1)
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no limit on the number of splits (all possible splits are made).
def get_leaf_value(d, search_string):
if ":" not in search_string:
return d[search_string]
next_d, next_search_string = search_string.split(':', 1)
return get_value(d[next_d], next_search_string)
payload = {
"data": {
"name": "John",
"surname": "Doe"
}
}
print(payload["data"]["name"])
print(get_leaf_value(payload, "data:name"))
Output:
John
John
This approach will only work if your data is completely nested dictionaries like in your example (i.e., no lists in non-leaf nodes) and : is not part of any keys obviously.
Here is an alternative. Maybe an overkill, it depends.
jq uses a single "search string" - an expression called 'jq program' by the author - to extract and transform data. It is a powerful tool meaning the jq program can be quite complex. Reading a good tutorial is almost a must.
import pyjq
payload = ... as posted in the question ...
expr = '.data.name'
name = pyjq.one(expr, payload) # "John"
The original project (written in C) is located here. The python jq libraries are build on top of that C code.

reading from a json file using python

Im trying to read from this json file and print the values. I cant find out how to print all the values from the first (dictonary-index?) in the list.
I want to print the following:
website: https://www.amazon.com/Apple-iPhone-GSM-Unlocked-64GB/dp/B07
price: 382,76
How can i do it?
JSON file:
[
{
"website": "https://www.amazon.com/Apple-iPhone-GSM-Unlocked-64GB/dp/B078P5BK5G",
"price": "382,76"
},
{
"website": "https://www.ebay.com/itm/Apple-iPhone-8-Plus-GSM-Unlocked-64GB-Gold-Renewed-Gold-64-GB-Gold-64-GB-/143340730792",
"price": "609,15"
}
]
Python code:
Tried this
import json
with open('./result.json') as json_file:
data = json.load(json_file)
for p in data:
print(p["price"])
Output is the prices of the products:
382,76
609,15
Instead of printing the prices it should print the values in the first dict in the list. Any good tips on how to do this?
You are looping over the list of dictionaries. If you want to loop over the values of the first dictionary, you first need to get the first element, and loop over that one.
first_dict = data[0]
for value in first_dict.values():
print(value)

Check for number of occurrences in json element matching a criteria, python

I have the following code which shows me the number of "Names elements in a Json file":
import json
with open('names.json') as f:
item_dict = json.load(f)
print len(item_dict['Names'])
What i want to do is to find the number of times a specific attribute in the Name element is equal to a specific word.
"Names": [
{
"PId": 2,
"Name": "John",
"Surname": "Snow"
}
I want to find how many times is in the file a Name element with Surname = Snow.
Help is appreciated
You can use a list comprehension
Ex:
print(sum([1 for i in item_dict['Names'] if i["Surname"] == "Snow"]))
Output:
1

Get the parent key by matching the value using Regular Expression

Consider the below json object, Here I need to take the parent key by matching the value using regular expression.
{
"PRODUCT": {
"attribs": {
"U1": {
"name": "^U.*1$"
},
"U2": {
"name": "^U.*2$"
},
"U3": {
"name": "^U.*3$"
},
"U4": {
"name": "^U.*4$"
},
"U5": {
"name": "^U.*5$"
},
"P1": {
"name": "^P.*1$"
}
}
}
}
I will be passing a String like this "U10001", It should return the key(U1) by matching the regular expression(^U.*1$).
If I am passing a String like this "P200001", It should return the key(P1) by matching the regular expression(^P.*1$).
I am looking for some help regarding the same, Any help is appreciated.
I'm not sure how you are getting your JSON, but you added python as a tag so I'm assuming at somepoint you will have it stored as a string in your code.
First decode the string into a python dict.
import json
my_dict = json.loads(my_json)["PRODUCT"]["attribs"]
If the JSON is formatted as above you should get a dict with keys as your U1, U2, etc.
Now you can use filter in python to apply your regular expression logic, and re to do the actual matching.
import re
test_string = "U10001"
def re_filter(item):
return re.match(item[1]["name"], test_string)
result = filter(re_filter, my_dict.items())
# Just get the matching attribute names
print [i[0] for i in result]
I haven't ran the code so it might need some syntax fixing, but this should give you the general idea. Of course you will need to make it more generic to allow multiple products.
How about this:
import re
my_dict = {...}
def get_key(dict_, test):
return next(k for k, v in dict_.items() if re.match(v['name'], test))
test = "U10001"
result = get_key(my_dict['PRODUCT']['attribs'], test))
print(result) # U1
Can you please elaborate on what you exactly want to design? Here's a quick way to return the desired key.
import re
def getKey(string):
return re.search('^(.\d)\d+', string).group(1)
If you want to loop over the whole json, then load it into dictionary and then loop over the "PRODUCT"->"attribs" dictionary to get required key-
import json, re
f = open('../file/path/here')
d = json.loads(f.read())
patents = d['PRODUCT']['attribs']
for key,val in patent_attribute.items():
patent_group = re.search('^(.\d)\d+', val['name']).group(1) #returns U1 U2,U3,.. or P1,P2,P3,..
#do whatever with patent_group(U1/P1 etc)

Create (json-) array from browser query string

From the geolocation api browser query, I get this:
browser=opera&sensor=true&wifi=mac:B0-48-7A-99-BD-86|ss:-72|ssid:Baldur WLAN|age:4033|chan:6&wifi=mac:00-24-FE-A7-BA-94|ss:-83|ssid:wlan23-k!17|age:4033|chan:10&wifi=mac:90-F6-52-3F-60-64|ss:-95|ssid:Baldur WLAN|age:4033|chan:13&device=mcc:262|mnc:7|rt:3&cell=id:15479311|lac:21905|mcc:262|mnc:7|ss:-107|ta:0&location=lat:52.398529|lng:13.107570
I would like to access all the single values local structured. My approach is to create a json array more in depth, than split it up by "&" first and "=" afterwards to get an array of all values in the query. Another approach is to use regex (\w+)=(.*) after splitting by "&" ends in the same depth but I need there more details accessible as datatype.
The resulting array should look like:
{
"browser": ["opera"],
...
"location": [{
"lat": 52.398529,
"lng": 13.107570
}],
...
"wifi": [{
"mac": "00-24-FE-A7-BA-94",
"ss": -83,
...
},
{
"mac": "00-24-FE-A7-BA-94",
"ss": -83,
...
}]
Or something similar that I can parse with an additional json library to access the values using python. Can anyone help with this?
Here a solution passing from a dictionary
import re
import json
transform a string to a dictionary, sepfield is the field separator,
def str_to_dict(s, sepfield, sepkv, infields=None):
""" transform a string to a dictionary
s: the string to transform
sepfield: the string with the field separator char
sepkv: the string with the key value separator
infields: a function to be applied to the values
if infields is defined a list of elements with common keys returned
for each key, otherwise the value is associated to the key as it is"""
pattern = "([^%s%s]*?)%s([^%s]*)" % (sepkv, sepfield, sepkv, sepfield)
matches = re.findall(pattern, s)
if infields is None:
return dict(matches)
else:
r=dict()
for k,v in matches:
parsedval=infields(v)
if k not in r:
r[k] = []
r[k].append(parsedval)
return r
def second_level_parsing(x):
return x if x.find("|")==-1 else str_to_dict(x, "|",":")
json.dumps(str_to_dict(s, "&", "=", second_level_parsing))
You can easily extend for multiple levels. Note that the different behaviour whether the infields function is defined or not is to match the output you asked for.

Categories

Resources