Match regex in python and return key - python

I have a nested dictionary and I have a trouble matching a regular expression with values in dictionary. I need to iterate through values in dictionary and return a key where regex has matched in value.
I have nested dictionary like this:
user_info = { 'user1': {'name': 'Aby',
'surname': 'Clark',
'description': 'Hi contact me by phone +1 548 5455 55
or facebook.com/aby.clark'},
'user2': {'name': 'Marta',
'surname': 'Bishop',
'description': 'Nice to meet you text me'},
'user3': {'name': 'Janice',
'surname': 'Valinise',
'description': 'You can contact me by phone +1 457
555667'},
'user4': {'name': 'Helen',
'surname': 'Bush',
'description': 'You can contact me by phone +1 778
65422'},
'user5': {'name': 'Janice',
'surname': 'Valinise',
'description': 'You can contact me by phone +1 457
5342327 or email janval#yahoo.com'}}
So I need to iterate through values of dictionary with regex and find a match and return back a key where is match happened.
A first problem I have faced is extracting a values from nested dictionary, but I solved this through:
for key in user_info.keys():
for values in user_info[key].values():
print(values)
And this getting back a values from nested dictionary. So is there a way to iterate through this values with regex as it will find a match and return back a key where match is happened.
I tried the following:
for key in user_info.keys():
for values in user_info.[key].values():
#this regex match the email
email = re.compile(r'(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)'.format(pattern), re.IGNORECASE|re.MULTILINE)
match = re.match(email)
if match is not None:
print ("No values.")
if found:
return match
Am I doing something wrong? I am wrestling with this question for a week...
Could you please tell me what's wrong and give a tips how to solve this #!4fd... please. Thank you!
P.S. And yep I didn't found the similar issue on stackoverflow and google. I've tried.

Looks like you want to extract the emails from the JSON values while also returning the matched key. Here are 2 solutions. The first one is similar to yours and the second one is generalized to any JSON with arbitrary levels.
Two for loops
import re
user_info = {
"user1": {
"name": "Aby",
"surname": "Clark",
"description": "Hi contact me by phone +1 548 5455 55or facebook.com/aby.clark"
},
"user2": {
"name": "Marta",
"surname": "Bishop",
"description": "Nice to meet you text me"
},
"user3": {
"name": "Janice",
"surname": "Valinise",
"description": "You can contact me by phone +1 457 555667"
},
"user4": {
"name": "Helen",
"surname": "Bush",
"description": "You can contact me by phone +1 778 65422"
},
"user5": {
"name": "Janice",
"surname": "Valinise",
"description": "You can contact me by phone +1 457 5342327 or email janval#yahoo.com",
}
}
matches = []
for user, info in user_info.items():
for key, value in info.items():
emails = re.findall("([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", value)
if emails:
matches.append((f'{user}.{key}', emails))
print(matches)
# -> [('user5.description', ['janval#yahoo.com'])]
The recursive approach for arbitrary JSON
import re
user_info = {
"user1": {
"name": "Aby",
"surname": "Clark",
"description": "Hi contact me by phone +1 548 5455 55or janval#yahoo.com",
"friends": [
{
"name": "Aby",
"surname": "Clark",
"description": "Hi contact me by phone +1 548 5455 55or janval#yahoo.com",
}
]
}
}
def traverse(obj, keys = []):
if isinstance(obj, str):
emails = re.findall("([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", obj)
return [('.'.join(keys), emails)] if emails else []
if isinstance(obj, dict):
return [match for key, value in obj.items() for match in traverse(value, [*keys, key])]
if isinstance(obj, list):
return [match for i, value in enumerate(obj) for match in traverse(value, [*keys, str(i)])]
return []
print(traverse(user_info, []))
# -> [('user1.description', ['janval#yahoo.com']), ('user1.friends.0.description', ['janval#yahoo.com'])]

You can try using search instead of the match function in the next way:
for key in user_info.keys():
for values in user_info[key].values():
email = re.search(r'([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)+', values)
if email != None:
print(key)
This code will print all the keys with the matched inner value.
Notice that in the code you have tried you didn't use values at all.

Related

Referencing Values in a List (syntax issue?) [duplicate]

I wrote some code to get data from a web API. I was able to parse the JSON data from the API, but the result I gets looks quite complex. Here is one example:
>>> my_json
{'name': 'ns1:timeSeriesResponseType', 'declaredType': 'org.cuahsi.waterml.TimeSeriesResponseType', 'scope': 'javax.xml.bind.JAXBElement$GlobalScope', 'value': {'queryInfo': {'creationTime': 1349724919000, 'queryURL': 'http://waterservices.usgs.gov/nwis/iv/', 'criteria': {'locationParam': '[ALL:103232434]', 'variableParam': '[00060, 00065]'}, 'note': [{'value': '[ALL:103232434]', 'title': 'filter:sites'}, {'value': '[mode=LATEST, modifiedSince=null]', 'title': 'filter:timeRange'}, {'value': 'sdas01', 'title': 'server'}]}}, 'nil': False, 'globalScope': True, 'typeSubstituted': False}
Looking through this data, I can see the specific data I want: the 1349724919000 value that is labelled as 'creationTime'.
How can I write code that directly gets this value?
I don't need any searching logic to find this value. I can see what I need when I look at the response; I just need to know how to translate that into specific code to extract the specific value, in a hard-coded way. I read some tutorials, so I understand that I need to use [] to access elements of the nested lists and dictionaries; but I can't figure out exactly how it works for a complex case.
More generally, how can I figure out what the "path" is to the data, and write the code for it?
For reference, let's see what the original JSON would look like, with pretty formatting:
>>> print(json.dumps(my_json, indent=4))
{
"name": "ns1:timeSeriesResponseType",
"declaredType": "org.cuahsi.waterml.TimeSeriesResponseType",
"scope": "javax.xml.bind.JAXBElement$GlobalScope",
"value": {
"queryInfo": {
"creationTime": 1349724919000,
"queryURL": "http://waterservices.usgs.gov/nwis/iv/",
"criteria": {
"locationParam": "[ALL:103232434]",
"variableParam": "[00060, 00065]"
},
"note": [
{
"value": "[ALL:103232434]",
"title": "filter:sites"
},
{
"value": "[mode=LATEST, modifiedSince=null]",
"title": "filter:timeRange"
},
{
"value": "sdas01",
"title": "server"
}
]
}
},
"nil": false,
"globalScope": true,
"typeSubstituted": false
}
That lets us see the structure of the data more clearly.
In the specific case, first we want to look at the corresponding value under the 'value' key in our parsed data. That is another dict; we can access the value of its 'queryInfo' key in the same way, and similarly the 'creationTime' from there.
To get the desired value, we simply put those accesses one after another:
my_json['value']['queryInfo']['creationTime'] # 1349724919000
I just need to know how to translate that into specific code to extract the specific value, in a hard-coded way.
If you access the API again, the new data might not match the code's expectation. You may find it useful to add some error handling. For example, use .get() to access dictionaries in the data, rather than indexing:
name = my_json.get('name') # will return None if 'name' doesn't exist
Another way is to test for a key explicitly:
if 'name' in resp_dict:
name = resp_dict['name']
else:
pass
However, these approaches may fail if further accesses are required. A placeholder result of None isn't a dictionary or a list, so attempts to access it that way will fail again (with TypeError). Since "Simple is better than complex" and "it's easier to ask for forgiveness than permission", the straightforward solution is to use exception handling:
try:
creation_time = my_json['value']['queryInfo']['creationTime']
except (TypeError, KeyError):
print("could not read the creation time!")
# or substitute a placeholder, or raise a new exception, etc.
Here is an example of loading a single value from simple JSON data, and converting back and forth to JSON:
import json
# load the data into an element
data={"test1": "1", "test2": "2", "test3": "3"}
# dumps the json object into an element
json_str = json.dumps(data)
# load the json to a string
resp = json.loads(json_str)
# print the resp
print(resp)
# extract an element in the response
print(resp['test1'])
Try this.
Here, I fetch only statecode from the COVID API (a JSON array).
import requests
r = requests.get('https://api.covid19india.org/data.json')
x = r.json()['statewise']
for i in x:
print(i['statecode'])
Try this:
from functools import reduce
import re
def deep_get_imps(data, key: str):
split_keys = re.split("[\\[\\]]", key)
out_data = data
for split_key in split_keys:
if split_key == "":
return out_data
elif isinstance(out_data, dict):
out_data = out_data.get(split_key)
elif isinstance(out_data, list):
try:
sub = int(split_key)
except ValueError:
return None
else:
length = len(out_data)
out_data = out_data[sub] if -length <= sub < length else None
else:
return None
return out_data
def deep_get(dictionary, keys):
return reduce(deep_get_imps, keys.split("."), dictionary)
Then you can use it like below:
res = {
"status": 200,
"info": {
"name": "Test",
"date": "2021-06-12"
},
"result": [{
"name": "test1",
"value": 2.5
}, {
"name": "test2",
"value": 1.9
},{
"name": "test1",
"value": 3.1
}]
}
>>> deep_get(res, "info")
{'name': 'Test', 'date': '2021-06-12'}
>>> deep_get(res, "info.date")
'2021-06-12'
>>> deep_get(res, "result")
[{'name': 'test1', 'value': 2.5}, {'name': 'test2', 'value': 1.9}, {'name': 'test1', 'value': 3.1}]
>>> deep_get(res, "result[2]")
{'name': 'test1', 'value': 3.1}
>>> deep_get(res, "result[-1]")
{'name': 'test1', 'value': 3.1}
>>> deep_get(res, "result[2].name")
'test1'

Printing pair of a dict

Im new in python but always trying to learn.
Today I got this error while trying select a key from dictionary:
print(data['town'])
KeyError: 'town'
My code:
import requests
defworld = "Pacera"
defcity = 'Svargrond'
requisicao = requests.get(f"https://api.tibiadata.com/v2/houses/{defworld}/{defcity}.json")
data = requisicao.json()
print(data['town'])
The json/dict looks this:
{
"houses": {
"town": "Venore",
"world": "Antica",
"type": "houses",
"houses": [
{
"houseid": 35006,
"name": "Dagger Alley 1",
"size": 57,
"rent": 2665,
"status": "rented"
}, {
"houseid": 35009,
"name": "Dream Street 1 (Shop)",
"size": 94,
"rent": 4330,
"status": "rented"
},
...
]
},
"information": {
"api_version": 2,
"execution_time": 0.0011,
"last_updated": "2017-12-15 08:00:00",
"timestamp": "2017-12-15 08:00:02"
}
}
The question is, how to print the pairs?
Thanks
You have to access the town object by accessing the houses field first, since there is nesting.
You want print(data['houses']['town']).
To avoid your first error, do
print(data["houses"]["town"])
(since it's {"houses": {"town": ...}}, not {"town": ...}).
To e.g. print all of the names of the houses, do
for house in data["houses"]["houses"]:
print(house["name"])
As answered, you must do data['houses']['town']. A better approach so that you don't raise an error, you can do:
houses = data.get('houses', None)
if houses is not None:
print(houses.get('town', None))
.get is a method in a dict that takes two parameters, the first one is the key, and the second parameter is ghe default value to return if the key isn't found.
So if you do in your example data.get('town', None), this will return None because town isn't found as a key in data.

Create partial dict from recursively nested field list

After parsing a URL parameter for partial responses, e.g. ?fields=name,id,another(name,id),date, I'm getting back an arbitrarily nested list of strings, representing the individual keys of a nested JSON object:
['name', 'id', ['another', ['name', 'id']], 'date']
The goal is to map that parsed 'graph' of keys onto an original, larger dict and just retrieve a partial copy of it, e.g.:
input_dict = {
"name": "foobar",
"id": "1",
"another": {
"name": "spam",
"id": "42",
"but_wait": "there is more!"
},
"even_more": {
"nesting": {
"why": "not?"
}
},
"date": 1584567297
}
should simplyfy to:
output_dict = {
"name": "foobar",
"id": "1",
"another": {
"name": "spam",
"id": "42"
},
"date": 1584567297,
}
Sofar, I've glanced over nested defaultdicts, addict and glom, but the mappings they take as inputs are not compatible with my list (might have missed something, of course), and I end up with garbage.
How can I do this programmatically, and accounting for any nesting that might occur?
you can use:
def rec(d, f):
result = {}
for i in f:
if isinstance(i, list):
result[i[0]] = rec(d[i[0]], i[1])
else:
result[i] = d[i]
return result
f = ['name', 'id', ['another', ['name', 'id']], 'date']
rec(input_dict, f)
output:
{'name': 'foobar',
'id': '1',
'another': {'name': 'spam', 'id': '42'},
'date': 1584567297}
here the assumption is that on a nested list the first element is a valid key from the upper level and the second element contains valid keys from a nested dict which is the value for the first element

Scan through nested json response for a specific value without knowing names

I have a massive (over 3000 lines) JSON response I am receiving from a UI. I need to scan through the entire response and search for the value "Not Answered". Once I find this value, I then need to get the other info around that response.
The response is heavily nested, up to 7 layers. I do know that the value has a key of "value", but that key is in the response multiple times. The number of nests and items under the first "value" key can be different for each call.
This is an example of a small piece of what the response can look like. I would need to find each instance of the value "Not Answered". I am not showing the other data within the response under the value keys.
{
"data": {
"reviewData": [
0: {
"value": [
1: {
"value": "Answer"
},
2: {
"value": "Not Answered"
},
3: {
"value": "Answer"
}
]
},
1: {
"value": [
1: {
"value": "Not Answered"
},
2: {
"value": "Not Answered"
},
3: {
"value": "Answer"
}
]
}
]
}
}
I understand that I could just put this all into a string and use regex but that wouldn't assist in getting the other data that I need. Thanks for any help!
You can use a generator to walk through your dictionary and yield the paths that lead to 'Not Answered' values in form of tuples:
def walk(obj):
for key, value in obj.items():
if isinstance(value, dict):
yield from ((key,) + x for x in walk(value))
elif value == 'Not Answered':
yield (key,)
For your example this gives the following output:
[('data', 'reviewData', '0', 'value', '2', 'value'),
('data', 'reviewData', '1', 'value', '1', 'value'),
('data', 'reviewData', '1', 'value', '2', 'value')]
If you need access to the surrounding information you can reduce the provided paths to any depth by using __getitem__ on the nested dicts:
from functools import reduce
for path in walk(test_dict):
info = reduce(lambda obj, key: obj[key], path[:-1], test_dict)
You can use a recursive function like this one:
def find_not_anwsered_object(obj, sink):
for key in obj.keys():
if obj[key] == "Not Answered":
sink.append(obj)
elif isinstance(obj[key], dict):
find_not_answered_object(obj[key], sink)
return sink
response = {} # your JSON response
print(find_not_answered_object(response, [])

How do i check for duplicate entries before i add an entry the dictionary

Given i have the following dictionary which stores key(entry_id),value(entry_body,entry_title) pairs.
"entries": {
"1": {
"body": "ooo",
"title": "jack"
},
"2": {
"body": "ooo",
"title": "john"
}
}
How do i check whether the title of an entry that i want to add to the dictionary already exists.
For example: This is the new entry that i want to add.
{
"body": "nnnn",
"title": "jack"
}
Have you thought about changing your data structure? Without context, the IDs of the entries seem a little useless. Your question suggests you only want to store unique titles, so why not make them your keys?
Example:
"entries": {
"jack": "ooo",
"john": "ooo"
}
That way you can do an efficient if newname in entries membership test.
EDIT:
Based on your comment you can still preserve the IDs by extending the data structure:
"entries": {
"jack": {
"body": "ooo",
"id": 1
},
"john": {
"body": "ooo",
"id": 2
}
}
I agree with #Christian König's answer, your data structure seems like it could be made clearer and more efficient. Still, if you need a solution to this setup in particular, here's one that will work - and it automatically adds new integer keys to the entries dict.
I've added an extra case to show both a rejection and an accepted update.
def existing_entry(e, d):
return [True for entry in d["entries"].values() if entry["title"] == e["title"]]
def update_entries(e, entries):
if not any(existing_entry(e, entries)):
current_keys = [int(x) for x in list(entries["entries"].keys())]
next_key = str(max(current_keys) + 1)
entries["entries"][next_key] = e
print("Updated:", entries)
else:
print("Existing entry found.")
update_entries(new_entry_1, data)
update_entries(new_entry_2, data)
Output:
Existing entry found.
Updated:
{'entries':
{'1': {'body': 'ooo', 'title': 'jack'},
'2': {'body': 'ooo', 'title': 'john'},
'3': {'body': 'qqqq', 'title': 'jill'}
}
}
Data:
data = {"entries": {"1": {"body": "ooo", "title": "jack"},"2": {"body": "ooo","title": "john"}}}
new_entry_1 = {"body": "nnnn", "title": "jack"}
new_entry_2 = {"body": "qqqq", "title": "jill"}
This should work?
entry_dict = {
"1": {"body": "ooo", "title": "jack"},
"2": {"body": "ooo", "title": "john"}
}
def does_title_exist(title):
for entry_id, sub_dict in entry_dict.items():
if sub_dict["title"] == title:
print("Title %s already in dictionary at entry %s" %(title, entry_id))
return True
return False
print("Does the title exist? %s" % does_title_exist("jack"))
As Christian Suggests above this seems like an inefficient data structure for the job. It seems like if you just need index ID's a list may be better.
I think to achieve this, one will have to traverse the dictionary.
'john' in [the_dict[en]['title'] for en in the_dict]

Categories

Resources