PySpark recursive key search - python

I have a deeply nested json esque structure that I need to search for a given key at all levels (up to 7) for all occurrences. There is data always present in level 0 that I need to associate with each occurrence of the search_key found at any level. I've tried pushing this data through the recursive calls and appending it on return, but I have ran into heap and unhashable type issues when I move this from standard Python to a PySpark RDD.
My search function is below:
def search(input, search_key, results):
if input:
for i in input:
if isinstance(i, list):
search(i, search_key, results)
elif isinstance(i, dict):
for k, v in i.iteritems():
if k == search_key:
results.append(i)
continue
elif isinstance(v, list):
search(v, search_key, results)
elif isinstance(v, dict):
search(v, search_key, results)
return results
And I have been calling it with:
origin_rdd = sc.parallelize(origin)
concept_lambda = lambda c: search(c, term, [])
results = origin_rdd.flatMap(concept_lambda)
Can anybody suggest a way to capture the top level data and have it as part of each object in results? Results could be 0 to n so a product of 7 keys always present in the top level and then all occurrences of the search term collection. I then want to transform each row in the resulting RDD into a PySpark Row for use with a PySpark DataFrame. I have not found a good way to start with a DataFrame instead of an RDD or apply the search function to a DataFrame column because the structure is highly dynamic in its schema, but happy to hear suggestions if somebody thinks that's a better route.

I was able to solve my problem by slicing and passing the base through while using deepcopy when I had a hit on my search. Somebody else trying to do something similar could tweak the slices below.
origin_rdd = sc.parallelize(origin)
concept_lambda = lambda r: search(r[-1], r[0:9], term, [])
results = origin_rdd.flatMap(concept_lambda)
The search function
def search(input, row_base, search_key, results):
if input:
for i in input:
if isinstance(i, list):
search(i, row_base, search_key, results)
if isinstance(i, dict):
for k, v in iteritems(i):
if k == search_key:
row = copy.deepcopy(row_base)
row.append(i)
results.append(row)
continue
elif isinstance(v, list):
search(v, row_base, search_key, results)
elif isinstance(v, dict):
search(v, row_base, search_key, results)
return results

Related

how to recursively check a nested dictionary keys and return specific value

i have a function with nested dictionary and a specipic number(in the code that i wrote it equals 1)
i need to write a recursive function that goes over the dictionary and returns only the values mapped to the keys that equals the specipic chosen number
here is what i wrote
def nested_get(d, key):
res=[]
for i in d.keys():
if i == key:
res.append(d[i])
return res
if type(d[i]) is dict:
another = nested_get(d[i], key)
if another is not None:
return res + another
return []
print(nested_get({1:{1:"c",2:"b"},2:"b"},1))
i need it to return ['c'] but instead it returns [{1:'c',2:'b'}]
Your program never makes it to the if type(d[i]) is dict: because it returns in the first if statement in the first iteration of the for loop.
Check if the value stored inside the key is a dict first, and don't return until after the if statements. You don't need the return [] at the end because res will be empty if nothing was appended to it.
def nested_get(d, key):
res=[]
for i in d.keys():
if type(d[i]) is dict:
res.extend(nested_get(d[i], key))
else:
if i == key:
res.append(d[i])
return res
print(nested_get({1:{1:"c",2:"b"},2:"b"},1))
print(nested_get( {1:{1:"a",2:"b"},2:{1:{1:"c",2:"b"},2:"b"}},1))
More compact and elegant version:
def nested_get(d, key):
res = []
for k, v in d.items():
if isinstance(v, dict):
res += nested_get(v, key)
elif k == key:
res.append(v)
return res

How to detect last call of a recursive function?

I have a list of complex dictionaries like this:
data = [
{
"l_1_k": 1,
"l_1_ch": [
{
"l_2_k": 2,
"l_2_ch": [...more levels]
},
{
"l_2_k": 3,
"l_2_ch": [...more levels]
}
]
},
...more items
]
I'm trying to flatten this structure to a list of rows like this:
list = [
{ "l_1_k": 1, "l_2_k": 2, ... },
{ "l_1_k": 1, "l_2_k": 3, ... },
]
I need this list to build a pandas data frame.
So, I'm doing a recursion for each nesting level, and at the last level I'm trying to append to rows list.
def build_dict(d, row_dict, rows):
# d is the data dictionary at each nesting level
# row_dict is the final row dictionary
# rows is the final list of rows
for key, value in d.items():
if not isinstance(value, list):
row_dict[key] = value
else:
for child in value:
build_dict(child, row_dict, rows)
rows.append(row_dict) # <- How to detect the last recursion and call the append
I'm calling this function like this:
rows = []
for row in data:
build_dict(d=row, row_dict={}, rows=rows)
My question is how to detect the last call of this recursive function if I do not know how many nesting levels there are. With the current code, the row is duplicated at each nesting level.
Or, is there a better approach to obtain the final result?
After looking up some ideas, the solution I have in mind is this:
Declare the following function, taken from here:
def find_depth(d):
if isinstance(d, dict):
return 1 + (max(map(find_depth, d.values())) if d else 0)
return 0
In your function, increment every time you go deeper as follows:
def build_dict(d, row_dict, rows, depth=0):
# depth = 1 for the beginning
for key, value in d.items():
if not isinstance(value, list):
row_dict[key] = value
else:
for child in value:
build_dict(child, row_dict, rows, depth + 1)
Finally, test if you reach the maximum depth, if so, at the end of your function you can append it. You will need to add an extra variable which you will call:
def build_dict(d, row_dict, rows, max_depth, depth=0):
# depth = 1 for the beginning
for key, value in d.items():
if not isinstance(value, list):
row_dict[key] = value
else:
for child in value:
build_dict(child, row_dict, rows,max_depth, depth + 1)
if depth == max_depth:
rows.append(row_dict)
Call the function as:
build_dict(d=row, row_dict={}, rows=rows, max_depth=find_depth(data))
Do keep in mind since I don't have a data-set I can use, there might be a syntax error or two in there, but the approach should be fine.
I don't think it is good practice to try to play with mutable default argument in function prototype.
Also, I think that the function in the recursion loop should never be aware of the level it is in. That's the point of the recursion. Instead, you need to think about what the function should return, and when it should exit the recursion loop to climb back to the zeroth level. On the climb back, higher level function calls handle the return value of lower level function calls.
Here is the code that I think will work. I am not sure it is optimal, in term of computing time.
edit: fixed return list of dicts instead of dict only
def build_dict(d):
"""
returns a list when there is no lowerlevel list of dicts.
"""
lst = []
for key, value in d.items():
if not isinstance(value, list):
lst.append([{key: value}])
else:
lst_lower_levels = []
for child in value:
lst_lower_levels.extend(build_dict(child))
new_lst = []
for elm in lst:
for elm_ll in lst_lower_levels:
lst_of_dct = elm + elm_ll
new_lst.append([{k: v for d in lst_of_dct for k, v in d.items()}])
lst = new_lst
return lst
rows = []
for row in data:
rows.extend(build_dict(d=row))

Python return key from value, but its a list in the dictionary

So Im quiet new to python and maybe I´ve searced the wrong words on google...
My current problem:
In python you can return the key to a value when its mentioned in a dictionary.
One thing I wonder, is it possible to return the key if the used value is part of a list of values to the key?
So my testing skript is the following
MainDict={'FAQ':['FAQ','faq','Faq']}
def key_return(X):
for Y, value in MainDict.items():
if X == value:
return Y
return "Key doesnt exist"
print(key_return(['FAQ', 'faq', 'Faq']))
print(key_return('faq'))
So I can just return the Key if I ask for the whole list,
How can I return the key if I just ask for one value of that list as written for the second print? On current code I get the "Key doesnt exist" as an answer.
You can check to see if a value in the dict is a list, and if it is check to see if the value you're searching for is in the list.
MainDict = {'FAQ':['FAQ','faq','Faq']}
def key_return(X):
for key, value in MainDict.items():
if X == value:
return key
if isinstance(value, list) and X in value:
return key
return "Key doesnt exist"
print(key_return(['FAQ', 'faq', 'Faq']))
print(key_return('faq'))
Note: You should also consider making MainDict a parameter that you pass to key_return instead of a global variable.
You can do this using next and a simple comprehension:
next(k for k, v in MainDict.items() if x == v or x in v)
So your code would look like:
MainDict = {'FAQ':['FAQ','faq','Faq']}
def key_return(x):
return next(k for k, v in MainDict.items() if x == v or x in v)
print(key_return(['FAQ', 'faq', 'Faq']))
#FAQ
print(key_return('faq'))
#FAQ
You can create a dict that maps from values in the lists to keys in MainDict:
MainDict={'FAQ':['FAQ','faq','Faq']}
back_dict = {value: k for k,values in MainDict.items() for value in values}
Then rewrite key_return to use this dict:
def key_return(X):
return back_dict[X]
print(key_return('faq'))
The line back_dict = {value: k for k,values in MainDict.items() for value in values} is a dictionary comprehension expression, which is equivalent to:
back_dict = {}
for k,values in MainDict.items():
for value in values:
back_dict[value] = k
This approach is more time-efficient that looping over every item of MainDict every time you search, since it only requires a single loopkup rather than a loop.

Advice on making my JSON find-all-nested-occurrences method cleaner

I am parsing unknown nested json object, I do not know the structure nor depth ahead of time. I am trying to search through it to find a value. This is what I came up with, but I find it fugly. Could anybody let me know how to make this look more pythonic and cleaner?
def find(d, key):
if isinstance(d, dict):
for k, v in d.iteritems():
try:
if key in str(v):
return 'found'
except:
continue
if isinstance(v, dict):
for key,value in v.iteritems():
try:
if key in str(value):
return "found"
except:
continue
if isinstance(v, dict):
find(v)
elif isinstance(v, list):
for x in v:
find(x)
if isinstance(d, list):
for x in d:
try:
if key in x:
return "found"
except:
continue
if isinstance(v, dict):
find(v)
elif isinstance(v, list):
for x in v:
find(x)
else:
if key in str(d):
return "found"
else:
return "Not Found"
It is generally more "Pythonic" to use duck typing; i.e., just try to search for your target rather than using isinstance. See What are the differences between type() and isinstance()?
However, your need for recursion makes it necessary to recurse the values of the dictionaries and the elements of the list. (Do you also want to search the keys of the dictionaries?)
The in operator can be used for both strings, lists, and dictionaries, so no need to separate the dictionaries from the lists when testing for membership. Assuming you don't want to test for the target as a substring, do use isinstance(basestring) per the previous link. To test whether your target is among the values of a dictionary, test for membership in your_dictionary.values(). See Get key by value in dictionary
Because the dictionary values might be lists or dictionaries, I still might test for dictionary and list types the way you did, but I mention that you can cover both list elements and dictionary keys with a single statement because you ask about being Pythonic, and using an overloaded oeprator like in across two types is typical of Python.
Your idea to use recursion is necessary, but I wouldn't define the function with the name find because that is a Python built-in which you will (sort of) shadow and make the recursive call less readable because another programmer might mistakenly think you're calling the built-in (and as good practice, you might want to leave the usual access to the built in in case you want to call it.)
To test for numeric types, use `numbers.Number' as described at How can I check if my python object is a number?
Also, there is a solution to a variation of your problem at https://gist.github.com/douglasmiranda/5127251 . I found that before posting because ColdSpeed's regex suggestion in the comment made me wonder if I were leading you down the wrong path.
So something like
import numbers
def recursively_search(object_from_json, target):
if isinstance(object_from_json, (basestring, numbers.Number)):
return object_from_json==target # the recursion base cases
elif isinstance(object_from_json, list):
for element in list:
if recursively_search(element, target):
return True # quit at first match
elif isinstance(object_from_json, dict):
if target in object_from_json:
return True # among the keys
else:
for value in object_from_json.values():
if recursively_search(value, target):
return True # quit upon finding first match
else:
print ("recursively_search() did not anticipate type ",type(object_from_json))
return False
return False # no match found among the list elements, dict keys, nor dict values

Recursive function prints but does not return [duplicate]

This question already has answers here:
Why does my recursive function return None?
(4 answers)
Closed 6 months ago.
I have a function that takes a key and traverses nested dicts to return the value regardless of its depth. However, I can only get the value to print, not return. I've read the other questions on this issue and and have tried 1. implementing yield 2. appending the value to a list and then returning the list.
def get_item(data,item_key):
# data=dict, item_key=str
if isinstance(data,dict):
if item_key in data.keys():
print data[item_key]
return data[item_key]
else:
for key in data.keys():
# recursion
get_item(data[key],item_key)
item = get_item(data,'aws:RequestId')
print item
Sample data:
data = OrderedDict([(u'aws:UrlInfoResponse', OrderedDict([(u'#xmlns:aws', u'http://alexa.amazonaws.com/doc/2005-10-05/'), (u'aws:Response', OrderedDict([(u'#xmlns:aws', u'http://awis.amazonaws.com/doc/2005-07-11'), (u'aws:OperationRequest', OrderedDict([(u'aws:RequestId', u'4dbbf7ef-ae87-483b-5ff1-852c777be012')])), (u'aws:UrlInfoResult', OrderedDict([(u'aws:Alexa', OrderedDict([(u'aws:TrafficData', OrderedDict([(u'aws:DataUrl', OrderedDict([(u'#type', u'canonical'), ('#text', u'infowars.com/')])), (u'aws:Rank', u'1252')]))]))])), (u'aws:ResponseStatus', OrderedDict([(u'#xmlns:aws', u'http://alexa.amazonaws.com/doc/2005-10-05/'), (u'aws:StatusCode', u'Success')]))]))]))])
When I execute, the desired value prints, but does not return:
>>>52c7e94b-dc76-2dd6-1216-f147d991d6c7
>>>None
What is happening? Why isn't the function breaking and returning the value when it finds it?
A simple fix, you have to find a nested dict that returns a value. You don't need to explicitly use an else clause because the if returns. You also don't need to call .keys():
def get_item(data, item_key):
if isinstance(data, dict):
if item_key in data:
return data[item_key]
for key in data:
found = get_item(data[key], item_key)
if found:
return found
return None # Explicit vs Implicit
>>> get_item(data, 'aws:RequestId')
'4dbbf7ef-ae87-483b-5ff1-852c777be012'
One of the design principles of python is EAFP (Easier to Ask for Forgiveness than Permission), which means that exceptions are more commonly used than in other languages. The above rewritten with EAFP design:
def get_item(data, item_key):
try:
return data[item_key]
except KeyError:
for key in data:
found = get_item(data[key], item_key)
if found:
return found
except (TypeError, IndexError):
pass
return None
As other people commented, you need return statement in else blocks, too. You have two if blocks so you would need two more return statement. Here is code that does what you may want
from collections import OrderedDict
def get_item(data,item_key):
result = []
if isinstance(data, dict):
for key in data:
if key == item_key:
print data[item_key]
result.append(data[item_key])
# recursion
result += get_item(data[key],item_key)
return result
return result
Your else block needs to return the value if it finds it.
I've made a few other minor changes to your code. You don't need to do
if item_key in data.keys():
Instead, you can simply do
if item_key in data:
Similarly, you don't need
for key in data.keys():
You can iterate directly over a dict (or any class derived from a dict) to iterate over its keys:
for key in data:
Here's my version of your code, which should run on Python 2.7 as well as Python 3.
from __future__ import print_function
from collections import OrderedDict
def get_item(data, item_key):
if isinstance(data, dict):
if item_key in data:
return data[item_key]
for val in data.values():
v = get_item(val, item_key)
if v is not None:
return v
data = OrderedDict([(u'aws:UrlInfoResponse',
OrderedDict([(u'#xmlns:aws', u'http://alexa.amazonaws.com/doc/2005-10-05/'), (u'aws:Response',
OrderedDict([(u'#xmlns:aws', u'http://awis.amazonaws.com/doc/2005-07-11'), (u'aws:OperationRequest',
OrderedDict([(u'aws:RequestId', u'4dbbf7ef-ae87-483b-5ff1-852c777be012')])), (u'aws:UrlInfoResult',
OrderedDict([(u'aws:Alexa',
OrderedDict([(u'aws:TrafficData',
OrderedDict([(u'aws:DataUrl',
OrderedDict([(u'#type', u'canonical'), ('#text', u'infowars.com/')])),
(u'aws:Rank', u'1252')]))]))])), (u'aws:ResponseStatus',
OrderedDict([(u'#xmlns:aws', u'http://alexa.amazonaws.com/doc/2005-10-05/'),
(u'aws:StatusCode', u'Success')]))]))]))])
item = get_item(data, 'aws:RequestId')
print(item)
output
4dbbf7ef-ae87-483b-5ff1-852c777be012
Note that this function returns None if the isinstance(data, dict) test fails, or if the for loop fails to return. It's generally a good idea to ensure that every possible return path in a recursive function has an explicit return statement, as that makes it clearer what's happening, but IMHO it's ok to leave those returns implicit in this fairly simple function.

Categories

Resources