Python - grouping values in dictionary - python

Here is JSON which I'm receiving from Smartsheet API:
{"rows":
[
{
"id":1315072712697732,
"cells":
[
{"columnId":3691535201003396,"value":"MyBooks","displayValue":"MyBooks"},
{"columnId":8195134828373892},
{"columnId":876785433896836,"value":"2018 Year","displayValue":"2018 Year"},
{"columnId":5380385061267332,"value":"http://google.com","displayValue":"http://google.com"}
]
},
{
"id":5818672340068228,
"cells":
[
{"columnId":3691535201003396,"value":"MyBooks","displayValue":"MyBooks"},
{"columnId":8195134828373892},
{"columnId":876785433896836,"value":"2019 Year","displayValue":"2019 Year"},
{"columnId":5380385061267332,"value":"http://google.com","displayValue":"http://google.com"}
]
},
{
"id":6381622293489540,
"cells":
[
{"columnId":3691535201003396,"value":"MyMovies","displayValue":"MyMovies"},
{"columnId":8195134828373892},
{"columnId":876785433896836,"value":"2027 Year","displayValue":"2027 Year"},
{"columnId":5380385061267332,"value":"http://google.com","displayValue":"http://google.com"}
]
},
{
"id":6100147316778884,
"cells":
[
{"columnId":3691535201003396,"value":"MyMovies","displayValue":"MyMovies"},
{"columnId":8195134828373892},
{"columnId":876785433896836,"value":"2035 Year","displayValue":"2035 Year"},
{"columnId":5380385061267332,"value":"http://google.com","displayValue":"http://google.com"}
]
},
{
"id":8351947130464132,
"cells":
[
{"columnId":3691535201003396,"value":"MyHobbies","displayValue":"MyHobbies"},
{"columnId":8195134828373892},
{"columnId":876785433896836,"value":"2037 Year","displayValue":"2037 Year"},
{"columnId":5380385061267332,"value":"http://google.com","displayValue":"http://google.com"}
]
}]}
Here is a piece of my python's code:
s = json.loads(myJson)
my_dictionary = []
for element in s['rows']:
my_dictionary.append({'category': element['cells'][0]['displayValue'],
'categoryId': element['cells'][1]['columnId'],
'pages': [
{'pageName': element['cells'][2]['displayValue'],
'pageURL': element['cells'][3]['displayValue']
}
]})
As I result I got dictionary with all data I need (without all unnecessary stuff), except one thing. I want to group it by category values. So output I want to achieve should looks similar to this:
"category": "MyMovies",
"categoryID": "8195134828373892"
"pages":
[
{"pageName": "2018 Year", "pageURL": "https://google.com"},
{"pageName": "2019 Year", "pageURL": "https://google.com"}
]
How can I do this?

You can do it with following code:
from collections import defaultdict
d = defaultdict(list)
for element in my_dictionary:
d[(element['categoryId'], element['category'])] += element['pages'] # merges all pages into one list
result = []
for element in sorted(d, key=lambda k: k[1]): # sort by category name
result.append({
'category': element[1],
'categoryId': element[0],
'pages': sorted(d[element], key=lambda e: e['pageName']) # sort by page name in pages list
})
print(result)

Related

Grouping duplicate objects into an array of values

Need to group the set of objects with duplicate value
Input
{
"errors": [
{
"check_no": "1.4",
"security": "IAM"
},
{
"check_no": "1.3",
"security": "SLM"
},
{
"check_no": "1.4",
"security": "EKM"
}
]
}
Here the check_no inside the array has 1.4, 1.3, 1.4. I need to group them into an array with an additional key as shown below
Output
{
"errors": [
{
"check_no": "1.4",
"values": [
{
"security": "IAM"
},
{
"security": "EKM"
}
]
},
{
"check_no": "1.3",
"values": [
{
"security": "SLM"
}
]
}
]
}
First you need to collect the "security": "EKM" per "check_no", for that a defaultdict is very suitable, then rebuild the result
from collections import defaultdict
grouped_check_no = defaultdict(list)
for error in v['errors']:
grouped_check_no[error['check_no']].append({"security": error['security']})
result = {'errors': [{"check_no": k, "values": v} for k, v in grouped_check_no.items()]}
grouped_check_no is
{'1.4': [{'security': 'IAM'}, {'security': 'EKM'}], '1.3': [{'security': 'SLM'}]}

How to transform a flattened data to a structured json?

This is primary flattened element, aka input data:
['a-ab-aba-abaa-abaaa', 'a-ab-aba-abab', 'a-ac-aca-acaa', 'a-ac-aca-acab']
This is the target data what I need, aka output data:
[
{
"title": "a",
"children": [
{
"title": "ab",
"children": [
{
"title": "aba",
"children": [
{
"title": "abaa",
"children": [
{
"title": "abaaa"
}
]
},
{
"title": "abab"
}
]
}
]
},
{
"title": "ac",
"children": [
{
"title": "aca",
"children": [
{
"title": "acaa"
},
{
"title": "acab"
}
]
}
]
}
]
}
]
I thought I can use deep-for-loop iteration to generate this json data, but it's so difficult, because num of level will bigger than 10. so I think for-loop can't do in this process, is there any algrithm or use a packaged code to implement a function to achieve this target?
I'm so grateful if you share your mindset, god bless you!
Here is a recursive solution using itertools. I dont know if this is efficient enough for your purpose, but it works. It works by transforming your list of strings into a list of lists, then dividing that into lists with the same first key, and then building the dict and repeating with the first key removed.
from itertools import groupby
from pprint import pprint
data = ['a-ab-aba-abaa-abaaa', 'a-ab-aba-abab', 'a-ac-aca-acaa', 'a-ac-aca-acab']
components = [x.split("-") for x in data]
def build_dict(component_list):
key = lambda x: x[0]
component_list = sorted(component_list, key=key)
# divide into lists with the same fist key
sublists = groupby(component_list, key)
result = []
for name, values in sublists:
value = {}
value["title"] = name
value["children"] = build_dict([x[1:] for x in values if x[1:]])
result.append(value)
return result
pprint(build_dict(components))
Output:
[{'children': [{'children': [{'children': [{'children': [{'children': [],
'title': 'abaaa'}],
'title': 'abaa'},
{'children': [], 'title': 'abab'}],
'title': 'aba'}],
'title': 'ab'},
{'children': [{'children': [{'children': [], 'title': 'acaa'},
{'children': [], 'title': 'acab'}],
'title': 'aca'}],
'title': 'ac'}],
'title': 'a'}]
To convert this dict to json you can use json.dumps from the json module. I hope my explanaition is clear.
Here is a start:
def populate_levels(dct, levels):
if levels:
if levels[0] not in dct:
dct[levels[0]] = {}
populate_levels(dct[levels[0]], levels[1:])
def create_final(dct):
final = []
for title in dct:
final.append({"title": title, "children": create_final(dct[title])})
return final
data = ['a-ab-aba-abaa-abaaa', 'a-ab-aba-abab', 'a-ac-aca-acaa', 'a-ac-aca-acab']
template = {}
for item in data:
populate_levels(template, item.split('-'))
final = create_final(template)
I couldn't see a clean way of doing it all at once so I created this in-between template dictionary. Right now if a 'node' has no children its corresponding dict will contain 'children': []
you can change this behavior in the create_final function if you like.
You can use collections.defaultdict:
from collections import defaultdict
def get_struct(d):
_d = defaultdict(list)
for a, *b in d:
_d[a].append(b)
return [{'title':a, 'children':get_struct(filter(None, b))} for a, b in _d.items()]
data = ['a-ab-aba-abaa-abaaa', 'a-ab-aba-abab', 'a-ac-aca-acaa', 'a-ac-aca-acab']
import json
print(json.dumps(get_struct([i.split('-') for i in data]), indent=4))
Output:
[
{
"title": "a",
"children": [
{
"title": "ab",
"children": [
{
"title": "aba",
"children": [
{
"title": "abaa",
"children": [
{
"title": "abaaa",
"children": []
}
]
},
{
"title": "abab",
"children": []
}
]
}
]
},
{
"title": "ac",
"children": [
{
"title": "aca",
"children": [
{
"title": "acaa",
"children": []
},
{
"title": "acab",
"children": []
}
]
}
]
}
]
}
]

Create a list of new urls contained in objects in python

I have two json databases. If there is a new value in the "img_url" (one in the last json that isn't in the other), I want to print the url or place it in a variable. The goal is just to find a list of the new values.
Input json:
last_data = [
{
"objectID": 16240,
"results": [
{
"img_url": "https://img.com/1.jpg"
},
{
"img_url": "https://img.com/2.jpg"
},
{
"img_url": "https://img.com/30.jpg"
}
]
}
{
"objectID": 16242,
"results": [
{
"img_url": "https://img.com/1.jpg"
},
{
"img_url": "https://img.com/2.jpg"
},
{
"img_url": "https://img.com/3.jpg"
}
]
}]
# ...
#multiple other objectIDs
]
Second input:
second_data =[
{
"objectID": 16240,
"results": [
{
"img_url": "https://img.com/1.jpg"
},
{
"img_url": "https://img.com/2.jpg"
}
]
},
{
"objectID": 16242,
"results": [
{
"img_url": "https://img.com/1.jpg"
},
{
"img_url": "https://img.com/2.jpg"
}
]
}...
#multiple other objectIDs
]
And I want to output only the https://img.com/3.jpg and the https://img.com/3.jpg urls (it can be a list because I have multiples objects) or place it in a variable
My code:
#last file
for item_last in last_data:
results_last = item_last["results"]
if results_last is not []:
for result_last in results_last:
ccv_last = result_last["img_url"]
#second file
for item_second in second_data:
results_second = item_second["results"]
if results_second is not []:
# loop in results
for result_second in results_second:
ccv_second = result_second["img_url"]
if gm_last != gm_second and gm_last is not None:
print(gm_last)
If you are trying to find difference between two different list here it is.
I have slightly modified your same code to get the expected result.
#last file
ccv_last = []
for item_last in last_data:
results_last = item_last["results"]
if results_last:
for result_last in results_last:
ccv_last.append(result_last["img_url"])
#second file
ccv_second = []
for item_second in second_data:
results_second = item_second["results"]
if results_second:
for result_second in results_second:
ccv_second.append(result_second["img_url"])
diff_list = list(set(ccv_last)-set(ccv_second)))
Output:
['https://img.com/30.jpg', 'https://img.com/3.jpg']
However you can plan to slightly change your results model for better performance please find below.
If you think no further keys are planned for the dictionaries in result list then probably you just want list. So you can change dict -> list
from
...
"results": [
{
"img_url": "https://img.com/1.jpg"
},
{
"img_url": "https://img.com/2.jpg"
}
]
...
to just list of urls
...
"img_url_results": ["https://img.com/1.jpg","https://img.com/2.jpg"]
...
By doing this change you can just skip one for loop.
#last file
ccv_last = []
for item_last in last_data:
if item_last.get('img_url_results'):
ccv_last.extend(item_last["img_url_results"])

Getting 0 records while parsing json file , if the Key Attribute does not exists

I have few static key columns EmployeeId,type and few columns coming from first FOR loop.
While in the second FOR loop if i have a specific key then only values should be appended to the existing data frame columns else whatever the columns getting fetched from first for loop should remain same.
First For Loop Output:
"EmployeeId","type","KeyColumn","Start","End","Country","Target","CountryId","TargetId"
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","","",""
After Second For Loop i have below output:
"EmployeeId","type","KeyColumn","Start","End","Country","Target","CountryId","TargetId"
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","AMAZON","1",""
"Emp1","Metal","1212121212","2000-06-17","9999-12-31","","FLIPKART","2",""
As per code if i have Employee tag available , i have got above 2 records but i may have few json files without Employee tag then output should remain same as per First Loop Output with all the key fields populated and rest columns with null.
But i am getting 0 records as per my code. Please help me if my way of coding is wrong.
Please help me ... If the way of asking question is not clear i am sorry , as i am new to python . Please find the sample data in the below link
Please find below code
for i in range(len(json_file['enty'])):
temp = {}
temp['EmployeeId'] = json_file['enty'][i]['id']
temp['type'] = json_file['enty'][i]['type']
for key in json_file['enty'][i]['data']['attributes'].keys():
try:
temp[key] = json_file['enty'][i]['data']['attributes'][key]['values'][0]['value']
except:
temp[key] = None
for key in json_file['enty'][i]['data']['attributes'].keys():
if(key == 'Employee'):
for j in range(len(json_file['enty'][i]['data']['attributes']['Employee']['group'])):
for key in json_file['enty'][i]['data']['attributes']['Employee']['group'][j].keys():
try:
temp[key] = json_file['enty'][i]['data']['attributes']['Employee']['group'][j][key]['values'][0]['value']
except:
temp[key] = None
temp_df = pd.DataFrame([temp])
df = pd.concat([df, temp_df], sort=True)
# Rearranging columns
df = df[['EmployeeId', 'type'] + [col for col in df.columns if col not in ['EmployeeId', 'type']]]
# Writing the dataset
df[columns_list].to_csv("Test22.csv", index=False, quotechar='"', quoting=1)
If Employee Tag is not available i am getting 0 records as output but i am expecting 1 record as for first for loop
enter link description here
The JSON structure is quite complicated. I try to simplified the data collection from it. The result is a list of flat dicts. The code handles the case where 'Employee' is not found.
import copy
d = {
"enty": [
{
"id": "Emp1",
"type": "Metal",
"data": {
"attributes": {
"KeyColumn": {
"values": [
{
"value": 1212121212
}
]
},
"End": {
"values": [
{
"value": "2050-12-31"
}
]
},
"Start": {
"values": [
{
"value": "2000-06-17"
}
]
},
"Employee": {
"group": [
{
"Target": {
"values": [
{
"value": "AMAZON"
}
]
},
"CountryId": {
"values": [
{
"value": "1"
}
]
}
},
{
"Target": {
"values": [
{
"value": "FLIPKART"
}
]
},
"CountryId": {
"values": [
{
"value": "2"
}
]
}
}
]
}
}
}
}
]
}
emps = []
for e in d['enty']:
entry = {'id': e['id'], 'type': e['type']}
for x in ["KeyColumn", "Start", "End"]:
entry[x] = e['data']['attributes'][x]['values'][0]['value']
if e['data']['attributes'].get('Employee'):
for grp in e['data']['attributes']['Employee']['group']:
clone = copy.deepcopy(entry)
for x in ['Target', 'CountryId']:
clone[x] = grp[x]['values'][0]['value']
emps.append(clone)
else:
emps.add(entry)
# TODO write to csv
for emp in emps:
print(emp)
output
{'End': '2050-12-31', 'Target': 'AMAZON', 'KeyColumn': 1212121212, 'Start': '2000-06-17', 'CountryId': '1', 'type': 'Metal', 'id': 'Emp1'}
{'End': '2050-12-31', 'Target': 'FLIPKART', 'KeyColumn': 1212121212, 'Start': '2000-06-17', 'CountryId': '2', 'type': 'Metal', 'id': 'Emp1'}

I have a json file with 1500 keys, whats the best way to iterate through them?

Each key has a list of strings in them that I use to compare to another list. The dictionary is very nested so I use a recursive function to get the data of each key.
But it takes a long time to get through the entire list. Is there a faster way?
This is the code:
def get_industry(industry_data, industry_category): category_list = list()
for category in industry_category:
for key, item in category.items():
r = re.compile('|'.join([r'\b%s\b' % porter.stem("".join(w.split())) for w in item['key_list']]), flags=re.I)
words_found = r.findall(industry_data)
if words_found:
category_list.extend([key])
new_list = get_industry(' '.join(words_found), item["Subcategories"])
category_list.extend(new_list)
return category_list
This is an example of a JSON file.
[
{
"Agriculture": {
"Subcategories": [
{
"Fruits ": {
"Subcategories": [
{
"Fresh Fruits": {
"Subcategories": [
{
"Apricots": {
"Subcategories": [],
"key_list": [
"Apricots"
]
}
},
{
"Tamarinds": {
"Subcategories": [],
"key_list": [
"Tamarinds"
]
}
}
],
"key_list": [
"loganberries",
"medlars"
]
}
}
],
"key_list": [
"lemons",
"tangelos"
]
}
},
{
"Vegetables ": {
"Subcategories": [
{
"Beetroot": {
"Subcategories": [],
"key_list": [
"Beetroot"
]
}
},
{
"Wasabi": {
"Subcategories": [],
"key_list": [
"Wasabi"
]
}
}
],
"key_list": [
"kohlrabies",
"wasabi "
]
}
}
],
"key_list": [
"wasabi",
"batatas"
]
}
}
]
This is an example of a list I want it compared with.
["lemons","wasabi","washroom","machine","grapefruit","about","city"]
The answer should return this list:
["Agriculture","Vegetables","Wasabi"]
In order to compare list to list and return category, it takes about 3-5 seconds to finish the operation. I heard that using Pandas will significantly increase the speed.

Categories

Resources