JSON nested list to Pandas dataframe - python

I have a json file which looks like this:
"Aveiro": {
"Albergaria-a-Velha": {
"candidates": [
{
"effectiveCandidates": [
"JOSÉ OLIVEIRA SANTOS"
],
"party": "B.E.",
"votes": {
"absoluteMajority": 0,
"acronym": "B.E.",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 1.34,
"presidents": 0,
"validVotesPercentage": 1.4,
"votes": 179
}
},
{
"effectiveCandidates": [
"ANTÓNIO AUGUSTO AMARAL LOUREIRO E SANTOS"
],
"party": "CDS-PP",
"votes": {
"absoluteMajority": 1,
"acronym": "CDS-PP",
"constituenctyCounter": 1,
"mandates": 5,
"percentage": 59.7,
"presidents": 1,
"validVotesPercentage": 62.5,
"votes": 7970
}
},
{
"effectiveCandidates": [
"CARLOS MANUEL DA COSTA SERVEIRA VASQUES"
],
"party": "CH",
"votes": {
"absoluteMajority": 0,
"acronym": "CH",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 1.87,
"presidents": 0,
"validVotesPercentage": 1.95,
"votes": 249
}
},
{
"effectiveCandidates": [
"RODRIGO MANUEL PEREIRA MARQUES LOURENÇO"
],
"party": "PCP-PEV",
"votes": {
"absoluteMajority": 0,
"acronym": "PCP-PEV",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 1.57,
"presidents": 0,
"validVotesPercentage": 1.65,
"votes": 210
}
},
{
"effectiveCandidates": [
"DELFINA LISBOA MARTINS DA CUNHA"
],
"party": "PPD/PSD",
"votes": {
"absoluteMajority": 0,
"acronym": "PPD/PSD",
"constituenctyCounter": 1,
"mandates": 2,
"percentage": 24.23,
"presidents": 0,
"validVotesPercentage": 25.37,
"votes": 3235
}
},
{
"effectiveCandidates": [
"JESUS MANUEL VIDINHA TOMÁS"
],
"party": "PS",
"votes": {
"absoluteMajority": 0,
"acronym": "PS",
"constituenctyCounter": 1,
"mandates": 0,
"percentage": 6.82,
"presidents": 0,
"validVotesPercentage": 7.14,
"votes": 910
}
}
],
"parentTerritoryName": "Aveiro",
"territoryKey": "LOCAL-010200",
"territoryName": "Albergaria-a-Velha",
"total_votes": {
"availableMandates": 0,
"blankVotes": 377,
"blankVotesPercentage": 2.82,
"displayMessage": null,
"hasNoVoting": false,
"nullVotes": 221,
"nullVotesPercentage": 1.66,
"numberParishes": 6,
"numberVoters": 13351,
"percentageVoters": 59.48
}
},
The full file is here for reference
I thought that this code would work
import pandas as pd
from pandas import json_normalize
import json
with open('autarquicas_2021.json') as f:
data = json.load(f)
df = pd.json_normalize(data)
However this is returning the following:
df.head()
Aveiro.Albergaria-a-Velha.candidates ... Évora.Évora.total_votes.percentageVoters
0 [{'effectiveCandidates': ['JOSÉ OLIVEIRA SANTO... ... 49.84
[1 rows x 4312 columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 4312 entries, Aveiro.Albergaria-a-Velha.candidates to Évora.Évora.total_votes.percentageVoters
dtypes: bool(308), float64(924), int64(1540), object(1540)
memory usage: 31.7+ KB
None
For some reason the code is not working, and my research has led me to no solutions, as it seems that every json file has a mind of its own.
Any help would be much appreciated. Thank you in advance!
Disclaimer: This is for an open source project to bring more transparency into local elections in Portugal. It will not be used for commercial, or for profit projects.

You can use json_normalize with a little transformation of original JSON format.
Convert JSON into list format.
I am assuming "Aveiro" as city, and "Albergaria-a-Velha" as district. Apologies of my unfamiliarity of the area, so if it is wrong, please rename the key.
res = [{**z, **{'city': x, 'district': y}} for x, y in data.items() for y, z in y.items()]
This will transform original JSON of key-values style into list of objects.
[{
"city": "Aveiro",
"district": "Albergaria-a-Velha",
"candidates": [{
...
}]
Then use json_normalize.
df = pd.json_normalize(res, record_path=['candidates'], meta=['total_votes', 'city', 'district'])
Further expanding the nested object total_votes.
df = pd.concat([df, pd.json_normalize(df['total_votes'])], axis=1)
>>> df.iloc[0]
effectiveCandidates [JOSÉ OLIVEIRA SANTOS]
party B.E.
votes.absoluteMajority 0
votes.acronym B.E.
votes.constituenctyCounter 1
votes.mandates 0
votes.percentage 1.34
votes.presidents 0
votes.validVotesPercentage 1.4
votes.votes 179
total_votes {'availableMandates': 0, 'blankVotes': 377, 'b...
city Aveiro
district Albergaria-a-Velha
availableMandates 0
blankVotes 377
blankVotesPercentage 2.82
displayMessage None
hasNoVoting False
nullVotes 221
nullVotesPercentage 1.66
numberParishes 6
numberVoters 13351
percentageVoters 59.48
Name: 0, dtype: object

Recursive Approach:
I usually use this function (a recursive approach) to do that kind of thing:
# Function for flattening
# json
def flatten_json(y):
out = {}
def flatten(x, name =''):
# If the Nested key-value
# pair is of dict type
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
# If the Nested key-value
# pair is of list type
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
You can call flatten_json for flattening your nested json.
# Driver code
print(flatten_json(data))
Library-based approach:
from flatten_json import flatten
unflat_json = {'user' :
{'foo':
{'UserID':0123456,
'Email': 'foo#mail.com',
'friends': ['Johnny', 'Mark', 'Tom']
}
}
}
flat_json = flatten(unflat_json)
print(flat_json)

Related

pandas.to_json suppress indentation for lists as values

I have a DataFrame with lists in one column.
I want to pretty print the data as JSON.
How can I use indentation without affecting the values in each cell to be indented.
An example:
df = pd.DataFrame(range(3))
df["lists"] = [list(range(i+1)) for i in range(3)]
print(df)
output:
0 lists
0 0 [0]
1 1 [0, 1]
2 2 [0, 1, 2]
Now I want to print the data as JSON using:
print(df.to_json(orient="index", indent=2))
output:
{
"0":{
"0":0,
"lists":[
0
]
},
"1":{
"0":1,
"lists":[
0,
1
]
},
"2":{
"0":2,
"lists":[
0,
1,
2
]
}
}
desired output:
{
"0":{
"0":0,
"lists":[0]
},
"1":{
"0":1,
"lists":[0,1]
},
"2":{
"0":2,
"lists":[0,1,2]
}
}
If you don't want to bother with json format output, you can just turn the list type to string temporarily when printing the dataframe
print(df.astype({'lists':'str'}).to_json(orient="index", indent=2))
{
"0":{
"0":0,
"lists":"[0]"
},
"1":{
"0":1,
"lists":"[0, 1]"
},
"2":{
"0":2,
"lists":"[0, 1, 2]"
}
}
If you don't want to see the quote mark, you use regex to replace them
import re
import re
result = re.sub(r'("lists":)"([^"]*)"', r"\1 \2",
df.astype({'lists':'str'}).to_json(orient="index", indent=2))
{
"0":{
"0":0,
"lists": [0]
},
"1":{
"0":1,
"lists": [0, 1]
},
"2":{
"0":2,
"lists": [0, 1, 2]
}
}

How to get specific data from JSON object in Python

I have a dict stored under the variable parsed:
{
"8119300029": {
"store": 4,
"total": 4,
"web": 4
},
"8119300030": {
"store": 2,
"total": 2,
"web": 2
},
"8119300031": {
"store": 0,
"total": 0,
"web": 0
},
"8119300032": {
"store": 1,
"total": 1,
"web": 1
},
"8119300033": {
"store": 0,
"total": 0,
"web": 0
},
"8119300034": {
"store": 2,
"total": 2,
"web": 2
},
"8119300036": {
"store": 0,
"total": 0,
"web": 0
},
"8119300037": {
"store": 0,
"total": 0,
"web": 0
},
"8119300038": {
"store": 2,
"total": 2,
"web": 2
},
"8119300039": {
"store": 3,
"total": 3,
"web": 3
},
"8119300040": {
"store": 3,
"total": 3,
"web": 3
},
"8119300041": {
"store": 0,
"total": 0,
"web": 0
}
}
I am trying to get the "web" value from each JSON entry but can only get the key values.
for x in parsed:
print(x["web"])
I tried doing this ^ but kept getting this error: "string indices must be integers". Can somebody explain why this is wrong?
because your x variable is dict key name
for x in parsed:
print(parsed[x]['web'])
A little information on your parsed data there: this is basically a dictionary of dictionaries. I won't go into too much of the nitty gritty but it would do well to read up a bit on json: https://www.w3schools.com/python/python_json.asp
In your example, for x in parsed is iterating through the keys of the parsed dictionary, e.g. 8119300029, 8119300030, etc. So x is a key (in this case, a string), not a dictionary. The reason you're getting an error about not indexing with an integer is because you're trying to index a string -- for example x[0] would give you the first character 8 of the key 8119300029.
If you need to get each web value, then you need to access that key in the parsed[x] dictionary:
for x in parsed:
print(parsed[x]["web"])
Output:
4
2
0
...

Speech to Text - Map speaker label to corresponding transcript in JSON response

Every so often comes a piece of JSON data that presents a challenge that can take hours to extract desired information from. I have the below JSON response produced from a Speech To Text API engine.
It shows the transcript, utterance of each word with timestamps and speaker labels for each speaker speaker 0 and speaker 2 in the conversation.
{
"results": [
{
"alternatives": [
{
"timestamps": [
[
"the",
6.18,
6.63
],
[
"weather",
6.63,
6.95
],
[
"is",
6.95,
7.53
],
[
"sunny",
7.73,
8.11
],
[
"it's",
8.21,
8.5
],
[
"time",
8.5,
8.66
],
[
"to",
8.66,
8.81
],
[
"sip",
8.81,
8.99
],
[
"in",
8.99,
9.02
],
[
"some",
9.02,
9.25
],
[
"cold",
9.25,
9.32
],
[
"beer",
9.32,
9.68
]
],
"confidence": 0.812,
"transcript": "the weather is sunny it's time to sip in some cold beer "
}
],
"final": "True"
},
{
"alternatives": [
{
"timestamps": [
[
"sure",
10.52,
10.88
],
[
"that",
10.92,
11.19
],
[
"sounds",
11.68,
11.82
],
[
"like",
11.82,
12.11
],
[
"a",
12.32,
12.96
],
[
"plan",
12.99,
13.8
]
],
"confidence": 0.829,
"transcript": "sure that sounds like a plan"
}
],
"final": "True"
}
],
"result_index":0,
"speaker_labels": [
{
"from": 6.18,
"to": 6.63,
"speaker": 0,
"confidence": 0.475,
"final": "False"
},
{
"from": 6.63,
"to": 6.95,
"speaker": 0,
"confidence": 0.475,
"final": "False"
},
{
"from": 6.95,
"to": 7.53,
"speaker": 0,
"confidence": 0.475,
"final": "False"
},
{
"from": 7.73,
"to": 8.11,
"speaker": 0,
"confidence": 0.499,
"final": "False"
},
{
"from": 8.21,
"to": 8.5,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.5,
"to": 8.66,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.66,
"to": 8.81,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.81,
"to": 8.99,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.99,
"to": 9.02,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 9.02,
"to": 9.25,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 9.25,
"to": 9.32,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 9.32,
"to": 9.68,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 10.52,
"to": 10.88,
"speaker": 2,
"confidence": 0.441,
"final": "False"
},
{
"from": 10.92,
"to": 11.19,
"speaker": 2,
"confidence": 0.364,
"final": "False"
},
{
"from": 11.68,
"to": 11.82,
"speaker": 2,
"confidence": 0.372,
"final": "False"
},
{
"from": 11.82,
"to": 12.11,
"speaker": 2,
"confidence": 0.372,
"final": "False"
},
{
"from": 12.32,
"to": 12.96,
"speaker": 2,
"confidence": 0.383,
"final": "False"
},
{
"from": 12.99,
"to": 13.8,
"speaker": 2,
"confidence": 0.428,
"final": "False"
}
]
}
Forgive indentation issues(if any) but the JSON is valid and I've been trying to map each transcript with its corresponding speaker label.
I want something like below. The JSON above is about 20,000 lines and its a nightmare extracting the speaker label based on timestamps & word utterance and putting it together along with the transcript.
[
{
"transcript": "the weather is sunny it's time to sip in some cold beer ",
"speaker" : 0
},
{
"transcript": "sure that sounds like a plan",
"speaker" : 2
}
]
What I've tried so far:
The JSON data is stored in a file named example.json. I have been able to put each word and its corresponding timestamp and speaker label in a list of tuples(see output below):
import json
# with open('C:\\Users\\%USERPROFILE%\\Desktop\\example.json', 'r') as f:
# data = json.load(f)
l1 = []
l2 = []
l3 = []
for i in data['results']:
for j in i['alternatives'][0]['timestamps']:
l1.append(j)
for m in data['speaker_labels']:
l2.append(m)
for q in l1:
for n in l2:
if q[1]==n['from']:
l3.append((q[0],n['speaker'], q[1], q[2]))
print(l3)
This gives the Output:
[('the', 0, 6.18, 6.63),
('weather', 0, 6.63, 6.95),
('is', 0, 6.95, 7.53),
('sunny', 0, 7.73, 8.11),
("it's", 0, 8.21, 8.5),
('time', 0, 8.5, 8.66),
('to', 0, 8.66, 8.81),
('sip', 0, 8.81, 8.99),
('in', 0, 8.99, 9.02),
('some', 0, 9.02, 9.25),
('cold', 0, 9.25, 9.32),
('beer', 0, 9.32, 9.68),
('sure', 2, 10.52, 10.88),
('that', 2, 10.92, 11.19),
('sounds', 2, 11.68, 11.82),
('like', 2, 11.82, 12.11),
('a', 2, 12.32, 12.96),
('plan', 2, 12.99, 13.8)]
But now I am not sure how to associate words together based on timestamp comparison and "bucket" each set of words to form the transcript again with its speaker label.
I've also managed to get the transcripts in a list but now how do I extract the speaker label for each transcript from the above list. The speaker labels speaker 0 and speaker 2 are for each word unfortunately, I wish they would've been for each transcript instead.
for i in data['results']:
l4.append(i['alternatives'][0]['transcript'])
This gives the Output:
["the weather is sunny it's time to sip in some cold beer ",'sure that sounds like a plan']
I've tried to explain the problem as best as I can but I am open to any feedback and will make changes if necessary. Also, I am pretty sure there is a better way to solve this problem rather than make several lists, any help is much appreciated.
For a larger dataset, refer to the pastebin. I hope this dataset can be helpful in bench-marking for performance. I can provide an even larger dataset as and when available or if required.
As I am dealing with large JSON data, performance is an important factor, similarly accurately achieving speaker isolation in overlapping transcriptions is another requirement.
using pandas, here's how I tackled it just now.
assuming the data is stored in a dictionary called data
import pandas as pd
labels = pd.DataFrame.from_records(data['speaker_labels'])
transcript_tstamps = pd.DataFrame.from_records(
[t for r in data['results']
for a in r['alternatives']
for t in a['timestamps']],
columns=['word', 'from', 'to']
)
# this list comprehension more-efficiently de-nests the dictionary into
# records that can be used to create a DataFrame
df = labels.merge(transcript_tstamps)
# produces a dataframe of speakers to words based on timestamps from & to
# since I knew I wanted to merge on the from & to columns,
# I named the columns thus when I created the transcript_tstamps data frame
# like this:
confidence final from speaker to word
0 0.475 False 6.18 0 6.63 the
1 0.475 False 6.63 0 6.95 weather
2 0.475 False 6.95 0 7.53 is
3 0.499 False 7.73 0 8.11 sunny
4 0.472 False 8.21 0 8.50 it's
5 0.472 False 8.50 0 8.66 time
6 0.472 False 8.66 0 8.81 to
7 0.472 False 8.81 0 8.99 sip
8 0.472 False 8.99 0 9.02 in
9 0.472 False 9.02 0 9.25 some
10 0.472 False 9.25 0 9.32 cold
11 0.472 False 9.32 0 9.68 beer
12 0.441 False 10.52 2 10.88 sure
13 0.364 False 10.92 2 11.19 that
14 0.372 False 11.68 2 11.82 sounds
15 0.372 False 11.82 2 12.11 like
16 0.383 False 12.32 2 12.96 a
17 0.428 False 12.99 2 13.80 plan
after the speaker & word data are joined, it is necessary to group successive words by the same speaker together to derive the current speaker. for instance, if the speaker array looked like [2,2,2,2,0,0,0,2,2,2,0,0,0,0], we would need to group the first four 2 together, then the next three 0, then the three 2 and then the remaining 0.
sort the data by ['from', 'to'] and then set up a dummy variable for this called current_speaker like this:
df = df.sort_values(['from', 'to'])
df['current_speaker'] = (df.speaker.shift() != df.speaker).cumsum()
from here, group by the current_speaker, aggregate the words into a sentence & convert to json. There's a little additional renaming to fix the output json keys
transcripts = df.groupby('current_speaker').agg({
'word': lambda x: ' '.join(x),
'speaker': min
}).rename(columns={'word': 'transcript'})
transcripts[['speaker', 'transcript']].to_json(orient='records')
# produces the following output (indentation added by me for legibility):
'[{"speaker":0,
"transcript":"the weather is sunny it\'s time to sip in some cold beer"},
{"speaker":2,
"transcript":"sure that sounds like a plan"}]'
To add additional data around when the the transcript starts / ends, you can add the min/max of from/to to the groupby
transcripts = df.groupby('current_speaker').agg({
'word': lambda x: ' '.join(x),
'speaker': min,
'from': min,
'to': max
}).rename(columns={'word': 'transcript'})
additionally, (though this doesn't apply to this example data set) you should perhaps pick the alternative with the highest confidence for each time slice.
This is what i tried using JS
See if this works for you in the similar way using python
var resultTimestampLen = 0;
arrLen = JSON.parse(sTot_resuts.results.length);
for(var i = 0; i<arrLen; i++){
speakerLablefrom = sTot_resuts.speaker_labels[resultTimestampLen].from;
speakerLabelto = sTot_resuts.speaker_labels[resultTimestampLen].to;
speakerId = sTot_resuts.speaker_labels[resultTimestampLen].speaker;
var findSpeaker = new Array();
findSpeaker = sTot_resuts.results[i].alternatives[0].timestamps[0];
var timeStampFrom = findSpeaker[1];
var timeStampto = findSpeaker[2];
if(timeStampFrom === speakerLablefrom && timeStampto === speakerLabelto){
console.log('Speaker '+sTot_resuts.speaker_labels[resultTimestampLen].speaker + ' ' + sTot_resuts.results[i].alternatives[0].transcript);
var resultsTimestamp = new Array();
resultsTimestamp = sTot_resuts.results[i].alternatives[0].timestamps.length;
resultTimestampLen = resultsTimestamp+resultTimestampLen;
}else{
console.log('resultTimestampLen '+resultTimestampLen + 'speakerLablefrom '+speakerLablefrom + 'speakerLabelto '+speakerLabelto + 'timeStampFrom '+timeStampFrom + 'timeStampto '+timeStampto);
}
}
I did it by throwing words into a dict based on their timestamp, and them matching them to their speakers:
times = {}
for r in data['results']:
for word in r['alternatives'][0]['timestamps']:
times[(word[1], word[2])] = word[0]
transcripts = {}
for r in data['speaker_labels']:
speaker = r['speaker']
if speaker in transcripts:
transcripts[speaker].append(times[(r['from'], r['to'])])
else:
transcripts[speaker] = [times[(r['from'], r['to'])]]
print([{'speaker': k, 'transcript': ' '.join(transcripts[k])} for k in transcripts])
It runs on the example provided 1,000,000 times in ~12.34 seconds, so hopefully it's fast enough for what you want.

csv to json with column data that needs to be grouped

I have a CSV file in a format similar to this
order_id, customer_name, item_1_id, item_1_quantity, Item_2_id, Item_2_quantity, Item_3_id, Item_3_quantity
1, John, 4, 1, 24, 4, 16, 1
2, Paul, 8, 3, 41, 1, 33, 1
3, Andrew, 1, 1, 34, 4, 8, 2
I want to export to json, currently I am doing this.
df = pd.read_csv('simple.csv')
print ( df.to_json(orient = 'records') )
And the output is
[
{
"Item_2_id": 24,
"Item_2_quantity": 4,
"Item_3_id": 16,
"Item_3_quantity": 1,
"customer_name": "John",
"item_1_id": 4,
"item_1_quantity": 1,
"order_id": 1
},
......
However, I would like the output to be
[
{
"customer_name": "John",
"order_id": 1,
"items": [
{ "id": 4, "quantity": 1 },
{ "id": 24, "quantity": 4 },
{ "id": 16, "quantity": 1 },
]
},
......
Any suggestions on a good way to do this?
In this particular project, there will not be more than 5 times per order
Try the following:
import pandas as pd
import json
output_lst = []
##specify the first row as header
df = pd.read_csv('simple.csv', header=0)
##iterate through all the rows
for index, row in df.iterrows():
dict = {}
items_lst = []
## column_list is a list of column headers
column_list = df.columns.values
for i, col_name in enumerate(column_list):
## for the first 2 columns simply copy the value into the dictionary
if i<2:
element = row[col_name]
if isinstance(element, str):
## strip if it is a string type value
element = element.strip()
dict[col_name] = element
elif "_id" in col_name:
## i+1 is used assuming that the item_quantity comes right after the corresponding item_id for each item
item_dict = {"id":row[col_name], "quantity":row[column_list[i+1]]}
items_lst.append(item_dict)
dict["items"] = items_lst
output_lst.append(dict)
print json.dumps(output_lst)
If you run the above file with the sample.csv described in the question then you get the following output:
[
{
"order_id": 1,
"items": [
{
"id": 4,
"quantity": 1
},
{
"id": 24,
"quantity": 4
},
{
"id": 16,
"quantity": 1
}
],
" customer_name": "John"
},
{
"order_id": 2,
"items": [
{
"id": 8,
"quantity": 3
},
{
"id": 41,
"quantity": 1
},
{
"id": 33,
"quantity": 1
}
],
" customer_name": "Paul"
},
{
"order_id": 3,
"items": [
{
"id": 1,
"quantity": 1
},
{
"id": 34,
"quantity": 4
},
{
"id": 8,
"quantity": 2
}
],
" customer_name": "Andrew"
}
]
Source DF:
In [168]: df
Out[168]:
order_id customer_name item_1_id item_1_quantity Item_2_id Item_2_quantity Item_3_id Item_3_quantity
0 1 John 4 1 24 4 16 1
1 2 Paul 8 3 41 1 33 1
2 3 Andrew 1 1 34 4 8 2
Solution:
In [169]: %paste
import re
x = df[['order_id','customer_name']].copy()
x['id'] = \
pd.Series(df.loc[:, df.columns.str.contains(r'item_.*?_id',
flags=re.I)].values.tolist(),
index=df.index)
x['quantity'] = \
pd.Series(df.loc[:, df.columns.str.contains(r'item_.*?_quantity',
flags=re.I)].values.tolist(),
index=df.index)
x.to_json(orient='records')
## -- End pasted text --
Out[169]: '[{"order_id":1,"customer_name":"John","id":[4,24,16],"quantity":[1,4,1]},{"order_id":2,"customer_name":"Paul","id":[8,41,33],"qua
ntity":[3,1,1]},{"order_id":3,"customer_name":"Andrew","id":[1,34,8],"quantity":[1,4,2]}]'
Intermediate helper DF:
In [82]: x
Out[82]:
order_id customer_name id quantity
0 1 John [4, 24, 16] [1, 4, 1]
1 2 Paul [8, 41, 33] [3, 1, 1]
2 3 Andrew [1, 34, 8] [1, 4, 2]
j = df.set_index(['order_id','customer_name']) \
.groupby(lambda x: x.split('_')[-1], axis=1) \
.agg(lambda x: x.values.tolist()) \
.reset_index() \
.to_json(orient='records')
import json
Beatufied result:
In [122]: print(json.dumps(json.loads(j), indent=2))
[
{
"order_id": 1,
"customer_name": "John",
"id": [
4,
24,
16
],
"quantity": [
1,
4,
1
]
},
{
"order_id": 2,
"customer_name": "Paul",
"id": [
8,
41,
33
],
"quantity": [
3,
1,
1
]
},
{
"order_id": 3,
"customer_name": "Andrew",
"id": [
1,
34,
8
],
"quantity": [
1,
4,
2
]
}
]

Take the first n dictionaries of a specific key in a sorted list

I writing a script which calculates the distance in miles between an order's shipping address and each store location for a specific chain of stores. So far, I have created a sorted list of dictionaries (sorted by order_id and then distance). It looks like this:
[
{
"order_id": 1,
"distance": 10,
"storeID": 1112
},
{
"order_id": 1,
"distance": 20,
"storeID": 1116
},
{
"order_id": 1,
"distance": 30,
"storeID": 1134
},
{
"order_id": 1,
"distance": 40,
"storeID": 1133
},
{
"order_id": 2,
"distance": 6,
"storeID": 1112
},
{
"order_id": 2,
"distance": 12,
"storeID": 1116
},
{
"order_id": 2,
"distance": 18,
"storeID": 1134
},
{
"order_id": 2,
"distance": 24,
"storeID": 1133
}
]
From here, I would like to find the two closest stores for each order_id, as well as their distances.
What I'd ultimately want to end up with is a list that looks like this:
[
{
"order_id": 1,
"closet_store_distance": 10,
"closest_store_id": 1112,
"second_closet_store_distance": 20,
"second_closest_store_id": 1116
},
{
"order_id": 2,
"closet_store_distance": 6,
"closest_store_id": 1112,
"second_closet_store_distance": 12,
"second_closest_store_id": 1116
}
]
I am unsure of how to loop through each order_id in this list and select the two closest stores. Any help is appreciated.
Try something like this, I made the assumption that the initial data was in a file called sample.txt.
import json
from operator import itemgetter
def make_order(stores, id):
return {
"order_id": id,
"closet_store_distance": stores[0][1],
"closest_store_id": stores[0][0],
"second_closet_store_distance": stores[1][1],
"second_closest_store_id": stores[1][0]
}
def main():
with open('sample.txt', 'r') as data_file:
data = json.loads(data_file.read())
id1 = {}
id2 = {}
for i in data:
if i["order_id"] == 1:
id1[i["storeID"]] = i["distance"]
else:
id2[i["storeID"]] = i["distance"]
top1 = sorted(id1.items(), key=itemgetter(1))
top2 = sorted(id2.items(), key=itemgetter(1))
with open('results.json', 'w') as result_file:
order1 = make_order(top1, 1)
order2 = make_order(top2, 2)
json.dump([order1, order2], result_file, indent=3, separators=(',', ': '))
if __name__ == '__main__':
main()
The resulting file looks like:
[
{
"second_closest_store_id": 1116,
"closet_store_distance": 10,
"closest_store_id": 1112,
"order_id": 1,
"second_closet_store_distance": 20
},
{
"second_closest_store_id": 1116,
"closet_store_distance": 6,
"closest_store_id": 1112,
"order_id": 2,
"second_closet_store_distance": 12
}
]
A nice readable answer (but using one of my free libraries.):
from PLOD import PLOD
order_store_list = [
{
"order_id": 1,
"distance": 10,
"storeID": 1112
},
{
"order_id": 1,
"distance": 20,
"storeID": 1116
},
{
"order_id": 1,
"distance": 30,
"storeID": 1134
},
{
"order_id": 1,
"distance": 40,
"storeID": 1133
},
{
"order_id": 2,
"distance": 6,
"storeID": 1112
},
{
"order_id": 2,
"distance": 12,
"storeID": 1116
},
{
"order_id": 2,
"distance": 18,
"storeID": 1134
},
{
"order_id": 2,
"distance": 24,
"storeID": 1133
}
]
#
# first, get the order_ids (place in a dictionary to ensure uniqueness)
#
order_id_keys = {}
for entry in order_store_list:
order_id_keys[entry["order_id"]] = True
#
# next, get the two closest stores per order_id
#
closest_stores = []
for order_id in order_id_keys:
top_two = PLOD(order_store_list).eq("order_id", order_id).sort("distance").returnList(limit=2)
closest_stores.append({
"order_id": order_id,
"closet_store_distance": top_two[0]["distance"],
"closest_store_id": top_two[0]["storeID"],
"second_closet_store_distance": top_two[1]["distance"],
"second_closest_store_id": top_two[1]["storeID"]
})
#
# sort by order_id again (if that is important)
#
closest_stores = PLOD(closest_stores).sort("order_id").returnList()
This example assumes the production order_store_list will fit in memory. If you are using a larger dataset, I strongly recommend using a database and python library for that database.
My PLOD library is free and open source (MIT), but requires Python 2.7. I'm about two weeks away from a Python 3.5 release. See https://pypi.python.org/pypi/PLOD/0.1.7

Categories

Resources