I tried to create a json object but I made a mistake somewhere. I'm getting some data on CSV file (center is string, lat and lng are float).
My codes:
data = []
data.append({
'id': 'id',
'is_city': false,
'name': center,
'county': center,
'cluster': i,
'cluster2': i,
'avaible': true,
'is_deleted': false,
'coordinates': ('{%s,%s}' %(lat,lng))
})
json_data = json.dumps(data)
print json_data
It goes with this:
[{
"county": "County",
"is_city": false,
"is_deleted": false,
"name": "name",
"cluster": 99,
"cluster2": 99,
"id": "id",
"coordinates": "{41.0063945,28.9048234}",
"avaible": true
}]
That's I want:
{
"id" : "id",
"is_city" : false,
"name" : "name",
"county" : "county",
"cluster" : 99,
"cluster2" : 99,
"coordinates" : [
41.0870185,
29.0235126
],
"available" : true,
"isDeleted" : false,
}
You are defining coordinates to be a string of the specified format. There is no way json can encode that as a list; you are saying one thing when you want another.
Similarly, if you don't want the top-level dictionary to be the only element in a list, don't define it to be an element in a list.
data = {
'id': 'id',
'is_city': false,
'name': name,
'county': county,
'cluster': i,
'cluster2': i,
'available': true,
'is_deleted': false,
'coordinates': [lat, lng]
}
I don't know how you defined center, or how you expected it to have the value 'name' and the value 'county' at basically the same time. I have declared two new variables to hold these values; you will need to adapt your code to take care of this detail. I also fixed the typo in "available" where apparently you expected Python to somehow take care of this.
You can use pprint to make pretty printing at python, but it should be applied on object not string.
At your case json_data is a string that represents a JSON object, so you need to load it back to be an object when you try to pprint it, (or to use the data variable itself since it already contains this JSON object in your example)
for example try to run:
pprint.pprint(json.loads(json_data))
Related
I am building a simple function But I am stuck on an error, I am trying to sort json array based on datetime defined it the response. But JSON array also contains some None and Empty string dates like "". so It is showing
KeyError: 'date'
when it sees None or empty date value
so I am trying to push these type of value in the last of the sorted json array which have None and empty string values (date).
example_response = [
{
"id": 2959,
"original_language": "Permanent Job",
"date": "2012-10-26",
"absent": False
},
{
"id": 8752,
"original_language": "Intern Job",
"date": "",
"absent": True
},
{
"adult": False,
"id": 1300,
"title": "Training Job",
"date": "2020-07-25",
"absent": False
},
{
"adult": False,
"id": 7807,
"title": "Training Job",
"absent": False
},
]
program.py
def sorting_function(response):
if response == True:
sorted_data = sorted(example_response, key=lambda x: datetime.strptime(x['date'], "%Y-%m-%d"))
print(sorted_data)
return sorted_data
As you can see in example_response one dict has empty string and one don't have "date".
When I run this function then it is showing KeyError: 'date'
What I have tried ?
I have also tried using
sorted_data = sorted(example_response, key=lambda x: (x['date'] is None, x['date'] == "", x['date'], datetime.strptime(x['date']), "%Y-%m-%d"))
But it still showing KeyError.
Any help would be much Appreciated.
Don't call strptime if x['date'] is None
If the key is
lambda x: (x['date'] is None, datetime.strptime(x['date'], "%Y-%m-%d"))
Then the pair will be computed for all values, which means strptime will be called on all x['date'], including those that are None.
I suggest using a conditional, in order to only call strptime if x['date'] is not None:
lambda x: (0, datetime.strptime(x['date'], "%Y-%m-%d")) if x['date'] is not None else (1, 0)
Use x.get('date') instead of x['date'] if x might be missing the 'date' key
If x is a dict that doesn't have a 'date', then attempting to access x['date'] will always cause a KeyError, even for something as simple as x['date'] is None.
Instead, you can use dict.get, which doesn't cause errors. If a value is missing, dict.get will return None, or another value which you can provide as a second argument:
x = { "id": 2959, "original_language": "Permanent Job" }
print(x['date'])
# KeyError
print(x.get('date'))
# None
print(x.get('date', 42))
# 42
Finally, the key function for the sort becomes:
lambda x: (0, datetime.strptime(x.get('date'), "%Y-%m-%d")) if x.get('date') is not None else (1, 0)
Note that if the key function becomes too complex, it might be better to write it using def instead of lambda:
def key(x):
date = x.get('date')
if date is None:
return (1, 0)
else:
return (0, datetime.strptime(date, "%Y-%m-%d"))
Dictionaries have a very useful get() function which you could utilise thus:
example_response = [
{
"id": 2959,
"original_language": "Permanent Job",
"date": "2012-10-26",
"absent": False
},
{
"id": 8752,
"original_language": "Intern Job",
"date": "",
"absent": True
},
{
"adult": False,
"id": 1300,
"title": "Training Job",
"date": "2020-07-25",
"absent": False
},
{
"adult": False,
"id": 7807,
"title": "Training Job",
"absent": False
}
]
example_response.sort(key=lambda d: d.get('date', ''))
print(example_response)
In this case, missing or empty 'date' values would precede any other dates.
Output:
[{'id': 8752, 'original_language': 'Intern Job', 'date': '', 'absent': True}, {'adult': False, 'id': 7807, 'title': 'Training Job', 'absent': False}, {'id': 2959, 'original_language': 'Permanent Job', 'date': '2012-10-26', 'absent': False}, {'adult': False, 'id': 1300, 'title': 'Training Job', 'date': '2020-07-25', 'absent': False}]
You're nearly on the right track, but you need to find a way to not evaluate the date string when it is invalid (key not present, or the value is the empty string).
The nice thing about dates is that chronological order is the same as lexicographical order (for ISO-8601 date formats -- %Y-%m-%d). So you don't actually have to convert them to dates or datetimes -- just sort them as strings.
That takes care of items in the sequence which have date keys. But what about ones where the date key is not present? There are three options.
Use a a default value. eg. the empty string. However, this means no-date-items will be mixed together with empty-date-items.
Use a fixed-length tuple where the first item indicates whether the date key is present or not and then a default when the value is not present. eg (False, ''), (True, '') and (True, '2022-10-03'). These values will sort in the order I gave them.
Use a variable length tuple. Tuples with different lengths have a total ordering iff their shared elements are comparable. Much like strings do. eg. car sorts before care. So we can use () to represent a no-date-item, ('',) to represent an empty-date-item and ('2022-10-3',) to represent a normal date string.
Using the third possibility you can do:
sorted(
example_response,
key=lambda item: (item['date'],) if 'date' in item else ()
)
This ensures items in the sequence with a date key are sorted separately to items where the value of date key is the empty string. However, both are sorted before all valid dates.
The keys and their sort order for your example would be:
[(), ('',), ('2012-10-26',), ('2020-07-25',)]
I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this
I'm trying to run some analysis on some data and ran into some questions while parsing the data in csv file.
This is the raw data in one cell:
{"completed": true, "attempts": 1, "item_state": {"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}, "raw_earned": 1.0}
Formatted for clarity:
{
"completed": true,
"attempts": 1,
"item_state": {
"1": {
"correct": true,
"zone": "zone-7"
},
"0": {
"correct": true,
"zone": "zone-2"
},
"2": {
"correct": true,
"zone": "zone-12"
}
},
"raw_earned": 1.0
}
I want to extract only the zone information after each number (1, 0, 2) and put the results (zone-7, zone-2, zone-12) in separate columns. How can I do that using R or Python?
It looks like a dictionary, and when it is stored as an element in a csv, it is stored as a string. In python you can use ast.literal_Eval(). It parses strings to pythonic data types like list, dictionary etc. Also works as data type parser.
If the cell you mentioned is indexed [i,j],
import pandas as pd
import ast
df = pd.read_csv(filename)
a = ast.literal_eval(df.loc[i][j])
b = pd.io.json.json_normalize(a)
output = []
for i in range(df.shape[0]):
c = ast.literal_eval(df.iloc[i][j])
temp = pd.DataFrame({'key':c['item_state'].keys(),'zone':[x['zone'] for x in c['item_state'].values()]})
temp['row_n'] = i
output.append(temp)
output2 = pd.concat(temp)
If [i,j] is your cell,
a in the above code is the dictionary as given in your example.
b is a flattened dictionary and contains all key,value pairs in output.
The rest of the code is to extract only the zone values.
If you are looking to apply this for more than one cell, use the loop, else only use the content inside the loop.
output is a list data frames, each of which has the item_state key and zone value as columns and also a row_number for identification.
output2 is concatenated data frame.
ast - Abstract Syntax Trees
In Python, you can use the json library to do something like this:
d = json.loads(raw_cell_data) # Load the data into a Python dict
results = {}
for key, value in d['item_state'].items():
results[key] = value['zone']
And then you can use results to print to a CSV.
The initial situation is a bit unclear, what you show looks like json, but you mention it is in a csv.
assuming you have a csv where the individual fields are strings containing json data, you can extract the zone information by using the csv and json packages.
set up a for loop to iterate over the rows of the csv (see csv docs for more detail)
and then use the json module to extract the zone from the string.
import csv
import json
# to get ss from a csv:
# my_csv = csv.reader( ... )
# for row in my_csv:
# ss = row[N]
ss = '{"completed": true, "attempts": 1, "item_state": {"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}, "raw_earned": 1.0}'
jj = json.loads(ss)
for vv in jj['item_state'].values():
print(vv['zone'])
Convert the cell value to JSON and then you can access any element you would like so:
import csv
import json
column_index = 0
state_keys = ['1', '0', '2']
with open('data.csv') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
object = json.loads(row[column_index])
state = object['item_state']
# Show all values under item_state in order they appear:
for key, value in state.items():
print(state[key]['zone'])
# Show only state_keys defined in variable in order they are defined in a list
for key in state_keys:
print(state[key]['zone'])
Some thing like this. Not tested as you have not provided sufficient sample.
import csv
import json
with open('data.csv') as fr:
rows = list(csv.reader(fr))
for row in rows:
data = json.loads(row[0])
new_col_data = [v['zone'] for v in data['item_state'].values()]
row.append(", ".join(new_col_data)
with open('new_data.csv', 'w') as fw:
writer = csv.writer(fw)
writer.writerows(rows)
In R package rjson, function fromJSON is simple to use.
Any of the following ways of reading the JSON string will produce the same result.
library("rjson")
x <- '{"completed": true, "attempts": 1, "item_state": {"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}, "raw_earned": 1.0}'
json <- fromJSON(json_str = x)
# if the string is in a file, say, "so.json"
#json <- fromJSON(file = "so.json")
json is an object of class "list", make a dataframe out of it.
result <- data.frame(zone_num = names(json$item_state))
result <- cbind(result, do.call(rbind.data.frame, json$item_state)[2])
result
# zone_num zone
#1 1 zone-7
#0 0 zone-2
#2 2 zone-12
get item_state and find the value as a zone than append key and value to empty list and finally create the new columns with those list
zone_val = []
zone_key = []
for k,v in d['item_state'].items():
zone_val.append(v['zone'])
zone_key.append(k)
DF[zone_key] = zone_key
DF[zone_val] = zone_val
In Python, it looks like each cells data is a dictionary that also contains dictionaries, ie nested dictionaries
If this cell's data were referenced as a variable cell_data, then you can get into the inner "item_state" dictionary with:
cell_data["item_state"]
this will return
{"1": {"correct": true, "zone": "zone-7"}, "0": {"correct": true, "zone": "zone-2"}, "2": {"correct": true, "zone": "zone-12"}}
Then you can do the same operation one level deeper by asking for the "1" dictionary:
cell_data["item_state"]["1"]
returns:
{'correct': 'true', 'zone': 'zone-7'}
Then once more:
cell_data["item_state"]["1"]["zone"]
returns
'zone-7'
So to bring it all together, you could get what you want with the following:
your_list = list( cell_data["item_state"][i]['zone'] for i in ["1","0","2"] )
returns:
['zone-7', 'zone-2', 'zone-12']
I have a DataFrame (df) like this:
PointID Time geojson
---- ---- ----
36F 2016-04-01T03:52:30 {'type': 'Point', 'coordinates': [3.961389, 43.123]}
36G 2016-04-01T03:52:50 {'type': 'Point', 'coordinates': [3.543234, 43.789]}
The geojson column contains data in geoJSON format (esentially, a Python dict).
I want to create a new column in geoJSON format, which includes the time coordinate. In other words, I want to inject the time information into the geoJSON info.
For a single value, I can successfully do:
oldjson = df.iloc[0]['geojson']
newjson = [df['coordinates'][0], df['coordinates'][1], df.iloc[0]['time'] ]
For a single parameter, I successfully used dataFrame.apply in combination with lambda (thanks to SO: related question
But now, I have two parameters, and I want to use it on the whole DataFrame. As I am not confident with the .apply syntax and lambda, I do not know if this is even possible. I would like to do something like this:
def inject_time(geojson, time):
"""
Injects Time dimension into geoJSON coordinates. Expects a dict in geojson POINT format.
"""
geojson['coordinates'] = [geojson['coordinates'][0], geojson['coordinates'][1], time]
return geojson
df["newcolumn"] = df["geojson"].apply(lambda x: inject_time(x, df['time'])))
...but that does not work, because the function would inject the whole series.
EDIT:
I figured that the format of the timestamped geoJSON should be something like this:
TimestampedGeoJson({
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [[-70,-25],[-70,35],[70,35]],
},
"properties": {
"times": [1435708800000, 1435795200000, 1435881600000]
}
}
]
})
So the time element is in the properties element, but this does not change the problem much.
You need DataFrame.apply with axis=1 for processing by rows:
df['new'] = df.apply(lambda x: inject_time(x['geojson'], x['Time']), axis=1)
#temporary display long string in column
with pd.option_context('display.max_colwidth', 100):
print (df['new'])
0 {'type': 'Point', 'coordinates': [3.961389, 43.123, '2016-04-01T03:52:30']}
1 {'type': 'Point', 'coordinates': [3.543234, 43.789, '2016-04-01T03:52:50']}
Name: new, dtype: object
I have a json in the following format. My requirement is to change the data if the "id" field is same then rest of the field should be made into a list. I tried looping it and referring other sample code but I couldn't get the required result. If the "id" is same then I should combine the rest of the field's value into a list and keeping the key as same. I tired to add values to new dictionary based on 'id' field but result was either last value or some thing like this
[
{
"time":" all dates ",
"author_id":"alll ",
"id_number":"all id_number",
"id":"all idd"
}
]
Received JSON :
data = [
{
"time":"2015/03/27",
"author_id":"abc_123",
"id":"4585",
"id_number":123
},
{
"time":"2015/03/30",
"author_id":"abc_123",
"id":"7776",
"id_number":122
},
{
"time":"2015/03/22",
"author_id":"abc_123",
"id":"8449",
"id_number":111
},
{
"time":"2012/03/30",
"author_id":"def_456",
"id":"4585",
"id_number":90
}
]
Required Output:
new_data = [
{
"time":[
"2015/03/27",
"2012/03/30"
],
"author_id":[
"abc_123",
"def_456"
],
"id":"4585",
"id_number":[
123,
90
]
},
{
"time":"2015/03/30",
"author_id":"abc_123",
"id":"7776",
"id_number":122
},
{
"time":"2015/03/27 05:22:42",
"author_id":"abc_123",
"id":"8449",
"id_number":111
}
]
First step could be to create a more regular structure by mapping ids to dictionaries where all key are mapped to lists of the corresponding values and merge the original dictionaries with the same id value.
Then in a second step create the result list by taking the values of the id to merged dictionaries mapping and decide on the length of the values list to just copy the dictionary over or taking the only element out of the values while copying. And that's it.
#!/usr/bin/env python
# coding: utf8
from __future__ import absolute_import, division, print_function
from collections import defaultdict
from functools import partial
from pprint import pprint
def main():
records = [
{
'time': '2015/03/27',
'author_id': 'abc_123',
'id': '4585',
'id_number': 123
},
{
'time': '2015/03/30',
'author_id': 'abc_123',
'id': '7776',
'id_number': 122
},
{
'time': '2015/03/22',
'author_id': 'abc_123',
'id': '8449',
'id_number': 111
},
{
'time': '2012/03/30',
'author_id': 'def_456',
'id': '4585',
'id_number': 90
}
]
id2record = defaultdict(partial(defaultdict, list))
for record in records:
merged_record = id2record[record['id']]
for key, value in record.iteritems():
merged_record[key].append(value)
result = list()
for record in id2record.itervalues():
if len(record['id']) == 1:
result.append(dict((k, vs[0]) for k, vs in record.iteritems()))
else:
record['id'] = record['id'][0]
result.append(dict(record))
pprint(result)
if __name__ == '__main__':
main()
If you can change the requirements for the output I would suggest getting rid of the irregularity in the values. Code for processing the result has to deal with both cases — single values and list/array with values — which just makes it a little more complicated than it has to be.
Update: Fixed a problem in the code. The id value should always be a single value and never a list.