Prevent Pandas to_json() from wrapping lists in single quotes - python

I am trying to generate a JSON blob that contains lists generated from series of data within my dataframe. When I run to_json(orient='values') it appropriately transforms the series to a list with just the series values, but then wraps the list in single quotes making it difficult to parse this data if anyone handle it. Is there a good way to convert the series to a JSON object, but not wrap the lists in single quotes?
Code:
# Create empty JSON blob
output_json = {}
# Add to JSON output
output_json['count_time_fitness_timeseries'] = {
'x': df.fitness_date.to_json(orient='values'),
'y': df.minutes_fitness.to_json(orient='values'),
'labels': df.minutes_fitness_hhmm.to_json(orient='values')
}
print(output_json)
Example:
{
'count_time_fitness_timeseries': {
'x': '["2020-04-01","2020-04-02","2020-04-03","2020-04-04","2020-04-05","2020-04-06","2020-04-07","2020-04-08","2020-04-09","2020-04-10","2020-04-11","2020-04-12","2020-04-13","2020-04-14","2020-04-15","2020-04-16","2020-04-17","2020-04-18","2020-04-19","2020-04-20","2020-04-21","2020-04-22","2020-04-23","2020-04-24","2020-04-25","2020-04-26","2020-04-27","2020-04-28","2020-04-29","2020-04-30","2020-05-01","2020-05-02","2020-05-03","2020-05-04","2020-05-05","2020-05-06","2020-05-07","2020-05-08","2020-05-09","2020-05-10","2020-05-11","2020-05-12","2020-05-13","2020-05-14","2020-05-15","2020-05-16","2020-05-17","2020-05-18","2020-05-19","2020-05-20","2020-05-21","2020-05-22","2020-05-23","2020-05-24","2020-05-25","2020-05-26","2020-05-27","2020-05-28","2020-05-29","2020-05-30","2020-05-31","2020-06-01","2020-06-02","2020-06-03","2020-06-04","2020-06-05","2020-06-06","2020-06-07","2020-06-08","2020-06-09","2020-06-10","2020-06-11","2020-06-12","2020-06-13","2020-06-14","2020-06-15","2020-06-16","2020-06-17","2020-06-18","2020-06-19","2020-06-20","2020-06-21","2020-06-22","2020-06-23","2020-06-24","2020-06-25","2020-06-26","2020-06-27","2020-06-28","2020-06-29","2020-06-30","2020-07-01"]',
'y': '[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,461.0,421.0,519.0,502.0,511.0,513.0,496.0,480.0,364.0,498.0,467.0,477.0,431.0,419.0,471.0,391.0,481.0,494.0,506.0,464.0,474.0,383.0,385.0,470.0,465.0,574.0,473.0,431.0,497.0,null,482.0,492.0,494.0,469.0,395.0,427.0,346.0,416.0,461.0,486.0,451.0,533.0,null,462.0,461.0,477.0,458.0,484.0,389.0,null,472.0,462.0,486.0,489.0,483.0,426.0,453.0,489.0,467.0,474.0,451.0,450.0,470.0,null,247.0,502.0,464.0]',
'labels': '[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,"7:41","7:01","8:39","8:22","8:31","8:33","8:16","8:00","6:04","8:18","7:47","7:57","7:11","6:59","7:51","6:31","8:01","8:14","8:26","7:44","7:54","6:23","6:25","7:50","7:45","9:34","7:53","7:11","8:17",null,"8:02","8:12","8:14","7:49","6:35","7:07","5:46","6:56","7:41","8:06","7:31","8:53",null,"7:42","7:41","7:57","7:38","8:04","6:29",null,"7:52","7:42","8:06","8:09","8:03","7:06","7:33","8:09","7:47","7:54","7:31","7:30","7:50",null,"4:07","8:22","7:44"]'
}

Try this:
import json
#.... your code
print(json.dumps(output_json))

Related

Deeply nested json - a list within a dictionary to Pandas DataFrame

I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize

map json values / dictionary to different functions

I currently have a json file structured as so.
[
{
"jumpcloud-group-name":"Gsuite-Team",
"jumpcloud-group-id":"abcde123455d2f4",
"google-group-name":"test#somewebsite.com"
},
{
"jumpcloud-group-name":"Gsuite-Team2",
"jumpcloud-group-id":"abcde12345asdasdaasdasd",
"google-group-name":"test1#somewebsite.com"
}
]
I am wanting to map the different values of the same keys to different functions.
example: jumpcloud-group-id to group_id and google-group-name to groupkey
*both fields need to be strings and i have already used json load to import the json file to a dictionary. I have tried using a for loop however, I am confused on how to map everything with the same dictionary values
def jumpcloud_group_membership_ids():
group_id = '5d2fsassfdasdasde9aa0ec'
def main():
groupKey = test#domain.com

Flatten list of json objects into table with column for each object in Databricks

I have a json file that looks like this
[
{"id": 1,
"properties":[{"propertyname":"propertyone",
"propertyvalye": 5},
"propertyname":"properttwo",
"propertyvalye": 7}]},
{"id": 2,
"properties":[{"propertyname":"propertyone",
"propertyvalye": 3},
"propertyname":"properttwo",
"propertyvalye": 8}]}]
I was able load the file in databricks and parse it, getting a column called properties that contains the array in the data. The next step is to flatten this column and get one column for each object in the array with the name from property name and the value. Is there any native way of doing this in databricks?
Most json structures I have worked with in the past are of a {name:value} format which is straightforward to parse but the format i'm dealing with is giving me some headaches.
Any suggestions? I would prefer to to use inbuilt functionality, but if there is a way of doing it in python i can also write an UDF
EDIT
This is the output I am looking for.
Write the sample data to storage:
data = """
{"id": 1, "properties":[{"propertyname":"propertyone","propertyvalue": 5},{"propertyname":"propertytwo","propertyvalue": 7}]},
{"id": 2, "properties":[{"propertyname":"propertyone","propertyvalue": 3},
{"propertyname":"propertytwo","propertyvalue": 8}]}
"""
dbutils.fs.put(inputpath + "/x.json", data, True)
Read the json data:
df = spark.read.format("json").load(inputpath)
The resultset will look like:
dfe = df.select("id", explode("properties").alias("p")) \
.select("id", "p.propertyname", "p.propertyvalue")
Will explode the array:
Finally with pivot, you get the key-value-pairs as columns:
display (dfe.groupby('id').pivot('propertyname').agg({'propertyvalue': 'first'}))
See also examples in this Notebook how to implement transformations on complex datatypes.

pandas change the order of columns

In my project I'm using flask I get a JSON (by REST API) that has data that I should convert to a pandas Dataframe.
The JSON looks like:
{
"entity_data":[
{"id": 1, "store": "a", "marker": "a"}
]
}
I get the JSON and extract the data:
params = request.json
entity_data = params.pop('entity_data')
and then I convert the data into a pandas dataframe:
entity_ids = pd.DataFrame(entity_data)
the result looks like this:
id marker store
0 1 a a
This is not the original order of the columns. I'd like to change the order of the columns as in the dictionary.
help?
Use OrderedDict for an ordered dictionary
You should not assume dictionaries are ordered. While dictionaries are insertion ordered in Python 3.7, whether or not libraries maintain this order when reading json into a dictionary, or converting the dictionary to a Pandas dataframe, should not be assumed.
The most reliable solution is to use collections.OrderedDict from the standard library:
import json
import pandas as pd
from collections import OrderedDict
params = """{
"entity_data":[
{"id": 1, "store": "a", "marker": "a"}
]
}"""
# replace myjson with request.json
data = json.loads(params, object_pairs_hook=OrderedDict)
entity_data = data.pop('entity_data')
df = pd.DataFrame(entity_data)
print(df)
# id store marker
# 0 1 a a
Just add the column names parameter.
entity_ids = pd.DataFrame(entity_data, columns=["id","store","marker"])
Assuming you have access to JSON sender, you can send the order in the JSON itself.
like
`{
"order":['id','store','marker'],
"entity_data":{"id": [1,2], "store": ["a","b"],
"marker": ["a","b"]}
}
then create DataFrame with columns specified. as said by Chiheb.K.
import pandas as pd
params = request.json
entity_data = params.pop('entity_data')
order = params.pop('order')
entity_df=pd.DataFrame(data,columns=order)
if you cannot explicitly specify the order in the JSON. see this answer to specify object_pairs_hook in
JSONDecoder to get an OrderedDict and then create the DataFrame

Import nested MongoDB to Pandas

I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.

Categories

Resources