I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above
Related
I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this
I just have to check the JSON data on the basis of comma-separated e_code in the table.
how to filter only that data where users e_codes are available
in the database:
id email age e_codes
1. abc#gmail 19 123456,234567,345678
2. xyz#gmail 31 234567,345678,456789
This is my JSON data
[
{
"ct": 1,
"e_code": 123456,
},
{
"ct": 2,
"e_code": 234567,
},
{
"ct": 3,
"e_code": 345678,
},
{
"ct": 4,
"e_code": 456789,
},
{
"ct": 5,
"e_code": 456710,
}
]
If efficiency is not an issue, you could loop through the table, split the values to a list by using case['e_codes'].split(',') and then, for each code loop through the JSON to see whether it is present.
This might be a little inefficient if your data, JSON, or number of values are long.
It might be better to first create a lookup dictionary in which the codes are the keys:
lookup={}
for e in my_json:
lookup[e['e_code']] = 1
You can then check how many of the codes in your table are actually in the JSON:
## Let's assume that the "e_codes" cell of the
## current line is data['e_codes'][i], where i is the line number
for i in lines:
match = [0,0]
for code in data['e_codes'][i].split(','):
try:
match[0]+=lookup[code]
match[1]+=1
except:
match[1]+=1
if match[1]>0: share_present=match[0]/match[1]
For each case, you get a share_present, which is 1.0 if all codes appear in the JSON, 0.0 if none of them do and some value between to indicate the share of codes that were present. Depending on your threshold for keeping a case you can set a filter to True or False depending on this value.
I'm trying to parse nested json results.
data = {
"results": [
{
"components": [
{
"times": {
"periods": [
{
"fromDayOfWeek": 0,
"fromHour": 12,
"fromMinute": 0,
"toDayOfWeek": 4,
"toHour": 21,
"toMinute": 0,
"id": 156589,
"periodId": 20855
}
],
}
}
],
}
],
}
I can get to and create dataframes for "results" and "components" lists, but cannot get to "periods" due to the "times" dict. So far I have this:
df = pd.json_normalize(data, record_path = ['results','components'])
Need a separate "periods" dataframe with the included column names and values. Would appreciate your help on this. Thank you!
I results
II components
III times
IIII periods
The normalize should be correct way:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
There is 4 level of nesting. There can be x components in results and y times in components - however that type of nesting is overengineering?
The simplest way of getting data is:
print data['a']['b']['c']['d'] (...)
in your case:
print data['results']['components']['times']['periods']
You can access the specific nested level by this piece of code:
def GetPropertyFromPeriods (property):
propertyList = []
for x in data['results']['components']['times']:
singleProperty = photoURL['periods'][property]
propertyList.append(singleProperty)
return propertyList
This give you access to one property inside periods (fromDayOfWeek, fromHour, fromMinute)
After coverting json value, transform it into pandas dataframe:
print pd.DataFrame(data, columns=["columnA", "columnB”])
If stuck:
How to Create a table with data from JSON output in Python
Python - How to convert JSON File to Dataframe
pandas documentation:
pandas.DataFrame.from_dict
pandas.json_normalize
I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": "Value3" }
]
})
df.to_parquet("test.parquet")
Now, that works perfectly fine, the problem is when one of the nested values of the dictionary has a different type than the rest. For instance:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
df.to_parquet("test.parquet")
This throws the following error:
ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')
Notice how, for the last row of the DF, the Field property of the ColC dictionary is a list instead of a string.
Is there any workaround to be able to store this DF as a Parquet file?
ColC is a UDT (user defined type) with one field called Field of type Union of String, List of String.
In theory arrow supports it, but in practice it has a hard time figuring out what the type of ColC is. Even if you were providing the schema of your data frame explicitly, it wouldn't work because this type of conversion (converting unions from pandas to arrow/parquet) isn't supported yet.
union_type = pa.union(
[pa.field("0",pa.string()), pa.field("1", pa.list_(pa.string()))],
'dense'
)
col_c_type = pa.struct(
[
pa.field('Field', union_type)
]
)
schema=pa.schema(
[
pa.field('ColA', pa.int32()),
pa.field('ColB', pa.string()),
pa.field('ColC', col_c_type),
]
)
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
pa.Table.from_pandas(df, schema)
This gives you this error:
('Sequence converter for type union[dense]<0: string=0, 1: list<item: string>=1> not implemented', 'Conversion failed for column ColC with type object'
Even if you create the arrow table manually it won't be able to convert it to parquet (again, union are not supported).
import io
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
xs = pa.array(["Value", "Value2", None], type=pa.string())
ys = pa.array([None, None, ["value3"]], type=pa.list_(pa.string()))
types = pa.array([0, 0, 1], type=pa.int8())
col_c = pa.UnionArray.from_sparse(types, [xs, ys])
table = pa.Table.from_arrays(
[col_a, col_b, col_c],
schema=pa.schema([
pa.field('ColA', col_a.type),
pa.field('ColB', col_b.type),
pa.field('ColC', col_c.type),
])
)
with io.BytesIO() as buffer:
pq.write_table(table, buffer)
Unhandled type for Arrow to Parquet schema conversion: sparse_union<0: string=0, 1: list<item: string>=1>
I think your only option for now it to use a struct where fields have got different names for string value and list of string values.
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field1": "Value" },
{ "Field1": "Value2" },
{ "Field2": ["Value3"] }
]
})
df.to_parquet('/tmp/hello')
I just had the same problem and fixed by converting ColC to string:
df['ColC'] = df['ColC'].astype(str)
I am not sure this would not create a problem in the future, don't quote me.
I extract data using API and retrieve a list of servers and backups. Some servers have more than one backup. This is how I get list of all servers with backaup IDs.
bkplist = requests.get('https://heee.com/1.2/storage/backup')
bkplist_json = bkplist.json()
backup_list = bkplist.json()
backupl = backup_list['storages']['storage']
Json looks like this:
{
"storages": {
"storage": [
{
"access": "",
"created": "",
"license": ,
"origin": "01165",
"size": ,
"state": "",
"title": "",
"type": "backup",
"uuid": "01019",
"zone": ""
},
Firstly I create a dictionary to store this data:
backup = {}
for u in backup_list['storages']['storage']:
srvuuidorg = u['origin']
backup_uuid = u['uuid']
backup[srvuuidorg] = backup_uuid
But then I find out there is more than one value for every server. As dictionary can have just one value assigned to one key I wanted to use some hybrid of list and dictionary, but I just can't figure it out how to do this with this example.
Servers are nested in storages->storage and I need to assign a couple of uuid which is backup ID to one origin which is server ID.
I know about collections module and with a simple example it is quite understandable, but I have a problem how to use this in my example with extracting data through API.
How extract origin and assign to this key other values stored in json uuid?
What's more it is a massive amount of data so I cannot add every value manually.
You can do something like this.
from collections import defaultdict
backup = defaultdict(list)
for u in backup_list['storages']['storage']:
srvuuidorg = u['origin']
backup_uuid = u['uuid']
backup[srvuuidorg].append(backup_uuid)
Note that you can simplify your loop like this.
from collections import defaultdict
backup = defaultdict(list)
for u in backup_list['storages']['storage']:
backup[u['origin']].append(u['uuid'])
But this may be considering as less readable.
You could store uuid list for origin key.
I sugget the following 2 ways:
Creating empty list for first accessing origin, and then appending to it:
backup = {}
for u in backup_list['storages']['storage']:
srvuuidorg = u['origin']
backup_uuid = u['uuid']
if not backup.get(srvuuidorg):
backup[srvuuidorg] = []
backup[srvuuidorg].append(backup_uuid)
Using defaultdict collection, which basically does the same for you under the hood:
from collections import defaultdict
backup = defaultdict(list)
for u in backup_list['storages']['storage']:
srvuuidorg = u['origin']
backup_uuid = u['uuid']
backup[srvuuidorg].append(backup_uuid)
It seems to me that the last way is more elegant.
If you need to store uuid unique list you should use the saem approach with set instead of list.
A json allows to contain an array in a key:
var= {
"array": [
{"id": 1, "value": "one"},
{"id": 2, "value": "two"},
{"id": 3, "value": "three"}
]
}
print var
{'array': [{'id': 1, 'value': 'one'}, {'id': 2, 'value': 'two'}, {'id': 3, 'value': 'three'}]}
var["array"].append({"id": 4, "value": "new"})
print var
{'array': [{'id': 1, 'value': 'one'}, {'id': 2, 'value': 'two'}, {'id': 3, 'value': 'three'}, {'id': 4, 'value': 'new'}]}
You can use a list for multiple values.
dict = {"Greetings": ["hello", "hi"]}