Convert CSV into Dictionary using python - python

I have a CSV/EXCEL file. The sample data is shown below
+-----------+------------+------------+
| Number | start_date | end_date |
+-----------+------------+------------+
| 987654321 | 2021-07-15 | 2021-08-15 |
| 999999999 | 2021-07-15 | 2021-08-15 |
| 888888888 | 2021-07-15 | 2021-08-15 |
| 777777777 | 2021-07-15 | 2021-09-15 |
+-----------+------------+------------+
I need to convert it into a dictionary(JSON) with some condition on it and then pass that dictionary(JSON) into the DB rows. This means the CSV can provide n number of dictionaries.
Conditions to applied:
The numbers which are having the same start date and end date should be in the same dictionary.
All numbers in a dictionary should be concatenated with comma(,) string.
Expected dictionaries from the above input
dict1 = {
"request": [
{
"key": "AMI_LIST",
"value": "987654321,999999999,888888888"
},
{
"key": "START_DATE",
"value": "2021-07-15"
},
{
"key": "END_DATE",
"value": "2021-08-15"
}
]
}
dict2 = {
"request": [
{
"key": "AMI_LIST",
"value": "7777777777"
},
{
"key": "START_DATE",
"value": "2021-07-15"
},
{
"key": "END_DATE",
"value": "2021-09-15"
}
]
}
All these dictionaries will be stored as a model object and then will pass on to DB. I am not creating different variables for each dict it will handle in a loop. It is just a notation that I want to explain using variables dict1 and dict2.
NOTE: The maximum rows in a file will be 500 only.
I have tried using for loop but that will increase the complexity. Is there any other way to approach this problem?
Thanks in advance for your help.

Yep pandas is a really good option, you can do something like this :
import pandas as pd
import json
df = pd.read_csv("table.csv")
dfgrp = df.groupby(['end_date', 'start_date'], as_index = False).agg({"Number": list})
dfgrp.to_json()
which gives you :
{
'end_date': {'0': '2021-08-15', '1': '2021-09-15'},
'start_date': {'0': '2021-07-15', '1': '2021-07-15'},
'Number': {'0': [987654321, 999999999, 888888888], '1': [777777777]}
}
And you're almost there !

how about using pandas to do the job?
it should be capable of reading your excel-files in as a DataFrame (see here), then you can apply your conditions and use DataFrame.to_json (see here) to export your DataFrames to json files

Related

merging list with nested list

I need to 'cross join' (for want of a better term !) 2 lists.
Between them they represent a tabled dataset but ..
One holds the column header names, the other a nested array with the row values.
I've managed the easy bit :
col_names = [i['name'] for i in c]
which strips the column names out in to a list without 'typeName'
But just thinking how to extract the row field values and map them with column names .. is giving me a headache!
Any pointers appreciated ;)
Thanks
Columns (as provided):
[
{
"name": "col1",
"typeName": "varchar"
},
{
"name": "col2",
"typeName": "int4"
}
]
Records (as provided):
[
[
{
"stringValue": "apples"
},
{
"longValue": 1
}
],
[
{
"stringValue": "bananas"
},
{
"longValue": 2
}
]
]
Required Result:
[
{
'col1':'apples',
'col2':1
},
{
'col1':'bananas',
'col2':2
}
]
You have to be able to assume there is a 1-to-1 correspondence between the names in the schema and the dicts in the records. Once you assume that, it's pretty easy:
names = [i['name'] for i in schema]
data = []
for row in records:
d = {}
for a,b in zip( names, row ):
d[a] = list(b.values())[0]
data.append(d)
print(data)

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

Create nested dictionary from Pandas DataFrame

I have a requirement to create a nested dictionary from a Pandas DataFrame.
Below is an example dataset in CSV format:
hostname,nic,vlan,status
server1,eth0,100,enabled
server1,eth2,200,enabled
server2,eth0,100
server2,eth1,100,enabled
server2,eth2,200
server1,eth1,100,disabled
Once the CSV is imported as a DataFrame I have:
>>> import pandas as pd
>>>
>>> df = pd.read_csv('test.csv')
>>>
>>> df
hostname nic vlan status
0 server1 eth0 100 enabled
1 server1 eth2 200 enabled
2 server2 eth0 100 NaN
3 server2 eth1 100 enabled
4 server2 eth2 200 NaN
5 server1 eth1 100 disabled
The output nested dictionary/JSON needs to group by the first two columns (hostname and nic), for example:
{
"hostname": {
"server1": {
"nic": {
"eth0": {
"vlan": 100,
"status": "enabled"
},
"eth1": {
"vlan": 100,
"status": "disabled"
},
"eth2": {
"vlan": 200,
"status": "enabled"
}
}
},
"server2": {
"nic": {
"eth0": {
"vlan": 100
},
"eth1": {
"vlan": 100,
"status": "enabled"
},
"eth2": {
"vlan": 200
}
}
}
}
}
I need to account for:
Missing data, for example not all rows will include 'status'. If this happens we just skip it in the output dictionary
hostnames in the first column may be listed out of order. For example, rows 0, 1 and 5 must be correctly grouped under server1 in the output dictionary
Extra columns beyond vlan and status may be added in future. These must be correctly grouped under hostname and nic
I have looked at groupby and multiindex in the Pandas documentation by as a newcomer I have got stuck.
Any help is appreciated on the best method to achieve this.
It may help to group the df first : df_new = df.groupby(["hostname", "nice"], as_index=False) - note, as_index=False preserves the dataframe format.
You can then use df_new.to_json(orient = 'records', lines=True) to convert your df to json format (as jtweeder mentions in comments). Once you get desired format and would like to write out, you can do something like:
with open('temp.json', 'w') as f:
f.write(df_new.to_json(orient='records', lines=True))

Extracting data from nested json arrays in python

I am having trouble extracting data from nested json in python. I want to create a one column pandas dataframe of all the values of "bill", e.g.
bill
----
a1
a2
a3
Using the output from an API formatted like this:
{
"status": "succeeded",
"travels": [
{
"jobs": [
{
"bill": "a1"
},
{
"bill": "a2"
},
{
"bill": "a3"
}
],
"vehicle": {
"plate": "xyz123"
}
}
]
}
Loading the json directly into pandas gives me only the first instance of 'bill'. I have tried json_normalize() on 'jobs', but it has a key error. Can anybody help me figure out how to grab just the 'bill'?
Thanks
I think you were on the right track with json_normalize. With your input as a python dictionary d:
from pandas.io.json import json_normalize
json_normalize(d, record_path=['travels', 'jobs'])
bill
0 a1
1 a2
2 a3

Is there a generic way to read the multiline json in spark. More specifically spark?

I have a multi line json like this
{ "_id" : { "$oid" : "50b59cd75bed76f46522c34e" }, "student_id" : 0, "class_id" : 2, "scores" : [ { "type" : "exam", "score" : 57.92947112575566 }, { "type" : "quiz", "score" : 21.24542588206755 }, { "type" : "homework", "score" : 68.19567810587429 }, { "type" : "homework", "score" : 67.95019716560351 }, { "type" : "homework", "score" : 18.81037253352722 } ] }
This is just 1 line from the json. And there are other files too. I am looking for a method to read this file in pyspark/spark. Can it be independent of the json format?
I need the output in the form of "scores" as individual column, like scores_exam should be one column with value 57.92947112575566, score_quiz another column with value 21.24542588206755.
Any help is appreciated.
YES.
Use multiline true option
from pyspark.sql.functions import explode, col
val df = spark.read.option("multiline", "true").json("multi.json")
You get below output.
+--------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|_id |class_id|scores |student_id|
+--------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|[50b59cd75bed76f46522c34e]|2 |[[57.92947112575566, exam], [21.24542588206755, quiz], [68.1956781058743, homework], [67.95019716560351, homework], [18.81037253352722, homework]]|0 |
+--------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------+
Add these lines to get
val df2= df.withColumn("scores",explode(col("scores")))
.select(col("_id.*"), col("class_id"),col("scores.*"),col("student_id"))
+------------------------+--------+-----------------+--------+----------+
|$oid |class_id|score |type |student_id|
+------------------------+--------+-----------------+--------+----------+
|50b59cd75bed76f46522c34e|2 |57.92947112575566|exam |0 |
|50b59cd75bed76f46522c34e|2 |21.24542588206755|quiz |0 |
|50b59cd75bed76f46522c34e|2 |68.1956781058743 |homework|0 |
|50b59cd75bed76f46522c34e|2 |67.95019716560351|homework|0 |
|50b59cd75bed76f46522c34e|2 |18.81037253352722|homework|0 |
+------------------------+--------+-----------------+--------+----------+
Note that we are using "col" and "explode" functions from spark hence, you need to do the following import inorder these functions to work.
from pyspark.sql.functions import explode, col
You can more on how to parse a JSON file with multiline on below page.
https://docs.databricks.com/spark/latest/data-sources/read-json.html
Thanks

Categories

Resources