Python/Pandas: read nested JSON

Python/Pandas: read nested JSON - python

I am reading a data table from an API that returns me the data in JSON format, and one of the columns is itself a JSON string. I succeed in creating a Pandas dataframe for the overall table, but in the process of reading it, double quotes in the JSON string get converted to single quotes, and I can't parse the nested JSON.
I can't provide a reproducible example, but here is the key code:
myResult = requests.get(myURL, headers = myHeaders).text
myDF = pd.read_json(myResult, orient = "records", dtype = {"custom": str}, encoding = "unicode_escape")
Where custom is the nested JSON string. Try as I might by setting the dtype and encoding arguments, I cannot force Pandas to preserve the double quotes in the string.
So what started off as:
"custom": {"Field1":"Value1","Field2":"Value2"}
gets into the dataframe as:
{'Field1':'Value1','Field2':'Value2'}
I found this question which suggests using a custom parser for read_csv - but I can't see that this option is available for read_json.
I found a few suggestions here but the only one I could try was manually replacing the double quotes - and this causes fresh errors because there are apostrophes contained within the nested field values themselves...
The JSON strings are formatted correctly within myResult so it's the parsing applied by read_json that's the problem. Is there any way to change that or do I need to find some other way of reading this in?

Related

How to convert Complex CSV to JSON?

I have CSV file which contains many fields and strings.
Here is that CSV file:
https://drive.google.com/file/d/1GVLvTqRv80Gfg1fvBYNx0z5xIiiLCHUo/view?usp=sharing
I want the proper JSON format. When I try with normal functions of python I get some fields in below form:
But I want "population_mesh" and other similar fields in Array (Not String)
-> One more thing is that I also want "population", "intervention", and "Outcomes" fields in array form.
Below is the sample target JSON file
https://drive.google.com/file/d/1Y7oGVyOG777APVqOsLBxc9eSd38pDMbc/view?usp=sharing
Actually I tried to convert "population" and other similar fields to array of strings but some string having quotes identified as a part of JSON jormat but escape character has to be put to make exactly like target JSON.

Convert JSON Dict to Pandas Dataframe

I have what appears to be a very simple JSON dict I need to convert into a Pandas dataframe. The dict is being pulled in for me as a string which I have little control over.
{
"data": "[{'key1':'value1'}]"
}
I have tried the usual methods such as pd.read_json() and json_normalize() etc but can't seem to get it anywhere close. Has anyone a few different suggestions to try. I think ive see every error message python has at this stage.

It seems to me that your JSON data is improperly formatted. The double quotations around the brackets indicate that everything within those double quotes is a string. Essentially the data is considered a string and not an array of values. Remove the double quotes and to create an array in your JSON file.
{
"data": [{"key1":"value1"}]
}
This will create the array and allow your JSON to be properly parsed using your previous stated methods.

The example provided is a single key, but in general you can use pandas to load json and nested json with pd.json_normalize(yourjsonhere)

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'

Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

How do I clean this weird JSON data which I extracted from an excel file so that it becomes a proper dictionary?

I have an excel file which has a JSON type of data. I extracted the data from the column of that Excel file and converted into dictionary using .to_dict() function. One cell of that column of the excel file looks like this, and there are more rows filled with same kind of data for that column:-
{\currentPortfolioId":null/"isNewRTQ":true/"isNewInvestmentTenure":true/"isNearTermVolatility":false/"getPath":true/"riskProfile":"Moderate"/"initialInvestment":200000/"cashflowDate":"01-01-2021"/"currentWealth":200000/"goalPriority":"Wish"/"rebalancing":"yearly"/"goalAmount":2000000/"startDate":"16-06-2021"/"endDate":"01-01-2031"/"isNewGoalPriority":true/"infusions":[0/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/0]/"scenario_type":"regular"/"infusion_type":"monthly"/"xforwardForValue":"49.228.234.102:49907/ 13.86.190.104:3072/ 172.30.217.148:36243"}"
As it is visible the data is not cleaned, with special characters like "/", "", "" etc.
Can anyone help me in how to clean this data and convert it into a proper dictionary so that I can later do operations in it?
I did try ast.literal_eval() but it doesn't seem to work.
Please help!

It's very similar to json. Maybe it was mangled by autocorrect or something like that in Excel? The commas have been replaced with backslash characters and every " is prefixed with a slash. Also a single " is missing before the first key.
If you add the missing ", strip out all \ and replace / with , you can parse it with json.loads().
data = r'''{\"currentPortfolioId\":null/\"isNewRTQ\":true/\"isNewInvestmentTenure\":true/\"isNearTermVolatility\":false/\"getPath\":true/\"riskProfile\":\"Moderate\"/\"initialInvestment\":200000/\"cashflowDate\":\"01-01-2021\"/\"currentWealth\":200000/\"goalPriority\":\"Wish\"/\"rebalancing\":\"yearly\"/\"goalAmount\":2000000/\"startDate\":\"16-06-2021\"/\"endDate\":\"01-01-2031\"/\"isNewGoalPriority\":true/\"infusions\":[0/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/0]/\"scenario_type\":\"regular\"/\"infusion_type\":\"monthly\"/\"xforwardForValue\":\"49.228.234.102:49907/ 13.86.190.104:3072/ 172.30.217.148:36243\"}'''
import json
json.loads(data.replace("\\", "").replace("/", ","))

String conversion to dataframe

In this screenshot data (string datatype) and df2 (pandas dataframe) store the same data - a timestamp and a value.
How do I get data in a similar dataframe so I can append the values to df2 so I have all the data records and all the df2 records in one dataframe and matching the current format of df2 ?
I can post what I've tried so far, but all I get is errors :(

import ast
import pandas as pd
data = "[[1212.1221, -10.5],[2232.55, -19.44],[32432.87655, -445.88]]"
df = pd.DataFrame(ast.literal_eval(data),
columns=['index', 'data'])

Looks like your string data is a correctly formatted json (which from my knowledge looks exactly like Python dictionaries but is strict about double quotes over single quotes). Try:
import json
dict = json.loads(data)
This will convert your string into dict type from which you can easily create and manipulate DataFrames.
EDIT:
If any of your strings have single quotes, you can remedy this using str.replace("'", "\"") to convert them to double quotes. This will only cause problems if for whatever reason your data has incorrectly paired quotes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Pandas: read nested JSON - python

Related

How to convert Complex CSV to JSON?

Convert JSON Dict to Pandas Dataframe

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

How do I clean this weird JSON data which I extracted from an excel file so that it becomes a proper dictionary?

String conversion to dataframe

Categories

Resources