Convert JSON Dict to Pandas Dataframe

Convert JSON Dict to Pandas Dataframe - python

I have what appears to be a very simple JSON dict I need to convert into a Pandas dataframe. The dict is being pulled in for me as a string which I have little control over.
{
"data": "[{'key1':'value1'}]"
}
I have tried the usual methods such as pd.read_json() and json_normalize() etc but can't seem to get it anywhere close. Has anyone a few different suggestions to try. I think ive see every error message python has at this stage.

It seems to me that your JSON data is improperly formatted. The double quotations around the brackets indicate that everything within those double quotes is a string. Essentially the data is considered a string and not an array of values. Remove the double quotes and to create an array in your JSON file.
{
"data": [{"key1":"value1"}]
}
This will create the array and allow your JSON to be properly parsed using your previous stated methods.

The example provided is a single key, but in general you can use pandas to load json and nested json with pd.json_normalize(yourjsonhere)

Related

null as value in dictionary or JSON object

I need to download data from certain database where data is stored in JSON. A sample of an object that I download looks like this :
data = [
{
'field1': 'value1',
'field2': 'value2',
'field3': null
},
{
'field1': 'value1',
'field2': 'value2',
'field3': 'value3'
}
]
Note that for field3, what I receive in the data JSON object is not a string null, i.e. with single quotes wrapping it, but literally null. This is just the way the database server returns a data request.
When I want to work on data as a Python dictionary (or mashed into data frame, etc.), I get the error:
NameError: name 'null' is not defined
How would I convert all of these null into a string null, just like string value1, value2?
Further details for clarity:
This is how I obtain object data (by making a requests to a database:
import requests
import json
import pandas as pd
req = requests.get(...)
data = req.json()
print(json.dumps(data, indent=4, default=str))
df = pd.DataFrame(data)
The database that I call the requests.get() from returns a JSON object that include null as value. When I print the data out as above, what is displayed is as above (first snippet of code).
print(type(data))
gives <class 'dict'>.
the last line
df = pd.DataFrame(data)
gives the error above.

It is important to be pedantic here to understand the problem. Sorry if it's a bit painful.
The database that I call the requests.get() from returns a JSON object that include null as value.
No, it does not. The database returns bytes, which represent text, which uses the JSON format to represent a JSON object.
"a JSON object" is, at best, a thing that only exists in a Javascript program. It is better to think of it as an abstraction. We say "object" because that's what Javascript calls its dictionaries (also called "associative arrays", "mappings" and a few other things, depending on what programming language you are using, or possibly what theoretical background you're leaning on).
When you parse a JSON document (i.e.: a sequence of bytes like what the database returned) in a Python program, normally you can expect to get a native Python data structure, implemented using the built-in data types. So the top-level JSON object will be represented with a Python dict. JSON arrays will be represented with lists. Numbers will be represented with int and float as appropriate. Strings will, unsurprisingly, be str instances. true and false will become True and False, i.e. the two pre-defined Python boolean values.
And null....
When I print the data out as above, what is displayed is as above (first snippet of code).
No, it is not. You will indeed see the display of a Python dictionary, as you show; but the way the value for the 'field3' key is rendered, is not null. Instead, it is None.
That is because None is a built-in Python object that every reasonable JSON parser (including the one built into requests, and the standard library one) uses to represent a "null" value in JSON.
When I want to work on data as a Python dictionary (or mashed into data frame, etc.), I get the error:
You will not get this error if data actually comes from parsing JSON. You will get it when you try to hard-code that Python representation of JSON into your program.
That is because the hard-coded representation should not say null; it should say None. Because that is the way that you write code to specify the value that is used in Python to represent a JSON null value.
How would I convert all of these null into a string null, just like string value1, value2?
You do not want to do this. Data types are important, and exist for a reason. By using the string 'null' to represent a null value, you lose the ability to know whether it's really a null value, or an actual string with a lowercase n, lowercase u etc. This sort of thing has caused real problems for real people before.
What you want to do is write the literal None in your program when you create this kind of data structure from scratch; and when you deal with this data - whether you get it from parsed JSON, a hand-written structure or any other process - look for these values by checking whether something is None.
Just for completeness:
print(json.dumps(data, indent=4, default=str))
Doing this re-creates text in the JSON format which corresponds to the parsed JSON data, and displays it. So of course you will see null if you do this, because that is what the JSON format uses. (You will also see double quotes for all strings, because the JSON format does not allow single quotes for strings.)

How do I clean this weird JSON data which I extracted from an excel file so that it becomes a proper dictionary?

I have an excel file which has a JSON type of data. I extracted the data from the column of that Excel file and converted into dictionary using .to_dict() function. One cell of that column of the excel file looks like this, and there are more rows filled with same kind of data for that column:-
{\currentPortfolioId":null/"isNewRTQ":true/"isNewInvestmentTenure":true/"isNearTermVolatility":false/"getPath":true/"riskProfile":"Moderate"/"initialInvestment":200000/"cashflowDate":"01-01-2021"/"currentWealth":200000/"goalPriority":"Wish"/"rebalancing":"yearly"/"goalAmount":2000000/"startDate":"16-06-2021"/"endDate":"01-01-2031"/"isNewGoalPriority":true/"infusions":[0/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/0]/"scenario_type":"regular"/"infusion_type":"monthly"/"xforwardForValue":"49.228.234.102:49907/ 13.86.190.104:3072/ 172.30.217.148:36243"}"
As it is visible the data is not cleaned, with special characters like "/", "", "" etc.
Can anyone help me in how to clean this data and convert it into a proper dictionary so that I can later do operations in it?
I did try ast.literal_eval() but it doesn't seem to work.
Please help!

It's very similar to json. Maybe it was mangled by autocorrect or something like that in Excel? The commas have been replaced with backslash characters and every " is prefixed with a slash. Also a single " is missing before the first key.
If you add the missing ", strip out all \ and replace / with , you can parse it with json.loads().
data = r'''{\"currentPortfolioId\":null/\"isNewRTQ\":true/\"isNewInvestmentTenure\":true/\"isNearTermVolatility\":false/\"getPath\":true/\"riskProfile\":\"Moderate\"/\"initialInvestment\":200000/\"cashflowDate\":\"01-01-2021\"/\"currentWealth\":200000/\"goalPriority\":\"Wish\"/\"rebalancing\":\"yearly\"/\"goalAmount\":2000000/\"startDate\":\"16-06-2021\"/\"endDate\":\"01-01-2031\"/\"isNewGoalPriority\":true/\"infusions\":[0/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/0]/\"scenario_type\":\"regular\"/\"infusion_type\":\"monthly\"/\"xforwardForValue\":\"49.228.234.102:49907/ 13.86.190.104:3072/ 172.30.217.148:36243\"}'''
import json
json.loads(data.replace("\\", "").replace("/", ","))

Force pandas to store type information in output files

I'm working with dataset which supposed to be pushed to external web-service. And that endpoint validates recieved JSON very strictly. While pandas treats types a bit more freely than it supposed to be.
For example, I need to push integer value as a text value to the web-service. Even if value was correctly serialized as string to JSON (with quotes around it), when it was read back pandas magically casts it back to integer.
Is there any way around this? Force pandas to store column type in output file? Or maybe be more strict during reading the data?
Maybe any other format is better for storing such data? Any clues would be greatly appreciated!

You can enforce dtypes while reading the data.
Example: considering the following JSON file called t.json:
{
"Strings": {
"A": "1",
"B": "2"
},
"Integers": {
"A": 3,
"B": 4
}
}
You can read the data, specifying the types in a dict:
df = pd.read_json('t.json', dtype={'Strings': str, 'Integers': str})
Which gives you the following dataframe:
Strings Integers
A 1 3
B 2 4
and df.dtypes gives you:
Strings object
Integers object
dtype: object

If there is no explicit requirement to use JSON or CSV, it's also possible to use different, more advanced formats to store data.
Feather seems to be most advanced according to the comparison article, however I had a problem to roll it on 32-bit version of python. So I ended up with pickle serialization. It's supported by pandas out of the box. And it cost me only a slight variation from initial code:
# Dumping dataframe information to pickle file
df.to_pickle(path)
# Reading datafreame information from pickle file
df = pandas.read_pickle(path)
Types were kept untouched.

How to parse integers in JSON file with a character following them (type marker)?

I have a JSON file which contains additional JSON data that follows the format below.
["{id:\"thaumcraft:celestial_notes\",Count:5b,Damage:10s}",
"{id:\"bloodmagic:ritual_stone\",Count:1b,Damage:0s}",
"{id:\"enderio:block_lava_generator\",Count:13b,Damage:0s}"]
You should notice that there are types appended to the numbers in the JSON.
How do I get Python to parse this without error (it thinks they are supposed to be strings)?
I cannot modify my JSON file as it is 50,000 lines long and will change dynamically from user to user.
I've thought up different ways to parse the string, but they all are inefficient or are not practical and dynamic (there is another structure that looks like this in the JSON data I will need to account for).
"{id:\"enderio:item_inventory_charger_basic\",Count:15b,tag:{enderio.darksteel.upgrade.energyUpgrade:{level:3,energy:5000000}},Damage:0s}",
The correct answer would end with the JSON loading correctly, e.g. parsing every string to look like this instead.
"{id:\"enderio:block_wired_charger\",Count:13,Damage:0}"

Python/Pandas: read nested JSON

I am reading a data table from an API that returns me the data in JSON format, and one of the columns is itself a JSON string. I succeed in creating a Pandas dataframe for the overall table, but in the process of reading it, double quotes in the JSON string get converted to single quotes, and I can't parse the nested JSON.
I can't provide a reproducible example, but here is the key code:
myResult = requests.get(myURL, headers = myHeaders).text
myDF = pd.read_json(myResult, orient = "records", dtype = {"custom": str}, encoding = "unicode_escape")
Where custom is the nested JSON string. Try as I might by setting the dtype and encoding arguments, I cannot force Pandas to preserve the double quotes in the string.
So what started off as:
"custom": {"Field1":"Value1","Field2":"Value2"}
gets into the dataframe as:
{'Field1':'Value1','Field2':'Value2'}
I found this question which suggests using a custom parser for read_csv - but I can't see that this option is available for read_json.
I found a few suggestions here but the only one I could try was manually replacing the double quotes - and this causes fresh errors because there are apostrophes contained within the nested field values themselves...
The JSON strings are formatted correctly within myResult so it's the parsing applied by read_json that's the problem. Is there any way to change that or do I need to find some other way of reading this in?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.