JSON_Normalize (nested json) to csv

JSON_Normalize (nested json) to csv - python

I have been trying via pandas to extract data from a txt file containing json utf-8 encoded data.
Direct link to data file - http://download.companieshouse.gov.uk/psc-snapshot-2022-02-06_8of20.zip
Data's structure looks like the following examples:
{"company_number":"04732933","data":{"address":{"address_line_1":"Windsor Road","locality":"Torquay","postal_code":"TQ1 1ST","premises":"Windsor Villas","region":"Devon"},"country_of_residence":"England","date_of_birth":{"month":1,"year":1964},"etag":"5623f35e4bb5dc9cb37e134cb2ac0ca3151cd01f","kind":"individual-person-with-significant-control","links":{"self":"/company/04732933/persons-with-significant-control/individual/8X3LALP5gAh5dAYEOYimeiRiJMQ"},"name":"Ms Karen Mychals","name_elements":{"forename":"Karen","surname":"Mychals","title":"Ms"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10118870","data":{"address":{"address_line_1":"Hilltop Road","address_line_2":"Bearpark","country":"England","locality":"Durham","postal_code":"DH7 7TL","premises":"54"},"ceased_on":"2019-04-15","country_of_residence":"England","date_of_birth":{"month":9,"year":1983},"etag":"5b3c984156794e5519851b7f1b22d1bbd2a5c5df","kind":"individual-person-with-significant-control","links":{"self":"/company/10118870/persons-with-significant-control/individual/hS6dYoZ234aXhmI6Q9y83QbAhSY"},"name":"Mr Patrick John Burns","name_elements":{"forename":"Patrick","middle_name":"John","surname":"Burns","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2017-04-06"}}
The simplepd.read_json did not work initially (I would get ValueError: Trailing data errors) until lines=true was used (using jupyternotebook for this).
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
this is how the data structure is displayed via df.head() :
company_number data
0 06851805 {'address': {'address_line_1': 'Briar Road', '...
1 04732933 {'address': {'address_line_1': 'Windsor Road',...
2 10118870 {'address': {'address_line_1': 'Hilltop Road',...
3 10118870 {'address': {'address_line_1': 'Hilltop Road',...
4 09565353 {'address': {'address_line_1': 'Old Hertford R...
After looking through stackoverflow and several online tutorials I tried using pd.json_normalize(df) but keep getting a AttributeError: 'str' object has no attribute 'values' error. I would like to ultimately export this json file into a csv file.
thank you in advance for any advice!

You can solve that problem by applying the json_normalize just to the data column.
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
#json_normalize
df2 = pd.json_normalize(df['data'])
df = pd.concat([df, df2], axis=1)
#output to csv
df.to_csv("./OUTPUT_FILE_NAME")
print(df)
company_number ... name_elements.middle_name
0 4732933 ... NaN
1 10118870 ... John
[2 rows x 24 columns]

Related

How to transform json format into string column for python dataframe?

I got this dataframe:
Dataframe: df_case_1
Id RecordType
0 1234 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/1234', 'name', 'XYZ'}}
1 4321 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/4321', 'name', 'ABC'}}
I want to have this dataframe:
Dataframe: df_case_final
Id RecordType
0 1234 'XYZ'
1 4321 'ABC'
At the moment I use this statemane but it gives me the name on position 0 for every case object.
df_case_1['RecordType'] = df_case_1.RecordType[0]['Name']
How to build the statement, that I give me the correct name for every id, like in df_case_final?
Thanks

There are 3 Ways you can convert JSON to Pandas Dataframe
# 1. Use json_normalize() to convert JSON to DataFrame
dict= json.loads(data)
df = json_normalize(dict['technologies'])
# 2. Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
# 3. Use pandas.DataFrame.from_dict() to Convert JSON to DataFrame
dict= json.loads(data)
df2 = pd.DataFrame.from_dict(dict, orient="index")
Now, after converting Json to df take the last column and append it to your original dataframe

split your df by coma & trim un-neccessary cols
import pandas as pd
df=pd.read_csv(r"Hansmuff.csv")
df[['1', '2','3','required']]=df['RecordType'].str.split(',', expand=True)
df = df.drop(columns=['RecordType', '1','2','3'])
df['required'] = df['required'].str.strip('{}')
print(df)
output
Id required
0 1234 'XYZ'
1 4321 'ABC'

Dataframe or CSV to JSON object array

This is probably an easy one for the python pros. So, please forgive my naivety.
Here is my data:
0 xyz#tim.com 13239169023 jane bo
1 tim#tim.com 13239169023 lane ko
2 jim#jim.com 13239169023 john do
Here is what I get as output:
[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]
My Code:
df = pd.read_csv('profiles.csv')
print(df)
data = df.to_json(orient="records")
print(data)
Output I want:
{"profiles":[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]}
Adding below does NOT work.
output = {"profiles": data}
It adds single quotes on the data and profiles in NOT in double quotes (basically NOT a valid JSON), Like so:
{'profiles': '[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]'}

You can use df.to_dict to output to a dictionary instead of a json-formatted string:
import pandas as pd
df = pd.read_csv('data.csv')
data = df.to_dict(orient='records')
output = {'profiles': data}
print(output)
Returns:
{'profiles': [{'0': 1, 'xyz#tim.com': 'tim#tim.com', '13239169023': 13239169023, 'jane': 'lane', 'bo': 'ko'}, {'0': 2, 'xyz#tim.com': 'jim#jim.com', '13239169023': 13239169023, 'jane': 'john', 'bo': 'do'}]}

I think I found a solution.
Changes:
data = df.to_dict(orient="records")
output = {}
output["profiles"] = data

Flatten nested JSON and concatenate to dataframe using pandas

I have searched for a lot of similar topics online, but I have not found the solution yet.
My pandas dataframe looks like this:
index FOR
0 [{'id': '2766', 'name': '0803 Computer Softwar...
1 [{'id': '2766', 'name': '0803 Computer Softwar...
2 [{'id': '2766', 'name': '0803 Computer Softwar...
3 [{'id': '2766', 'name': '0803 Computer Softwar...
4 [{'id': '2766', 'name': '0803 Computer Softwar...
And I would like to flatten all 4 rows to become like the following dataframe while below is just the result for the first row:
index id name
0 2766 0803 Computer Software
I found a similar solution here. Unfortunately, I got a "TypeError" as the following:
TypeError: the JSON object must be str, bytes or bytearray, not 'list'
My code was:
dfs = []
for i in test['FOR']:
data = json.loads(i)
dfx = pd.json_normalize(data)
dfs.append(dfx)
df = pd.concat(dfs).reset_index(inplace = True)
print(df)
Would anyone can help me here?
Thank you very much!

try using literal_eval from the ast standard lib.
from ast import literal_eval
df_flattened = pd.json_normalize(df['FOR'].map(literal_eval))
then drop duplicates.
print(df_flattened.drop_duplicates())
id name
0 2766 0803 Computer Software

After a few weeks not touching related works,
I encountered another similar case and
I think I have got the solution so far for this case.
Please feel free to correct me or provide any other ideas.
I really appreciated all the helps and all the generous support!
chuck = []
for i in range(len(test)):
chuck.append(json_normalize(test.iloc[i,:]['FOR']))
test_df = pd.concat(chuck)
And then drop duplicated columns for the test_df

convert json file to data frame in python

I am new to json file handling. I have a below json file
{"loanAccount":{"openAccount":[{"accountNumber":"986985874","accountOpenDate":"2020-02-045-11:00","accountCode":"NA","relationship":"Main account","loanTerm":"120"}]}}
I want to convert this into dataframe. I am using the below code :
import pandas as pd
from pandas.io.json import json_normalize
data1 = pd.read_json (r'./DLResponse1.json',lines=True)
df = pd.DataFrame.from_dict(data1, orient='columns')
This is giving me the below output :
index loanAccount
0 {'openAccount': [{'accountNumber': '986985874', 'accountOpenDate': '2020-02-045-11:00', 'accountCode': 'NA', 'relationship': 'Main account', 'loanTerm': '120'}]}}
However I want to extract in the below format :
loanAccount openAccount accountNumber accountOpenDate accountCode relationship loanTerm
986985874 2020-02-045-11:00 NA Main account 120

you may use:
# s is your json, you can read from file
pd.DataFrame(s["loanAccount"]["openAccount"])
output:
if you want also to have the other json keys as columns you may use:
pd.DataFrame([{"loanAccount": '', "openAccount": '', **s["loanAccount"]["openAccount"][0]}])

Extract a column's data into a variable

I've got a very large dataframe where one of the columns is a dictionary itself. (let's say column 12). In that dictionary is a part of a hyperlink, which I want to get.
In Jupyter, I want to display a table where I have column 0 and 2, as well as the completed hyperlink
I think I need to:
Extract that dictionary from the dataframe
Get a particular keyed value from it
Create the full hyperlink from the extracted value
Copy the dataframe and replace the column with the hyperlink created above
Let's just tackle step 1 and I'll make other questions for the next steps.
How do I extract values from a dataframe into a variable I can play with?
import pytd
import pandas
client = pytd.Client(apikey=widget_api_key.value, database=widget_database.value)
results = client.query(query)
dataframe = pandas.DataFrame(**results)
dataframe
# Not sure what to do next

If you only want to extract one key from the dictionary and the dictionary is already stored as a dictionary in the column, you can do it as follows:
import numpy as np
import pandas as pd
# assuming, your dicts are stored in column 'data'
# and you want to store the url in column 'url'
df['url']= df['data'].map(lambda d: d.get('url', np.NaN) if hasattr(d, 'get') else np.NaN)
# from there you can do your transformation on the url column
Testdata and results
df= pd.DataFrame({
'col1': [1, 5, 6],
'data': [{'url': 'http://foo.org', 'comment': 'not interesting'}, {'comment': 'great site about beer receipes, but forgot the url'}, np.NaN],
'json': ['{"url": "http://foo.org", "comment": "not interesting"}', '{"comment": "great site about beer receipes, but forgot the url"}', np.NaN]
}
)
# Result of the logic above:
col1 data url
0 1 {'url': 'http://foo.org', 'comment': 'not inte... http://foo.org
1 5 {'comment': 'great site about beer receipes, b... NaN
2 6 NaN NaN
If you need to test, if your data is already stored in python dicts (rather than strings), you can do it as follows:
print(df['data'].map(type))
If your dicts are stored as strings, you can convert them to dicts first based on the following code:
import json
def get_url_from_json(document):
if pd.isnull(document):
url= np.NaN
else:
try:
_dict= json.loads(document)
url= _dict.get('url', np.NaN)
except:
url= np.NaN
return url
df['url2']= df['json'].map(get_url_from_json)
# output:
print(df[['col1', 'url', 'url2']])
col1 url url2
0 1 http://foo.org http://foo.org
1 5 NaN NaN
2 6 NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

JSON_Normalize (nested json) to csv - python

Related

How to transform json format into string column for python dataframe?

Dataframe or CSV to JSON object array

Flatten nested JSON and concatenate to dataframe using pandas

convert json file to data frame in python

Extract a column's data into a variable

Categories

Resources