JSON_Normalize (nested json) to csv - python

I have been trying via pandas to extract data from a txt file containing json utf-8 encoded data.
Direct link to data file - http://download.companieshouse.gov.uk/psc-snapshot-2022-02-06_8of20.zip
Data's structure looks like the following examples:
{"company_number":"04732933","data":{"address":{"address_line_1":"Windsor Road","locality":"Torquay","postal_code":"TQ1 1ST","premises":"Windsor Villas","region":"Devon"},"country_of_residence":"England","date_of_birth":{"month":1,"year":1964},"etag":"5623f35e4bb5dc9cb37e134cb2ac0ca3151cd01f","kind":"individual-person-with-significant-control","links":{"self":"/company/04732933/persons-with-significant-control/individual/8X3LALP5gAh5dAYEOYimeiRiJMQ"},"name":"Ms Karen Mychals","name_elements":{"forename":"Karen","surname":"Mychals","title":"Ms"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10118870","data":{"address":{"address_line_1":"Hilltop Road","address_line_2":"Bearpark","country":"England","locality":"Durham","postal_code":"DH7 7TL","premises":"54"},"ceased_on":"2019-04-15","country_of_residence":"England","date_of_birth":{"month":9,"year":1983},"etag":"5b3c984156794e5519851b7f1b22d1bbd2a5c5df","kind":"individual-person-with-significant-control","links":{"self":"/company/10118870/persons-with-significant-control/individual/hS6dYoZ234aXhmI6Q9y83QbAhSY"},"name":"Mr Patrick John Burns","name_elements":{"forename":"Patrick","middle_name":"John","surname":"Burns","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2017-04-06"}}
The simplepd.read_json did not work initially (I would get ValueError: Trailing data errors) until lines=true was used (using jupyternotebook for this).
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
this is how the data structure is displayed via df.head() :
company_number data
0 06851805 {'address': {'address_line_1': 'Briar Road', '...
1 04732933 {'address': {'address_line_1': 'Windsor Road',...
2 10118870 {'address': {'address_line_1': 'Hilltop Road',...
3 10118870 {'address': {'address_line_1': 'Hilltop Road',...
4 09565353 {'address': {'address_line_1': 'Old Hertford R...
After looking through stackoverflow and several online tutorials I tried using pd.json_normalize(df) but keep getting a AttributeError: 'str' object has no attribute 'values' error. I would like to ultimately export this json file into a csv file.
thank you in advance for any advice!

You can solve that problem by applying the json_normalize just to the data column.
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
#json_normalize
df2 = pd.json_normalize(df['data'])
df = pd.concat([df, df2], axis=1)
#output to csv
df.to_csv("./OUTPUT_FILE_NAME")
print(df)
company_number ... name_elements.middle_name
0 4732933 ... NaN
1 10118870 ... John
[2 rows x 24 columns]

Related

How to transform json format into string column for python dataframe?

I got this dataframe:
Dataframe: df_case_1
Id RecordType
0 1234 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/1234', 'name', 'XYZ'}}
1 4321 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/4321', 'name', 'ABC'}}
I want to have this dataframe:
Dataframe: df_case_final
Id RecordType
0 1234 'XYZ'
1 4321 'ABC'
At the moment I use this statemane but it gives me the name on position 0 for every case object.
df_case_1['RecordType'] = df_case_1.RecordType[0]['Name']
How to build the statement, that I give me the correct name for every id, like in df_case_final?
Thanks
There are 3 Ways you can convert JSON to Pandas Dataframe
# 1. Use json_normalize() to convert JSON to DataFrame
dict= json.loads(data)
df = json_normalize(dict['technologies'])
# 2. Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
# 3. Use pandas.DataFrame.from_dict() to Convert JSON to DataFrame
dict= json.loads(data)
df2 = pd.DataFrame.from_dict(dict, orient="index")
Now, after converting Json to df take the last column and append it to your original dataframe
split your df by coma & trim un-neccessary cols
import pandas as pd
df=pd.read_csv(r"Hansmuff.csv")
df[['1', '2','3','required']]=df['RecordType'].str.split(',', expand=True)
df = df.drop(columns=['RecordType', '1','2','3'])
df['required'] = df['required'].str.strip('{}')
print(df)
output
Id required
0 1234 'XYZ'
1 4321 'ABC'

Dataframe or CSV to JSON object array

This is probably an easy one for the python pros. So, please forgive my naivety.
Here is my data:
0 xyz#tim.com 13239169023 jane bo
1 tim#tim.com 13239169023 lane ko
2 jim#jim.com 13239169023 john do
Here is what I get as output:
[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]
My Code:
df = pd.read_csv('profiles.csv')
print(df)
data = df.to_json(orient="records")
print(data)
Output I want:
{"profiles":[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]}
Adding below does NOT work.
output = {"profiles": data}
It adds single quotes on the data and profiles in NOT in double quotes (basically NOT a valid JSON), Like so:
{'profiles': '[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]'}
You can use df.to_dict to output to a dictionary instead of a json-formatted string:
import pandas as pd
df = pd.read_csv('data.csv')
data = df.to_dict(orient='records')
output = {'profiles': data}
print(output)
Returns:
{'profiles': [{'0': 1, 'xyz#tim.com': 'tim#tim.com', '13239169023': 13239169023, 'jane': 'lane', 'bo': 'ko'}, {'0': 2, 'xyz#tim.com': 'jim#jim.com', '13239169023': 13239169023, 'jane': 'john', 'bo': 'do'}]}
I think I found a solution.
Changes:
data = df.to_dict(orient="records")
output = {}
output["profiles"] = data

Flatten nested JSON and concatenate to dataframe using pandas

I have searched for a lot of similar topics online, but I have not found the solution yet.
My pandas dataframe looks like this:
index FOR
0 [{'id': '2766', 'name': '0803 Computer Softwar...
1 [{'id': '2766', 'name': '0803 Computer Softwar...
2 [{'id': '2766', 'name': '0803 Computer Softwar...
3 [{'id': '2766', 'name': '0803 Computer Softwar...
4 [{'id': '2766', 'name': '0803 Computer Softwar...
And I would like to flatten all 4 rows to become like the following dataframe while below is just the result for the first row:
index id name
0 2766 0803 Computer Software
I found a similar solution here. Unfortunately, I got a "TypeError" as the following:
TypeError: the JSON object must be str, bytes or bytearray, not 'list'
My code was:
dfs = []
for i in test['FOR']:
data = json.loads(i)
dfx = pd.json_normalize(data)
dfs.append(dfx)
df = pd.concat(dfs).reset_index(inplace = True)
print(df)
Would anyone can help me here?
Thank you very much!
try using literal_eval from the ast standard lib.
from ast import literal_eval
df_flattened = pd.json_normalize(df['FOR'].map(literal_eval))
then drop duplicates.
print(df_flattened.drop_duplicates())
id name
0 2766 0803 Computer Software
After a few weeks not touching related works,
I encountered another similar case and
I think I have got the solution so far for this case.
Please feel free to correct me or provide any other ideas.
I really appreciated all the helps and all the generous support!
chuck = []
for i in range(len(test)):
chuck.append(json_normalize(test.iloc[i,:]['FOR']))
test_df = pd.concat(chuck)
And then drop duplicated columns for the test_df

convert json file to data frame in python

I am new to json file handling. I have a below json file
{"loanAccount":{"openAccount":[{"accountNumber":"986985874","accountOpenDate":"2020-02-045-11:00","accountCode":"NA","relationship":"Main account","loanTerm":"120"}]}}
I want to convert this into dataframe. I am using the below code :
import pandas as pd
from pandas.io.json import json_normalize
data1 = pd.read_json (r'./DLResponse1.json',lines=True)
df = pd.DataFrame.from_dict(data1, orient='columns')
This is giving me the below output :
index loanAccount
0 {'openAccount': [{'accountNumber': '986985874', 'accountOpenDate': '2020-02-045-11:00', 'accountCode': 'NA', 'relationship': 'Main account', 'loanTerm': '120'}]}}
However I want to extract in the below format :
loanAccount openAccount accountNumber accountOpenDate accountCode relationship loanTerm
986985874 2020-02-045-11:00 NA Main account 120
you may use:
# s is your json, you can read from file
pd.DataFrame(s["loanAccount"]["openAccount"])
output:
if you want also to have the other json keys as columns you may use:
pd.DataFrame([{"loanAccount": '', "openAccount": '', **s["loanAccount"]["openAccount"][0]}])

Extract a column's data into a variable

I've got a very large dataframe where one of the columns is a dictionary itself. (let's say column 12). In that dictionary is a part of a hyperlink, which I want to get.
In Jupyter, I want to display a table where I have column 0 and 2, as well as the completed hyperlink
I think I need to:
Extract that dictionary from the dataframe
Get a particular keyed value from it
Create the full hyperlink from the extracted value
Copy the dataframe and replace the column with the hyperlink created above
Let's just tackle step 1 and I'll make other questions for the next steps.
How do I extract values from a dataframe into a variable I can play with?
import pytd
import pandas
client = pytd.Client(apikey=widget_api_key.value, database=widget_database.value)
results = client.query(query)
dataframe = pandas.DataFrame(**results)
dataframe
# Not sure what to do next
If you only want to extract one key from the dictionary and the dictionary is already stored as a dictionary in the column, you can do it as follows:
import numpy as np
import pandas as pd
# assuming, your dicts are stored in column 'data'
# and you want to store the url in column 'url'
df['url']= df['data'].map(lambda d: d.get('url', np.NaN) if hasattr(d, 'get') else np.NaN)
# from there you can do your transformation on the url column
Testdata and results
df= pd.DataFrame({
'col1': [1, 5, 6],
'data': [{'url': 'http://foo.org', 'comment': 'not interesting'}, {'comment': 'great site about beer receipes, but forgot the url'}, np.NaN],
'json': ['{"url": "http://foo.org", "comment": "not interesting"}', '{"comment": "great site about beer receipes, but forgot the url"}', np.NaN]
}
)
# Result of the logic above:
col1 data url
0 1 {'url': 'http://foo.org', 'comment': 'not inte... http://foo.org
1 5 {'comment': 'great site about beer receipes, b... NaN
2 6 NaN NaN
If you need to test, if your data is already stored in python dicts (rather than strings), you can do it as follows:
print(df['data'].map(type))
If your dicts are stored as strings, you can convert them to dicts first based on the following code:
import json
def get_url_from_json(document):
if pd.isnull(document):
url= np.NaN
else:
try:
_dict= json.loads(document)
url= _dict.get('url', np.NaN)
except:
url= np.NaN
return url
df['url2']= df['json'].map(get_url_from_json)
# output:
print(df[['col1', 'url', 'url2']])
col1 url url2
0 1 http://foo.org http://foo.org
1 5 NaN NaN
2 6 NaN NaN

Categories

Resources