Convert parquet to list of objects in python

Convert parquet to list of objects in python - python

I am reading a parquet file with panda:
import pandas as pd
df = pd.read_parquet('myfile.parquet', engine='pyarrow')
The file has the following structure:
company_id
user_id
attribute_name
attribute_value
timestamp
1
116664
111f07000612
first_name
Tom
2022-03-23 17:11:58
2
116664
111f07000612
last_name
Cruise
2022-03-23 17:11:58
3
116664
111f07000612
city
New York
2022-03-23 17:11:58
4
116664
abcf0700d009d122
first_name
Matt
2022-02-23 10:11:59
5
116664
abcf0700d009d122
last_name
Damon
2022-02-23 10:11:59
I would like to group by user_id and generate a list of objects (that will be stored as json) with the following format:
[
{
"user_id": "111f07000612",
"first_name": "Tom",
"last_name": "Cruise",
"city": "New York"
},
{
"user_id": "abcf0700d009d122",
"first_name": "Matt",
"last_name": "Damon"
}
]

Hi 👋🏻 Hope you are doing well!
You can achieve it with something similar to this 🙂
from pprint import pprint
import pandas as pd
# because I don't have the exact parquet file, I will just mock it
# df = pd.read_parquet("myfile.parquet", engine="pyarrow")
df = pd.DataFrame(
{
"company_id": [116664, 116664, 116664, 116664, 116664],
"user_id": ["111f07000612", "111f07000612", "111f07000612", "abcf0700d009d122", "abcf0700d009d122"],
"attribute_name": ["first_name", "last_name", "city", "first_name", "last_name"],
"attribute_value": ["Tom", "Cruise", "New York", "Matt", "Damon"],
"timestamp": ["2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58"]
}
)
records = []
for user_id, group in df.groupby("user_id"):
transformed_group = (
group[["attribute_name", "attribute_value"]]
.set_index("attribute_name")
.transpose()
.assign(user_id=user_id)
)
rercord, *_ = transformed_group.to_dict("records")
records.append(rercord)
pprint(records)
# [{'city': 'New York',
# 'first_name': 'Tom',
# 'last_name': 'Cruise',
# 'user_id': '111f07000612'},
# {'first_name': 'Matt', 'last_name': 'Damon', 'user_id': 'abcf0700d009d122'}]

Related

How to map and update python dictionary with different key value pair?

I want to transform a Dictionary in Python, from Dictionary 1 Into Dictionary 2 as follows.
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA",
}
I want to transform the above dictionary to the following
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "DHL.COM TEXAS USA",
"amount": 239.95,
"currency": "USD",
"location": "TEXAS USA",
"merchant_name": "DHL"
}
I tried the following but it did not work
dic1 = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA"
}
print(type(dic1))
copiedDic = dic1.copy()
print("copiedDic = ",copiedDic)
updatekeys = ['amount', 'currency', 'merchant_name', 'location', 'trans_category']
for key in dic1:
if key == 'description':
list_words = dic1[key].split(" ")
newdict = {updatekeys[i]: x for i, x in enumerate(list_words)}
copiedDic.update(newdict)
print(copiedDic)
I got The following result
{
'trans_time': '14/07/2015 10:03:20',
'trans_type': 'DEBIT',
'description': '239.95 USD DHL.COM TEXAS USA',
'amount': '239.95',
'currency': 'USD',
'merchant_name': 'DHL.COM',
'location': 'TEXAS',
'trans_category': 'USA'
}
My Intended output should look like this:
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "DHL.COM TEXAS USA",
"amount": 239.95,
"currency": "USD",
"location": "TEXAS USA",
"merchant_name": "DHL"
}

I think it would be easier to turn the value into an array of words and parse it. Here, an array of words 'aaa ' is created from the dictionary string 'transaction['description']'. Where there are more than one word(array element) 'join' is used to turn the array back into a string. The currency value itself is converted to fractional format from the string. In 'merchant_name', the segment up to the point is taken.
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA",
}
aaa = transaction['description'].split()
transaction['description'] = ' '.join(aaa[2:])
transaction['amount'] = float(aaa[0])
transaction['currency'] = aaa[1]
transaction['location'] = ' '.join(aaa[3:])
transaction['merchant_name'] = aaa[2].partition('.')[0]
print(transaction)
Output
{
'trans_time': '14/07/2015 10:03:20',
'trans_type': 'DEBIT',
'description': 'DHL.COM TEXAS USA',
'amount': 239.95,
'currency': 'USD',
'location': 'TEXAS USA',
'merchant_name': 'DHL'}

If you want to transform, you do not need the copy to the original dictionary.
Just do something like this:
new_keys = ['amount', 'currency', 'merchant_name', 'location', 'trans_category']
values = transaction["description"].split(' ')
for idx, key in enumerate(new_keys):
if key == "amount":
transaction[key] = float(values[idx])
else:
transaction[key] = values[idx]

How to create dataframe from nested JSON?

I'm trying to create a dataframe using the following JSON structure -
{
"tables" : {
"name" : "PrimaryResult",
"columns" : [
{
"name" : "EmployeeID",
"type" : "Int"
},
{
"name" : "EmployeeName",
"type" : "String"
},
{
"name" : "DepartmentName",
"type" : "String"
}
],
"rows" : [
[
123,
"John Doe",
"IT"
],
[
234,
"Jane Doe",
"HR"
]
]
}
}
I tried few of the suggestions from - How to create pandas DataFrame from nested Json with list , How to parse nested JSON objects in spark sql?.
But I'm still confused. Essentially the output should look somewhat like below -
+----------+------------+--------------+
|EmployeeId|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+
I'm trying to refrain from using pandas as it shows lot of memory issues if the data is huge (not sure if there is a way to handle them).
Please help.

See below logic -
import json
data = [json.loads(js)]
print(data)
# Output
[{'tables': {'name': 'PrimaryResult', 'columns': [{'name': 'EmployeeID', 'type': 'Int'}, {'name': 'EmployeeName', 'type': 'String'}, {'name': 'DepartmentName', 'type': 'String'}], 'rows': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}}]
Now fetch the columns as below -
columns = []
for i in range(len(data[0]['tables']['columns'])):
columns.append(data[0]['tables']['columns'][i]['name'])
print(columns)
#Output
['EmployeeID', 'EmployeeName', 'DepartmentName']
Create a dictionary of columns and rows as below -
dict_JSON = {}
dict_JSON["columns"] = columns
dict_JSON["data"] = data[0]['tables']['rows']
print(dict_JSON)
#Output
{'columns': ['EmployeeID', 'EmployeeName', 'DepartmentName'], 'data': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}
Now once you have this dictionary create pandas dataframe and from there create the spark dataframe as below -
import pandas as pd
pdf = pd.read_json(json.dumps(dict_JSON), orient='split')
df = spark.createDataFrame(pdf)
df.show()
+----------+------------+--------------+
|EmployeeID|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+

Converting the CSV file to specified Json format

I am new to Python and don't know how to achieve this. I am trying to convert CSV file to JSON format. Address have types 1. Primary 2. Work and Address is multi value attribute as well. Person can have 2 Primary address.
Input Data in CSV format
"f_name"|"l_name"|"address_type"|"address_line_1"|"city"|"state"|"postal_code"|"country"
Brad|Pitt|Primary|"18 Atherton"|Irvine|CA|"92620-2501"|USA
Brad|Pitt|work|"1325 S Grand Ave"|Santa Ana|CA|"92705-4406"|USA
Output Expecting in JSON Format
{
"f_name": "Brad",
"l_name": "Pitt",
"parsed_address": [
{
"address_type": "Primary",
"address": [
{
"address_line_1": "18 Atherton",
"city": "Irvine",
"state": "CA",
"postal_code": "92620-2501",
"country": "USA"
}
]
},
{
"address_type": "work",
"address": [
{
"address_line_1": "1325 S Grand Ave",
"city": "Santa Ana",
"state": "CA",
"postal_code": "92620-2501",
"country": "USA"
}
]
}
]
}
Code Tried
df = pd.read_csv("file")
g_cols = ['f_name','l_name']
address_field = ['address']
cols = ['address_line_1', 'address_line_2', 'address_line_3', 'city', 'state', 'postal_code', 'country']
for i in g_cols:
if i in dict_val.keys():
g_cols[g_cols.index(i)] = dict_val[i]
for i in cols:
if i in dict_val.keys():
cols[cols.index(i)] = dict_val[i]
df2 = df.drop_duplicates().groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(
name=address_field).to_dict('record')

You were close. This should do too exactly what aim to do.
df = pd.read_csv("data.csv", sep="|")
df
dic = {}
for name, group in df.groupby(by=["name"]):
dic["name"] = name
dic["parsed_address"] = []
for address_type, group in df.groupby(by=["address_type"]):
address_dic = {}
address_dic["address_type"] = address_type
address_dic["address"] = group.drop(columns=["name", "address_type"]).to_dict(orient="records")
dic["parsed_address"].append(address_dic)
dic

I think you can try having a dictionary or list (json_data in the code below) to keep track of a person's data and iterating throw each row of the dataframe using for _, row in df.iterrows():
import pandas as pd
df = pd.read_csv("file", delimiter='|')
print(df)
json_data = {}
for _, row in df.iterrows():
name = row["name"]
address_type = row["address_type"]
address_line_1 = row["address_line_1"]
city = row["city"]
state = row["state"]
postal_code = row["postal_code"]
country = row["country"]
if name not in json_data:
json_data[name] = {
"name": name,
"parsed_address": []
}
address_list = None
for address in json_data[name]["parsed_address"]:
if address["address_type"] == address_type:
address_list = address
if address_list is None:
address_list = {
"address_type": address_type,
"address": []
}
json_data[name]["parsed_address"].append(address_list)
address_list["address"].append({
"address_line_1": address_line_1,
"city": city,
"state": state,
"postal_code": postal_code,
"country": country
})
lst = list(json_data.values())
# Verify data parsing
import json
print(json.dumps(lst, indent=2))

dic = {}
g_cols = ['id','first_name','last_name','address_type]
for name, group in df.groupby(g_cols)["address"]:
id = name[0]
dic["id"] = id
dic["parsed_address"] = []
for address_type, group in df.groupby(by=["address_type"]):
address_dic = {}
address_dic["address_type"] = address_type
address_dic["address"] = group.drop(
columns=["id", "first_name","last_name","address_type"]).to_dict("record")
dic["parsed_address"].append(address_dic)

Move values from DataFrame json column into Dataframe rows

Can't resolve such a problem.
I have a JSON file, with some nested dicts in its column
So I load JSON into Dataframe with:
df2=pd.read_json(filename)
And now a I have Dataframe with main column - SKU and another column, which contains a dict like:
"other_stores":
{"addrcode1": {"address": "Address1", "price_current": 990.0, "in_stock_count": 1, "price_original": 990.0},
"addrcode2": {"address": "Address2", "price_current": 990.0, "in_stock_count": 1, "price_original": 990.0}}
with command like apply(pd.Series)
I can move each "addrcode1" into columns of my data frame and create a table like this:
SKU
brand
title
addrcode1
addrcode2
sku1
brand1
title1
{addrcode1:data}
{addrcode2:data}
But what I'm trying to do - is to create a table like:
SKU
brand
title
address
address_price
address_stock
sku1
brand1
title1
address1
address1_price
address1_stock
sku1
brand1
title1
address2
address2_price
address2_stock
Sample of JSON is:
[
{
"SKU": "sampleSKU",
"brand": "My Brand",
"title": "My SKU title",
"other_stores": {
"addrcode1": {
"address": "Address1",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
},
"addrcode2": {
"address": "Address2",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}
}
}
]

Using multiple df.apply(pd.Series) with df.stack()
Your "JSON" is not really in a structure that can be easily normalized. For your nested dictionary structured JSON, you can use the following:
Convert the list of dicts to pandas dataframe.
Then for the other_stores column, apply pd.Series followed by stacking the 2 addrcode1 and addrcode2,
Then another apply pd.Series to extract the columns needed
Finally concatenate the original dataframe with the nested one.
d = [{'SKU': 'sampleSKU 1',
'brand': 'My Brand 1',
'title': 'My SKU 1 title',
'other_stores': {'addrcode1': {'address': 'Address1',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990},
'addrcode2': {'address': 'Address2',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990}}},
{'SKU': 'sampleSKU 2',
'brand': 'My Brand 2',
'title': 'My SKU 2 title',
'other_stores': {'addrcode1': {'address': 'Address1',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990},
'addrcode2': {'address': 'Address2',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990}}}]
df = pd.DataFrame(d)
nested = df['other_stores'].apply(pd.Series).stack().apply(pd.Series)
pd.concat([df.drop('other_stores',1), nested.reset_index(-1, drop=True)], axis=1)
Using pd.json_normalize() on a slighly modified JSON
If you do have control over how your JSON is structure, might I advise adding lists to store nested dicts. This allows you to utilize pd.json_normalize as a convenience function, with record_path and meta parameters -
d_fixed = [
{
"SKU": "sampleSKU",
"brand": "My Brand",
"title": "My SKU title",
"other_stores": [{ #<---- list start
"addrcode1": [{ #<---- nested list start
"address": "Address1",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}], #<---- nested list end
"addrcode2": [{ #<---- nested list start
"address": "Address2",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}] #<---- nested list end
}] #<---- list end
}
]
a1 = pd.json_normalize(d_fixed, record_path = ['other_stores','addrcode1'], meta=['SKU','brand','title'])
a2 = pd.json_normalize(d_fixed, record_path = ['other_stores','addrcode2'], meta=['SKU','brand','title'])
df = pd.concat([a1,a2])
df = df[['SKU','brand','title','address','price_current','in_stock_count','price_original']]
df

Use -
a=[
{
"SKU": "sampleSKU",
"brand": "My Brand",
"title": "My SKU title",
"other_stores": {
"addrcode1": {
"address": "Address1",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
},
"addrcode2": {
"address": "Address2",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}
}
}
]
pd.json_normalize(a)
Output
SKU brand title other_stores.addrcode1.address other_stores.addrcode1.price_current other_stores.addrcode1.in_stock_count other_stores.addrcode1.price_original other_stores.addrcode2.address other_stores.addrcode2.price_current other_stores.addrcode2.in_stock_count other_stores.addrcode2.price_original
0 sampleSKU My Brand My SKU title Address1 990 1 990 Address2 990 1 990
Update
Try this example -
import json
# load data using Python JSON module
with open('data/simple.json','r') as f:
data = json.loads(f.read())
# Flattening JSON data
pd.json_normalize(data)

Output pandas dataframe to json in a particular format

My dataframe is
fname lname city state code
Alice Lee Athens Alabama PXY
Nor Xi Mesa Arizona ABC
The output of json should be
{
"Employees":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}
df.to_json() gives no hierarchy to the json. Can you please suggest what am I missing? Is there a way to combine columns and give them a 'keyname' while writing json in pandas?
Thank you.

Try:
names = df[["fname", "lname"]].apply(" ".join, axis=1)
addresses = df[["city", "state"]].apply(", ".join, axis=1)
codes = df["code"]
out = {"Employees": {}}
for n, a, c in zip(names, addresses, codes):
out["Employees"][n] = {"code": c, "Address": a}
print(out)
Prints:
{
"Employees": {
"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"},
}
}

We can populate a new dataframe with columns being "code" and "Address", and index being "full_name" where the latter two are generated from the dataframe's columns with string addition:
new_df = pd.DataFrame({"code": df["code"],
"Address": df["city"] + ", " + df["state"]})
new_df.index = df["fname"] + " " + df["lname"]
which gives
>>> new_df
code Address
Alice Lee PXY Athens, Alabama
Nor Xi ABC Mesa, Arizona
We can now call to_dict with orient="index":
>>> d = new_df.to_dict(orient="index")
>>> d
{"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"}}
To match your output, we wrap d with a dictionary:
>>> {"Employee": d}
{
"Employee":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}

json = json.loads(df.to_json(orient='records'))
employees = {}
employees['Employees'] = [{obj['fname']+' '+obj['lname']:{'code':obj['code'], 'Address':obj['city']+', '+obj['state']}} for obj in json]
This outputs -
{
'Employees': [
{
'Alice Lee': {
'code': 'PXY',
'Address': 'Athens, Alabama'
}
},
{
'Nor Xi': {
'code': 'ABC',
'Address': 'Mesa, Arizona'
}
}
]
}

you can solve this using df.iterrows()
employee_dict = {}
for row in df.iterrows():
# row[0] is the index number, row[1] is the data respective to that index
row_data = row[1]
employee_name = row_data.fname + ' ' + row_data.lname
employee_dict[employee_name] = {'code': row_data.code, 'Address':
row_data.city + ', ' + row_data.state}
json_data = {'Employees': employee_dict}
Result:
{'Employees': {'Alice Lee': {'code': 'PXY', 'Address': 'Athens, Alabama'},
'Nor Xi': {'code': 'ABC', 'Address': 'Mesa, Arizona'}}}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert parquet to list of objects in python - python

Related

How to map and update python dictionary with different key value pair?

How to create dataframe from nested JSON?

Converting the CSV file to specified Json format

Move values from DataFrame json column into Dataframe rows

Output pandas dataframe to json in a particular format

Categories

Resources