Convert parquet to list of objects in python - python

I am reading a parquet file with panda:
import pandas as pd
df = pd.read_parquet('myfile.parquet', engine='pyarrow')
The file has the following structure:
company_id
user_id
attribute_name
attribute_value
timestamp
1
116664
111f07000612
first_name
Tom
2022-03-23 17:11:58
2
116664
111f07000612
last_name
Cruise
2022-03-23 17:11:58
3
116664
111f07000612
city
New York
2022-03-23 17:11:58
4
116664
abcf0700d009d122
first_name
Matt
2022-02-23 10:11:59
5
116664
abcf0700d009d122
last_name
Damon
2022-02-23 10:11:59
I would like to group by user_id and generate a list of objects (that will be stored as json) with the following format:
[
{
"user_id": "111f07000612",
"first_name": "Tom",
"last_name": "Cruise",
"city": "New York"
},
{
"user_id": "abcf0700d009d122",
"first_name": "Matt",
"last_name": "Damon"
}
]

Hi 👋🏻 Hope you are doing well!
You can achieve it with something similar to this 🙂
from pprint import pprint
import pandas as pd
# because I don't have the exact parquet file, I will just mock it
# df = pd.read_parquet("myfile.parquet", engine="pyarrow")
df = pd.DataFrame(
{
"company_id": [116664, 116664, 116664, 116664, 116664],
"user_id": ["111f07000612", "111f07000612", "111f07000612", "abcf0700d009d122", "abcf0700d009d122"],
"attribute_name": ["first_name", "last_name", "city", "first_name", "last_name"],
"attribute_value": ["Tom", "Cruise", "New York", "Matt", "Damon"],
"timestamp": ["2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58"]
}
)
records = []
for user_id, group in df.groupby("user_id"):
transformed_group = (
group[["attribute_name", "attribute_value"]]
.set_index("attribute_name")
.transpose()
.assign(user_id=user_id)
)
rercord, *_ = transformed_group.to_dict("records")
records.append(rercord)
pprint(records)
# [{'city': 'New York',
# 'first_name': 'Tom',
# 'last_name': 'Cruise',
# 'user_id': '111f07000612'},
# {'first_name': 'Matt', 'last_name': 'Damon', 'user_id': 'abcf0700d009d122'}]

Related

How to map and update python dictionary with different key value pair?

I want to transform a Dictionary in Python, from Dictionary 1 Into Dictionary 2 as follows.
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA",
}
I want to transform the above dictionary to the following
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "DHL.COM TEXAS USA",
"amount": 239.95,
"currency": "USD",
"location": "TEXAS USA",
"merchant_name": "DHL"
}
I tried the following but it did not work
dic1 = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA"
}
print(type(dic1))
copiedDic = dic1.copy()
print("copiedDic = ",copiedDic)
updatekeys = ['amount', 'currency', 'merchant_name', 'location', 'trans_category']
for key in dic1:
if key == 'description':
list_words = dic1[key].split(" ")
newdict = {updatekeys[i]: x for i, x in enumerate(list_words)}
copiedDic.update(newdict)
print(copiedDic)
I got The following result
{
'trans_time': '14/07/2015 10:03:20',
'trans_type': 'DEBIT',
'description': '239.95 USD DHL.COM TEXAS USA',
'amount': '239.95',
'currency': 'USD',
'merchant_name': 'DHL.COM',
'location': 'TEXAS',
'trans_category': 'USA'
}
My Intended output should look like this:
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "DHL.COM TEXAS USA",
"amount": 239.95,
"currency": "USD",
"location": "TEXAS USA",
"merchant_name": "DHL"
}
I think it would be easier to turn the value into an array of words and parse it. Here, an array of words 'aaa ' is created from the dictionary string 'transaction['description']'. Where there are more than one word(array element) 'join' is used to turn the array back into a string. The currency value itself is converted to fractional format from the string. In 'merchant_name', the segment up to the point is taken.
transaction = {
"trans_time": "14/07/2015 10:03:20",
"trans_type": "DEBIT",
"description": "239.95 USD DHL.COM TEXAS USA",
}
aaa = transaction['description'].split()
transaction['description'] = ' '.join(aaa[2:])
transaction['amount'] = float(aaa[0])
transaction['currency'] = aaa[1]
transaction['location'] = ' '.join(aaa[3:])
transaction['merchant_name'] = aaa[2].partition('.')[0]
print(transaction)
Output
{
'trans_time': '14/07/2015 10:03:20',
'trans_type': 'DEBIT',
'description': 'DHL.COM TEXAS USA',
'amount': 239.95,
'currency': 'USD',
'location': 'TEXAS USA',
'merchant_name': 'DHL'}
If you want to transform, you do not need the copy to the original dictionary.
Just do something like this:
new_keys = ['amount', 'currency', 'merchant_name', 'location', 'trans_category']
values = transaction["description"].split(' ')
for idx, key in enumerate(new_keys):
if key == "amount":
transaction[key] = float(values[idx])
else:
transaction[key] = values[idx]

How to create dataframe from nested JSON?

I'm trying to create a dataframe using the following JSON structure -
{
"tables" : {
"name" : "PrimaryResult",
"columns" : [
{
"name" : "EmployeeID",
"type" : "Int"
},
{
"name" : "EmployeeName",
"type" : "String"
},
{
"name" : "DepartmentName",
"type" : "String"
}
],
"rows" : [
[
123,
"John Doe",
"IT"
],
[
234,
"Jane Doe",
"HR"
]
]
}
}
I tried few of the suggestions from - How to create pandas DataFrame from nested Json with list , How to parse nested JSON objects in spark sql?.
But I'm still confused. Essentially the output should look somewhat like below -
+----------+------------+--------------+
|EmployeeId|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+
I'm trying to refrain from using pandas as it shows lot of memory issues if the data is huge (not sure if there is a way to handle them).
Please help.
See below logic -
import json
data = [json.loads(js)]
print(data)
# Output
[{'tables': {'name': 'PrimaryResult', 'columns': [{'name': 'EmployeeID', 'type': 'Int'}, {'name': 'EmployeeName', 'type': 'String'}, {'name': 'DepartmentName', 'type': 'String'}], 'rows': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}}]
Now fetch the columns as below -
columns = []
for i in range(len(data[0]['tables']['columns'])):
columns.append(data[0]['tables']['columns'][i]['name'])
print(columns)
#Output
['EmployeeID', 'EmployeeName', 'DepartmentName']
Create a dictionary of columns and rows as below -
dict_JSON = {}
dict_JSON["columns"] = columns
dict_JSON["data"] = data[0]['tables']['rows']
print(dict_JSON)
#Output
{'columns': ['EmployeeID', 'EmployeeName', 'DepartmentName'], 'data': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}
Now once you have this dictionary create pandas dataframe and from there create the spark dataframe as below -
import pandas as pd
pdf = pd.read_json(json.dumps(dict_JSON), orient='split')
df = spark.createDataFrame(pdf)
df.show()
+----------+------------+--------------+
|EmployeeID|EmployeeName|DepartmentName|
+----------+------------+--------------+
| 123| John Doe| IT|
| 234| Jane Doe| HR|
+----------+------------+--------------+

Converting the CSV file to specified Json format

I am new to Python and don't know how to achieve this. I am trying to convert CSV file to JSON format. Address have types 1. Primary 2. Work and Address is multi value attribute as well. Person can have 2 Primary address.
Input Data in CSV format
"f_name"|"l_name"|"address_type"|"address_line_1"|"city"|"state"|"postal_code"|"country"
Brad|Pitt|Primary|"18 Atherton"|Irvine|CA|"92620-2501"|USA
Brad|Pitt|work|"1325 S Grand Ave"|Santa Ana|CA|"92705-4406"|USA
Output Expecting in JSON Format
{
"f_name": "Brad",
"l_name": "Pitt",
"parsed_address": [
{
"address_type": "Primary",
"address": [
{
"address_line_1": "18 Atherton",
"city": "Irvine",
"state": "CA",
"postal_code": "92620-2501",
"country": "USA"
}
]
},
{
"address_type": "work",
"address": [
{
"address_line_1": "1325 S Grand Ave",
"city": "Santa Ana",
"state": "CA",
"postal_code": "92620-2501",
"country": "USA"
}
]
}
]
}
Code Tried
df = pd.read_csv("file")
g_cols = ['f_name','l_name']
address_field = ['address']
cols = ['address_line_1', 'address_line_2', 'address_line_3', 'city', 'state', 'postal_code', 'country']
for i in g_cols:
if i in dict_val.keys():
g_cols[g_cols.index(i)] = dict_val[i]
for i in cols:
if i in dict_val.keys():
cols[cols.index(i)] = dict_val[i]
df2 = df.drop_duplicates().groupby(g_cols)[cols].apply(lambda x: x.to_dict('records')).reset_index(
name=address_field).to_dict('record')
You were close. This should do too exactly what aim to do.
df = pd.read_csv("data.csv", sep="|")
df
dic = {}
for name, group in df.groupby(by=["name"]):
dic["name"] = name
dic["parsed_address"] = []
for address_type, group in df.groupby(by=["address_type"]):
address_dic = {}
address_dic["address_type"] = address_type
address_dic["address"] = group.drop(columns=["name", "address_type"]).to_dict(orient="records")
dic["parsed_address"].append(address_dic)
dic
I think you can try having a dictionary or list (json_data in the code below) to keep track of a person's data and iterating throw each row of the dataframe using for _, row in df.iterrows():
import pandas as pd
df = pd.read_csv("file", delimiter='|')
print(df)
json_data = {}
for _, row in df.iterrows():
name = row["name"]
address_type = row["address_type"]
address_line_1 = row["address_line_1"]
city = row["city"]
state = row["state"]
postal_code = row["postal_code"]
country = row["country"]
if name not in json_data:
json_data[name] = {
"name": name,
"parsed_address": []
}
address_list = None
for address in json_data[name]["parsed_address"]:
if address["address_type"] == address_type:
address_list = address
if address_list is None:
address_list = {
"address_type": address_type,
"address": []
}
json_data[name]["parsed_address"].append(address_list)
address_list["address"].append({
"address_line_1": address_line_1,
"city": city,
"state": state,
"postal_code": postal_code,
"country": country
})
lst = list(json_data.values())
# Verify data parsing
import json
print(json.dumps(lst, indent=2))
dic = {}
g_cols = ['id','first_name','last_name','address_type]
for name, group in df.groupby(g_cols)["address"]:
id = name[0]
dic["id"] = id
dic["parsed_address"] = []
for address_type, group in df.groupby(by=["address_type"]):
address_dic = {}
address_dic["address_type"] = address_type
address_dic["address"] = group.drop(
columns=["id", "first_name","last_name","address_type"]).to_dict("record")
dic["parsed_address"].append(address_dic)

Move values from DataFrame json column into Dataframe rows

Can't resolve such a problem.
I have a JSON file, with some nested dicts in its column
So I load JSON into Dataframe with:
df2=pd.read_json(filename)
And now a I have Dataframe with main column - SKU and another column, which contains a dict like:
"other_stores":
{"addrcode1": {"address": "Address1", "price_current": 990.0, "in_stock_count": 1, "price_original": 990.0},
"addrcode2": {"address": "Address2", "price_current": 990.0, "in_stock_count": 1, "price_original": 990.0}}
with command like apply(pd.Series)
I can move each "addrcode1" into columns of my data frame and create a table like this:
SKU
brand
title
addrcode1
addrcode2
sku1
brand1
title1
{addrcode1:data}
{addrcode2:data}
But what I'm trying to do - is to create a table like:
SKU
brand
title
address
address_price
address_stock
sku1
brand1
title1
address1
address1_price
address1_stock
sku1
brand1
title1
address2
address2_price
address2_stock
Sample of JSON is:
[
{
"SKU": "sampleSKU",
"brand": "My Brand",
"title": "My SKU title",
"other_stores": {
"addrcode1": {
"address": "Address1",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
},
"addrcode2": {
"address": "Address2",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}
}
}
]
Using multiple df.apply(pd.Series) with df.stack()
Your "JSON" is not really in a structure that can be easily normalized. For your nested dictionary structured JSON, you can use the following:
Convert the list of dicts to pandas dataframe.
Then for the other_stores column, apply pd.Series followed by stacking the 2 addrcode1 and addrcode2,
Then another apply pd.Series to extract the columns needed
Finally concatenate the original dataframe with the nested one.
d = [{'SKU': 'sampleSKU 1',
'brand': 'My Brand 1',
'title': 'My SKU 1 title',
'other_stores': {'addrcode1': {'address': 'Address1',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990},
'addrcode2': {'address': 'Address2',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990}}},
{'SKU': 'sampleSKU 2',
'brand': 'My Brand 2',
'title': 'My SKU 2 title',
'other_stores': {'addrcode1': {'address': 'Address1',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990},
'addrcode2': {'address': 'Address2',
'price_current': 990,
'in_stock_count': 1,
'price_original': 990}}}]
df = pd.DataFrame(d)
nested = df['other_stores'].apply(pd.Series).stack().apply(pd.Series)
pd.concat([df.drop('other_stores',1), nested.reset_index(-1, drop=True)], axis=1)
Using pd.json_normalize() on a slighly modified JSON
If you do have control over how your JSON is structure, might I advise adding lists to store nested dicts. This allows you to utilize pd.json_normalize as a convenience function, with record_path and meta parameters -
d_fixed = [
{
"SKU": "sampleSKU",
"brand": "My Brand",
"title": "My SKU title",
"other_stores": [{ #<---- list start
"addrcode1": [{ #<---- nested list start
"address": "Address1",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}], #<---- nested list end
"addrcode2": [{ #<---- nested list start
"address": "Address2",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}] #<---- nested list end
}] #<---- list end
}
]
a1 = pd.json_normalize(d_fixed, record_path = ['other_stores','addrcode1'], meta=['SKU','brand','title'])
a2 = pd.json_normalize(d_fixed, record_path = ['other_stores','addrcode2'], meta=['SKU','brand','title'])
df = pd.concat([a1,a2])
df = df[['SKU','brand','title','address','price_current','in_stock_count','price_original']]
df
Use -
a=[
{
"SKU": "sampleSKU",
"brand": "My Brand",
"title": "My SKU title",
"other_stores": {
"addrcode1": {
"address": "Address1",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
},
"addrcode2": {
"address": "Address2",
"price_current": 990,
"in_stock_count": 1,
"price_original": 990
}
}
}
]
pd.json_normalize(a)
Output
SKU brand title other_stores.addrcode1.address other_stores.addrcode1.price_current other_stores.addrcode1.in_stock_count other_stores.addrcode1.price_original other_stores.addrcode2.address other_stores.addrcode2.price_current other_stores.addrcode2.in_stock_count other_stores.addrcode2.price_original
0 sampleSKU My Brand My SKU title Address1 990 1 990 Address2 990 1 990
Update
Try this example -
import json
# load data using Python JSON module
with open('data/simple.json','r') as f:
data = json.loads(f.read())
# Flattening JSON data
pd.json_normalize(data)

Output pandas dataframe to json in a particular format

My dataframe is
fname lname city state code
Alice Lee Athens Alabama PXY
Nor Xi Mesa Arizona ABC
The output of json should be
{
"Employees":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}
df.to_json() gives no hierarchy to the json. Can you please suggest what am I missing? Is there a way to combine columns and give them a 'keyname' while writing json in pandas?
Thank you.
Try:
names = df[["fname", "lname"]].apply(" ".join, axis=1)
addresses = df[["city", "state"]].apply(", ".join, axis=1)
codes = df["code"]
out = {"Employees": {}}
for n, a, c in zip(names, addresses, codes):
out["Employees"][n] = {"code": c, "Address": a}
print(out)
Prints:
{
"Employees": {
"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"},
}
}
We can populate a new dataframe with columns being "code" and "Address", and index being "full_name" where the latter two are generated from the dataframe's columns with string addition:
new_df = pd.DataFrame({"code": df["code"],
"Address": df["city"] + ", " + df["state"]})
new_df.index = df["fname"] + " " + df["lname"]
which gives
>>> new_df
code Address
Alice Lee PXY Athens, Alabama
Nor Xi ABC Mesa, Arizona
We can now call to_dict with orient="index":
>>> d = new_df.to_dict(orient="index")
>>> d
{"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"}}
To match your output, we wrap d with a dictionary:
>>> {"Employee": d}
{
"Employee":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}
json = json.loads(df.to_json(orient='records'))
employees = {}
employees['Employees'] = [{obj['fname']+' '+obj['lname']:{'code':obj['code'], 'Address':obj['city']+', '+obj['state']}} for obj in json]
This outputs -
{
'Employees': [
{
'Alice Lee': {
'code': 'PXY',
'Address': 'Athens, Alabama'
}
},
{
'Nor Xi': {
'code': 'ABC',
'Address': 'Mesa, Arizona'
}
}
]
}
you can solve this using df.iterrows()
employee_dict = {}
for row in df.iterrows():
# row[0] is the index number, row[1] is the data respective to that index
row_data = row[1]
employee_name = row_data.fname + ' ' + row_data.lname
employee_dict[employee_name] = {'code': row_data.code, 'Address':
row_data.city + ', ' + row_data.state}
json_data = {'Employees': employee_dict}
Result:
{'Employees': {'Alice Lee': {'code': 'PXY', 'Address': 'Athens, Alabama'},
'Nor Xi': {'code': 'ABC', 'Address': 'Mesa, Arizona'}}}

Categories

Resources