How to decode dictionary column when using pyarrow to read parquet files? - python

I have three .snappy.parquet files stored in an s3 bucket, I tried to use pandas.read_parquet() but it only work when I specify one single parquet file, e.g: df = pandas.read_parquet("s3://bucketname/xxx.snappy.parquet"), but if I don't specify the filename df = pandas.read_parquet("s3://bucketname"), this won't work and it gave me error: Seek before start of file.
I did a lot of reading, then I found this page
it suggests that we can use pyarrow to read multiple parquet files, so here's what I tried:
import s3fs
import import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
bucket_uri = f's3://bucketname'
data = pq.ParquetDataset(bucket_uri, filesystem=s3)
df = data.read().to_pandas()
This works, but I found that the value for one of the columns in thie df is a dictionary, how can I decode this dictionary and the selected key as column names and value as the corresponding values?
For example, the current column:
column_1
{'Id': 'xxxxx', 'name': 'xxxxx','age': 'xxxxx'....}
The expected column:
Id age
xxx xxx
xxx xxx
Here's the output for data.read().schema:
column_0: string
-- field metadata --
PARQUET:field_id: '1'
column_1: struct<Id: string, name: string, age: string,.......>
child 0, Id: string
-- field metadata --
PARQUET:field_id: '3'
child 1, name: string
-- field metadata --
PARQUET:field_id: '7'
child 2, age: string
-- field metadata --
PARQUET:field_id: '8'
...........
...........

You have a column with a "struct type" and you want to flatten it. To do so call flatten before calling to_pandas
import pyarrow as pa
COLUMN1_SCHEMA = pa.struct([('Id', pa.string()), ('Name', pa.string()), ('Age', pa.string())])
SCHEMA = pa.schema([("column1", COLUMN1_SCHEMA), ('column2', pa.int32())])
df = pd.DataFrame({
"column1": [("1", "foo", "16"), ("2", "bar", "17"), ],
"column2": [1, 2],
})
pa.Table.from_pandas(df, SCHEMA).to_pandas() # without flatten
| column1 | column2 |
|:----------------------------------------|----------:|
| {'Id': '1', 'Name': 'foo', 'Age': '16'} | 1 |
| {'Id': '2', 'Name': 'bar', 'Age': '17'} | 2 |
pa.Table.from_pandas(df, SCHEMA).flatten().to_pandas() # with flatten
| column1.Id | column1.Name | column1.Age | column2 |
|-------------:|:---------------|--------------:|----------:|
| 1 | foo | 16 | 1 |
| 2 | bar | 17 | 2 |
As a side note, you shoulnd't call it a dictionary column. dictionary is loaded term in pyarrow, and usually refer to distionary encoding
Edit: how to read a subset of columns in parquet
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df, SCHEMA)
pq.write_table(table, 'data.pq')
# Using read_table:
pq.read_table('data.pq', columns=['column1.Id', 'column1.Age'])
# Using ParquetDataSet:
pq.ParquetDataset('data.pq').read(columns=['column1.Id', 'column1.Age'])

Related

Python Pandas Flatten nested JSON

I'm new to python & pandas and it took me awhile to get the results I wanted using below. Basically, I am trying to flatten a nested JSON. While json_normalize works, there are columns where it contains a list of objects(key/value). I want to breakdown that list and add them as separate columns. Below sample code I wrote worked out fine but was wondering if this could be simplified or improved further or an alternative? Most of the articles I've found relies on actually naming the columns and such (codes that I cant pick up on just yet), but I would rather have this as a function and name the columns dynamically. The output csv will be used for Power BI.
with open('json.json', 'r') as json_file: jsondata = json.load(json_file)
df = pd.json_normalize(jsondata['results'], errors='ignore')
y = df.columns[df.applymap(type).eq(list).any()]
for x in y:
df = df.explode(x).reset_index(drop=True)
df_exploded = pd.json_normalize(df.pop(x))
for i in df_exploded.columns:
df_exploded = df_exploded.rename(columns={i:x + '_' + i})
df = df.join(df_exploded)
df.to_csv('json.csv')
Sample JSON Format (Not including the large JSON I was working on):
data = {
'results': [
{
'recordType': 'recordType',
'id': 'id',
'values':
{
'orderType': [{'value': 'value', 'text': 'text'}],
'type': [{'value': 'value', 'text': 'text'}],
'entity': [{'value': 'value', 'text': 'text'}],
'trandate': 'trandate'
}
}
]
}
The values part when json_normalized, doesn't get flatten and required explode and joined.
You can use something like this:
df = pd.json_normalize(jsondata['results'],meta=['recordType','id'])[['recordType','id','values.trandate']]
record_paths = [['values','orderType'],['values','type'],['values','entity']]
for i in record_paths:
df = pd.concat([df,pd.json_normalize(jsondata['results'],record_path=i,record_prefix=i[1])],axis=1)
df.to_csv('json.csv')
Or (much faster):
df = pd.DataFrame({'recordType':[i['recordType'] for i in jsondata['results']],
'id':[i['id'] for i in jsondata['results']],
'values.trandate':[i['values']['trandate'] for i in jsondata['results']],
'orderTypevalue':[i['values']['orderType'][0]['value'] for i in jsondata['results']],
'orderTypetext':[i['values']['orderType'][0]['text'] for i in jsondata['results']],
'typevalue':[i['values']['type'][0]['value'] for i in jsondata['results']],
'typetext':[i['values']['type'][0]['text'] for i in jsondata['results']],
'entityvalue':[i['values']['entity'][0]['value'] for i in jsondata['results']],
'entitytext':[i['values']['entity'][0]['text'] for i in jsondata['results']]})
df.to_csv('json.csv')
Output:
| | recordType | id | values.trandate | orderTypevalue | orderTypetext | typevalue | typetext | entityvalue | entitytext |
|---:|:-------------|:-----|:------------------|:-----------------|:----------------|:------------|:-----------|:--------------|:-------------|
| 0 | recordType | id | trandate | value | text | value | text | value | text |

How to transform json format into string column for python dataframe?

I got this dataframe:
Dataframe: df_case_1
Id RecordType
0 1234 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/1234', 'name', 'XYZ'}}
1 4321 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/4321', 'name', 'ABC'}}
I want to have this dataframe:
Dataframe: df_case_final
Id RecordType
0 1234 'XYZ'
1 4321 'ABC'
At the moment I use this statemane but it gives me the name on position 0 for every case object.
df_case_1['RecordType'] = df_case_1.RecordType[0]['Name']
How to build the statement, that I give me the correct name for every id, like in df_case_final?
Thanks
There are 3 Ways you can convert JSON to Pandas Dataframe
# 1. Use json_normalize() to convert JSON to DataFrame
dict= json.loads(data)
df = json_normalize(dict['technologies'])
# 2. Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
# 3. Use pandas.DataFrame.from_dict() to Convert JSON to DataFrame
dict= json.loads(data)
df2 = pd.DataFrame.from_dict(dict, orient="index")
Now, after converting Json to df take the last column and append it to your original dataframe
split your df by coma & trim un-neccessary cols
import pandas as pd
df=pd.read_csv(r"Hansmuff.csv")
df[['1', '2','3','required']]=df['RecordType'].str.split(',', expand=True)
df = df.drop(columns=['RecordType', '1','2','3'])
df['required'] = df['required'].str.strip('{}')
print(df)
output
Id required
0 1234 'XYZ'
1 4321 'ABC'

Convert nested dictionary to csv

I have a particular case where I want to create a csv using the inner values of a nested dictionary as the keys and the inner keys as the header. The 'healthy' key can contain more subkeys other than 'height' and 'weight', but the 'unhealthy' will either ever contain None or a string of values.
My current dictionary looks like this:
{0: {'healthy': {'height': 160,
'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170,
'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
How would I convert this to this csv:
+------+--------+----------------------------+
|height| weight| unhealthy|
+------+--------+----------------------------+
|160 | 180| |
+------+--------+----------------------------+
|170 | 250|alcohol, smoking, overweight|
+------+--------+----------------------------+
Is there anyway of not hardcoding this and doing this without Pandas and saving it to a location?
With D being your dictionary you can pass D.values() to pandas.json_normalize() and rename the columns if needed.
>>> import pandas as pd
>>> print(pd.json_normalize(D.values()).to_markdown(tablefmt='psql'))
+╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+
| | unhealthy | healthy.height | healthy.weight |
|╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌|
| 0 | | 160 | 180 |
| 1 | alcohol, smoking, overweight | 170 | 250 |
+╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+
So this may be very dumb way to do this, but if your dictionary has this structure and you don't mind about hardcoding the actual values, this might be the way
import csv
dictionary = {0: {'healthy': {'height': 160,
'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170,
'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(['height', 'weight', 'unhealthy'])
writer.writerows([
[value['healthy']['height'],
value['healthy']['weight'],
value['unhealthy']
] for key, value in dictionary.items()])
So the point is you just create an array of [<height>, <weight>, <unhealthy>] arrays and write it to csv file using python's builtin module's csv.writer.writerows()
I used pandas to deal with the value and to save it as a csv file.
I loaded the json format data as a dataframe.
By transposing the data, I got 'unhealty' columne
By using json_normalize(), I parsed the nested dictionary data, 'healthy' and generated two columns into 'height' and 'weight'
concat the 'healthy' and 'height', 'weight' columns
saved the dataframe as a csv file.
import pandas as pd
val = {
0: {'healthy': {'height': 160, 'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170, 'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
df = pd.DataFrame.from_dict(val)
df = df.transpose()
df = pd.concat([pd.json_normalize(df['healthy'], max_level=1), df['unhealthy']], axis=1)
df.to_csv('filename.csv', index=False) # A csv file generated.
This is the csv file I generated (I opend it using MS Excel).

Handle JSON objects in CSV File and save to PySpark DataFrame

I have a CSV file which contains JSON objects as well as other data like String, Integer in it.
If I try to read the file as CSV then the JSON objects overlaps in other columns.
Column1, Column2, Column3, Column4, Column5
100,ABC,{"abc": [{"xyz": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},foo, pine
101,XYZ,{"xyz": [{"abc": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},bar, apple
I am getting output as:
Column1 | Column2 | Column3 | Column4 | Column5
100 | ABC | {"abc": [{"xyz": 0, "mno": "h"} | {"apple": 0, "hello": 1 | "temp": "cnot"}]}
101 | XYZ | {"xyz": [{"abc": 0, "mno": "h"} | {"xyz": [{"abc": 0, "mno": "h"} | "temp": "cnot"}]}
Test_File.py
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# Initializing SparkSession and setting up the file source
filepath = "s3a://file.csv"
df = spark.read.format("csv").options(header="true", delimiter = ',', inferschema='true').load(filepath)
df.show(5)
Also tried handling this issue by reading the file as text as discussed in this approach
'100,ABC,"{\'abc\':["{\'xyz\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, foo, pine'
'101,XYZ,"{\'xyz\':["{\'abc\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, bar, apple'
But instead of creating a new file, I wanted to load this quoted string as the PySpark DataFrame to run the SQL Queries on them, to create a DataFrame I need to split this again to assign each column to PySpark which results in splitting the JSON Object again.
The issue is with the delimiter you are using. You are reading CSV with comma as a delimiter and your JSON string contains commas. Hence Spark is splitting the JSON string also on coma therefore the above output. You will need to have a CSV with a delimiter which is unique and will not be present in either of the column value so as to overcome your case.

Split pyspark dataframe to chunks and convert to dictionary

I have a pyspark dataframe which looks like the following:
+----+--------------------+
| ID| Email|
+----+--------------------+
| 1| sample#example.org|
| 2| sample2#example.org|
| 3| sampleexample.org|
| 4| sample#exampleorg|
+----+--------------------+
What I need to do is to split it into chunks and then convert those chunks to dictionaries like:
chunk1
[{'ID': 1, 'Email': 'sample#example.org'}, {'ID': 2, 'Email': 'sample2#example.org'}]
chunk2
[{'ID': 3, 'Email': 'sampleexample.org'}, {'ID': 4, 'Email': 'sample#exampleorg'}]
I've found this post on SO but I figured it would not make any sense to first convert the chunks to pandas dataframe and from there to dictionary while I might be able to do it directly. Using the idea in that post, I've got the following code but not sure if this is the best way of doing it:
columns = spark_df.schema.fieldNames()
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [iterator.to_dict('records')]).toLocalIterator()
for list_of_dicts in chunks:
# do work locally on list_of_dicts
You can return [[x.asDict() for x in iterator]] in the mapPartitions function (no need Pandas). [x.asDict() for x in iterator] creates a list of dicts including all rows in the same partition. we then enclose it using another list so that it was treated as a single item with toLocalIterator():
from json import dumps
num_chunks = 2
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [[x.asDict() for x in iterator]]).toLocalIterator()
for list_of_dicts in chunks:
print(dumps(list_of_dicts))
#[{"ID": "2", "Email": "sample2#example.org"}, {"ID": "1", "Email": "sample#example.org"}]
#[{"ID": "4", "Email": "sample#exampleorg"}, {"ID": "3", "Email": "sampleexample.org"}]

Categories

Resources