I'm new to python & pandas and it took me awhile to get the results I wanted using below. Basically, I am trying to flatten a nested JSON. While json_normalize works, there are columns where it contains a list of objects(key/value). I want to breakdown that list and add them as separate columns. Below sample code I wrote worked out fine but was wondering if this could be simplified or improved further or an alternative? Most of the articles I've found relies on actually naming the columns and such (codes that I cant pick up on just yet), but I would rather have this as a function and name the columns dynamically. The output csv will be used for Power BI.
with open('json.json', 'r') as json_file: jsondata = json.load(json_file)
df = pd.json_normalize(jsondata['results'], errors='ignore')
y = df.columns[df.applymap(type).eq(list).any()]
for x in y:
df = df.explode(x).reset_index(drop=True)
df_exploded = pd.json_normalize(df.pop(x))
for i in df_exploded.columns:
df_exploded = df_exploded.rename(columns={i:x + '_' + i})
df = df.join(df_exploded)
df.to_csv('json.csv')
Sample JSON Format (Not including the large JSON I was working on):
data = {
'results': [
{
'recordType': 'recordType',
'id': 'id',
'values':
{
'orderType': [{'value': 'value', 'text': 'text'}],
'type': [{'value': 'value', 'text': 'text'}],
'entity': [{'value': 'value', 'text': 'text'}],
'trandate': 'trandate'
}
}
]
}
The values part when json_normalized, doesn't get flatten and required explode and joined.
You can use something like this:
df = pd.json_normalize(jsondata['results'],meta=['recordType','id'])[['recordType','id','values.trandate']]
record_paths = [['values','orderType'],['values','type'],['values','entity']]
for i in record_paths:
df = pd.concat([df,pd.json_normalize(jsondata['results'],record_path=i,record_prefix=i[1])],axis=1)
df.to_csv('json.csv')
Or (much faster):
df = pd.DataFrame({'recordType':[i['recordType'] for i in jsondata['results']],
'id':[i['id'] for i in jsondata['results']],
'values.trandate':[i['values']['trandate'] for i in jsondata['results']],
'orderTypevalue':[i['values']['orderType'][0]['value'] for i in jsondata['results']],
'orderTypetext':[i['values']['orderType'][0]['text'] for i in jsondata['results']],
'typevalue':[i['values']['type'][0]['value'] for i in jsondata['results']],
'typetext':[i['values']['type'][0]['text'] for i in jsondata['results']],
'entityvalue':[i['values']['entity'][0]['value'] for i in jsondata['results']],
'entitytext':[i['values']['entity'][0]['text'] for i in jsondata['results']]})
df.to_csv('json.csv')
Output:
| | recordType | id | values.trandate | orderTypevalue | orderTypetext | typevalue | typetext | entityvalue | entitytext |
|---:|:-------------|:-----|:------------------|:-----------------|:----------------|:------------|:-----------|:--------------|:-------------|
| 0 | recordType | id | trandate | value | text | value | text | value | text |
Related
I have a particular case where I want to create a csv using the inner values of a nested dictionary as the keys and the inner keys as the header. The 'healthy' key can contain more subkeys other than 'height' and 'weight', but the 'unhealthy' will either ever contain None or a string of values.
My current dictionary looks like this:
{0: {'healthy': {'height': 160,
'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170,
'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
How would I convert this to this csv:
+------+--------+----------------------------+
|height| weight| unhealthy|
+------+--------+----------------------------+
|160 | 180| |
+------+--------+----------------------------+
|170 | 250|alcohol, smoking, overweight|
+------+--------+----------------------------+
Is there anyway of not hardcoding this and doing this without Pandas and saving it to a location?
With D being your dictionary you can pass D.values() to pandas.json_normalize() and rename the columns if needed.
>>> import pandas as pd
>>> print(pd.json_normalize(D.values()).to_markdown(tablefmt='psql'))
+╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+
| | unhealthy | healthy.height | healthy.weight |
|╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌|
| 0 | | 160 | 180 |
| 1 | alcohol, smoking, overweight | 170 | 250 |
+╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+
So this may be very dumb way to do this, but if your dictionary has this structure and you don't mind about hardcoding the actual values, this might be the way
import csv
dictionary = {0: {'healthy': {'height': 160,
'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170,
'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(['height', 'weight', 'unhealthy'])
writer.writerows([
[value['healthy']['height'],
value['healthy']['weight'],
value['unhealthy']
] for key, value in dictionary.items()])
So the point is you just create an array of [<height>, <weight>, <unhealthy>] arrays and write it to csv file using python's builtin module's csv.writer.writerows()
I used pandas to deal with the value and to save it as a csv file.
I loaded the json format data as a dataframe.
By transposing the data, I got 'unhealty' columne
By using json_normalize(), I parsed the nested dictionary data, 'healthy' and generated two columns into 'height' and 'weight'
concat the 'healthy' and 'height', 'weight' columns
saved the dataframe as a csv file.
import pandas as pd
val = {
0: {'healthy': {'height': 160, 'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170, 'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
df = pd.DataFrame.from_dict(val)
df = df.transpose()
df = pd.concat([pd.json_normalize(df['healthy'], max_level=1), df['unhealthy']], axis=1)
df.to_csv('filename.csv', index=False) # A csv file generated.
This is the csv file I generated (I opend it using MS Excel).
I have three .snappy.parquet files stored in an s3 bucket, I tried to use pandas.read_parquet() but it only work when I specify one single parquet file, e.g: df = pandas.read_parquet("s3://bucketname/xxx.snappy.parquet"), but if I don't specify the filename df = pandas.read_parquet("s3://bucketname"), this won't work and it gave me error: Seek before start of file.
I did a lot of reading, then I found this page
it suggests that we can use pyarrow to read multiple parquet files, so here's what I tried:
import s3fs
import import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
bucket_uri = f's3://bucketname'
data = pq.ParquetDataset(bucket_uri, filesystem=s3)
df = data.read().to_pandas()
This works, but I found that the value for one of the columns in thie df is a dictionary, how can I decode this dictionary and the selected key as column names and value as the corresponding values?
For example, the current column:
column_1
{'Id': 'xxxxx', 'name': 'xxxxx','age': 'xxxxx'....}
The expected column:
Id age
xxx xxx
xxx xxx
Here's the output for data.read().schema:
column_0: string
-- field metadata --
PARQUET:field_id: '1'
column_1: struct<Id: string, name: string, age: string,.......>
child 0, Id: string
-- field metadata --
PARQUET:field_id: '3'
child 1, name: string
-- field metadata --
PARQUET:field_id: '7'
child 2, age: string
-- field metadata --
PARQUET:field_id: '8'
...........
...........
You have a column with a "struct type" and you want to flatten it. To do so call flatten before calling to_pandas
import pyarrow as pa
COLUMN1_SCHEMA = pa.struct([('Id', pa.string()), ('Name', pa.string()), ('Age', pa.string())])
SCHEMA = pa.schema([("column1", COLUMN1_SCHEMA), ('column2', pa.int32())])
df = pd.DataFrame({
"column1": [("1", "foo", "16"), ("2", "bar", "17"), ],
"column2": [1, 2],
})
pa.Table.from_pandas(df, SCHEMA).to_pandas() # without flatten
| column1 | column2 |
|:----------------------------------------|----------:|
| {'Id': '1', 'Name': 'foo', 'Age': '16'} | 1 |
| {'Id': '2', 'Name': 'bar', 'Age': '17'} | 2 |
pa.Table.from_pandas(df, SCHEMA).flatten().to_pandas() # with flatten
| column1.Id | column1.Name | column1.Age | column2 |
|-------------:|:---------------|--------------:|----------:|
| 1 | foo | 16 | 1 |
| 2 | bar | 17 | 2 |
As a side note, you shoulnd't call it a dictionary column. dictionary is loaded term in pyarrow, and usually refer to distionary encoding
Edit: how to read a subset of columns in parquet
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df, SCHEMA)
pq.write_table(table, 'data.pq')
# Using read_table:
pq.read_table('data.pq', columns=['column1.Id', 'column1.Age'])
# Using ParquetDataSet:
pq.ParquetDataset('data.pq').read(columns=['column1.Id', 'column1.Age'])
I have dataframe_1:
+-------------+----+---------+
| Name| Age| Salary|
+-------------+----+---------+
|Avery Bradley|25.0|7730337.0|
| Jae Crowder|25.0|6796117.0|
+-------------+----+---------+
and want to transform it to dataframe_2:
+----------------------------------------------------------------------------------------------------------------------+
| json_data |
+----------------------------------------------------------------------------------------------------------------------+
|[{"Name": "Avery Bradley", "Age": 25.0, "Salary" 7730337.0}, {"Name": "Jae Crowder", "Age": 25.0, "Salary" 6796117.0}]|
+----------------------------------------------------------------------------------------------------------------------+
I can do dataframe_1.toPandas().to_dict(orient="records"), but this would be dataframe to dict(json object) transformation and I need dataframe to dataframe transformation.
A solution in PySpark, if possible, would be appreciated.
You can do a collect_list of json:
import pyspark.sql.functions as F
df2 = df.agg(F.collect_list(F.to_json(F.struct('*'))).alias('json_data'))
df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------+
|json_data |
+--------------------------------------------------------------------------------------------------------------+
|[{"Name":"Avery Bradley","Age":25.0,"Salary":7730337.0}, {"Name":"Jae Crowder","Age":25.0,"Salary":6796117.0}]|
+--------------------------------------------------------------------------------------------------------------+
You can merge columns to a map then create a JSON out of it
(df
.withColumn('json', F.to_json(F.create_map(
F.lit('name'), F.col('name'),
F.lit('age'), F.col('age'),
F.lit('salary'), F.col('salary'),
)))
.agg(F.collect_list('json').alias('json_value'))
)
+----------------------------------------------------------------------------------------------------------------------+
|json_value |
+----------------------------------------------------------------------------------------------------------------------+
|[{"name":"Avery Bradley","age":"25.0","salary":"7730337.0"}, {"name":"Jae Crowder","age":"25.0","salary":"6796117.0"}]|
+----------------------------------------------------------------------------------------------------------------------+
I have a pyspark dataframe which looks like the following:
+----+--------------------+
| ID| Email|
+----+--------------------+
| 1| sample#example.org|
| 2| sample2#example.org|
| 3| sampleexample.org|
| 4| sample#exampleorg|
+----+--------------------+
What I need to do is to split it into chunks and then convert those chunks to dictionaries like:
chunk1
[{'ID': 1, 'Email': 'sample#example.org'}, {'ID': 2, 'Email': 'sample2#example.org'}]
chunk2
[{'ID': 3, 'Email': 'sampleexample.org'}, {'ID': 4, 'Email': 'sample#exampleorg'}]
I've found this post on SO but I figured it would not make any sense to first convert the chunks to pandas dataframe and from there to dictionary while I might be able to do it directly. Using the idea in that post, I've got the following code but not sure if this is the best way of doing it:
columns = spark_df.schema.fieldNames()
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [iterator.to_dict('records')]).toLocalIterator()
for list_of_dicts in chunks:
# do work locally on list_of_dicts
You can return [[x.asDict() for x in iterator]] in the mapPartitions function (no need Pandas). [x.asDict() for x in iterator] creates a list of dicts including all rows in the same partition. we then enclose it using another list so that it was treated as a single item with toLocalIterator():
from json import dumps
num_chunks = 2
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [[x.asDict() for x in iterator]]).toLocalIterator()
for list_of_dicts in chunks:
print(dumps(list_of_dicts))
#[{"ID": "2", "Email": "sample2#example.org"}, {"ID": "1", "Email": "sample#example.org"}]
#[{"ID": "4", "Email": "sample#exampleorg"}, {"ID": "3", "Email": "sampleexample.org"}]
I have a large dataset of news articles loaded into a PySpark DataFrame. I am interested in filtering that DataFrame down to the set of articles that contain certain words of interest in their body text. At the moment the list of keywords is small, but I would like to store them in a DataFrame anyway as that list may expand in the future. Consider the following small example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
article_df = [{'source': 'a', 'body': 'Seattle is in Washington.'},
{'source': 'b', 'body': 'Los Angeles is in California'},
{'source': 'a', 'body': 'Banana is a fruit'}]
article_data = spark.createDataFrame(article_data)
keyword_data = [{'city': 'Seattle', 'state': 'Washington'},
{'city': 'Los Angeles', 'state': 'California'}]
keyword_df = spark.createDataFrame(keyword_data)
This gives us the following DataFrames:
+--------------------+------+
| body|source|
+--------------------+------+
|Seattle is in Was...| a|
|Los Angeles is in...| b|
| Banana is a fruit| a|
+--------------------+------+
+-----------+----------+
| city| state|
+-----------+----------+
| Seattle|Washington|
|Los Angeles|California|
+-----------+----------+
As a first pass, I would like to filter down article_df so that it only contains articles whose body string contains any of the strings in keyword_df['city']. I'd also like to filter it down to articles that contain both a string from keyword_df['city'] and the corresponding entry (same row) in keyword_df['state']. How can I accomplish this?
I have managed to do this with a manually defined list of keywords:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def city_filter(x):
cities = ['Seattle', 'Los Angeles']
x = x.lower()
return any(s.lower() in x for s in cities)
filterUDF = udf(city_filter, BooleanType())
Then article_df.filter(filterUDF(article_df.body)).show() gives the desired result:
+--------------------+------+
| body|source|
+--------------------+------+
|Seattle is in Was...| a|
|Los Angeles is in...| b|
+--------------------+------+
How can I implement this filter without having to manually define the list of keywords (or tuples of keyword pairs)? Should I need to use a UDF for this?
You can implement it using leftsemi join with custom expression, for example:
body_contains_city = expr('body like concat("%", city, "%")')
article_df.join(keyword_df, body_contains_city, 'leftsemi').show()