Handle JSON objects in CSV File and save to PySpark DataFrame

Handle JSON objects in CSV File and save to PySpark DataFrame - python

I have a CSV file which contains JSON objects as well as other data like String, Integer in it.
If I try to read the file as CSV then the JSON objects overlaps in other columns.
Column1, Column2, Column3, Column4, Column5
100,ABC,{"abc": [{"xyz": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},foo, pine
101,XYZ,{"xyz": [{"abc": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},bar, apple
I am getting output as:
Column1 | Column2 | Column3 | Column4 | Column5
100 | ABC | {"abc": [{"xyz": 0, "mno": "h"} | {"apple": 0, "hello": 1 | "temp": "cnot"}]}
101 | XYZ | {"xyz": [{"abc": 0, "mno": "h"} | {"xyz": [{"abc": 0, "mno": "h"} | "temp": "cnot"}]}
Test_File.py
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# Initializing SparkSession and setting up the file source
filepath = "s3a://file.csv"
df = spark.read.format("csv").options(header="true", delimiter = ',', inferschema='true').load(filepath)
df.show(5)
Also tried handling this issue by reading the file as text as discussed in this approach
'100,ABC,"{\'abc\':["{\'xyz\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, foo, pine'
'101,XYZ,"{\'xyz\':["{\'abc\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, bar, apple'
But instead of creating a new file, I wanted to load this quoted string as the PySpark DataFrame to run the SQL Queries on them, to create a DataFrame I need to split this again to assign each column to PySpark which results in splitting the JSON Object again.

The issue is with the delimiter you are using. You are reading CSV with comma as a delimiter and your JSON string contains commas. Hence Spark is splitting the JSON string also on coma therefore the above output. You will need to have a CSV with a delimiter which is unique and will not be present in either of the column value so as to overcome your case.

Related

How to decode dictionary column when using pyarrow to read parquet files?

I have three .snappy.parquet files stored in an s3 bucket, I tried to use pandas.read_parquet() but it only work when I specify one single parquet file, e.g: df = pandas.read_parquet("s3://bucketname/xxx.snappy.parquet"), but if I don't specify the filename df = pandas.read_parquet("s3://bucketname"), this won't work and it gave me error: Seek before start of file.
I did a lot of reading, then I found this page
it suggests that we can use pyarrow to read multiple parquet files, so here's what I tried:
import s3fs
import import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
bucket_uri = f's3://bucketname'
data = pq.ParquetDataset(bucket_uri, filesystem=s3)
df = data.read().to_pandas()
This works, but I found that the value for one of the columns in thie df is a dictionary, how can I decode this dictionary and the selected key as column names and value as the corresponding values?
For example, the current column:
column_1
{'Id': 'xxxxx', 'name': 'xxxxx','age': 'xxxxx'....}
The expected column:
Id age
xxx xxx
xxx xxx
Here's the output for data.read().schema:
column_0: string
-- field metadata --
PARQUET:field_id: '1'
column_1: struct<Id: string, name: string, age: string,.......>
child 0, Id: string
-- field metadata --
PARQUET:field_id: '3'
child 1, name: string
-- field metadata --
PARQUET:field_id: '7'
child 2, age: string
-- field metadata --
PARQUET:field_id: '8'
...........
...........

You have a column with a "struct type" and you want to flatten it. To do so call flatten before calling to_pandas
import pyarrow as pa
COLUMN1_SCHEMA = pa.struct([('Id', pa.string()), ('Name', pa.string()), ('Age', pa.string())])
SCHEMA = pa.schema([("column1", COLUMN1_SCHEMA), ('column2', pa.int32())])
df = pd.DataFrame({
"column1": [("1", "foo", "16"), ("2", "bar", "17"), ],
"column2": [1, 2],
})
pa.Table.from_pandas(df, SCHEMA).to_pandas() # without flatten
| column1 | column2 |
|:----------------------------------------|----------:|
| {'Id': '1', 'Name': 'foo', 'Age': '16'} | 1 |
| {'Id': '2', 'Name': 'bar', 'Age': '17'} | 2 |
pa.Table.from_pandas(df, SCHEMA).flatten().to_pandas() # with flatten
| column1.Id | column1.Name | column1.Age | column2 |
|-------------:|:---------------|--------------:|----------:|
| 1 | foo | 16 | 1 |
| 2 | bar | 17 | 2 |
As a side note, you shoulnd't call it a dictionary column. dictionary is loaded term in pyarrow, and usually refer to distionary encoding
Edit: how to read a subset of columns in parquet
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df, SCHEMA)
pq.write_table(table, 'data.pq')
# Using read_table:
pq.read_table('data.pq', columns=['column1.Id', 'column1.Age'])
# Using ParquetDataSet:
pq.ParquetDataset('data.pq').read(columns=['column1.Id', 'column1.Age'])

Read multiple Json in one file in Spark

I am trying to read in spark a json file that have a json file per line
["io", {"in": 8, "out": 0, "dev": "68", "time": 1532035082.614868}]
["io", {"in": 0, "out": 0, "dev": "68", "time": 1532035082.97122}]
["test", {"A": [{"para1":[], "para2": true, "para3": 68, "name":"", "observation":[[2,3],[3,2]],"time": 1532035082.97122}]}]
It is a bit tricky because each line is a valid json file.
with pandas I do direcly:
pd.read_json(filepath,compression='infer', orient='records, lines=True)
But in spark with DataFrame it does not work
spark.read.option('multiline','true').json(filepath)
I tried to read the file line by line but I still have an error:
lines = sc.textFile(filepath)
llist = lines.collect()
for line in llist:
print(line)
df = spark.read.option('multiline','true).json(line)
df.printSchema()
the error is IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI: .....
thanks for you help to find out a solution

One possible way is to read as a text file and parse each row as an array of two strings:
import pyspark.sql.functions as F
df = spark.read.text(filepath).withColumn(
'value',
F.from_json('value', 'array<string>')
).select(
F.col('value')[0].alias('c0'),
F.col('value')[1].alias('c1')
)
df.show(truncate=False)
+----+------------------------------------------------------------------------------------------------------------+
|c0 |c1 |
+----+------------------------------------------------------------------------------------------------------------+
|io |{"in":8,"out":0,"dev":"68","time":1.532035082614868E9} |
|io |{"in":0,"out":0,"dev":"68","time":1.53203508297122E9} |
|test|{"A":[{"para1":[],"para2":true,"para3":68,"name":"","observation":[[2,3],[3,2]],"time":1.53203508297122E9}]}|
+----+------------------------------------------------------------------------------------------------------------+
But note that column c1 is of string type. It is not possible for Spark to behave like pandas where the column holds dictionaries with different schemas.

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

I am working on getting data from an API using python. The API returns data in form of json which is being normalised and written to a data frame which is then written to a csv file.
The API can return any number of columns which differs between each records. I need only a fixed number of columns which i am defining in the code.
In the scenario where the required column is not being returned my code fails.
I need a solution where even though required columns are not present in the data frame column header gets created in the csv and all rows gets populated with null.
required csv structure :
name address phone
abc bcd 1214
bcd null null

I'm not sure if understood you correctly but I hope the following code solves your problem:
import json
import pandas as pd
# Declare json with missing values:
# - First element doesn't contain "phone" field
# - Second element doesn't contain "married" field
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd", "married": true},
{ "name": "def", "address": "ghi", "phone" : 7687 }
]
}
"""
json_data = json.loads(api_data)
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Save result to csv:
df.to_csv("tmp.csv", index=False)
The content of resulting csv:
name,address,married,phone
abc,bcd,True,
def,ghi,,7687.0
P.S.:
It should work even if columns are absent in all the records. Here is another example:
# Both elements do not contain "married" and "phone" fields
api_data = """
{ "sentences" :
[
{ "name": "abc", "address": "bcd"},
{ "name": "def", "address": "ghi"}
]
}
"""
json_data = json.loads(api_data)
json_data["sentences"][0]
df = pd.DataFrame(
data=json_data["sentences"],
# Explicitly declare which columns should be presented in DataFrame
# If value for given column is absent it will be populated with NaN
columns=["name", "address", "married", "phone"]
)
# Print first rows of DataFrame:
df.head()
# Expected output:
# name address married phone
# 0 abc bcd NaN NaN
# 1 def ghi NaN NaN
df.to_csv("tmp.csv", index=False)
In this case the resulting csv file will contain the following text:
name,address,married,phone
abc,bcd,,
def,ghi,,
The last two commas in the 2nd and 3d lines mean "an empty/missing value" and if you create DataFrame from resulting csv by pd.read_csv then "married" and "phone" columns will be populated with NaN values.

Split pyspark dataframe to chunks and convert to dictionary

I have a pyspark dataframe which looks like the following:
+----+--------------------+
| ID| Email|
+----+--------------------+
| 1| sample#example.org|
| 2| sample2#example.org|
| 3| sampleexample.org|
| 4| sample#exampleorg|
+----+--------------------+
What I need to do is to split it into chunks and then convert those chunks to dictionaries like:
chunk1
[{'ID': 1, 'Email': 'sample#example.org'}, {'ID': 2, 'Email': 'sample2#example.org'}]
chunk2
[{'ID': 3, 'Email': 'sampleexample.org'}, {'ID': 4, 'Email': 'sample#exampleorg'}]
I've found this post on SO but I figured it would not make any sense to first convert the chunks to pandas dataframe and from there to dictionary while I might be able to do it directly. Using the idea in that post, I've got the following code but not sure if this is the best way of doing it:
columns = spark_df.schema.fieldNames()
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [iterator.to_dict('records')]).toLocalIterator()
for list_of_dicts in chunks:
# do work locally on list_of_dicts

You can return [[x.asDict() for x in iterator]] in the mapPartitions function (no need Pandas). [x.asDict() for x in iterator] creates a list of dicts including all rows in the same partition. we then enclose it using another list so that it was treated as a single item with toLocalIterator():
from json import dumps
num_chunks = 2
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [[x.asDict() for x in iterator]]).toLocalIterator()
for list_of_dicts in chunks:
print(dumps(list_of_dicts))
#[{"ID": "2", "Email": "sample2#example.org"}, {"ID": "1", "Email": "sample#example.org"}]
#[{"ID": "4", "Email": "sample#exampleorg"}, {"ID": "3", "Email": "sampleexample.org"}]

Read multiple jsons from one file [duplicate]

This question already has answers here:
Loading and parsing a JSON file with multiple JSON objects
(5 answers)
Closed 3 years ago.
I am working with python and I have a file (data.json) which contains multiple jsons but the whole of it is not a json.
So the file looks like that:
{ "_id" : 01, ..., "path" : "2017-12-12" }
{ "_id" : 02, ..., "path" : "2017-1-12" }
{ "_id" : 03, ..., "path" : "2017-5-12" }
at the place of ... there are about 30 more keys which some of them have nested jsons (so my point is that each json above is pretty long).
Therefore, each of the blocks above at this single file are jsons but the whole of the file is not a json since these are not separated by commas etc.
How can I read each of these jsons separately either with pandas or with simple python?
I have tried this:
import pandas as pd
df = pd.read_json('~/Desktop/data.json', lines=True)
and it actually creates a dataframe where each row is about one json but it also create a column for each of the (1st level) keys of the json which makes things a bit more messy instead of putting the whole json directly in one cell.
To be more clear, I would like my output to be like this in a 'pandas' dataframe (or in another sensible data-structure):
jsons
0 { "_id" : 01, ..., "path" : "2017-12-12" }
1 { "_id" : 02, ..., "path" : "2017-1-12" }
2 { "_id" : 03, ..., "path" : "2017-5-12" }

Idea is use read_csv with no exist separator in data and then convert each value of column to dictionary:
import pandas as pd
import ast, json
from io import StringIO
temp=u"""{ "_id" : 1, "path" : "2017-12-12" }
{ "_id" : 2, "path" : "2017-1-12" }
{ "_id" : 3, "path" : "2017-5-12" }"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['data'])
print (df)
#jsons
df['data'] = df['data'].apply(json.loads)
#dictionaries
#df['data'] = df['data'].apply(ast.literal_eval)
print (df)
data
0 {'_id': 1, 'path': '2017-12-12'}
1 {'_id': 2, 'path': '2017-1-12'}
2 {'_id': 3, 'path': '2017-5-12'}

As the file is itself is not a json, so i will read it line by line
and as the line is a string format so i will convert it to dict type using yaml
then last i will append it all in dataframe
import yaml
import pandas as pd
f = open('data.json')
line = f.readline()
df = pd.DataFrame()
while line:
#string line to dict
d = yaml.load(line)
#temp dataframe
df1=pd.DataFrame(d,index=[0])
#append in every iteration
df=df.append(df1, ignore_index=True)
line = f.readline()
f.close()
print(df)
#output
_id path
0 01 2017-12-12
1 02 2017-1-12
2 03 2017-5-12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handle JSON objects in CSV File and save to PySpark DataFrame - python

Related

How to decode dictionary column when using pyarrow to read parquet files?

Read multiple Json in one file in Spark

Required columns are not present in the data frame but column header gets created in the csv and all rows gets populated with null

Split pyspark dataframe to chunks and convert to dictionary

Read multiple jsons from one file [duplicate]

Categories

Resources