Read multiple Json in one file in Spark

Read multiple Json in one file in Spark - python

I am trying to read in spark a json file that have a json file per line
["io", {"in": 8, "out": 0, "dev": "68", "time": 1532035082.614868}]
["io", {"in": 0, "out": 0, "dev": "68", "time": 1532035082.97122}]
["test", {"A": [{"para1":[], "para2": true, "para3": 68, "name":"", "observation":[[2,3],[3,2]],"time": 1532035082.97122}]}]
It is a bit tricky because each line is a valid json file.
with pandas I do direcly:
pd.read_json(filepath,compression='infer', orient='records, lines=True)
But in spark with DataFrame it does not work
spark.read.option('multiline','true').json(filepath)
I tried to read the file line by line but I still have an error:
lines = sc.textFile(filepath)
llist = lines.collect()
for line in llist:
print(line)
df = spark.read.option('multiline','true).json(line)
df.printSchema()
the error is IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI: .....
thanks for you help to find out a solution

One possible way is to read as a text file and parse each row as an array of two strings:
import pyspark.sql.functions as F
df = spark.read.text(filepath).withColumn(
'value',
F.from_json('value', 'array<string>')
).select(
F.col('value')[0].alias('c0'),
F.col('value')[1].alias('c1')
)
df.show(truncate=False)
+----+------------------------------------------------------------------------------------------------------------+
|c0 |c1 |
+----+------------------------------------------------------------------------------------------------------------+
|io |{"in":8,"out":0,"dev":"68","time":1.532035082614868E9} |
|io |{"in":0,"out":0,"dev":"68","time":1.53203508297122E9} |
|test|{"A":[{"para1":[],"para2":true,"para3":68,"name":"","observation":[[2,3],[3,2]],"time":1.53203508297122E9}]}|
+----+------------------------------------------------------------------------------------------------------------+
But note that column c1 is of string type. It is not possible for Spark to behave like pandas where the column holds dictionaries with different schemas.

Related

Handle JSON objects in CSV File and save to PySpark DataFrame

I have a CSV file which contains JSON objects as well as other data like String, Integer in it.
If I try to read the file as CSV then the JSON objects overlaps in other columns.
Column1, Column2, Column3, Column4, Column5
100,ABC,{"abc": [{"xyz": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},foo, pine
101,XYZ,{"xyz": [{"abc": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},bar, apple
I am getting output as:
Column1 | Column2 | Column3 | Column4 | Column5
100 | ABC | {"abc": [{"xyz": 0, "mno": "h"} | {"apple": 0, "hello": 1 | "temp": "cnot"}]}
101 | XYZ | {"xyz": [{"abc": 0, "mno": "h"} | {"xyz": [{"abc": 0, "mno": "h"} | "temp": "cnot"}]}
Test_File.py
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# Initializing SparkSession and setting up the file source
filepath = "s3a://file.csv"
df = spark.read.format("csv").options(header="true", delimiter = ',', inferschema='true').load(filepath)
df.show(5)
Also tried handling this issue by reading the file as text as discussed in this approach
'100,ABC,"{\'abc\':["{\'xyz\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, foo, pine'
'101,XYZ,"{\'xyz\':["{\'abc\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, bar, apple'
But instead of creating a new file, I wanted to load this quoted string as the PySpark DataFrame to run the SQL Queries on them, to create a DataFrame I need to split this again to assign each column to PySpark which results in splitting the JSON Object again.

The issue is with the delimiter you are using. You are reading CSV with comma as a delimiter and your JSON string contains commas. Hence Spark is splitting the JSON string also on coma therefore the above output. You will need to have a CSV with a delimiter which is unique and will not be present in either of the column value so as to overcome your case.

JSON format for a dictionary with Pandas Dataframe as values

I need to return, from a web framework (Flask for instance), a few dataframes and a string in a single Json object. My code looks something like this:
import pandas as pd
data1 = [['Alex',10],['Bob',12],['Clarke',13]]
df1 = pd.DataFrame(data1,columns=['Name','Age'])
data2 = [['Cycle',5],['Run',1],['Hike',7]]
df2 = pd.DataFrame(data1,columns=['Sport','Duration'])
test_value={}
test_value["df1"] = df1.to_json(orient='records')
test_value["df2"] = df2.to_json(orient='records')
print(json.dumps(test_value))
This outputs :
{"df1": "[{\"Name\":\"Alex\",\"Age\":10},{\"Name\":\"Bob\",\"Age\":12},{\"Name\":\"Clarke\",\"Age\":13}]", "df2": "[{\"Sport\":\"Alex\",\"Duration\":10},{\"Sport\":\"Bob\",\"Duration\":12},{\"Sport\":\"Clarke\",\"Duration\":13}]"}
So a number of escape characters in front of every key of the value of "df1" and "df2". If on the other hand, I look at test_value, I get:
{'df1': '[{"Name":"Alex","Age":10},{"Name":"Bob","Age":12},{"Name":"Clarke","Age":13}]', 'df2': '[{"Sport":"Alex","Duration":10},{"Sport":"Bob","Duration":12},{"Sport":"Clarke","Duration":13}]'}
Which is not quite right. What I need is 'df1' to be in double quotes "df1". Short of doing a search and replace in the string, what is the way to achieve that?
I've even tried to create the string myself, doing something like that :
print('\{"test": "{0:.2f}"\}'.format(123))
but I get this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-266-1fa35152436c> in <module>
----> 1 print('\{"test": "{0:.2f}"\}'.format(123))
KeyError: '"test"'
which I really don't get :). That said, there must be a better way then searching/replacing for the 'df1' for "df1".
Ideas?

There is double converting to json, in to_json and in json.dumps functions. Solution is convert values to dictionaries by DataFrame.to_dict and then only once to json by json.dumps:
test_value={}
test_value["df1"] = df1.to_dict(orient='records')
test_value["df2"] = df2.to_dict(orient='records')
print(json.dumps(test_value))
{"df1": [{"Name": "Alex", "Age": 10},
{"Name": "Bob", "Age": 12},
{"Name": "Clarke", "Age": 13}],
"df2": [{"Sport": "Alex", "Duration": 10},
{"Sport": "Bob", "Duration": 12},
{"Sport": "Clarke", "Duration": 13}]}

Why is a string integer read incorrectly with pandas.read_json?

I am not the one for any hyperbole but I am really stumped by this error and i am sure you will be too..
Here is a simple json object:
[
{
"id": "7012104767417052471",
"session": -1332751885,
"transactionId": "515934477",
"ts": "2019-10-30 12:15:40 AM (+0000)",
"timestamp": 1572394540564,
"sku": "1234",
"price": 39.99,
"qty": 1,
"ex": [
{
"expId": 1007519,
"versionId": 100042440,
"variationId": 100076318,
"value": 1
}
]
}
]
Now I saved the file into ex.json and then executed the following python code:
import pandas as pd
df = pd.read_json('ex.json')
When i see the dataframe the value of my id has changed from "7012104767417052471" to "7012104767417052160"py
Does anyone understand why python does this? I tried it in node, js, and even excel and it is looking fine in everything else..
If I do this I get the right id:
with open('Siva.json') as data_file:
data = json.load(data_file)
df = json_normalize(data)
But I want to understand why pandas doesn't process json properly in a strange way.

This is a known issue:
This has been an OPEN issue since 2018-04-04
read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608
As stated in the issue. Explicitly designate the dtype to get the correct number.
import pandas as pd
df = pd.read_json('test.json', dtype={'id': 'int64'})
id session transactionId ts timestamp sku price qty ex
7012104767417052471 -1332751885 515934477 2019-10-30 12:15:40 AM (+0000) 2019-10-30 00:15:40.564 1234 39.99 1 [{'expId': 1007519, 'versionId': 100042440, 'variationId': 100076318, 'value': 1}]

Read multiple jsons from one file [duplicate]

This question already has answers here:
Loading and parsing a JSON file with multiple JSON objects
(5 answers)
Closed 3 years ago.
I am working with python and I have a file (data.json) which contains multiple jsons but the whole of it is not a json.
So the file looks like that:
{ "_id" : 01, ..., "path" : "2017-12-12" }
{ "_id" : 02, ..., "path" : "2017-1-12" }
{ "_id" : 03, ..., "path" : "2017-5-12" }
at the place of ... there are about 30 more keys which some of them have nested jsons (so my point is that each json above is pretty long).
Therefore, each of the blocks above at this single file are jsons but the whole of the file is not a json since these are not separated by commas etc.
How can I read each of these jsons separately either with pandas or with simple python?
I have tried this:
import pandas as pd
df = pd.read_json('~/Desktop/data.json', lines=True)
and it actually creates a dataframe where each row is about one json but it also create a column for each of the (1st level) keys of the json which makes things a bit more messy instead of putting the whole json directly in one cell.
To be more clear, I would like my output to be like this in a 'pandas' dataframe (or in another sensible data-structure):
jsons
0 { "_id" : 01, ..., "path" : "2017-12-12" }
1 { "_id" : 02, ..., "path" : "2017-1-12" }
2 { "_id" : 03, ..., "path" : "2017-5-12" }

Idea is use read_csv with no exist separator in data and then convert each value of column to dictionary:
import pandas as pd
import ast, json
from io import StringIO
temp=u"""{ "_id" : 1, "path" : "2017-12-12" }
{ "_id" : 2, "path" : "2017-1-12" }
{ "_id" : 3, "path" : "2017-5-12" }"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['data'])
print (df)
#jsons
df['data'] = df['data'].apply(json.loads)
#dictionaries
#df['data'] = df['data'].apply(ast.literal_eval)
print (df)
data
0 {'_id': 1, 'path': '2017-12-12'}
1 {'_id': 2, 'path': '2017-1-12'}
2 {'_id': 3, 'path': '2017-5-12'}

As the file is itself is not a json, so i will read it line by line
and as the line is a string format so i will convert it to dict type using yaml
then last i will append it all in dataframe
import yaml
import pandas as pd
f = open('data.json')
line = f.readline()
df = pd.DataFrame()
while line:
#string line to dict
d = yaml.load(line)
#temp dataframe
df1=pd.DataFrame(d,index=[0])
#append in every iteration
df=df.append(df1, ignore_index=True)
line = f.readline()
f.close()
print(df)
#output
_id path
0 01 2017-12-12
1 02 2017-1-12
2 03 2017-5-12

Construct DataFrame from multiple JSON files

I'm using pandas to convert multiple json files into a dataframe. I only want some entries that match some criteria from those files, but I'm appending the whole converted files, then filtering it.
Suppose I have 2 json files that look like this:
File 1500.json
[
{
"CodStore": 1500,
"CodItem": 10,
"NameItem": "Burger",
"Price": 10.0
},
{
"CodStore": 1500,
"CodItem": 20,
"NameItem": "Fries",
"Price": 3.0
},
{
"CodStore": 1500,
"CodItem": 30,
"NameItem": "Ice Cream",
"Price": 1.0
}
]
File 1805.json
[
{
"CodStore": 1805,
"CodItem": 10,
"NameItem": "Burger",
"Price": 9.0
},
{
"CodStore": 1805,
"CodItem": 20,
"NameItem": "Fries",
"Price": 2.0
},
{
"CodStore": 1805,
"CodItem": 30,
"NameItem": "Ice Cream",
"Price": 0.5
}
]
I only want entries with CodItem 10 and 30 on my dataframe, so my python code looks like this:
from pandas import DataFrame, read_json
df = DataFrame()
stores = [1500, 1805]
for store in stores:
filename = '%s.json' % store
df = df.append(read_json(filename))
df = df[(df.CodItem == 10) | (df.CodItem == 30)]
This is just an example, the problem is that I have more than 600+ json files so reading takes a lot of time, the dataframe becomes very long and memory consumption is very high.
Is there a way to read only the matching criteria to the dataframe?

One option would be to append your JSON data to a list, then convert once at the end and filter.
coditems = [10, 30]
data = []
for filename in json_files:
data.extend(read_json(filename))
df = pd.DataFrame(data).query('CodItem in #coditems')
This should be a lot faster because append is a quadratic operation. You have to read all the data in anyway, so you may as well use pandas to speed it up.
Another option would be to initialise your DataFrames inside a loop and then call pd.concat after you're done.
df_list = []
for file in json_files:
df_list.append(pd.DataFrame.from_records(read_json(filename)))
df = pd.concat(df_list, ignore_index=True).query('CodItem in #coditems')

You can can create a temporary data frame within your loop and filter it before appending:
from pandas import DataFrame, read_json
df = DataFrame()
stores = [1500, 1805]
for store in stores:
filename = '%s.json' % store
temp_df = read_json(filename)
df = df.append(temp_df[(temp_df.CodItem == 10) | (temp_df.CodItem == 30)])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read multiple Json in one file in Spark - python

Related

Handle JSON objects in CSV File and save to PySpark DataFrame

JSON format for a dictionary with Pandas Dataframe as values

Why is a string integer read incorrectly with pandas.read_json?

Read multiple jsons from one file [duplicate]

Construct DataFrame from multiple JSON files

Categories

Resources