how to accelerate compute for pyspark - python

The source data is event logs from device and all data is json format,
sample of raw json data
{"sn": "123", "ip": null, "evt_name": "client_requestData", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "music", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350052, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData2", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "fm", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350053, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData3", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "video", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350054, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData4", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "fm", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}
I have a event list,eg: tar_task_list, about 100 and more items,and for each event
I need to aggregate all the event from the raw data and then save this to a event csv file
Below is code
#read source data
raw_data = sc.textFile("s3://xxx").map(lambda x:json.loads(x))
# TODO: NEED TO SPEED UP THIS COMPUTING
for tar_evt_name in evts:
print("...")
table_name = out_table_prefix + tar_evt_name
evt_one_rdd = raw_data.filter(lambda x: x.get("evt_name") == tar_evt_name)
evt_one_rdd.cache()
evt_one_dict = evt_one_rdd.first()
Evt_one = Row(*sorted(['{}'.format(k) for k, v in evt_one_dict.items()]))
col_len = len(evt_one_rdd.first())
evt_one_rdd2 = evt_one_rdd.map(lambda x: to_list(x, col_len)).filter(lambda x: len(x) is not 0)
evt_one_rdd2.cache()
df = spark.createDataFrame(evt_one_rdd2.map(lambda x: Evt_one(*x)))
out_csv_path = output + '/' + tar_evt_name+'/'# add last '/' for copy err
df.write.csv(out_csv_path, mode='overwrite', header=True,sep='|',nullValue="NULL")
the output data like below:
time : 2018-05-07 00:03|8dab4796-fa37-4114-0011-7637fa2b0001|f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759|0.2.23|131074|2018-05-08 23:24:25|0|false|default|2.4.130

Here is my attempt,
I have noticed a few issues here,
for tar_evt_name in evts is a native Python for loop, which incur a performance penalty when it seems like you want to do a group by operation;
.cache() is used, but seemingly for no reason;
Unsure what to_list is;
Don't think evt_one_rdd2.map(lambda x: Evt_one(*x))) works;
import json
from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.sql import Window
raw_data = sc.textFile('test.txt')
df = raw_data.map(
# Map the raw input to python dict using `json.loads`
json.loads,
).map(
# Duplicate the evt_name and evt_ts for later use in a Row object
lambda x: Row(evt_name=x['evt_name'], evt_ts=x.get('evt_ts', 1), data=x),
).toDF() # Convert into a dataframe...
# ... (I am actually unusre if this is faster...
# ... but I am more comfortable with this)
filtered_df = df.withColumn(
# NOTE: Assumed you want the first row, as you used `evt_one_rdd.first()`.
# So we assign a row number (named rn) and then filter on rn = 1.
# Here the evt_name and evt_ts becomes handy, you might want to set
# your own evt_ts properly.
'rn', F.row_number().over(
Window.partitionBy(df['evt_name']).orderBy(df['evt_ts'])
),
).filter('rn = 1').where(
# NOTE: Since you used `map(lambda x: to_list(x, col_len)).filter(lambda x: len(x) is not 0)`,
# I assume you meant data should have more than 0 keys,
# but this should be almost always true?
# Since you are grouping by `evt_name`, which means
# there is at least that key most of the time.
F.size(F.col('data')) > 0
)
filtered_df.write(....)

Related

Databricks - Pyspark - Handling nested json with a dynamic key

I have an example json data file which has the following structure:
{
"Header": {
"Code1": "abc",
"Code2": "def",
"Code3": "ghi",
"Code4": "jkl",
},
"TimeSeries": {
"2020-11-25T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
},
"2020-11-26T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
}
}
}
When I parse this into databricks with command:
df = spark.read.json("/FileStore/test.txt")
I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema:
Date
UnitPrice
Amount
As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically:
def flatten_json(data):
columnlist = data.select("TimeSeries.*")
count = 0
for name in data.select("TimeSeries.*"):
df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a"))
df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a"))
if count == 0:
df3 = df1.join(df2, on=['join'], how="inner")
else:
df3 = df3.union(df1.join(df2, on=['join'], how="inner"))
count = count + 1
return(df3)
This is far from ideal. Does anyone know a better method to create the described dataframe?
The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.
Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.
Step 5: Cross join with the Header row.
import pyspark.sql.functions as F
header_df = df.select("Header.*")
timeseries_df = df.select("TimeSeries.*")
fieldNames = enumerate(timeseries_df.schema.fieldNames())
cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]
combined = explode(array(cols)).alias("comb")
timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')
result = header_df.crossJoin(timeseries)
result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+
|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|
+-----+-----+-----+-----+-------------------------+------+---------+
|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 |
|abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |
+-----+-----+-----+-----+-------------------------+------+---------+

Json file not formatted correctly when writing json differences with pandas and numpy

I am trying to compare two json and then write another json with columns names and with differences as yes or no. I am using pandas and numpy
The below is sample files i am including actually, these json are dynamic, that mean we dont know how many key will be there upfront
Input files:
fut.json
[
{
"AlarmName": "test",
"StateValue": "OK"
}
]
Curr.json:
[
{
"AlarmName": "test",
"StateValue": "OK"
}
]
Below code I have tried:
import pandas as pd
import numpy as np
with open(r"c:\csv\fut.json", 'r+') as f:
data_b = json.load(f)
with open(r"c:\csv\curr.json", 'r+') as f:
data_a = json.load(f)
df_a = pd.json_normalize(data_a)
df_b = pd.json_normalize(data_b)
_, df_a = df_b.align(df_a, fill_value=np.NaN)
_, df_b = df_a.align(df_b, fill_value=np.NaN)
with open(r"c:\csv\report.json", 'w') as _file:
for col in df_a.columns:
df_temp = pd.DataFrame()
df_temp[col + '_curr'], df_temp[col + '_fut'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
#[df_temp.rename(columns={c:'Missing'}, inplace=True) for c in df_temp.columns if df_temp[c].isnull().all()]
df_temp.fillna('Missing', inplace=True)
with pd.option_context('display.max_colwidth', -1):
_file.write(df_temp.to_json(orient='records'))
Expected output:
[
{
"AlarmName_curr": "test",
"AlarmName_fut": "test",
"AlarmName_diff": "No"
},
{
"StateValue_curr": "OK",
"StateValue_fut": "OK",
"StateValue_diff": "No"
}
]
Coming output: Not able to parse it in json validator, below is the problem, those [] should be replaed by ',' to get right json dont know why its printing like that
[{"AlarmName_curr":"test","AlarmName_fut":"test","AlarmName_diff":"No"}][{"StateValue_curr":"OK","StateValue_fut":"OK","StateValue_diff":"No"}]
Edit1:
Tried below as well
_file.write(df_temp.to_json(orient='records',lines=True))
now i get json which is again not parsable, ',' is missing and unless i add , between two dic and [ ] at beginning and end manually , its not parsing..
[{"AlarmName_curr":"test","AlarmName_fut":"test","AlarmName_diff":"No"}{"StateValue_curr":"OK","StateValue_fut":"OK","StateValue_diff":"No"}]
Honestly pandas is overkill for this... however
load dataframes as you did
concat them as columns. rename columns
do calcs and map boolean to desired Yes/No
to_json() returns a string so json.loads() to get it back into a list/dict. Filter columns to get to your required format
import json
data_b = [
{
"AlarmName": "test",
"StateValue": "OK"
}
]
data_a = [
{
"AlarmName": "test",
"StateValue": "OK"
}
]
df_a = pd.json_normalize(data_a)
df_b = pd.json_normalize(data_b)
df = pd.concat([df_a, df_b], axis=1)
df.columns = [c+"_curr" for c in df_a.columns] + [c+"_fut" for c in df_a.columns]
df["AlarmName_diff"] = df["AlarmName_curr"] == df["AlarmName_fut"]
df["StateValue_diff"] = df["StateValue_curr"] == df["StateValue_fut"]
df = df.replace({True:"Yes", False:"No"})
js = json.loads(df.loc[:,(c for c in df.columns if c.startswith("Alarm"))].to_json(orient="records"))
js += json.loads(df.loc[:,(c for c in df.columns if c.startswith("State"))].to_json(orient="records"))
js
output
[{'AlarmName_curr': 'test', 'AlarmName_fut': 'test', 'AlarmName_diff': 'Yes'},
{'StateValue_curr': 'OK', 'StateValue_fut': 'OK', 'StateValue_diff': 'Yes'}]

Multi Level JSON Data into SQL Using Python

I've been working on taking JSON data and dumping it into a SQL database. I've run across some data that is "multi level" and I'm stuck on finding the best approach to handle the data and create the correct table structure in SQL.
Here's the portion of the JSON data that has a multi level structure:
"LimitedTaxonomy": {
"Children": [
{
"Children": [
{
"Children": [],
"NewPartCount": 0,
"Parameter": "Categories",
"ParameterId": -8,
"PartCount": 1,
"Value": "Logic - Flip Flops",
"ValueId": "706"
}
],
"NewPartCount": 0,
"Parameter": "Categories",
"ParameterId": -8,
"PartCount": 1,
"Value": "Integrated Circuits (ICs)",
"ValueId": "32"
}
],
"NewPartCount": 0,
"Parameter": "Categories",
"ParameterId": -8,
"PartCount": 1,
"Value": "Out of Bounds",
"ValueId": "0"
}
I have a function that I call when I'm parsing the JSON data that takes the structure and puts data in the SQL tables. Using the JSON data above I'd be passing:
thetable = 'LimitedTaxonomy'
the value = {'Children': [{'Children': [{'Children': [], 'NewPartCount': 0, 'Parameter': 'Categories', 'ParameterId': -8, 'PartCount': 1, 'Value': 'Logic - Flip Flops', 'ValueId': '706'}], 'NewPartCount': 0, 'Parameter': 'Categories', 'ParameterId': -8, 'PartCount': 1, 'Value': 'Integrated Circuits (ICs)', 'ValueId': '32'}], 'NewPartCount': 0, 'Parameter': 'Categories', 'ParameterId': -8, 'PartCount': 1, 'Value': 'Out of Bounds', 'ValueId': '0'}
def create_sql(thetable, thevalue):
if len(thevalue) > 0:
#print(thevalue[0], len(thevalue))
if type(thevalue) is list:
x = tuple(thevalue[0])
elif type(thevalue) is dict:
x = tuple(thevalue)
a = str(x).replace("'","").replace("(","")
query = "INSERT INTO " + thetable + " (PartDetailsId, " + a + " VALUES(" + str(data['PartDetails']['PartId'])
b = ""
for i in range(len(x)):
b += ", ?" # I've also seen %s used as a placeholder
b += ")"
query += b
print(query)
print(thevalue)
print()
#TODO Need to check before entering data, list has many records, dict has 1
if type(thevalue) is list:
cursor.executemany(query, [tuple(d.values()) for d in thevalue])
elif type(thevalue) is dict:
cursor.execute(query, tuple(thevalue.values()))
cursor.commit()
The above function seems to work well with "single level" JSON but this data has basically a table/array (Children) as a column.
In SQL I have the 2 tables as defined like this, which might not be the way to handle this:
SELECT TOP (1000) [LimitedTaxonomyId]
,[PartDetailsId]
,[NewPartCount]
,[Parameter]
,[ParameterId]
,[PartCount]
,[Value]
,[ValueId]
FROM [Components].[dbo].[LimitedTaxonomy]
SELECT TOP (1000) [ChildrenId]
,[LimitedTaxonomyId]
,[NewPartCount]
,[Parameter]
,[ParameterId]
,[PartCount]
,[Value]
,[ValueId]
FROM [Components].[dbo].[Children]
I think I need to check thevalue and if contains a list I need to first pull that list out of thevalue and then feed remaining data into my create_sql function. After that I then put that extracted list back through the create_sql function but first querying the SQL Database to grab the [LimitedTaxonomyId] value that was just entered.
This sounds like a big mess and maybe it's the only way but I'd like some 2nd opinions on how I'm going about this and if there's perhaps a better way.

Get Mongo field names from collection using pymongo

I am trying to get fields names from the MongoDB using pymongo. Is there a way to do that?
Mongo Collection Format:
"_id" : ObjectId("5e7a773721ee63712e9d25a3"),
"effective_date" : "2020-03-24",
"data" : [
{
"Year" : 2020,
"month" : 1,
"Day" : 28,
"views" : 4994,
"clicks" : 3982
},
{
"Year" : 2020,
"month" : 1,
"Day" : 17,
"views" : 1987,
"clicks" : 3561
},
.
.
.
]
Is there a way I can get field names:
I want to get: _id, effective_date, data.Year, data.month, data.Day, data.views, data.clicks
This is what I have:
from datetime import datetime, timedelta, date
import pymongo
from pymongo import MongoClient
from pymongo.read_preferences import ReadPreference
from pprint import pprint
from bson.son import SON
from bson import json_util
from bson.json_util import dumps, loads
import re
client = pymongo.MongoClient(host='mongodb://00.00.00.0:00000')
db = client.collection
pprint(db)
def get_results(filters):
col=db.results
res = col.find()
res = list(res)
return dumps(res, indent=4)
Is there a way for me to get just the field names using pymongo?
We are not really filtering or aggregating in the example; we are doing a big find() and then we want all the field names. There is no projection either. So assuming that we are dragging over all the data anyway, let the client side do the work. Here's something that will capture unique field names including through arrays and give you a count of each unique field name as well:
r = [
{"_id":0, "A":"A", "data":[
{"Y":2020,"day":3,"clicks":12},
{"Y":2020,"day":4,"clicks":192}
]} ,
{"_id":1, "B":{"foo":"bar"}, "data":[
{"Y":2020,"day":3,"clicks":888,"corn":"dog"},
{"Y":2020,"day":4,"clicks":999,"zing":"zap"}
]} ,
{"_id":2, "B":{"foo":"bit"} },
{"_id":3, "B":{"fin":"bar"} }
]
coll.insert(r)
fieldNames = {}
def addFldName(s):
if s not in fieldNames:
fieldNames[s] = 0
fieldNames[s] += 1
def process(path, v):
addFldName(path)
if("dict" == v.__class__.__name__):
walkMap(path, v)
elif("list" == v.__class__.__name__):
walkList(path, v)
def walkMap(path, doc):
dot = "" if path is "" else "."
for k, v in doc.iteritems():
s = path + dot + k
process(s, v)
def walkList(path, array):
dot = "" if path is "" else "."
for n in range(0,len(array)):
s = path + dot + str(n)
process(s, array[n])
for doc in coll.find():
walkMap("", doc)
print(fieldNames)
{u'A': 1, u'data.1.clicks': 2, u'B': 3, u'data.0': 2, u'data.1': 2, u'data.0.Y': 2, u'data.1.zing': 1, u'data.0.day': 2, u'B.fin': 1, u'B.foo': 2, u'data.1.Y': 2, u'_id': 4, u'data': 2, u'data.0.corn': 1, u'data.0.clicks': 2, u'data.1.day': 2}
It's a little weird, but yes, data.0.clicks is unique and shows up in 2 docs.

How to parse tab-delimited text file with 4th column as json and remove certain keys?

I have a text file that is 26 Gb, The line format is as follow
/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
I'm trying to get only the last columns which is a json and from that Json I'm only trying to save the "title", "isbn 13", "isbn 10"
I was able to save only the last column with this code
csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
with open(output_file, "w") as tmp_file:
reader = csv.reader(to_read, delimiter = "\t")
writer = csv.writer(tmp_file)
desired_column = [4] # text column
for row in reader: # read one row at a time
myColumn = list(row[i] for i in desired_column) # build the output row (process)
writer.writerow(myColumn) # write it
but this doesn't return a proper json object instead returns everything with a double quotations next to it. Also how would I extract certain values from the json as a new json
EDIT:
"{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}"
EDIT 2:
so im trying to read this file which is a tab separated file with the following columns:
type - type of record (/type/edition, /type/work etc.)
key - unique key of the record. (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format
Im trying to read the JSON file and from that Json im only trying to get the "title", "isbn 13", "isbn 10" as a json and save it to the file as a row
so every row should look like the original but with only those key and values
Here's a straight-forward way of doing it. You would need to repeat this and extract the desired data from each line of the file as it's being read, line-by-line (the default way text file reading is handled in Python).
import json
line = '/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}'
csv_cols = line.split('\t')
json_data = json.loads(csv_cols[4])
#print(json.dumps(json_data, indent=4))
desired = {key: json_data[key] for key in ("title", "isbn_13", "isbn_10")}
result = json.dumps(desired, indent=4)
print(result)
Output from sample line:
{
"title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93",
"isbn_13": [
"9780107805401"
],
"isbn_10": [
"0107805405"
]
}
So given that your current code returns the following:
result = '{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}'
Looks like what you need to do is: First - Replace those double-double-quotes with regular double quotes, otherwise things are not parsible:
res = result.replace('""','"')
Now res is convertible to a JSON object:
import json
my_json = json.loads(res)
my_json now looks like this:
{'authors': [{'key': '/authors/OL2645777A'}],
'identifiers': {'goodreads': ['6850240']},
'isbn_10': ['0107805405'],
'isbn_13': ['9780107805401'],
'key': '/books/OL10000135M',
'languages': [{'key': '/languages/eng'}],
'last_modified': {'type': '/type/datetime',
'value': '2010-04-24T17:54:01.503315'},
'latest_revision': 4,
'number_of_pages': 64,
'physical_format': 'Hardcover',
'publish_date': 'December 1993',
'publishers': ['Bernan Press'],
'revision': 4,
'subjects': ['Government - Comparative', 'Politics / Current Events'],
'subtitle': '9th November - 3rd December, 1992',
'title': 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93',
'type': {'key': '/type/edition'},
'works': [{'key': '/works/OL7925046W'}]}
You can conveniently get any field you want from this object:
my_json['title']
# 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93'
my_json['isbn_10'][0]
# '0107805405'
Especially because your example is so large, I'd recommend using a specialized library such as pandas, which has a read_csv method, or even dask, which supports out-of-memory operations.
Both of these systems will automatically parse out the quotations for you, and dask will do so in "pieces" direct from disk so you never have to try to load 26GB into RAM.
In both libraries, you can then access the columns you want like this:
data = read_csv(PATH)
data["ColumnName"]
You can then parse these rows either using json.loads() (import json) or you can use the pandas/dask json implementations. If you can give some more details of what you're expecting, I can help you draft a more specific code example.
Good luck!
I saved your data to a file to see if i could read just the rows, let me know if this works:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
for row in temp:
print(row['isbn_13'])
If that works this should create a json for you:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
new_json=[]
for row in temp:
new_json.append({'title': row['title'], 'isbn_13': row['isbn_13'], 'isbn_10': row['isbn_10']})

Categories

Resources