Multi Level JSON Data into SQL Using Python - python

I've been working on taking JSON data and dumping it into a SQL database. I've run across some data that is "multi level" and I'm stuck on finding the best approach to handle the data and create the correct table structure in SQL.
Here's the portion of the JSON data that has a multi level structure:
"LimitedTaxonomy": {
"Children": [
{
"Children": [
{
"Children": [],
"NewPartCount": 0,
"Parameter": "Categories",
"ParameterId": -8,
"PartCount": 1,
"Value": "Logic - Flip Flops",
"ValueId": "706"
}
],
"NewPartCount": 0,
"Parameter": "Categories",
"ParameterId": -8,
"PartCount": 1,
"Value": "Integrated Circuits (ICs)",
"ValueId": "32"
}
],
"NewPartCount": 0,
"Parameter": "Categories",
"ParameterId": -8,
"PartCount": 1,
"Value": "Out of Bounds",
"ValueId": "0"
}
I have a function that I call when I'm parsing the JSON data that takes the structure and puts data in the SQL tables. Using the JSON data above I'd be passing:
thetable = 'LimitedTaxonomy'
the value = {'Children': [{'Children': [{'Children': [], 'NewPartCount': 0, 'Parameter': 'Categories', 'ParameterId': -8, 'PartCount': 1, 'Value': 'Logic - Flip Flops', 'ValueId': '706'}], 'NewPartCount': 0, 'Parameter': 'Categories', 'ParameterId': -8, 'PartCount': 1, 'Value': 'Integrated Circuits (ICs)', 'ValueId': '32'}], 'NewPartCount': 0, 'Parameter': 'Categories', 'ParameterId': -8, 'PartCount': 1, 'Value': 'Out of Bounds', 'ValueId': '0'}
def create_sql(thetable, thevalue):
if len(thevalue) > 0:
#print(thevalue[0], len(thevalue))
if type(thevalue) is list:
x = tuple(thevalue[0])
elif type(thevalue) is dict:
x = tuple(thevalue)
a = str(x).replace("'","").replace("(","")
query = "INSERT INTO " + thetable + " (PartDetailsId, " + a + " VALUES(" + str(data['PartDetails']['PartId'])
b = ""
for i in range(len(x)):
b += ", ?" # I've also seen %s used as a placeholder
b += ")"
query += b
print(query)
print(thevalue)
print()
#TODO Need to check before entering data, list has many records, dict has 1
if type(thevalue) is list:
cursor.executemany(query, [tuple(d.values()) for d in thevalue])
elif type(thevalue) is dict:
cursor.execute(query, tuple(thevalue.values()))
cursor.commit()
The above function seems to work well with "single level" JSON but this data has basically a table/array (Children) as a column.
In SQL I have the 2 tables as defined like this, which might not be the way to handle this:
SELECT TOP (1000) [LimitedTaxonomyId]
,[PartDetailsId]
,[NewPartCount]
,[Parameter]
,[ParameterId]
,[PartCount]
,[Value]
,[ValueId]
FROM [Components].[dbo].[LimitedTaxonomy]
SELECT TOP (1000) [ChildrenId]
,[LimitedTaxonomyId]
,[NewPartCount]
,[Parameter]
,[ParameterId]
,[PartCount]
,[Value]
,[ValueId]
FROM [Components].[dbo].[Children]
I think I need to check thevalue and if contains a list I need to first pull that list out of thevalue and then feed remaining data into my create_sql function. After that I then put that extracted list back through the create_sql function but first querying the SQL Database to grab the [LimitedTaxonomyId] value that was just entered.
This sounds like a big mess and maybe it's the only way but I'd like some 2nd opinions on how I'm going about this and if there's perhaps a better way.

Related

How to parse tab-delimited text file with 4th column as json and remove certain keys?

I have a text file that is 26 Gb, The line format is as follow
/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
I'm trying to get only the last columns which is a json and from that Json I'm only trying to save the "title", "isbn 13", "isbn 10"
I was able to save only the last column with this code
csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
with open(output_file, "w") as tmp_file:
reader = csv.reader(to_read, delimiter = "\t")
writer = csv.writer(tmp_file)
desired_column = [4] # text column
for row in reader: # read one row at a time
myColumn = list(row[i] for i in desired_column) # build the output row (process)
writer.writerow(myColumn) # write it
but this doesn't return a proper json object instead returns everything with a double quotations next to it. Also how would I extract certain values from the json as a new json
EDIT:
"{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}"
EDIT 2:
so im trying to read this file which is a tab separated file with the following columns:
type - type of record (/type/edition, /type/work etc.)
key - unique key of the record. (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format
Im trying to read the JSON file and from that Json im only trying to get the "title", "isbn 13", "isbn 10" as a json and save it to the file as a row
so every row should look like the original but with only those key and values
Here's a straight-forward way of doing it. You would need to repeat this and extract the desired data from each line of the file as it's being read, line-by-line (the default way text file reading is handled in Python).
import json
line = '/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}'
csv_cols = line.split('\t')
json_data = json.loads(csv_cols[4])
#print(json.dumps(json_data, indent=4))
desired = {key: json_data[key] for key in ("title", "isbn_13", "isbn_10")}
result = json.dumps(desired, indent=4)
print(result)
Output from sample line:
{
"title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93",
"isbn_13": [
"9780107805401"
],
"isbn_10": [
"0107805405"
]
}
So given that your current code returns the following:
result = '{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}'
Looks like what you need to do is: First - Replace those double-double-quotes with regular double quotes, otherwise things are not parsible:
res = result.replace('""','"')
Now res is convertible to a JSON object:
import json
my_json = json.loads(res)
my_json now looks like this:
{'authors': [{'key': '/authors/OL2645777A'}],
'identifiers': {'goodreads': ['6850240']},
'isbn_10': ['0107805405'],
'isbn_13': ['9780107805401'],
'key': '/books/OL10000135M',
'languages': [{'key': '/languages/eng'}],
'last_modified': {'type': '/type/datetime',
'value': '2010-04-24T17:54:01.503315'},
'latest_revision': 4,
'number_of_pages': 64,
'physical_format': 'Hardcover',
'publish_date': 'December 1993',
'publishers': ['Bernan Press'],
'revision': 4,
'subjects': ['Government - Comparative', 'Politics / Current Events'],
'subtitle': '9th November - 3rd December, 1992',
'title': 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93',
'type': {'key': '/type/edition'},
'works': [{'key': '/works/OL7925046W'}]}
You can conveniently get any field you want from this object:
my_json['title']
# 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93'
my_json['isbn_10'][0]
# '0107805405'
Especially because your example is so large, I'd recommend using a specialized library such as pandas, which has a read_csv method, or even dask, which supports out-of-memory operations.
Both of these systems will automatically parse out the quotations for you, and dask will do so in "pieces" direct from disk so you never have to try to load 26GB into RAM.
In both libraries, you can then access the columns you want like this:
data = read_csv(PATH)
data["ColumnName"]
You can then parse these rows either using json.loads() (import json) or you can use the pandas/dask json implementations. If you can give some more details of what you're expecting, I can help you draft a more specific code example.
Good luck!
I saved your data to a file to see if i could read just the rows, let me know if this works:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
for row in temp:
print(row['isbn_13'])
If that works this should create a json for you:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
new_json=[]
for row in temp:
new_json.append({'title': row['title'], 'isbn_13': row['isbn_13'], 'isbn_10': row['isbn_10']})

Structuring request JSON for API

I'm building a small API to interact with our database for other projects. I've built the database and have the API functioning fine, however, the data I get back isn't structured how I want it.
I am using Python with Flask/Flask-Restful for the API.
Here is a snippet of my Python that handles the interaction:
class Address(Resource):
def get(self, store):
print('Received a request at ADDRESS for Store ' + store )
conn = sqlite3.connect('store-db.db')
cur = conn.cursor()
addresses = cur.execute('SELECT * FROM Sites WHERE StoreNumber like ' + store)
for adr in addresses:
return(adr, 200)
If I make a request to the /sites/42 endpoint, where 42 is the site id, this is what I'll receive:
[
"42",
"5000 Robinson Centre Drive",
"",
"Pittsburgh",
"PA",
"15205",
"(412) 787-1330",
"(412) 249-9161",
"",
"Dick's Sporting Goods"
]
Here is how it is structured in the database:
Ultimately I'd like to use the column name as the Key in the JSON that's received, but I need a bit of guidance in the right direction so I'm not Googling ambiguous terms hoping to find something.
Here is an example of what I'd like to receive after making a request to that endpoint:
{
"StoreNumber": "42",
"Street": "5000 Robinson Centre Drive",
"StreetSecondary": "",
"City": "Pittsburgh",
"State": "PA",
"ZipCode": "15205",
"ContactNumber": "(412) 787-1330",
"XO_TN": "(412) 249-9161",
"RelocationStatus": "",
"StoreType": "Dick's Sporting Goods"
}
I'm just looking to get some guidance on if I should change how my data is structured in the database (i.e. I've seen some just put the JSON in their database, but I think that's messy) or if there's a more intuitive method I could use to control my data.
Updated Code using Accepted Answer
class Address(Resource):
def get(self, store):
print('Received a request at ADDRESS for Store ' + store )
conn = sqlite3.connect('store-db.db')
cur = conn.cursor()
addresses = cur.execute('SELECT * FROM Sites WHERE StoreNumber like ' + store)
for r in res:
column_names = ["StoreNumber", "Street", "StreetSecondary","City","State", "ZipCode", "ContactNumber", "XO_TN", "RelocationStatus", "StoreType"]
data = [r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8]]
datadict = {column_names[itemindex]:item for itemindex, item in enumerate(data)}
return(datadict, 200)
You could just convert your list to a dict and then parse it to a JSON string before passing it back out.
// These are the names of the columns in your database
>>> column_names = ["storeid", "address", "etc"]
// This is the data coming from the database.
// All data is passed as you are using SELECT * in your query
>>> data = [42, "1 the street", "blah"]
// This is a quick notation for creating a dict from a list
// enumerate means we get a list index and a list item
// as the columns are in the same order as the data, we can use the list index to pull out the column_name
>>> datadict = {column_names[itemindex]:item for itemindex, item in enumerate(data)}
//This just prints datadict in my terminal
>>> datadict
We now have a named dict containing your data and the column names.
{'etc': 'blah', 'storeid': 42, 'address': '1 the street'}
Now dump the datadict to a string so that it can be sent to the frontend.
>>> import json
>>> json.dumps(datadict)
The dict has now been converted to a string.
'{"etc": "blah", "storeid": 42, "address": "1 the street"}'
This would require no change to your database but the script would need to know about the column names or retrieve them dynamically using some SQL.
If the data in the database is in the correct format for passing to the frontend then you shouldn't need to change the database structure. If it was not in the correct format then you could either change the way it was stored or change your SQL query to manipulate it.

Python-How to find duplicated name/document in mongo db?

I want to find the duplicated document in my mongodb based on name, I have the following code:
def Check_BFA_DB(options):
issue_list=[]
client = MongoClient(options.host, int(options.port))
db = client[options.db]
collection = db[options.collection]
names = [{'$project': {'name':'$name'}}]
name_cursor = collection.aggregate(names, cursor={})
for name in name_cursor:
issue_list.append(name)
print(name)
It will print all names, how can I print only the duplicated ones?
Appritiated for any help!
The following query will show only duplicates:
db['collection_name'].aggregate([{'$group': {'_id':'$name', 'count': {'$sum': 1}}}, {'$match': {'count': {'$gt': 1}}}])
How it works:
Step 1:
Go over the whole collection, and group the documents by the property called name, and for each name count how many times it is used in the collection.
Step 2:
filter (using the keyword match) only documents in which the count is greater than 1 (the gt operator).
An example (written for mongo shell, but can be easily adapted for python):
db.a.insert({name: "name1"})
db.a.insert({name: "name1"})
db.a.insert({name: "name2"})
db.a.aggregate([{"$group": {_id:"$name", count: {"$sum": 1}}}, {$match: {count: {"$gt": 1}}}])
Result is { "_id" : "name1", "count" : 2 }
So your code should look something like this:
def Check_BFA_DB(options):
issue_list=[]
client = MongoClient(options.host, int(options.port))
db = client[options.db]
name_cursor = db[options.collection].aggregate([
{'$group': {'_id': '$name', 'count': {'$sum': 1}}},
{'$match': {'count': {'$gt': 1}}}
])
for document in name_cursor:
name = document['_id']
issue_list.append(name)
print(name)
BTW (not related to the question), python naming convention for function names is lowercase letters, so you might want to call it check_bfa_db()

how to accelerate compute for pyspark

The source data is event logs from device and all data is json format,
sample of raw json data
{"sn": "123", "ip": null, "evt_name": "client_requestData", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "music", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350052, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData2", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "fm", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350053, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData3", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "video", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}, "evt_ts": 1521350054, "app_key": "f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759", "sdk_name": "countlysdk_0.0.9", "sdk_version": "17.05"}
{"sn": "123", "ip": null, "evt_name": "client_requestData4", "evt_content": {"count": 1, "hour": 13, "dow": 0, "segmentation": {"requestService": "fm", "requestData": "is_online", "requestOpcode": "get_state"}, "sum": 0}
I have a event list,eg: tar_task_list, about 100 and more items,and for each event
I need to aggregate all the event from the raw data and then save this to a event csv file
Below is code
#read source data
raw_data = sc.textFile("s3://xxx").map(lambda x:json.loads(x))
# TODO: NEED TO SPEED UP THIS COMPUTING
for tar_evt_name in evts:
print("...")
table_name = out_table_prefix + tar_evt_name
evt_one_rdd = raw_data.filter(lambda x: x.get("evt_name") == tar_evt_name)
evt_one_rdd.cache()
evt_one_dict = evt_one_rdd.first()
Evt_one = Row(*sorted(['{}'.format(k) for k, v in evt_one_dict.items()]))
col_len = len(evt_one_rdd.first())
evt_one_rdd2 = evt_one_rdd.map(lambda x: to_list(x, col_len)).filter(lambda x: len(x) is not 0)
evt_one_rdd2.cache()
df = spark.createDataFrame(evt_one_rdd2.map(lambda x: Evt_one(*x)))
out_csv_path = output + '/' + tar_evt_name+'/'# add last '/' for copy err
df.write.csv(out_csv_path, mode='overwrite', header=True,sep='|',nullValue="NULL")
the output data like below:
time : 2018-05-07 00:03|8dab4796-fa37-4114-0011-7637fa2b0001|f6e7f4f8ec4b4d6dae6fa2b5ed8f90cb6a640759|0.2.23|131074|2018-05-08 23:24:25|0|false|default|2.4.130
Here is my attempt,
I have noticed a few issues here,
for tar_evt_name in evts is a native Python for loop, which incur a performance penalty when it seems like you want to do a group by operation;
.cache() is used, but seemingly for no reason;
Unsure what to_list is;
Don't think evt_one_rdd2.map(lambda x: Evt_one(*x))) works;
import json
from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.sql import Window
raw_data = sc.textFile('test.txt')
df = raw_data.map(
# Map the raw input to python dict using `json.loads`
json.loads,
).map(
# Duplicate the evt_name and evt_ts for later use in a Row object
lambda x: Row(evt_name=x['evt_name'], evt_ts=x.get('evt_ts', 1), data=x),
).toDF() # Convert into a dataframe...
# ... (I am actually unusre if this is faster...
# ... but I am more comfortable with this)
filtered_df = df.withColumn(
# NOTE: Assumed you want the first row, as you used `evt_one_rdd.first()`.
# So we assign a row number (named rn) and then filter on rn = 1.
# Here the evt_name and evt_ts becomes handy, you might want to set
# your own evt_ts properly.
'rn', F.row_number().over(
Window.partitionBy(df['evt_name']).orderBy(df['evt_ts'])
),
).filter('rn = 1').where(
# NOTE: Since you used `map(lambda x: to_list(x, col_len)).filter(lambda x: len(x) is not 0)`,
# I assume you meant data should have more than 0 keys,
# but this should be almost always true?
# Since you are grouping by `evt_name`, which means
# there is at least that key most of the time.
F.size(F.col('data')) > 0
)
filtered_df.write(....)

Execute user-defined query on list of dictionaries

I have a set of data that a user needs to query using their own query string. The current solution creates a temporary in-memory sqlite database that the query is run against.
The dataset is a list of "flat" dictionaries, i.e. there is no nested data. The query string does not need to be SQL, but it should be simple to define using an existing query framework.
It needs to support ordering (ascending, descending, custom) and filtering.
The purpose of this question is to get a range of different solutions that might work for this use case.
import sqlite3
items = [
{'id': 1},
{'id': 2, 'description': 'This is a description'},
{'id': 3, 'comment': 'This is a comment'},
{'id': 4, 'height': 1.78}
]
# Assemble temporary sqlite database
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
knownTypes = { "id": "real", "height": "real", "comment": "text" }
allKeys = list(set().union(*(d.keys() for d in items)))
allTypes = list(knownTypes.get(k, "text") for k in allKeys)
createTable_query = "CREATE TABLE data ({});".format(", ".join(["{} {}".format(x[0], x[1]) for x in zip(allKeys, allTypes)]))
cur.execute(createTable_query)
conn.commit()
qs = ["?" for i in range(len(allKeys))]
insertRow_query = "INSERT INTO data VALUES ({});".format(", ".join(qs))
for p in items:
vals = list([p.get(k, None) for k in allKeys])
cur.execute(insertRow_query, vals)
conn.commit()
# modify user query here
theUserQuery = "SELECT * FROM data"
# Get data from query
data = [row for row in cur.execute(theUserQuery)]
YAQL is what I'm looking for.
It doesn't do SQL, but it does execute a query string - which is a simple way to do complex user-defined sorting and filtering.
There's a library called litebox that does what you want. It is backed by SQLite.
from litebox import LiteBox
items = [
{'id': 1},
{'id': 2, 'description': 'This is a description'},
{'id': 3, 'comment': 'This is a comment'},
{'id': 4, 'height': 1.78}
]
types = {"id": int, "height": float, "comment": str}
lb = LiteBox(items, types)
lb.find("height > 1.5")
Result: [{'id': 4, 'height': 1.78}]

Categories

Resources