I need to write an automated python code to create database table having column names as the keys from the json file and column data should be the values of those respective key.
My json looks like this:
{
"Table_1": [
{
"Name": "B"
},
{
"BGE3": [
"Itm2",
"Itm1",
"Glass"
]
},
{
"Trans": []
},
{
"Art": [
"SYS"
]
}]}
My table name should be: Table_1.
So my column name should look like: Name | BGE3 | Trans | Art.
And data should be its respected values.
Creation of table and columns has to be dynamic because I need to run this code on multiple json file.
So far I have managed to connect to the postgresql database using python.
So please help me with the solutions.Thankyou.
Postgres version 13.
Existing code:
cur.execute("CREATE TABLE Table_1(Name varchar, BGE3 varchar, Trans varchar, Art varchar)")
for d in data: cur.execute("INSERT into B_Json_3(Name, BGE3, Trans , Art) VALUES (%s, %s, %s, %s,)", d)
Where data is a list of arrays i made which can only be executed for this json. I need a function that will execute any json i want that can have 100 elements of list in the values of any key.
The table creation portion, using Python json module to convert JSON to Python dict and psycopg2.sql module to dynamically CREATE TABLE:
import json
import psycopg2
from psycopg2 import sql
tbl_json = """{
"Table_1": [
{
"Name": "B"
},
{
"BGE3": [
"Itm2",
"Itm1",
"Glass"
]
},
{
"Trans": []
},
{
"Art": [
"SYS"
]
}]}
"""
# Transform JSON string into Python dict. Use json.load if pulling from file.
# Pull out table name and column names from dict.
tbl_dict = json.loads(tbl_json)
tbl_name = list(tbl_dict)[0]
tbl_name
'Table_1'
col_names = [list(col_dict)[0] for col_dict in tbl_dict[tbl_name]]
# Result of above.
col_names
['Name', 'BGE3', 'Trans', 'Art']
# Create list of types and then combine column names and column types into
# psycopg2 sql composed object. Warning: sql.SQL() does no escaping so potential
# injection risk.
type_list = ["varchar", "varchar", "varchar"]
col_type = []
for i in zip(map(sql.Identifier, col_names), map(sql.SQL,type_list)):
col_type.append(i[0] + i[1])
# The result of above.
col_type
[Composed([Identifier('Name'), SQL('varchar')]),
Composed([Identifier('BGE3'), SQL('varchar')]),
Composed([Identifier('Trans'), SQL('varchar')])]
# Build psycopg2 sql string using above.
sql_str = sql.SQL("CREATE table {} ({})").format(sql.Identifier(tbl_name), sql.SQL(',').join(col_type) )
con = psycopg2.connect("dbname=test host=localhost user=aklaver")
cur = con.cursor()
# Shows the CREATE statement that will be executed.
print(sql_str.as_string(con))
CREATE table "Table_1" ("Name"varchar,"BGE3"varchar,"Trans"varchar)
# Execute statement and commit.
cur.execute(sql_str)
con.commit()
# In psql client the result of the execute:
\d "Table_1"
Table "public.Table_1"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
Name | character varying | | |
BGE3 | character varying | | |
Trans | character varying | | |
Related
I am trying to convert the below JSON into a pandas dataframe. The JSON is being captured using flask request method.
{
"json_str": [{
"TileName ": "Master",
"ImageLink ": "Link1",
"Report Details": [{
"ReportName": "Primary",
"ReportUrl": "link1",
"ADGroup": ["operations", "Sales"],
"IsActive": 1
}, {
"ReportName": "Secondry",
"ReportUrl": "link2",
"ADGroup": ["operations", "Sales"],
"IsActive": 1
}],
"OpsFlag": 1
}]
}
Now below are the code snippet that I am
Using `request.json.get() method to get the JSON
Converting into string using json.loads()
Normalizing using pd.json_normalize and finally
Using pyodbc to run a Stored Procedure to insert the data into Database
Below is the code snippets:
###Step 1 and 2###
json_strg = request.json.get("json_str",None) <----in flask app.py
json_strf = json.dumps(json_strg)
js_obj = json.loads(json_strf)
###Step 3###
df = pd.json_normalize(js_obj,record_path='Report Details',\
meta=
['TileName','ImageLink','OpsFlag'],errors='ignore').explode('ADGroup').apply(pd.Series)
Cols = ['TileName','ImageLink','ReportName','ReportUrl','ADGroup','OpsFlag','IsActive']
df= df[Cols]
###Step 4###
conn = pyodbc.connect(conn_string)
cur=conn.cursor()
for i,v in df.iterrows():
sql = """SET NOCOUNT ON;
EXEC [dbo].[mystored_proc] ?, ?, ?, ?, ?,?,?"""
value = tuple(v)
cur.execute(sql,value)
conn.commit()
The above code is giving the below error:
"('42000', '[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]The incoming
tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect.
Parameter 4 (\"\"): The supplied value is not a valid instance of data type float.
Check the source data for invalid values. An example of an invalid value is data of
numeric type with scale greater than precision. (8023) (SQLExecDirectW)')"
Strange thing is when I am running the above JSON string with the value of json_str part (i.e. string starting from '{"TileName:), no error is coming and it is actually inserting data into DB.
This means, there is no issue in Step 3 and 4. Issue is in step 1 and 2.
Any clue?
I am doing a migration from Cassandra on an AWS machine to Astra Cassandra and there are some problems :
I cannot insert new data in Astra Cassandra with a column which is around 2 million characters and 1.77 MB (and I have bigger data to insert - around 20 millions characters). Any one knows how to address the problem?
I am inserting it via a Python app (cassandra-driver==3.17.0) and this is the error stack I get :
start.sh[5625]: [2022-07-12 15:14:39,336]
INFO in db_ops: error = Error from server: code=1500
[Replica(s) failed to execute write]
message="Operation failed - received 0 responses and 2 failures: UNKNOWN from 0.0.0.125:7000, UNKNOWN from 0.0.0.181:7000"
info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 2}
If I used half of those characters it works.
new Astra Cassandra CQL Console table description :
token#cqlsh> describe mykeyspace.series;
CREATE TABLE mykeyspace.series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
) WITH additional_write_policy = '99PERCENTILE'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.UnifiedCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99PERCENTILE';
Old Cassandra table description:
ansible#cqlsh> describe mykeyspace.series;
CREATE TABLE mykeyspace.series (
type text,
name text,
as_of timestamp,
data text,
hash text,
PRIMARY KEY ((type, name, as_of))
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Data sample :
{"type": "OP", "name": "book", "as_of": "2022-03-17", "data": [{"year": 2022, "month": 3, "day": 17, "hour": 0, "quarter": 1, "week": 11, "wk_year": 2022, "is_peak": 0, "value": 1.28056854009628e-08}, .... ], "hash": "84421b8d934b06488e1ac464bd46e83ccd2beea5eb2f9f2c52428b706a9b2a10"}
where this json contains 27.000 entries inside the data array like :
{"year": 2022, "month": 3, "day": 17, "hour": 0, "quarter": 1, "week": 11, "wk_year": 2022, "is_peak": 0, "value": 1.28056854009628e-08}
Python part of the code :
def insert_to_table(self, table_name, **kwargs):
try:
...
elif table_name == "series":
self.session.execute(
self.session.prepare("INSERT INTO series (type, name, as_of, data, hash) VALUES (?, ?, ?, ?, ?)"),
(
kwargs["type"],
kwargs["name"],
kwargs["as_of"],
kwargs["data"],
kwargs["hash"],
),
)
return True
except Exception as error:
current_app.logger.error('src/db/db_ops.py insert_to_table() table_name = %s error = %s', table_name, error)
return False
Big thanks !
you are hitting the configured limit for the maximum mutation size. On Cassandra, this defaults to 16 MB, while on Astra DB at the moment it is 4 MB (it's possible that it will be raised, but performing inserts with veyr large cell sizes is still strongly discouraged).
A more agile approach to storing this data would be to revise your data model and split the big row with the huge string into several rows, each containing a single item of the 27000 or so entries. With proper use of partitioning, you would still be able to retrieve the whole contents with a single query (paginated between the database and the driver for your convenience, which would help avoiding annoying timeouts which may arise when reading so large individual rows).
Incidentally, I suggest you create the prepared statement only once outside of the insert_to_table function (caching it or something). In the insert function you simply self.session.execute(already_prepared_statement, (value1, value2, ...)) which would noticeably improve your performance.
A last point: I believe the drivers are able to connect to Astra DB only starting from version 3.24.0, so I'm not sure how you are using version 3.17. I don't think version 3.17 know of the cloud argument to the Cluster constructor. In any case, I suggest you upgrade the drivers to the latest version (currently 3.25.0).
There's something not quite right with the details you posted in your question.
In the schema you posted, the data column is of type text:
data text,
But the sample data you posted looks like you are inserting key/value pairs which oddly seems to be formatted like a CQL collection type.
If it were really a string, it would be formatted as:
... "data": "key1: value1, key2: value2, ...", ...
Review your data and code then try again. Cheers!
Finally, the short solution to make a data migration possible for an inhouse Cassandra DB to Astra Cassandra (DataStax) was to use compression (zlib).
So the "data" field of each entry was compress in Python with zlib and then stored in Cassandra to reduce the size of the entries.
def insert_to_table(self, table_name, **kwargs):
try:
if table_name == 'series_as_of':
...
elif table_name == 'series':
list_of_bytes = bytes(json.dumps(kwargs["data"]),'latin1')
compressed_data = zlib.compress(list_of_bytes)
a = str(compressed_data, 'latin1')
self.session.execute(
self.session.prepare(INSERT_SERIES_QUERY),
(
kwargs["type"],
kwargs["name"],
kwargs["as_of"],
a,
kwargs["hash"],
),
)
....
Then, when reading the entries, a decompression step is needed :
def get_series(self, query_type, **kwargs):
try:
if query_type == "data":
execution = self.session.execute_async(
self.session.prepare(GET_SERIES_DATA_QUERY),
(kwargs["type"], kwargs["name"], kwargs["as_of"]),
timeout=30,
)
decompressed = None
a = None
for row in execution.result():
my_bytes = row[0].encode('latin1')
decompressed = zlib.decompress(my_bytes)
a = str(decompressed, encoding="latin1")
return a
...
The data looks like this now in Astra Cassandra :
token#cqlsh> select * from series where type = 'OP' and name = 'ZTP_PFM_H_LUX_PHY' and as_of = '2022-09-30';
type | name | as_of | data | hash
------+-------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------
OP | ZTP_PFM_H_LUX_PHY | 2022-09-30 00:00:00.000000+0000 | x\x9c-\x8dÉ\nÂ#\x10D\x7fEúl`z\x16Mü\x80\x90\x83â\x1c\x14\F\x86îYÈÅ\x05\x92\x8b\x88ÿnÂx*ªÞ\x83\x82\x8f\x83ñýJ\x0e6\x0b\x07{ë`9å\x83îÿår°Þ¶;ßùíñämw.\x02\rþ\x99\x8b!\x85\x94\x95h*%\n\x8a4ÒL®·¹õ4ôÅÓÙ¨\x10\të \x99H\x04¡\x8cf6¹!\x95\x02'\x93"Jb\x1dë\x84ÈT¯U\x90\x19\x11W8\x1dp£\x8d\x83/ü\x00Ó<0\x99 | 4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945
(1 rows)
Long term solution is to rearrange the data like #Stefano Lottini said.
How to update JSON field in sqlalchemy appending another json?
stmt = pg_insert(Users).values(
userid=user.id,
pricesjson=[{
"product1": user.product1,
"product2": user.product2,
"product3": user.product3
}],
tsins=datetime.now()
)
stmtUpsert = stmt.on_conflict_do_update(index_elements=[Users.userid],
set_={'pricesjson': cast({"product1": user.product1,
"product2": user.product2,
"product3": user.product3
} +
cast(stmt.excluded.pricesjson, JSONB),
JSON)
, 'tsvar': datetime.now()})
In that way i don't receive errors but overwrite json field without append.
Thank you ;)
Solved: After altered the field on table from json to jsonb, that's the working code:
stmtUpsert = stmt.on_conflict_do_update(index_elements=[Users.userid],
set_={'pricesjson': cast([{"product1": user.product1,
"product2": user.product2,
"product3": user.product3
], JSONB) + Users.pricesjson
, 'tsvar': datetime.now()})
That's the relative sample query:
insert into users (userid, pricesjson) values('1', '{"product1": "test1", product2": "test2"}')
on conflict (userid)
do update set pricesjson =cast('[{"productX": "testX"}]' as jsonb) || securitiesprices.pricesjson
I have the following complex data that would like to parse in PySpark:
records = '[{"segmentMembership":{"ups":{"FF6KCPTR6AQ0836R":{"lastQualificationTime":"2021-01-16 22:05:11.074357","status":"exited"},"QMS3YRT06JDEUM8O":{"lastQualificationTime":"2021-01-16 22:05:11.074357","status":"realized"},"8XH45RT87N6ZV4KQ":{"lastQualificationTime":"2021-01-16 22:05:11.074357","status":"exited"}}},"_aepgdcdevenablement2":{"emailId":{"address":"stuff#someemail.com"},"person":{"name":{"firstName":"Name2"}},"identities":{"customerid":"PH25PEUWOTA7QF93"}}},{"segmentMembership":{"ups":{"FF6KCPTR6AQ0836R":{"lastQualificationTime":"2021-01-16 22:05:11.074457","status":"realized"},"D45TOO8ZUH0B7GY7":{"lastQualificationTime":"2021-01-16 22:05:11.074457","status":"realized"},"QMS3YRT06JDEUM8O":{"lastQualificationTime":"2021-01-16 22:05:11.074457","status":"existing"}}},"_aepgdcdevenablement2":{"emailId":{"address":"stuff4#someemail.com"},"person":{"name":{"firstName":"TestName"}},"identities":{"customerid":"9LAIHVG91GCREE3Z"}}}]'
df = spark.read.json(sc.parallelize([records]))
df.show()
df.printSchema()
The problem I am having is with the segmentMembership object. The JSON object looks like this:
"segmentMembership": {
"ups": {
"FF6KCPTR6AQ0836R": {
"lastQualificationTime": "2021-01-16 22:05:11.074357",
"status": "exited"
},
"QMS3YRT06JDEUM8O": {
"lastQualificationTime": "2021-01-16 22:05:11.074357",
"status": "realized"
},
"8XH45RT87N6ZV4KQ": {
"lastQualificationTime": "2021-01-16 22:05:11.074357",
"status": "exited"
}
}
}
The annoying thing with this is, the key values ("FF6KCPTR6AQ0836R", "QMS3YRT06JDEUM8O", "8XH45RT87N6ZV4KQ") end up being defined as a column in pyspark.
In the end, if the status of the segment is "exited", I was hoping to get the results as follows.
+--------------------+----------------+---------+------------------+
|address |customerid |firstName|segment_id |
+--------------------+----------------+---------+------------------+
|stuff#someemail.com |PH25PEUWOTA7QF93|Name2 |[8XH45RT87N6ZV4KQ]|
|stuff4#someemail.com|9LAIHVG91GCREE3Z|TestName |[8XH45RT87N6ZV4KQ]|
+--------------------+----------------+---------+------------------+
After loading the data into a dataframe(above), I tried the following:
dfx = df.select("_aepgdcdevenablement2.emailId.address", "_aepgdcdevenablement2.identities.customerid", "_aepgdcdevenablement2.person.name.firstName", "segmentMembership.ups")
dfx.show(truncate=False)
seg_list = array(*[lit(k) for k in ["8XH45RT87N6ZV4KQ", "QMS3YRT06JDEUM8O"]])
print(seg_list)
# if v["status"] in ['existing', 'realized']
def confusing_compare(ups, seg_list):
seg_id_filtered_d = dict((k, ups[k]) for k in seg_list if k in ups)
# This is the line I am having a problem with.
# seg_id_status_filtered_d = {key for key, value in seg_id_filtered_d.items() if v["status"] in ['existing', 'realized']}
return list(seg_id_filtered_d)
final_conf_dx_pred = udf(confusing_compare, ArrayType(StringType()))
result_df = dfx.withColumn("segment_id", final_conf_dx_pred(dfx.ups, seg_list)).select("address", "customerid", "firstName", "segment_id")
result_df.show(truncate=False)
I am not able to check the status field within the value field of the dic.
You can actually do that without using UDF. Here I'm using all the segment names present in the schema and filtering out those with status = 'exited'. You can adapt it depending on which segments and status you want.
First, using the schema fields, get the list of all segment names like this:
segment_names = df.select("segmentMembership.ups.*").schema.fieldNames()
Then, by looping throught the list created above and using when function, you can create a column that can have either segment_name as value or null depending on status:
active_segments = [
when(col(f"segmentMembership.ups.{c}.status") != lit("exited"), lit(c))
for c in segment_names
]
Finally, add new column segments of array type and use filter function to remove null elements from the array (which corresponds to status 'exited'):
dfx = df.withColumn("segments", array(*active_segments)) \
.withColumn("segments", expr("filter(segments, x -> x is not null)")) \
.select(
col("_aepgdcdevenablement2.emailId.address"),
col("_aepgdcdevenablement2.identities.customerid"),
col("_aepgdcdevenablement2.person.name.firstName"),
col("segments").alias("segment_id")
)
dfx.show(truncate=False)
#+--------------------+----------------+---------+------------------------------------------------------+
#|address |customerid |firstName|segment_id |
#+--------------------+----------------+---------+------------------------------------------------------+
#|stuff#someemail.com |PH25PEUWOTA7QF93|Name2 |[QMS3YRT06JDEUM8O] |
#|stuff4#someemail.com|9LAIHVG91GCREE3Z|TestName |[D45TOO8ZUH0B7GY7, FF6KCPTR6AQ0836R, QMS3YRT06JDEUM8O]|
#+--------------------+----------------+---------+------------------------------------------------------+
I am struggling to make Python play nice with my UTF-8 encoded MySQL database containing, for example, the Norwegian characters, æøå. I have searched around for hours, but have not been able to find anything that works as expected. Here is an example table extracted from the database:
mysql> select * from my_table;
+----+-----------------+
| id | shop_group_name |
+----+-----------------+
| 1 | Frukt og grønt |
| 2 | Kjøtt og fisk |
| 3 | Meieriprodukter |
| 4 | Frysevarer |
| 5 | Bakevarer |
| 6 | Tørrvarer |
| 7 | Krydder |
| 8 | Hermetikk |
| 9 | Basisvarer |
| 10 | Diverse |
+----+-----------------+
10 rows in set (0.00 sec)
So the data is definitely UTF-8 encoded. When running the below Python code, however, it does not give the output int UTF-8. What could be wrong with it? It has nothing to do with the zipping; the tuples returned by cursor.execute(query) has already messed up the encoding.
#!/usr/bin/env python
import MySQLdb
db = MySQLdb.connect(host="localhost",
user="test",
passwd="passwd",
db="mydb",
charset='utf8',
use_unicode=True)
# Set desired conversion of data.
db.converter[MySQLdb.FIELD_TYPE.NEWDECIMAL] = float
db.converter[MySQLdb.FIELD_TYPE.DATETIME] = str
db.converter[MySQLdb.FIELD_TYPE.LONGLONG] = int
db.converter[MySQLdb.FIELD_TYPE.LONG] = int
db.converter[MySQLdb.FIELD_TYPE.DATETIME] = str
db.converter[MySQLdb.FIELD_TYPE.DATETIME] = str
db.converter[MySQLdb.FIELD_TYPE.DATETIME] = str
cursor = db.cursor()
query = 'SELECT * FROM my_table'
allResults = {}
cursor.execute(query)
columns = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
results = []
for row in rows:
row = dict(zip(columns, row))
results.append(row)
allResults['my_table'] = results
cursor.close()
db.close()
The allResults dictionary now contains:
{
'my_table': [
{
'id': 1,
'shop_group_name': 'Fruktoggr\xf8nt'
},
{
'id': 2,
'shop_group_name': 'Kj\xf8ttogfisk'
},
{
'id': 3,
'shop_group_name': 'Meieriprodukter'
},
{
'id': 4,
'shop_group_name': 'Frysevarer'
},
{
'id': 5,
'shop_group_name': 'Bakevarer'
},
{
'id': 6,
'shop_group_name': 'T\xf8rrvarer'
},
{
'id': 7,
'shop_group_name': 'Krydder'
},
{
'id': 8,
'shop_group_name': 'Hermetikk'
},
{
'id': 9,
'shop_group_name': 'Basisvarer'
},
{
'id': 10,
'shop_group_name': 'Diverse'
}
]
}
I cannot really see what I am doing wrong. I am running the tests in Python 2.7.6 in Ubuntu.
Update (changing tables to UTF-8)
I tried changing the tables to UTF-8 by dumping the database and changing the character set and collation in the dump file and then inserting it into a new database. For example, this part of the dump file corresponds to the example above. This is how it was:
DROP TABLE IF EXISTS `my_table`;
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `my_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`shop_group_name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = #saved_cs_client */;
And this is what I changed this part to:
DROP TABLE IF EXISTS `my_table`;
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `my_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`shop_group_name` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = #saved_cs_client */;
However, this is still not working. The output is still the same as above. Running SELECT CHARACTER_SET_NAME FROM information_schema.columns WHERE TABLE_NAME = 'my_table'; now produces utf8.
When you create your table, create your columns in UTF-8:
CREATE TABLE my_table (
...
shop_group_name VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_general_ci
);
If you don't specify the character set and collation, then MySQL uses defaults for character set and collation. Alternatively, you can set the defaults in mysql.cnf.