Effectively classifying a DB of event logs by a column

Effectively classifying a DB of event logs by a column - python

Situation
I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)
I have a sqlite database that looks like:
| user_id | action | timestamp |
| ------- | ------ | ---------- |
| Alice | 0 | 1551683796 |
| Alice | 23 | 1551683797 |
| James | 1 | 1551683798 |
| ....... | ...... | .......... |
where user_id is TEXT, action is an arbitary INTEGER, and timestamp is an INTEGER representing UNIX time.
The database has 200M rows, and there are 70K distinct user_ids.
Goal
I need to make a Python dictionary that looks like:
{
"Alice":[(0, 1551683796), (23, 1551683797)],
"James":[(1, 1551683798)],
...
}
that has user_ids as keys and respective event logs as values, which are lists of tuples (action, timestamp). Hopefully each list will be sorted by timestamp in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.
Effort
I have the following code to query the database. It first queries for the list of users (with user_list_cursor), and then query for all rows belonging to the user.
import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()
classified_log = {}
log_cursor = connection.cursor()
while user_id:
user_id = user_id[0] # cursor.fetchone() returns a tuple
query = (
"SELECT action, timestamp"
" FROM TABLE"
" WHERE user_id = ?"
" ORDER BY timestamp ASC"
)
parameters = (user_id,)
local_cursor.execute(query, parameters) # Here is the bottleneck
classified_log[user_id] = list()
for row in local_cursor.fetchall():
classified_log[user_id].append(row)
user_id = user_list_cursor.fetchone()
Problem
The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?
I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.
Reproducibility
If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.
But I hope there is somebody who is familiar with such a situation and knows how to write the right query.
import sqlite3
import random
connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
user_id = f"user{random.randint(0, 70000)}"
action = random.randint(0, 1_000_000)
parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()

Big thanks to #Strawberry and #Solarflare for their help given in comments.
The following solution achieved more than 70X performance increase, so I'm leaving what I did as an answer for completeness' sake.
I used indices and queried for the whole table, as they suggested.
import sqlite3
from operators import attrgetter
connection = sqlite3.connect("database.db")
# Creating index, thanks to #Solarflare
cursor = connection.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_user_id ON EVENT_LOG (user_id)")
cursor.commit()
# Reading the whole table, then make lists by user_id. Thanks to #Strawberry
cursor.execute("SELECT user_id, action, timestamp FROM EVENT_LOG ORDER BY user_id ASC")
previous_user_id = None
log_per_user = list()
classified_log = dict()
for row in cursor:
user_id, action, timestamp = row
if user_id != previous_user_id:
if previous_user_id:
log_per_user.sort(key=itemgetter(1))
classified_log[previous_user_id] = log_per_user[:]
log_per_user = list()
log_per_user.append((action, timestamp))
previous_user_id = user_id
So the points are
Indexing by user_id to make ORDER BY user_id ASC execute in acceptable time.
Reading the whole table, then classify by user_id, instead of making individual queries for each user_id.
Iterating over cursor to read row by row, instead of cursor.fetchall().

Related

MySql MAX(DATE(Transaction_Date)) returns different type than DATE(MAX(Transaction_Date))

I have a MySql 5.7 transaction table with a DATETIME column, Transaction_Date, that is indexed. My Python 3.8.5 program is interested in retrieving the maximum transaction date on the table (ignoring the time portion). There are two possible queries (maybe even more):
select date(max(Transaction_Date)) from `transaction`
and
select max(date(Transaction_Date)) from `transaction`
However, depending on which query I use, pymysql or mysql.connector (it doesn't matter which one I use) returns me a different data type as the result, i.e. a date.datetime instance for the first query and a str for the second:
The Table:
CREATE TABLE `transaction` (
`Transaction_ID` varchar(32) NOT NULL,
`Transaction_Date` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `transaction`
ADD PRIMARY KEY (`Transaction_ID`),
ADD KEY `Transaction_Date` (`Transaction_Date`);
The Program:
import pymysql
database = 'xxxx'
user_id = 'xxxx'
password = 'xxxx'
conn = pymysql.connect(db=database, user=user_id, passwd=password, charset='utf8mb4', use_unicode=True)
cursor = conn.cursor()
cursor.execute('select date(max(Transaction_Date)) from `transaction`')
row = cursor.fetchone()
d = row[0]
print(d, type(d))
cursor.execute('select max(date(Transaction_Date)) from `transaction`')
row = cursor.fetchone()
d = row[0]
print(d, type(d))
Prints:
2021-01-19 <class 'datetime.date'>
2021-01-19 <class 'str'>
Can anyone explain why a datetime.date is not returned for the second query? For what it's worth, I have another table with a DATE column and when I select the MAX of that column I am returned a datetime.date instance. So why am I not returned a datetime.date for a MAX(DATE(column_name))?
Update
mysql> create temporary table t1 as select max(date(Transaction_Date)) as d from `transaction`;
Query OK, 1 row affected (0.20 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> show columns from t1;
+-------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-----+---------+-------+
| d | date | YES | | NULL | |
+-------+------+------+-----+---------+-------+
1 row in set (0.01 sec)
This is what I would expect, so it is quite a puzzle.

I looked through pymysql source and some MySQL documentation. It looks like the database server returns field descriptors, which the module then uses to set a class on the value, there are various converters available: https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/converters.py. At the bottom, it maps field types to the method.
The database server is describing max(date) as a VAR_STRING, so the module proceeds accordingly. I'm not sure of the specific reason for this description, which I suppose is the core of your question. It would take some digging through MySQL source, the documentation is not very detailed.
As a workaround, casting the result to a date does make it work as expected:
select cast(max(date(Transaction_Date)) as date) from transaction

What kind of table is created when executing a query that returns nothing?

I'm using petl and trying to create a simple table with a value from a query. I have written the following:
#staticmethod
def get_base_price(date):
# open connection to db
# run SQL query to check if price exists for that date
# set base price to that value if it exists
# set it to 100 if it doesn't
sql = '''SELECT [TimeSeriesValue]
FROM [RAP].[dbo].[TimeSeriesPosition]
WHERE TimeSeriesTypeID = 12
AND SecurityMasterID = 45889
AND FundID = 7
AND EffectiveDate = %s''' % date
with self.job.rap.connect() as conn:
data = etl.fromdb(conn, sql).cache()
return data
I'm connecting to the database, and if there's a value for that date, then i'll be able to create a table that would look like this:
+-----------------+
| TimeSeriesValue |
+=================+
| 100 |
+-----------------+
However, if the query returns nothing, what would the table look like?
I want to set the TimeSeriesValue to 100 if the query returns nothing. Not sure how to do that.

You should be passing in parameters when you execute the statement rather than munging the string . . but that is not central to your question.
Possibly the simplest solution is to do all the work in SQL. If you are expecting at most one row from the query, then:
SELECT COALESCE(TimeSeriesValue, 100) as TimeSeriesValue
FROM [RAP].[dbo].[TimeSeriesPosition]
WHERE TimeSeriesTypeID = 12 AND
SecurityMasterID = 45889 AND
FundID = 7 AND
EffectiveDate = %s
This will always return one row and if nothing is found, it will put in the 100 value.

A query that returns nothing would just display the column name(s) with nothing under them. I'd try something like this:
IF(TimeSeriesValue IS NOT NULL)
<your query>
ELSE
SET TimeSeriesValue = 100

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.

Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).

It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

Run checks on Items from tables in Sqlite and python

I have two tables below:
----------
Items | QTY
----------
sugar | 14
mango | 10
apple | 50
berry | 1
----------
Items |QTY
----------
sugar |10
mango |5
apple |48
berry |1
I use the following query in python to check difference between the QTY of table one and table two.
cur = conn.cursor()
cur.execute("select s.Items, s.qty - t.qty as quantity from Stock s join Second_table t on s.Items = t.Items;")
remaining_quantity = cur.fetchall()
I'm a bit stuck on how to go about what I need to accomplish. I need to check the difference between the quantity of table one and table two, if the quantity (difference) is under 5 then for those Items I want to be able to store this in another table column with the value 1 if not then the value will be 0 for those Items. How can I go about this?
Edit:
I have attempted this like by looping through the rows and if the column value is less than 5 then insert into the new table with the value below. :
for row in remaining_quantity:
print(row[1])
if((row[1]) < 5):
cur.execute('INSERT OR IGNORE INTO check_quantity_tb VALUES (select distinct s.Items, s.qty, s.qty - t.qty as quantity, 1 from Stock s join Second_table t on s.Items = t.Items'), row)
print(row)
But I get a SQL syntax error not sure where the error could be :/

First modify your first query so you retrieve all relevant infos and don't have to issue subqueries later:
readcursor = conn.cursor()
readcursor.execute(
"select s.Items, s.qty, s.qty - t.qty as remain "
"from Stock s join Second_table t on s.Items = t.Items;"
)
Then use it to update your third table:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
if remain < 5:
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, 1)
)
conn.commit()
Note the following points:
1/ We use two distinct cursor so we can iterate over the first one while wrting with the second one. This avoids fetching all results in memory, which can be a real life saver on huge datasets
2/ when iterating on the first cursor, we unpack the rows into their individual componants. This is called "tuple unpacking" (but actually works for most sequence types):
>>> row = ("1", "2", "3")
>>> a, b, c = row
>>> a
'1'
>>> b
'2'
>>> c
'3'
3/ We let the db-api module do the proper sanitisation and escaping of the values we want to insert. This avoids headaches with escaping / quoting etc and protects your code from SQL injection attacks (not that you might have one here, but that's the correct way to write parameterized queries in Python).
NB : since you didn't not post your full table definitions nor clear specs - not even the full error message and traceback - I only translated your code snippet to something more sensible (avoiding the costly and useless subquery, which migh or not be the cause of your error). I can't garantee it will work out of the box, but at least it should put you back on tracks.
NB2 : you mentionned you had to set the last col to either 1 or 0 depending on remain value. If that's the case, you want your loop to be:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
flag = 1 if remain < 5 else 0
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, flag)
)
conn.commit()
If you instead only want to process rows where remain < 5, you can specify it directly in your first query with a where clause.

Create a temporary table in python to join with a sql table

I have the following data in a vertica db, Mytable:
+----+-------+
| ID | Value |
+----+-------+
| A | 5 |
| B | 9 |
| C | 10 |
| D | 7 |
+----+-------+
I am trying to create a query in python to access a vertica data base. In python I have a list:
ID_list= ['A', 'C']
I would like to create a query that basically inner joins Mytable with the ID_list and then I could make a WHERE query.
So it will be basically something like this
SELECT *
FROM Mytable
INNER JOIN ID_list
ON Mytable.ID = ID_list as temp_table
WHERE Value = 5
I don't have writing rights on the data base, so the table needs to be created localy. Or is there an alternative way of doing this?

If you have a small table, then you can do as Tim suggested and create an in-list.
I kind of prefer to do this using python ways, though. I would probably also make ID_list a set as well to keep from having dups, etc.
in_list = '(%s)' % ','.join(str(id) for id in ID_list)
or better use bind variables (depends on the client you are using, and probably not strictly necessary if you are dealing with a set of ints since I can't imagine a way to inject sql with that):
in_list = '(%s)' % ','.join(['%d'] * len(ID_list)
and send in your ID_list as a parameter list for your cursor.execute. This method is positional, so you'll need to arrange your bind parameters correctly.
If you have a very, very large list... you could create a local temp and load it before doing your query with join.
CREATE LOCAL TEMP TABLE mytable ( id INTEGER );
COPY mytable FROM STDIN;
-- Or however you need to load the data. Using python, you'll probably need to stream in a list using `cursor.copy`
Then join to mytable.
I wouldn't bother doing the latter with a very small number of rows, too much overhead.

So I used the approach from Tim:
# create a String of all the ID_list so they can be inserted into a SQL query
Sql_string='(';
for ss in ID_list:
Sql_string= Sql_string + " " + str(ss) + ","
Sql_string=Sql_string[:-1] + ")"
"SELECT * FROM
(SELECT * FROM Mytable WHERE ID IN " + Sql_string) as temp
Where Value = 5"
works surprisingly fast

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Effectively classifying a DB of event logs by a column - python

Related

MySql MAX(DATE(Transaction_Date)) returns different type than DATE(MAX(Transaction_Date))

What kind of table is created when executing a query that returns nothing?

Count the number of non-null values in each column of each table in MySQL

Run checks on Items from tables in Sqlite and python

Create a temporary table in python to join with a sql table

Categories

Resources