query sqlite3 2 tables joining them by one column and date period

query sqlite3 2 tables joining them by one column and date period - python

i have a sqlite3 database file with the following tables.
solicitari-id
id_im
data_inchidere
defect
tip_solicitare
operatii_aditionale-id
id_im
data_aditionala
descriere
First table (solicitari) contains:
id , id_im, data_inchidere, defect,tip_solicitare
---------------------------------------------------
1,123456,2017-01-01 10:00:00,faulty mouse,replacement
2,789456,2017-01-01 11:00:00,pos installed,intall
3,147852,2017-01-05 12:00:00, monitor installed,install
4,369852,2017-01-06 11:00:00, monitor installed,install
Second table(operatii_aditionale) contain aditional operations:
id, id_im, data_aditionala, descriere
---------------------------------------------
1,123456,2017-01-02 10:00:00,mouse replaced need cd replacement to
2,123456,2017-01-03 10:00:00,cd replaced system ok
3,123456,2017-01-03 10:00:00,hdd replaced system ok
4,789456,2017-01-04 10:00:00,ac adapter not working anymore
What i want to do, is to build a table from these two tables but only with existing data between 2 dates wich will look like this:
id_im, data_inchidere, defect,tip_solicitare, id_im, data_aditionala, descriere
-------------------------------------------------------------------------------
123456,2017-01-01 10:00:00,faulty mouse,replacement,123456,2017-01-02 10:00:00,mouse replaced need cd replacement to
123456,2017-01-03 10:00:00,cd replaced system ok
123456,2017-01-03 10:00:00,hdd replaced system ok
2,789456,2017-01-01 11:00:00,pos installed,intall,789456,2017-01-04 10:00:00,ac adapter not working anymore
3,147852,2017-01-05 12:00:00, monitor installed,install
4,369852,2017-01-06 11:00:00, monitor installed,install
I have used , as separator here column separator.
I found something similar but is filling the left part for each aditional row from second table.
Is there a method to do this directly from sqlite query command or i need to use a python script?
btw, i need this for a python app.
thanks

Related

How to delete top "n" rows using sqlite3 in Python

Is there an efficient way to delete the top "n" rows of a table in an SQLite database using sqlite3?
USE CASE:
--> I need to keep rolling timeseries data in table. To do this, I am fetching n new data points regularly and appending it to the table. However, to keep the table updated and a constant size (in terms of number of rows), I need to trim the table by removing the top n rows of data.
DELETE TOP(n) FROM <table_name>
looked promising, but it appears to be unsupported by sqlite3.
Below for example:
import sqlite3
conn = sqlite3.connect("testDB.db")
table_name = "test_table"
c = conn.cursor()
c.execute("DELETE TOP(2500) FROM test_table")
conn.commit()
conn.close()
The following error is raised:
Traceback (most recent call last):
File "test_db.py", line 9, in <module>
c.execute("DELETE TOP(10) FROM test_table")
sqlite3.OperationalError: near "TOP": syntax error
The only work around I've seen is to use c.executemany instead of c.execute but this would require specifying the exact dates to delete which is much more cumbersome than it needs to be.

Sqlite supports this, but not out of the box.
You first have to create a custom sqlite3.c amalgamation file from the master source tree and compile it with the C preprocessor macro SQLITE_ENABLE_UPDATE_DELETE_LIMIT defined (By running ./configure --enable-update-limit; make), then put the resulting shared library where Python will load it instead of whatever version it would otherwise use (This is the hard part compared to using it in C or C++, where you can just add a custom sqlite3.c to the project source files instead of using a library).
Once all that's done and you're successfully using your own custom sqlite3 library from Python, you can do
DELETE FROM test_table LIMIT 10
which will delete 10 unspecified rows. To control which rows to delete, you need an ORDER BY clause:
DELETE FROM test_table ORDER BY foo LIMIT 10
See the documentation for details.
I suspect most people would give up on this as too complicated and just first find the rowids (Or other primary/unique key) of the rows they want to delete, and then delete them.

There is no way to delete a specific row as far as I know.
You need to drop the table and add it again.
So basically If i had to do it I would fetch the whole table store it in a variable and would remove the part I don't need. Then I will drop the table and create the same table again and then I will input the rest of the data.

mySQL Load Data - Row 1 doesn't contain data for all columns

I've looked at many similar questions on this topic.. But none appear to apply.
Here are the details:
I have a table with 8 columns.
create table test (
node_name varchar(200),
parent varchar(200),
actv int(11),
fid int(11),
cb varchar(100),
co datetime,
ub varchar(100),
uo datetime
);
There is a trigger on the table:
CREATE TRIGGER before_insert_test
BEFORE INSERT ON test
FOR EACH ROW SET NEW.co = now(), NEW.uo = now(), NEW.cb = user(), NEW.ub = user()
I have a csv file to load into this table. Its got just 2 columns in it.
First few rows:
node_name,parent
West,
East,
BBB: someone,West
Quebec,East
Ontario,East
Manitoba,West
British Columbia,West
Atlantic,East
Alberta,West
I have this all set up in a mySQL 5.6 environment. Using python and SQLAlchemy, i run the load of the file without issue.. It LOADS ALL RECORDS with empty strings for the second field in the first 2 records. All as expected.
I have a mysql 8 environment, and run the exact same routine. All the same statements, etc. It fails with the 'Row 1 doesn't contain data for all columns' error.
The connection is made using this:
engine = create_engine(
connection_string,
pool_size=6, max_overflow=10, encoding='latin1', isolation_level='AUTOCOMMIT',
connect_args={"local_infile": 1}
)
db_connection = engine.connect()
The Command I place in the sql variable is:
LOAD DATA INFILE 'test.csv'
INTO TABLE test
FIELDS TERMINATED BY ',' ENCLOSED BY '\"' IGNORE 1 LINES SET fid = 526, actv = 1;
And execute it with:
db_connection.execute(sql)
So.. I basically load the first two columns from the file.. I set the next 2 columns in the load statement, and the final 4 others are handled by the trigger.
I repeat - this is working fine in the mysql 5 environment, but not the mysql 8.
I checked mysql character set variables in both db environments, and they are equivalent (just in case the default character set change between 5.6 and 8 had an impact).
I will say that the mySQL 5 db is running on ubuntu 18.04.5 while mySQL 8 is running on ubuntu 20.02.2 - could there be something there??
I have tried all sorts of fiddling with the LOAD DATA statement.. I tried filling in data for the first two records in the file in case that was it.. I tried using different line terminators in the LOAD statement.. I'm at a loss for the next thing to look into..
Thanks for any pointers..

MySQL will assume that each row in your CSV maps to a column in the table, unless you tell it otherwise.
Give the query a column list:
LOAD DATA INFILE 'test.csv'
INTO TABLE test
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
IGNORE 1 LINES
(node_name, parent)
SET fid = 526, actv = 1;

In addition to Tangentially Perpendicular's answer, there are other options:
add the IGNORE keyword as per:
https://dev.mysql.com/doc/refman/8.0/en/sql-mode.html#ignore-effect-on-execution
it should come just before the ' INTO' in the LOAD DATA statement as per https://dev.mysql.com/doc/refman/8.0/en/load-data.html.
or, altering the sql_mode to be less strict will work also.
Due to the strict sql_mode, LOAD DATA isn't smart enough to realize that TRIGGERS are handling a couple columns.. Would be nice if they enhanced it to be that smart.. but alas.

Force each row in Hive table to use a mapper

Supposed I have a Hive table (named table), like so:
row1 2341
row2 828242
row3 205252
...
The table itself is very long (thousands of lines). I am doing something like this to run a transformation using a Python script:
FROM (
MAP table.row, table.num
USING 'python script.py'
AS output
FROM table
) t1
INSERT OVERWRITE TABLE table2
SELECT (t1.output) as output_result;
The issue is that because I'm actually reading over a table and not files, each of the rows are being passed to the same mapper. This, as you can imagine, takes a long time. Is there a way to force each row to go to a separate mapper so that whatever logic is in the script can take care of everything else? Essentially, I want to run mapreduce like it's supposed to, but only passing in rows from a table to different mappers.
Thanks for the help.

Number of input splits are decided by Hadoop. But you may control it by setting
mapred.min.split.size parameter.
Passing rows through table or through file does not matter as behind the scenes both are text files.
By default, a file in kilobytes would be passed to one mapper only.
If you only want to try, you can create a file with size of around 1 GB and then run the query.

Insert into Vertica if table does not exist or not a duplicate row

I have written a python script to create a table using a create table if not exists statement and then insert rows from dataframe into vertica database. For the first time when I run this python script, I want it to create a table and insert the data - it works fine.
But from next time onwards, I want it to create a table only if it does not exist (works fine) and insert data only if that row is not contained in the database.
I use both insert statement and COPY statement to insert data. How to do this in python ? I am accessing Vertica database from python using pyodbc.
Editing the post to include some code :
There is a dataframe called tableframe_df , from which I need to populate content into a table created as bellow:
I am creating a table in vertica with create table if not exists, which creates a table if there is not one.
cursor.execute("create table if not exists <tablename> (fields in the table)")
COPY statement to write to this table from a csv that was created
`cursor.execute("COPY tablename1 FROM LOCAL 'tablename.csv' DELIMITER ',' exceptions 'exceptions' rejected data 'rejected'")`
##for i,row in tablename_df.iterrows():
cursor.execute("insert into tablename2 values(?,?,?,?,?,?,?,?,?,?,?,?)",row.values[0],row.values[1],row.values[2],row.values[3],row.values[4],row.values[5],row.values[6],row.values[7],row.values[8],row.values[9],row.values[10],row.values[11])
Here in the above code, I am creating table and then inserting into tablename1 and tablename2 using COPY and insert. This works fine when executed the first time ( as there is no data in the table). Now by mistake if I run the same script twice, the data will be inserted twice in these tables. What check should I perform to ensure that data does not get inserted if it is already present?

First I'll mention that INSERT VALUES is pretty slow if you are doing a lot of rows. If you are using batch sql and the standard vertica drivers, it should convert it to a COPY but if it doesn't then your inserts might take forever. I don't think this will happen with pyodbc since they don't implement executemany() optimally. You might be able to with ceodbc though, but I haven't tried it. Alternatively, you can use vertica_python which has a .copy('COPY FROM STDIN...',data) command that is efficient.
Anyhow, for your question...
You can do it one of two ways. Also for the inserts, I would really try to change this to a copy or at least an executemany. Again, pydobc does not do this properly, at least for the releases that I have used.
Use a control table that somehow uniquely describe the set of data being loaded and insert into it and check before running the script that the data set has not been loaded.
--Step 1. Check control table for data set load.
SELECT *
FROM mycontroltable
WHERE dataset = ?
--Step 2. If row not found, insert rows
for row in data:
cursor.execute('INSERT INTO mytargettable....VALUES(...)')
-- Step 3. Insert row into control table
INSERT INTO mycontroltable( dataset ) VALUES ( ? )
-- Step 4. Commit data
COMMIT
Alternatively you can insert or merge data in based on a key. You can create a temp or other staging table to do it. If you don't want updates and data does not change once inserted, then INSERT will be better as it will not incur a delete vector. I'll do INSERT based on the way you phrased your question.
--Step 1. Create local temp for intermediate target
CREATE LOCAL TEMP TABLE mytemp (fields) ON COMMIT DELETE ROWS;
--Step 2. Insert data.
for row in data:
cursor.execute('INSERT INTO mytemp....VALUES(...)')
--Step 3. Insert/select only data that doesn't exist by key value
INSERT INTO mytargettable (fields)
SELECT fields
FROM mytemp
WHERE NOT EXISTS (
SELECT 1
FROM mytargettable t
WHERE t.key = mytemp.key
)
--Step 4. Commit
COMMIT;

Python script to diff same table in two different databases

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?

To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

query sqlite3 2 tables joining them by one column and date period - python

Related

How to delete top "n" rows using sqlite3 in Python

mySQL Load Data - Row 1 doesn't contain data for all columns

Force each row in Hive table to use a mapper

Insert into Vertica if table does not exist or not a duplicate row

Python script to diff same table in two different databases

Categories

Resources