I have loaded with succes large CSV files (with 1 header row) to a mysql table from Python with the command:
LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE 'table" FIELDS TERMINATED BY ';' IGNORE 1 LINES (#vone, #vtwo, #vthree) SET DatumTijd = #vone, Debiet = NULLIF(#vtwo,''),Boven = NULLIF(#vthree,'')
The file contains historic data back to 1970. Every month I get an update with roughly 4320 rows that need to be added to the existing table. Sometimes there is an overlap with the existing table, so I would like to use REPLACE. But this does not seem to work in combination with IGNORE 1 LINES. The primary key is DatumTijd, which follows the mysql datetime format.
I tried several combinations of REPLACE and IGNORE in different order, before the INTO TABLE "table" and behind FIELDS TERMINATED part.
Any suggestions how to solve this?
Apart from the possible typo of enclosing the table name in single quotes rather than backticks the load statement works fine on my windows device given the following data
one;two;three
2023-01-01;1;1
2023-01-02;2;2
2023-01-01;3;3
2022-01-04;;;
note I prefer coalesce to nullif and have included an auto_increment id to demonstrate what replace actually does , ie delete and insert.
drop table if exists t;
create table t(
id int auto_increment primary key,
DatumTijd date,
Debiet varchar(1),
boven varchar(1),
unique index key1(DatumTijd)
);
LOAD DATA INFILE 'C:\\Program Files\\MariaDB 10.1\\data\\sandbox\\data.txt'
replace INTO TABLE t
FIELDS TERMINATED BY ';'
IGNORE 1 LINES (#vone, #vtwo, #vthree)
SET DatumTijd = #vone,
Debiet = coalesce(#vtwo,''),
Boven = coalesce(#vthree,'')
;
select * from t;
+----+------------+--------+-------+
| id | DatumTijd | Debiet | boven |
+----+------------+--------+-------+
| 2 | 2023-01-02 | 2 | 2 |
| 3 | 2023-01-01 | 3 | 3 |
| 4 | 2022-01-04 | | |
+----+------------+--------+-------+
3 rows in set (0.001 sec)
It should not matter that replace in effect creates a new record but if it does to you consider loading to a staging table then insert..on duplicate key to target
ive written some code to parse a website, and input it into a mysql db.
The problem is I am getting a lot of duplicates per FKToTech_id like:
id | ref | FKToTech_id |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+
| 1 | website.com/path | 1 |
| 2 | website.com/path | 1 |
| 3 | website.com/path | 1
What Im looking for is instead to have (1) row in this database, based on if ref has been entered already for FKToTech_id and not have multiple of the same row like:
id | ref | FKToTech_id |
+----+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+
| 1 | website.com/path | 1 |
How can I modify my code below to just python pass if the above is True (==1 ref with same FKToTech_id?
for i in elms:
allcves = {cursor.execute("INSERT INTO TechBooks (ref, FKToTech_id) VALUES (%s, %s) ", (i.attrs["href"], row[1])) for row in cves}
mydb.commit()
Thanks
Make ref a unique column, then use INSERT IGNORE to skip the insert if it would cause a duplicate key error.
ALTER TABLE TechBooks ADD UNIQUE INDEX (ref);
for i in elms:
cursor.executemany("INSERT IGNORE INTO TechBooks (ref, FKToTech_id) VALUES (%s, %s) ", [(i.attrs["href"], row[1]) for row in cves])
mydb.commit()
I'm not sure what your intent was by assigning the results of cursor.execute() to allcves. cursor.execute() doesn't return a value unless you use multi=True. I've replaced the useless set comprehension with use of cursor.executemany() to insert many rows at once.
In Spark SQL , i would need to cast as_of_date to string and do a multiple inner join with 3 tables and select all rows & columns in table1 , 2 and 3 after join . Example table schema as shown below
Tablename : Table_01 alias t1
Column | Datatype
as_of_date | String
Tablename | String
Credit_Card | String
Tablename : Table_02 alias t2
Column | Datatype
as_of_date | INT
Customer_name | String
tablename | string
Tablename : Table_03 alias t3
Column | Datatype
as_of_date | String
tablename | String
address | String
Join use-case :
t1.as_of_date = t2.as_of_date AND t1.tablename = t2.tablename
t2.as_of_date = t3.as_of_date AND t2.tablename = t3.tablename
Tables are already created in hive, i am doing a spark transformation on top of this tables and i am converting as_of_date in table_02 as string.
There are 2 approach i have thought of , but i am unsure which is the best approach
Approach 1:
df = spark.sql("select t1.*,t2.*,t3.* from table_1 t1 where cast(t1.as_of_date as string) inner join table_t2 t2 on t1.as_of_date = t2.as_of_date AND t1.tablename = t2.tablename inner join table_03 t3 on t2.as_of_date = t3.as_of_date and t2.tablename = t3.tablename")
Approach 2:
df_t1 = spark.sql("select * from table_01");
df_t2 = spark.sql("select * from table_02");
df_t3 = spark.sql("select * from table_03");
## Cast as_of_date as String if dtype as of date is int
if dict(df_t2.dtypes)["as_of_date"] == 'int':
df_t1["as_of_date"].cast(cast(StringType())
## Join Condition
df = df_t1.alias('t1').join(df_t2.alias('t2'),on="t1.tablename=t2.tablename AND t1.as_of_date = t2.as_of_date", how='inner').join(df_t3.alias('t3'),on="t2.as_of_date = t3.as_of_date AND t2.tablename = t3.tablename",how='inner').select('t1.*,t2.*,t3.*')
I feel that using approach 2 is long winded, i would need some advice on which approach should i go with as for easy maintenance and the scripts used
I would suggest using Spark SQL directly as below. You can cast every as_of_date column from all tables as a string regardless of its data type. You want to cast integer into string, but if you also cast string into a string, it does no harm.
df = spark.sql("""
select t1.*, t2.*, t3.*
from t1
join t2 on string(t1.as_of_date) = string(t2.as_of_date) AND t1.tablename = t2.tablename
join t3 on string(t2.as_of_date) = string(t3.as_of_date) AND t2.tablename = t3.tablename
""")
I'm trying to retrieve data from an sqlite table conditional on another sqlite table within the same database. The formats are the following:
master table
--------------------------------------------------
| id | TypeA | TypeB | ...
--------------------------------------------------
| 2020/01/01 | ID_0-0 |ID_0-1 | ...
--------------------------------------------------
child table
--------------------------------------------------
| id | Attr1 | Attr2 | ...
--------------------------------------------------
| ID_0-0 | 112.04 |-3.45 | ...
--------------------------------------------------
I want to write an sqlite3 query that takes:
A date D present in master.id
A list of types present in master's columns
A list of attributes present in child's columns
and returns a dataframe with rows the types and columns the attributes. Of course, I could just read the tables into pandas and do the work there, but I think it would be more computationally intensive and I want to learn more SQL syntax!
UPDATE
So far I've tried:
"""SELECT *
FROM child
WHERE EXISTS
(SELECT *
FROM master
WHERE master.TypeX = Type_input
WHERE master.dateX = date_input)"""
and then concatenate the strings over all required TypeX and dateX and execute with:
cur.executescripts(script)
Using pandas in python, I need to be able to generate efficient queries from a dataframe into postgresql. Unfortunately DataFrame.to_sql(...) only performs direct inserts and the query i wish to make is fairly complicated.
Ideally, I'd like to do this:
WITH my_data AS (
SELECT * FROM (
VALUES
<dataframe data>
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
my_table.col1 = my_data.col1,
my_table.col2 = complex_function(my_table.col2, my_data.col2),
FROM my_data
WHERE my_table.col3 < my_data.col3;
However, to do that, i would need to turn my dataframe into a plain values statement. I could, of course, rewrite my own functions, but past experiences have taught me that writing functions to escape and sanitize sql should never be done manually.
We are using SQLAlchemy, but bound parameters seem to only work with a limited number of arguments, and ideally i would like the serialization of the dataframe into text to be done at C-speed.
So, is there a way, either through pandas, or through SQLAlchemy, to turn efficiently my dataframe into the values substatement, and insert it into my query?
You could use psycopg2.extras.execute_values.
For example, given this setup
CREATE TABLE my_table (
col1 int
, col2 text
, col3 int
);
INSERT INTO my_table VALUES
(99, 'X', 1)
, (99, 'Y', 2)
, (99, 'Z', 99);
# | col1 | col2 | col3 |
# |------+------+------|
# | 99 | X | 1 |
# | 99 | Y | 2 |
# | 99 | Z | 99 |
The python code
import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config
df = pd.DataFrame([
(1, 'A', 10),
(2, 'B', 20),
(3, 'C', 30)])
with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
with conn.cursor() as cursor:
sql = '''WITH my_data AS (
SELECT * FROM (
VALUES %s
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3'''
pge.execute_values(cursor, sql, df.values)
updates my_table to be
# SELECT * FROM my_table
| col1 | col2 | col3 |
|------+------+------|
| 99 | Z | 99 |
| 1 | XA | 1 |
| 1 | YA | 2 |
Alternatively, you could use psycopg2 to generate the SQL.
The code in format_values is almost entirely copied from the source code for pge.execute_values.
import psycopg2
import psycopg2.extras as pge
import pandas as pd
import config
df = pd.DataFrame([
(1, "A'foo'", 10),
(2, 'B', 20),
(3, 'C', 30)])
def format_values(cur, sql, argslist, template=None, page_size=100):
enc = pge._ext.encodings[cur.connection.encoding]
if not isinstance(sql, bytes):
sql = sql.encode(enc)
pre, post = pge._split_sql(sql)
result = []
for page in pge._paginate(argslist, page_size=page_size):
if template is None:
template = b'(' + b','.join([b'%s'] * len(page[0])) + b')'
parts = pre[:]
for args in page:
parts.append(cur.mogrify(template, args))
parts.append(b',')
parts[-1:] = post
result.append(b''.join(parts))
return b''.join(result).decode(enc)
with psycopg2.connect(host=config.HOST, user=config.USER, password=config.PASS, database=config.USER) as conn:
with conn.cursor() as cursor:
sql = '''WITH my_data AS (
SELECT * FROM (
VALUES %s
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3'''
print(format_values(cursor, sql, df.values))
yields
WITH my_data AS (
SELECT * FROM (
VALUES (1,'A''foo''',10),(2,'B',20),(3,'C',30)
) AS data (col1, col2, col3)
)
UPDATE my_table
SET
col1 = my_data.col1,
-- col2 = complex_function(col2, my_data.col2)
col2 = my_table.col2 || my_data.col2
FROM my_data
WHERE my_table.col3 < my_data.col3