LOAD DATA INFILE REPLACE with specific columns - python

I am currently using LOAD DATA LOCAL INFILE successfully, and typically with REPLACE. However, I am now trying to use the same command to load a CSV file into a table but only replace specific columns.
For example, table currently looks like:
ID date num1 num2
01 1/1/2017 100 200
01 1/2/2017 101 201
01 1/3/2017 102 202
where ID and date are the primary keys.
I have a similar CSV, but one that only has ID, date, num1 as columns. I want to load that new CSV into the table, but maintain whatever is in num2.
My current code per this post:
LOAD DATA LOCAL INFILE mycsv.csv
REPLACE INTO TABLE mytable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(ID, date, num1)
Again, my code works flawlessly when I'm replacing all the columns, but when I try to replace only select columns, it fills the other columns with NULL values. The similar posts (like the one I referenced) haven't helped.
I don't know if this is python specific issue, but I'm using MySQLdb to connect to the database and I'm familiar with the local_infile parameter and that's all working well. MySQL version 5.6.33.

Simply load csv table into a similarly structured temp table and then run UPDATE JOIN:
CREATE TABLE mytemptable AS SELECT * FROM mytable LIMIT 1; --- RUN ONLY ONCE
DELETE FROM mytemptable;
LOAD DATA LOCAL INFILE mycsv.csv
INTO TABLE mytemptable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(ID, date, num1);
UPDATE mytable t1
INNER JOIN mytemptable t2 ON t1.ID = t2.ID AND t1.`date` = t2.`date`
SET t1.num1 = t2.num1;

Related

Reading csv data into a table in postgresql via INSERT INTO with Python

I have a postgresql table created in python that I need to then populate with data from a csv file. The csv file has 4 columns and a header row. When I use a for loop with INSERT INTO it's not working correctly.
It is giving me an error telling me that a certain column doesn't exist, but the column is actually the first ID in the ID column.
I've looked over all the similar issues reported on other questions and can't seem to find something that works.
The table looks like this (with more lines):
ID
Gender
Weight
Age
A
F
121
20
B
M
156
31
C
F
110
18
The code I am running is the following:
import pandas as pd
df = pd.read_csv('df.csv')
for x in df.index:
cursor.execute("""
INSERT INTO iddata (ID, Gender, Weight, Age)
VALUES (%s, %s, %d, %d)""" % (df.loc[x]['ID'],
df.loc[x]['Gender'],
df.loc[x]['Weight'],
df.loc[x]['Age']))
conn.commit
The error I'm getting says
UndefinedColumn: column "a" does not exist
LINE 3: VALUES (A, F, 121, 20)
^
Replace the """ % with """,.
Also add () after the .commit().
The fixed loop code becomes:
for x in df.index:
cursor.execute("""
INSERT INTO iddata (ID, Gender, Weight, Age)
VALUES (%s, %s, %d, %d)""", (df.loc[x]['ID'],
df.loc[x]['Gender'],
df.loc[x]['Weight'],
df.loc[x]['Age']))
conn.commit()
The reason why you need a comma , is to pass the data (4-element tuple) as the 2nd argument of cursor.execute. By doing so, cursor.execute will take care of correct quoting and escaping. This matters for string values: proper escaping will make sure that strings containing any characters (including ', " and \) will be sent intact to the database.

how to avoid read_sql_query data truncation for the returned set

I'm executing a sql command which returns a create table statement provided by my database, PostgreSQL.
In order to execute the sql command I use:
import io
import json
import pandas as pd
import pandas.io.sql as psql
import psycopg2 as pg
import boto3
from datetime import datetime
conn = pg.connect(
host=pgparams['url'],
dbname=pgparams['db'],
user=pgparams['usr'],
password=pgparams['pwd'])
createTable_sql = "postgresql select which returns the create table statement"
df_create_table_script = pd.read_sql_query(createTable_sql ,con=connection)
My scope create a table. The table creation script is returned by PostgreSql after executing via Pandas / Python the "pd.read_sql_query" command.
If I execute the "createTable_sql" in an pgsql interpreter (e.g. pgadmin, etc.) it works fine and as a result I'm having only one column having the expected create table statement, or just a plain string having a length of 512 characters.
The content of "createTable_sql" variable is:
createTable_sql= "SELECT cast ('CREATE TABLE dbo.table1 (" ...
createTable_sql= createTable_sql + "|| string_agg(pa.attname || ' ' || pg_catalog.format_type(pa.atttypid, pa.atttypmod)|| coalesce(' DEFAULT ' || (select pg_catalog.pg_get_expr(d.adbin, d.adrelid) from pg_catalog.pg_attrdef d where d.adrelid = pa.attrelid and d.adnum = pa.attnum and pa.atthasdef), '') || ' ' || case pa.attnotnull when true then 'NOT NULL' else 'NULL' end, ',')"
createTable_sql= createTable_sql + " as column_from_script from pg_catalog.pg_attribute pa join pg_catalog.pg_class pc on pc.oid = pa.attrelid and pc.relname = 'tabl1_source' join pg_catalog.pg_namespace pn on pn.oid = pc.relnamespaceand pn.nspname = 'dbo' where pa.attnum > 0 and not pa.attisdropped group by pn.nspname, pc.relname, pa.attrelid;"
The result when executing this sql command should be :
CREATE TABLE dbo.table1 (col1 datatype, col2, datatype, ....etc) -total number of charters from script is 512.
My issue is that the Pandas read_sql_query, or read_sql has a limitation (or at least is what i think) for the returned data set.
I was expecting the returned or read data set to have 512 characters, but the read_sql method is truncating it.
The result i'm having when i try to access the result returned by Postgresql (db engine) is:
' CREATE TABLE dbo.tabl1 (col1...'
Therefore instead of the full text (representing the table creation script) I'm only having something which is truncated after the first couple of characters.
Initially I assumed it was only truncated when i was using the print() function to get the returned result, but the value itself is truncated as well.
I even tried another approach, such as:
conn = pg.connect(
host=pgparams['url'],
dbname=pgparams['db'],
user=pgparams['usr'],
password=pgparams['pwd'])
sql =createTable_sql
copy_func_csv = "COPY ({sql_cmd}) TO STDOUT WITH CSV {head}".format(sql_cmd=sql, head="HEADER")
cur = conn.cursor()
store = io.StringIO()
cur.copy_expert(copy_func_csv , store)
store.seek(0)
df_new = pd.read_csv(store, engine='python', true_values=[True, 't'],false_values =[False, 'f'])
table_script = df_new .column_from_script.to_string(header=False,index=False)
But the table_script content was still truncated, looking like:
' CREATE TABLE dbo.tabl1 (col1...'
Is there any way I can retrieve a response set, meaning a single column (e.g. Col1) which can have a datatype definition such as Varchar(1000), or STR?
Regards,
If running the same query from using pgadmin works fine, perhaps try running the query directly from psycopg2? Try the following code chunk in place of pandas:
cur = conn.cursor()
cur.execute(createTable_sql)
result = cur.fetchall()

Python and Pandas to Query API's and update DB

I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.
Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).
It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

Sorting Multi-indexed Large Database Table in Sqlite

I am attempting to select from (with a WHERE clause) and sort a large database table in sqlite3 via python. The sort is currently taking 30+ minutes on about 36 MB of data. I have a feeling it can work faster than this with indices, but I think the order of my code may be incorrect.
The code is executed in the order listed here.
My CREATE TABLE statement looks like this:
c.execute('''CREATE table gtfs_stop_times (
trip_id text , --REFERENCES gtfs_trips(trip_id),
arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
stop_id text , --REFERENCES gtfs_stops(stop_id),
stop_sequence int NOT NULL --NOT NULL
)''')
The rows are then inserted in the next step:
stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)
Next, I create an index out of two columns (trip_id and stop_sequence):
c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')
Finally, I run a SELECT statement with a WHERE clause that sorts this data by the two columns used in the index and then write that to a csv file:
c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')
f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer
Is there any way to speed up Step 4 (possibly be changing how I add and/or use the index) or should I just go to lunch while this runs?
I have added PRAGMA statements to try to improve performance to no avail:
c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')
The SELECT executes extremely fast because there is no gtfs_stops table and you get nothing but an error message.
If we assume that there is a gtfs_stops table, then your trip_seq index is already quite optimal for the query.
However, you also need an index for looking up stop_code values in the gtfs_stops column.

Categories

Resources