SQLite3 + Python CSV DictReader: Best Method to handle empty values

SQLite3 + Python CSV DictReader: Best Method to handle empty values - python

Still new to Python and I ran into an issue earlier this month where String '0' was being passed into my Integer Column (using a SQLite db). More information from my original thread:
SQL: Can WHERE Statement filter out specific groups for GROUP BY Statement
This caused my SQL Query statements to return invalid data.
I'm having this same problem pop up in other columns in my database when the CSV file does not contain any value for the specific cell.
The source of my data is from an external csv file that I download (unicode format). I use the following code to insert my code into the DB:
with sqlite3.connect(db_filename) as conn:
dbcursor = conn.cursor()
with codecs.open(csv_filename, "r", "utf-8-sig") as f:
csv_reader = csv.DictReader(f, delimiter=',')
# This is a much smaller column example as the actual data has many columns.
csv_dict = [( i['col1'], i['col2'] ) for i in csv_reader)
dbcursor.executemany(sql_str, csv_dict)
From what I researched, by design, SQLite does not enforce column type when inserting values. My solution to my original problem was to do a manual check to see if it was an empty value and then make it an int 0 using this code:
def Check_Session_ID( sessionID ):
if sessionID == '':
sessionID = int(0)
return sessionID
Each integer / float column will need to be checked when I insert the values into the Database. Since there will be many rows on each import (100K +) x (50+ columns) I would imagine the imports to take quite a bit of time.
What are better ways to handle this problem instead of checking each value for each Int / Float column per row?
Thank you so much for the advice and guidance.

Related

Using Python - How can I parse CSV list of both integers and strings and add to SQL table through Insert Statement?

I am automating a task through Python that will run an SQL statement to insert into an existing table in a DB.
My CSV headers look as such:
ID,ACCOUNTID,CATEGORY,SUBCATEGORY,CREATION_DATE,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE
My values:
seq_addnoteid.nextval,123456,TEST,ADMN_TEST,sysdate,ME,This is a test,Y,1,A
NOTE: Currently, seq_addnote works from DB but in my code i added a small snippet to get the max ID and the rows will increase this by one for each iteration.
Sysdate could also be passed as format '19-MAY-22'
If i was to run from DB, this would work:
insert into notes values(seq_addnoteid.nextval,'123456','TEST','ADMN_TEST',sysdate,'ME','This is a test','Y',1,'A');
# Snippet to get function
cursor.execute("SELECT MAX(ID) from NOTES")
max = cursor.fetchone()
max = int(max[0])
with open ('sample.csv', 'r') as f:
reader = csv.reader(f)
columns = next(reader)
query = 'INSERT INTO NOTES({0}) values ({1})'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
cursor = conn.cursor()
for data in reader:
cursor.execute(query, data)
conn.commit()
print("Records inserted successfully")
cursor.close()
conn.close()
Currently, i'm getting Oracle-Error-Message: ORA-01036: illegal variable name/number and i think its because of my query.format line. However, I'm looking for help to get this code to handle the data types properly.
Thanks!

Try printing your query before you execute it. I think you'll find that it's printing this:
INSERT INTO NOTES(ID,ACCOUNTID,CATEGORY,SUBCATEGORY,CREATION_DATE,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE)
values(seq_addnoteid.nextval,123456,TEST,ADMN_TEST,sysdate,ME,This is a test,Y,1,A);
Which will also give you a ORA-01036 if you try to run it manually.
The problem is that you want some of your column values to be literal values, and some of them to be strings escaped in single-quotes, and your code doesn't do that. I don't think there's an easy to way to do it with ','.join(), so you'll either need to modify your CSVs to quote the strings, like:
seq_addnoteid.nextval,"'123456'","'TEST'","'ADMN_TEST'",sysdate,"'ME'","'This is a test'","'Y'",1,"'A'"
Or modify your query.format to add the quotes around the parameters that you want to treat as strings:
query.format(','.join(columns), "?,'?','?','?',?,'?','?','?',?,'?'")
As the commenters mentioned, pandas does handle this all very nicely.
EDIT: I see what you're saying. I'm not sure pandas will help with the literal functions you want to pass to the insert. But yes, you should be able to change your CSV and then do:
query.format(','.join(columns) + ',ID,CREATION_DATE', "'?','?','?','?','?','?',?,'?',seq_addnoteid.nextval,sysdate")
As a side note, a lot of people do this sort of thing on the database side in a BEFORE INSERT trigger, e.g.:
create or replace trigger NOTES_INS_TRG
before insert on NOTES
for each row
begin
:NEW.ID := seq_addnoteid.nextval;
:NEW.CREATION_DATE := sysdate;
end;
/
Then you could leave those columns out of your insert entirely.
Edit again:
I'm not sure you can use ? for bind/substitution variables in cx_oracle (see documentation ). So where your raw query is currently:
INSERT INTO NOTES(ACCOUNTID,CATEGORY,SUBCATEGORY,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE,ID,CREATION_DATE)
values (seq_addnoteid.nextval,sysdate,'?','?','?','?','?','?',?,'?')
You'd need something like:
INSERT INTO NOTES(ACCOUNTID,CATEGORY,SUBCATEGORY,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE,ID,CREATION_DATE)
values (seq_addnoteid.nextval,sysdate,:1,:2,:3,:4,:5,:6,:7,:8)
We can probably do that by modifying the format string again to generate some bind variables:
query.format('ID,CREATION_DATE,' + ','.join(columns),
"seq_addnoteid.nextval,sysdate," + ','.join([':'+c for c in columns])
Again, try printing the query before executing it to make sure the column names and values are lining up correctly.

200GB Sqlite database file search and retrieve

I have a 250GB sqlite database file on an SSD drive and need to search through this file and search for a specific value in a table.
I wrote a script to perform the lookup in python and here is a similar sql statement to the one that I wrote:
SELECT table FROM database WHERE table like X'003485FAd480'.
I am looking to compare between hex values stored in a table to a given hex value.I am using Anaconda command prompt and not sure if this is the best route.
My question is about possible recommendations or tools to help speed up the lookup?
Thanks!

LIKE converts both operands into strings, so it might not work correctly if a value contains zero bytes or bytes that are not valid in the UTF-8 encoding.
To compare for equality, use =:
SELECT ... FROM MyTable WHERE MyColumn = x'003485FAD480';
This search can be sped up with an index on the lookup column; if you do not already have a primary key or unique constraint on this column, you can create an index manually:
CREATE INDEX MyLittleIndex ON MyTable(MyColumn);

I don't know if this is what your looking for, you mentioned using Python. If you're searching different values that are in Python, have you thought about writing two functions, one to search the database and one to compare those results and do something with them?
def queryFuntion():
cnxn = pyodbc.connect('DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=test.db;Trusted_connection=yes') #for production use only
cursor = cnxn.cursor()
query = cursor.execute("SELECT table FROM database")
for row in cursor.fetchall():
yield str(row.table)
def compareFunction(row):
search = '003485FAd480'
if row == search:
print('Yes')
else:
print('No')

SQLite3 Columns Are Not Unique

I'm inserting data from some csv files into my SQLite3 database with a python script I wrote. When I run the script, it inserts the first row into the database, but gives this error when trying to inset the second:
sqlite3.IntegrityError: columns column_name1, column_name2 are not unique.
It is true the values in column_name1 and column_name2 are same in the first two rows of the csv file. But, this seems a bit strange to me, because reading about this error indicated that it signifies a uniqueness constraint on one or more of the database's columns. I checked the database details using SQLite Expert Personal, and it does not show any uniqueness constraints on the current table. Also, none of the fields that I am entering specify the primary key. It seems that the database automatically assigns those. Any thoughts on what could be causing this error? Thanks.
import sqlite3
import csv
if __name__ == '__main__' :
conn = sqlite3.connect('ts_database.sqlite3')
c = conn.cursor()
fileName = "file_name.csv"
f = open(fileName)
csv_f = csv.reader(f)
for row in csv_f:
command = "INSERT INTO table_name(column_name1, column_name2, column_name3)"
command += " VALUES (%s, '%s', %s);" % (row[0],row[1],row[2])
print command
c.execute(command)
conn.commit()
f.close()

If SQLite is reporting an IntegrityError error it's very likely that there really is a PRIMARY KEY or UNIQUE KEY on those two columns and that you are mistaken when you state there is not. Ensure that you're really looking at the same instance of the database.
Also, do not write your SQL statement using string interpolation. It's dangerous and also difficult to get correct (as you probably know considering you have single quotes on one of the fields). Using parameterized statements in SQLite is very, very simple.

The error may be due to duplicate column names in the INSERT INTO statement. I am guessing it is a typo and you meant column_name3 in the INSERT INTO statement.

using Python to search a csv file and extract needed information

I have a large PC Inventory in csv file format. I would like to write a code that will help me find needed information. Specifically, I would like to type in the name or a part of the name of a user(user names are located in the 5th column of the file) and for the code to give me the name of that computer(computer names are located in second column in the file). My code doesn't work and I don't know what is the problem. Thank you for your help, I appreciate it!
import csv #import csv library
#open PC Inventory file
info = csv.reader(open('Creedmoor PC Inventory.csv', 'rb'), delimiter=',')
key_index = 4 # Names are in column 5 (array index is 4)
user = raw_input("Please enter employee's name:")
rows = enumerate(info)
for row in rows:
if row == user: #name is in the PC Inventory
print row #show the computer name

You've got three problems here.
First, since rows = enumerate(info), each row in rows is going to be a tuple of the row number and the actual row.
Second, the actual row itself is a sequence of columns.
So, if you want to compare user to the fifth column of an (index, row) tuple, you need to do this:
if row[1][key_index] == user:
Or, more clearly:
for index, row in rows:
if row[key_index] == user:
print row[1]
Or, if you don't actually have any need for the row number, just don't use enumerate:
for row in info:
if row[key_index] == user:
print row[1]
But that just gets you to your third problem: You want to be able to search for the name or a part of the name. So, you need the in operator:
for row in info:
if user in row[key_index]:
print row[1]
It would be clearer to read the whole thing into a searchable data structure:
inventory = { row[key_index]: row for row in info }
Then you don't need a for loop to search for the user; you can just do this:
print inventory[user][1]
Unfortunately, however, that won't work for doing substring searches. You need a more complex data structure. A trie, or any sorted/bisectable structure, would work if you only need prefix searches; if you need arbitrary substring searches, you need something fancier, and that's probably not worth doing.
You could consider using a database for that. For example, with a SQL database (like sqlite3), you can do this:
cur = db.execute('SELECT Computer FROM Inventory WHERE Name LIKE %s', name)
Importing a CSV file and writing a database isn't too hard, and if you're going to be running a whole lot of searches against a single CSV file it might be worth it. (Also, if you're currently editing the file by opening the CSV in Excel or LibreOffice, modifying it, and re-exporting it, you can instead just attach an Excel/LO spreadsheet to the database for editing.) Otherwise, it will just make things more complicated for no reason.

enumerate returns an iterator of index, element pairs. You don't really need it. Also, you forgot to use key_index:
for row in info:
if row[key_index] == user:
print row

it's hard to tell what's wrong without knowing how your file looks like, but I'm pretty sure the error is:
for row in info:
if row[key_Index] == user: #name is in the PC Inventory
print row #show the computer name
where you did define the column, but forget to get that column from each line you're comparing to the user, so in the end you're comparing a string with a list.
And you don't need the enumerate, per default you iterate over the rows.

How to determine if field exists in same table in another field in any other row

I am having troubles finding out if I can even do this. Basically, I have a csv file that looks like the following:
1111,804442232,1
1112,312908721,1
1113,A*2434,1
1114,A*512343128760987,1
1115,3512748,1
1116,1111,1
1117,1234,1
This is imported into a sqlite database in memory for manipulation. I will be importing multiple files into this database after some manipulation. Sqlite is allowing me to keep constraints on the tables and receive errors where needed without creating additional functions just to check each constraint while using arrays in python. I want to do a few things but the first of which is to prepend field2 where all field2 strings match an entry in field1.
For example, in the above data field2 in entry 6 matches entry 1. In this case I would like to prepend field2 in entry 6 with '555'
If this is not possible I do believe I could make do using a regex and just do this on every row with 4 digits in field2... though... I have yet to successfully get REGEX working using python/sqlite as it always throws me an error.
I am working within Python using Sqlite3 to connect/manipulate my sqlite database.
EDIT: I am looking for a method to manipulate the resultant tables which reside in a sqlite database rather than manipulating just the csv data. The data above is just a simple representation of what is contained in the files I am working with. Would it be better to work with arrays containing the data from the csv files? These files have 10,000+ entries and about 20-30 columns.

If you must do it in SQLite, how about this:
First, get the column names of the table by running the following and parsing the result
def get_columns(table_name, cursor):
cursor.execute('pragma table_info(%s)' % table_name)
return [row[1] for row in cursor]
conn = sqlite3.connect('test.db')
columns = get_columns('test_table',conn.cursor())
For each of those columns, run the following update, that does your prepending
def prepend(column, reference, prefix, cursor):
query = '''
UPDATE %s
SET %s = 'prefix' || %s
WHERE %s IN (SELECT %s FROM %s)
''' % (table, column, column, column, reference, table)
cursor.execute(query)
reference = 'field1'
[prepend('test_table', column, reference, '555', conn.cursor())
for column in columns
if column != reference]
Note that this is expensive: O(n^2) for each column you want to do it for.
As per your edit and Nathan's answer, it might be better to simply work with python's builtin datastructures. You can always insert it into SQLite after.
10,000 entries is not really much so it might not matter in the end. It all depends on your reason for requiring it to be done in SQLite (which we don't have much visibility of).

There is no need to use regex expressions to do this, just throw the contents from the first column into a set and then iterate through the rows and update the second field.
first_col_values = set(row[0] for row in rows)
for row in rows:
if row[1] in first_col_values:
row[1] = '555' + row[1]

So... I found the answer to my own question after a ridiculous amount of my own searching and trial and error. My unfamiliarity with SQL had me stumped as I was trying all kinds of crazy things. In the end... this was the simple type of solution I was looking for:
prefix="555"
cur.execute("UPDATE table SET field2 = %s || field2 WHERE field2 IN (SELECT field1 FROM table)"% (prefix))
I kept the small amount of python in there but what I was looking for was the SQL statement. Not sure why nobody else came up with something that simple =/. Unsatisfied with the answers so far, I had been searching far and wide for this simple line >_<.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.