Inserting large amounts of data in sqlite - python

I am making an inverted index lookup table for my database in sqlite3. The database I have consists of certain bloggers and their posts.
I have a table post which has the columns id, text, blogger_id. This table consists of ~680 000 posts. And I want to make a table Blogger_Post_Word with the columns blogger_id, post_id, word_position, word_id.
I am using Python for this and I have tried a 2 ways before but both have their problems.
I saw online that the best way to insert large amounts of data is with a bulk insert. This means that I have to fetch all the posts and for each word in a post I have to store that locally so I can do a bulk insert later. This requires way to much memory that I don't have.
I have also tried inserting each word one by one but this just takes way to long.
Is there an efficient way to solve this problem or an sql statement that does this in one go?
Edit:
This is my the code I'm using now:
#lru_cache()
def get_word_id(_word: str) -> int:
word_w_id = db.get_one('Word', ['word'], (word,))
if word_w_id is None:
db.insert_one('Word', ['word'], (word,))
word_w_id = db.get_one('Word', ['word'], (word,))
return word_w_id[0]
for post_id, text, creation_date, blogger_id in db.get_all('Post'):
split_text = text.split(' ')
for word_position, word in enumerate(split_text):
word_id = get_word_id(word)
db.insert_one('Blogger_Post_Word',
['blogger_id', 'post_id', 'word_position', 'word_id'],
(blogger_id, post_id, word_position, word_id))
The db is a class I wrote to handle the database, these are the functions in that class I use:
def get(self, table: str, where_cols: list = None, where_vals: tuple = None):
query = 'SELECT * FROM ' + table
if where_cols is not None and where_vals is not None:
where_cols = [w + '=?' for w in where_cols]
query += ' WHERE ' + ' and '.join(where_cols)
return self.c.execute(query, where_vals)
return self.c.execute(query)
def get_one(self, table: str, where_cols: list = None, where_vals: tuple = None):
self.get(table, where_cols, where_vals)
return self.c.fetchone()
def insert_one(self, table: str, columns: list, values: tuple):
query = self.to_insert_query(table, columns)
self.c.execute(query, values)
self.conn.commit()
def to_insert_query(self, table: str, columns: list):
return 'INSERT INTO ' + table + '(' + ','.join(columns) + ')' + ' VALUES (' + ','.join(['?' for i in columns]) + ')'

Okay I hope this helps anyone.
The problem was that indeed that insert one is too slow and I didn't have enough memory to store the whole list locally.
Instead I used a hybrid of the two and insert them into the database incrementaly.
I displayed the size of my list to determine the bottleneck. It seemed that 150 000 posts of the 680 000 was about my bottleneck. The total size of the list was about 4.5 GB.
from pympler.asizeof import asizeof
print(asizeof(indexed_data))
>>> 4590991936
I decide on an increment of 50 000 posts to keep everything running smooth.
This is now my code:
# get all posts
c.execute('SELECT * FROM Post')
all_posts = c.fetchall()
increment = 50000
start = 0
end = increment
while start < len(all_posts):
indexed_data = []
print(start, ' -> ', end)
for post_id, text, creation_date, blogger_id in all_posts[start:end]:
split_text = text.split(' ')
# for each word in the post add a tuple with blogger id, post id, word position in the post and the word to indexed_data
indexed_data.extend([(blogger_id, post_id, word_position, word) for word_position, word in enumerate(split_text)])
print('saving...')
c.executemany('''
INSERT INTO Inverted_index (blogger_id, post_id, word_position, word)
VALUES (?, ?, ?, ?)
''', indexed_data)
start += increment
if end + increment > len(all_posts):
end = len(all_posts)
else:
end += increment

Related

Python: Bulk upload records of Oracle table with executemany() in python

data=[]
dataToInsert = [ ]
for index, row in df.iterrows():
contentid = row['CONTENTID']
Objectsummary = row['OBJECT_SUMMARY']
Title=row['TITLE']
if Title is None:
Title = ""
if Objectsummary is None:
Objectsummary = ""
allSummeries =Title + ' ' + Objectsummary
lists=function_togetNounsAndVerbs(allSummeries)
verbList =lists[0]
nounList =lists[1]
NounSet = set(nounList)
VerbSet = set(verbList)
verbs = " "
verbs=verbs.join(VerbSet)
nouns=" "
nouns=nouns.join(NounSet)
verbs=re.sub(r" ", ", ", verbs)
nouns=re.sub(r" ", ", ", nouns)
# Here we are going to create the data sdet to be updated in database table in batch form.
data.append(nouns)
data.append(verbs)
data.append('PROCESSED')
data.append(contentid)
dataToInsert.append([data[0], data[1], data[2], data[3]])
print("ALL DATA TO BE UPDATED IN TABLE IS :---> ",dataToInsert)
statement = """UPDATE test_batch_update_python SET NOUNS = ?, Verbs = ? where CONTENTID = ?"""
a = cursor.executemany(statement, dataToInsert)
connection.commit()
In above code function_togetNounsAndVerbs(allSummeries) this function will return the lists.
I am getting following exception:
**a = cursor.executemany(statement, dataToInsert)
cx_Oracle.DatabaseError: ORA-01036: illegal variable name/number**
Please help me with this.
Or what are the other ways I can do this. Initially I used to update single row at a time using cursor.execute() but it was very time consuming. To minimize the time i am using bulk upload (i.e. cursor.executemany() )
Here's a related example that works. The table is created as:
DROP table test_batch_update_python;
CREATE TABLE test_batch_update_python (contentid NUMBER, nouns VARCHAR2(20), verbs VARCHAR2(20));
INSERT INTO test_batch_update_python (contentid) VALUES (1);
INSERT INTO test_batch_update_python (contentid) VALUES (2);
COMMIT;
cursor = connection.cursor()
dataToInsert = [
('shilpa', 'really fast', 1),
('venkat', 'also really fast', 2),
]
print("ALL DATA TO BE UPDATED IN TABLE IS :---> ", dataToInsert)
connection.autocommit = True;
statement = """UPDATE test_batch_update_python SET nouns=:1, verbs=:2 WHERE contentid=:3"""
cursor.setinputsizes(20, 20, None)
cursor.executemany(statement, dataToInsert)
The output is:
ALL DATA TO BE UPDATED IN TABLE IS :---> [('shilpa', 'really fast', 1), ('venkat', 'also really fast', 2)]
And then querying the data gives:
SQL> select * from test_batch_update_python;
CONTENTID NOUNS VERBS
---------- -------------------- --------------------
1 shilpa really fast
2 venkat also really fast
check this link for an and update statement
https://blogs.oracle.com/oraclemagazine/perform-basic-crud-operations-with-cx-oracle-part-3
thanks

Running chatbot on system

import re
import sqlite3
from collections import Counter
from string import punctuation
from math import sqrt
# initialize the connection to the database
connection = sqlite3.connect('chatbot.sqlite')
cursor = connection.cursor()
# create the tables needed by the program
create_table_request_list = [
'CREATE TABLE words(word TEXT UNIQUE)',
'CREATE TABLE sentences(sentence TEXT UNIQUE, used INT NOT NULL DEFAULT 0)',
'CREATE TABLE associations (word_id INT NOT NULL, sentence_id INT NOT NULL, weight REAL NOT NULL)',
]
for create_table_request in create_table_request_list:
try:
cursor.execute(create_table_request)
except:
pass
def get_id(entityName, text):
"""Retrieve an entity's unique ID from the database, given its associated text.
If the row is not already present, it is inserted.
The entity can either be a sentence or a word."""
tableName = entityName + 's'
columnName = entityName
cursor.execute('SELECT rowid FROM ' + tableName + ' WHERE ' + columnName + ' = ?', (text,))
row = cursor.fetchone()
if row:
return row[0]
else:
cursor.execute('INSERT INTO ' + tableName + ' (' + columnName + ') VALUES (?)', (text,))
return cursor.lastrowid
def get_words(text):
"""Retrieve the words present in a given string of text.
The return value is a list of tuples where the first member is a lowercase word,
and the second member the number of time it is present in the text."""
wordsRegexpString = '(?:\w+|[' + re.escape(punctuation) + ']+)'
wordsRegexp = re.compile(wordsRegexpString)
wordsList = wordsRegexp.findall(text.lower())
return Counter(wordsList).items()
B = 'Hello!'
while True:
# output bot's message
print('B: ' + B)
# ask for user input; if blank line, exit the loop
H = raw_input('H: ').strip()
if H == '':
break
# store the association between the bot's message words and the user's response
words = get_words(B)
words_length = sum([n * len(word) for word, n in words])
sentence_id = get_id('sentence', H)
for word, n in words:
word_id = get_id('word', word)
weight = sqrt(n / float(words_length))
cursor.execute('INSERT INTO associations VALUES (?, ?, ?)', (word_id, sentence_id, weight))
connection.commit()
# retrieve the most likely answer from the database
cursor.execute('CREATE TEMPORARY TABLE results(sentence_id INT, sentence TEXT, weight REAL)')
words = get_words(H)
words_length = sum([n * len(word) for word, n in words])
for word, n in words:
weight = sqrt(n / float(words_length))
cursor.execute('INSERT INTO results SELECT associations.sentence_id, sentences.sentence, ?*associations.weight/(4+sentences.used) FROM words INNER JOIN associations ON associations.word_id=words.rowid INNER JOIN sentences ON sentences.rowid=associations.sentence_id WHERE words.word=?', (weight, word,))
# if matches were found, give the best one
cursor.execute('SELECT sentence_id, sentence, SUM(weight) AS sum_weight FROM results GROUP BY sentence_id ORDER BY sum_weight DESC LIMIT 1')
row = cursor.fetchone()
cursor.execute('DROP TABLE results')
# otherwise, just randomly pick one of the least used sentences
if row is None:
cursor.execute('SELECT rowid, sentence FROM sentences WHERE used = (SELECT MIN(used) FROM sentences) ORDER BY RANDOM() LIMIT 1')
row = cursor.fetchone()
# tell the database the sentence has been used once more, and prepare the sentence
B = row[1]
cursor.execute('UPDATE sentences SET used=used+1 WHERE rowid=?', (row[0],))
This is a code written for creating a chatbot. When I try running this code on cmd. By using command python chatbot.py, it returns an error saying invalid syntax.
IS there any way i can remove this error and run this code on my system?
it gives error: File "chatbot.py", line 1 syntax: invalid syntax
What version of Python are you running and in what environment? I ran this code on my Python 3.70b4 under Windows and it worked fine except for line 52:
H = raw_input('H: ').strip()
Which you have to change to:
H = input('H: ').strip()
This is probably unrelated directly to your issue, but the code you posted did run fine for me in my environment, after I made that one change (and of course installed any libraries or modules needed).

Counting Organizations by using Python and Sqlite

This application will read the mailbox data (mbox.txt) count up the number email messages per organization (i.e. domain name of the email address) using a database with the following schema to maintain the counts.
CREATE TABLE Counts (org TEXT, count INTEGER)
When you have run the program on mbox.txt upload the resulting database file above for grading.
If you run the program multiple times in testing or with different files, make sure to empty out the data before each run.
You can use this code as a starting point for your application: http://www.pythonlearn.com/code/emaildb.py. The data file for this application is the same as in previous assignments: http://www.pythonlearn.com/code/mbox.txt.
First time to learn Sqlite. I am very confused about this assignment although it seems to be easy. I don't know how can I connect Python codes to Sqlite. It seems that they don't need the code as assignment. All the need is database file. How should I solve this problem. Don't know how to start it. Much appreciated it!
The starting code you've been given is a really good template for what you want to do. The difference is that - in that example - you're counting occurences of email address, and in this problem you're counting domains.
First thing to do is think about how to get domain names from email addresses. Building from the code given (which sets email = pieces[1]):
domain = email.split('#')[1]
This will break the email on the # character, and return the second item (the part after the '#'), which is the domain - the thing you want to count.
After this, go through the SQL statements in the code and replace 'email' with 'domain', so that you're counting the right thing.
One last thing - the template code checks 'mbox-short.txt' - you'll need to edit that as well for the file you want.
import sqlite3
conn = sqlite3.connect('emaildb2.sqlite')
cur = conn.cursor()
cur.execute('''
DROP TABLE IF EXISTS Counts''')
cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')
fname = input('Enter file name: ')
if (len(fname) < 1): fname = 'mbox.txt'
fh = open(fname)
list_1 =[]
for line in fh:
if not line.startswith('From: '): continue
pieces = line.split()
email = pieces[1]
dom = email.find('#')
org = email[dom+1:len(email)]
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (org, count)
VALUES (?, 1)''', (org,))
else:
cur.execute('UPDATE Counts SET count = count + 1 WHERE org = ?',
(org,))
conn.commit()
# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10'
for row in cur.execute(sqlstr):
print(str(row[0]), row[1])
cur.close()
I am still new here, but I want to thank Stidgeon for pointing me in the right direction. I suspect other Using Databases with Python students will end up here too.
There are two things you need to do with the source code.
domain = email.split('#')[1] http://www.pythonlearn.com/code/emaildb.py
Change from email TEXT to org TEXT when the database is generated.
That should get you on your way.
import sqlite3
conn = sqlite3.connect('emaildb.sqlite')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Counts')
cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')
fname = input('Enter file name: ')
if (len(fname) < 1): fname = 'mbox-short.txt'
fh = open(fname)
for line in fh:
if not line.startswith('From: '): continue
pieces = line.split()
org = pieces[1].split('#')
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org[1],))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (org, count)
VALUES (?, 1)''', (org[1],))
else:
cur.execute('UPDATE Counts SET count = count + 1 WHERE org = ?',
(org[1],))
conn.commit()
# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10'
for row in cur.execute(sqlstr):
print(str(row[0]), row[1])
cur.close()
print('-----------------done----------------')

Python: Search database using SQL

I am new to programming and working on a homework assignment. I am trying to search a database by comparing a user's search term with any matching values from a selected column. If a user searches "Smith" and clicks on the "Smith" radio button in my GUI, all the records containing "Smith" as its author should appear. I am able to print all the records in the database, but not the records that relate to my search.
db = None
colNum = None
def search_db(self):
global db
global colNum
self.searchTerm = self.searchvalue.get()
dbname = 'books.db'
if os.path.exists(dbname):
db = sqlite3.connect(dbname)
cursor = db.cursor()
sql = 'SELECT * FROM BOOKS'
cursor.execute(sql)
rows = cursor.fetchall()
for record in rows:
sql_search = 'SELECT * FROM BOOKS WHERE' + ' ' + record[colNum] + ' ' + 'LIKE "%' + ' ' + self.searchTerm + '%"'
cursor.execute(sql_search)
searched_rows = cursor.fetchall()
print(searched_rows)
The error I'm receiving is "sqlite3.OperationalError: no such column:"
There isn't enough information in your question to be sure, but this certainly is fishy:
sql_search = 'SELECT * FROM BOOKS WHERE' + ' ' + record[colNum] + ' ' + 'LIKE "%' + ' ' + self.searchTerm + '%"'
That record[colNum] is the value in a row for your column, not the name of the column. For example, if the column you wanted is Title, you're going to treat every title of every book as if it were a column name.
So, you end up running queries like this:
SELECT * FROM BOOKS WHERE The Meaning of Life: The Script like %Spam%
Even if that were valid SQL (quoted properly), The Meaning of Life: The Script is probably not a column in the BOOKS table.
Meanwhile, SELECT * returns the columns in an arbitrary order, so using colNum isn't really guaranteed to do anything useful. But, if you really want to do what you're trying to do, I think it's this:
sql = 'SELECT * FROM BOOKS'
cursor.execute(sql)
colName = cursor.description[colNum][0]
sql_search = 'SELECT * FROM BOOKS WHERE ' + colName + ' LIKE "%' + ' ' + self.searchTerm + '%"'
However, you really shouldn't be wanting to do that…
You need to get the column name from the fields of the table, or from somewhere else. Your query uses record[colNum] but record contains rows of data. Instead, to get the field names, use something like this:
fields = []
for field in cursor.description:
fields.append(field[0])
When you use rows = cursor.fetchall(), you are only getting data (and not the column headers).
It looks to me like you are just not forming the SQL correctly. Remember that whatever you put after the LIKE clause needs to be quoted. Your SQL needs to look like
SELECT * FROM BOOKS WHERE Title like '%Spam%'
So you need another set of single quotes in there so that's why I would use double quotes to surround your Python string:
sql_search = "SELECT * FROM BOOKS WHERE " + record[colNum] + " LIKE '%" + self.searchTerm + "%'"

Python script returns old MySQL values (until restart) [duplicate]

This question already has answers here:
Why are some mysql connections selecting old data the mysql database after a delete + insert?
(2 answers)
Closed 7 years ago.
I have the following code that I run once and then again when a new value is inserted to refresh.
def getStageAdditives(self, stage):
stagesAdditivesSelectQuery = """SELECT a.id,
a.name,
IFNULL(sa.dose, 0) as dose,
IFNULL(sa.last_dose, 0) as last
FROM additives a
LEFT JOIN stage_additives sa
ON a.id = sa.additive_id
AND sa.stage_id = (
SELECT id
FROM stages
WHERE name = '""" + stage + """')
ORDER BY a.name"""
self.cursor.execute(stagesAdditivesSelectQuery)
data = self.cursor.fetchall()
additives = []
for additive in data:
id = additive[0]
name = additive[1]
dose = additive[2]
last = additive[3]
additives.append({ 'title': name, 'dose': dose, 'last': last })
print stagesAdditivesSelectQuery
return additives
The issue is that after I use the following code to insert a value into 'additives' table I get old values (new value is missing).
def createAdditive(self, name):
additiveInsertQuery = """ INSERT INTO additives
SET name = '""" + name + """'"""
try:
self.cursor.execute(additiveInsertQuery)
self.db.commit()
return "True"
except:
self.db.rollback()
return "False"
I can confirm that values are being inserted into the database by looking at phpMyAdmin. If I restart the script I get the new values as expected. If I run the query with phpMyAdmin that also returns new values. Refreshing the page and waiting 10+ seconds doesn't help and I get old values still.
Both methods are in separate classes/files if it matters. The getStageAdditives is called with ajax after the 'createAdditive' method has returned successfully.
DB initialisation:
import MySQLdb
import time
class Stage:
def __init__(self):
self.db = MySQLdb.connect('192.168.0.100', 'user', 'pass', 'dbname')
self.cursor = self.db.cursor()
Another method that retrieves similar values gets the new values as expected (same class as createAdditives):
def getAdditives(self, additive=None):
where = ''
if additive is not None:
where = "WHERE pa.additive_id = ("
where += "SELECT id FROM additives "
where += "WHERE name = '" + additive + "') "
where += "AND a.name = '" + additive + "'"
additiveSelectQuery = """ SELECT a.name,
pa.pump_id
FROM additives a,
pump_additives pa """ + where + """
ORDER BY a.name"""
self.cursor.execute(additiveSelectQuery)
data = self.cursor.fetchall()
additives = []
for item in data:
additives.append( {'additive': item[0], 'pump': item[1]} )
return additives
For those who find this question: Similar question was solved in Why are some mysql connections selecting old data the mysql database after a delete + insert?
The trick is to add an connection.commit().

Categories

Resources