How can I select part of sqlite database using python - python

I have a very big database and I want to send part of that database (1/1000) to someone I am collaborating with to perform test runs. How can I (a) select 1/1000 of the total rows (or something similar) and (b) save the selection as a new .db file.
This is my current code, but I am stuck.
import sqlite3
import json
from pprint import pprint
conn = sqlite3.connect('C:/data/responses.db')
c = conn.cursor()
c.execute("SELECT * FROM responses;")

Create a another database with similar table structure as the original db. Sample records from original database and insert into new data base
import sqlite3
conn = sqlite3.connect("responses.db")
sample_conn = sqlite3.connect("responses_sample.db")
c = conn.cursor()
c_sample = sample_conn.cursor()
rows = c.execute("select no, nm from responses")
sample_rows = [r for i, r in enumerate(rows) if i%10 == 0] # select 1/1000 rows
# create sample table with similar structure
c_sample.execute("create table responses(no int, nm varchar(100))")
for r in sample_rows:
c_sample.execute("insert into responses (no, nm) values ({}, '{}')".format(*r))
c_sample.close()
sample_conn.commit()
sample_conn.close()

Simplest way to do this would be:
Copy the database file in your filesystem same as you would any other file (e.g. ctrl+c then ctrl+v in windows to make responses-partial.db or something)
Then open this new copy in an sqlite editor such as http://sqlitebrowser.org/ run the delete query to remove however many rows you want to. Then you might want to run compact database from file menu.
Close sqlite editor and confirm file size is smaller
Email the copy
Unless you need to create a repeatable system I wouldn't bother with doing this in python. But you could perform similar steps in python (copy the file, open it it run delete query, etc) if you need to.

The easiest way to do this is to
make a copy of the database file;
delete 999/1000th of the data, either by keeping the first few rows:
DELETE FROM responses WHERE SomeID > 1000;
or, if you want really random samples:
DELETE FROM responses
WHERE rowid NOT IN (SELECT rowid
FROM responses
ORDER BY random()
LIMIT (SELECT count(*)/1000 FROM responses));
run VACUUM to reduce the file size.

Related

SQLITE multiple files

Sorry about this unprofessional question but I'm kinda new to sqlite but I was wondering if there's any way I can open two files in same python command db = sqlite3.connect('./cogs/database/users.sqlite')
when I open this in my command it doesn't allow me to do same thing in the same command to open another file so for example
open db = sqlite3.connect('./cogs/database/users.sqlite') and read something from it if so
open db = sqlite3.connect('./cogs/database/anotherfile.sqlite') and insert to it
but it always accepts first file only and ignore second file
Assign db1 so it connects to users.sqlite,
and db2 so it connects to anotherfile.sqlite.
Then you can e.g. SELECT from one
and INSERT into the other,
with a temp var bridging the two.
Sqlite databases are single file based, so no - sqlite3.connect builds a connection object to a single data base file.
Even if you build two connection objects, you can't execute queries across them.
If you really need the data from two files at a time you need to merge that data into one data - or don't use sqlite.
You can execute queries across two SQLite files, but you will need to execute an ATTACH command on the first connection cursor.
conn = sqlite3.connect("users.sqlite")
cur = conn.cursor()
cmd = "ATTACH DATABASE 'anotherfile.sqlite' AS otra"
try:
cur.execute(cmd)
query = """
SELECT
t1.Id, t1.Name, t2.Address
FROM personnel t1
LEFT JOIN otra.location t2
ON t2.PersonId = t1.Id
WHERE t1.Status = 'current'
ORDER BY t1.Name;
"""
cur.execute(query)
rows = cur.fetchall()
except sqlite3.Error as err:
do_something_with(err)

SQLite Database: One Big vs. Several Small? Write to Database in Parallel?

As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.

Database not consistent when accessed from different source files

I have a program in Python composed of 4 source files. One of them is the main file which imports the other 3. As I work with a small Sqlite database, I am creating tables in one of the "secondary" source files, but when I access the database again from the main source file, the tables just populated before are empty.
Can I save the tables' content in a more consistent way? I am quite surprised with what is happening.
So in the main file I typed:
conn = sqlite3.connect("bayes.db")
cur = conn.cursor()
cur.execute("select count(*) from TableA")
print cur.fetchone()
The result is 0 (rows).
Just before, in another source file I do the same thing and get size=8 of TableA.
You must call the commit function in order to save your changes in the database. You can see the full documentation here: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.commit

pysqlite, save database, and open it later

Newbie to sql and sqlite.
I'm trying to save a database, then copy the file.db to another folder and open it. So far I created the database, copy and pasted the file.db to another folder but when I try to access the database the output says that it is empty.
So far I have
from pysqlite2 import dbapi2 as sqlite
conn = sqlite.connect('db1Thu_04_Aug_2011_14_20_15.db')
c = conn.cursor()
print c.fetchall()
and the output is
[]
You need something like
c.execute("SELECT * FROM mytable")
for row in c:
#process row
I will echo Mat and point out that is not valid syntax. More than that, you do not include any select request (or other sql command) in your example. If you actually do not have a select statement in your code, and you run fetchall on a newly created cursor, you can expect to get an empty list, which seems to be what you have.
Finally, do make sure that you are opening the file from the right directory. If you tell sqlite to open a nonexistent file, it will happily create a new, empty one for you.

Insert multiple tab-delimited text files into MySQL with Python?

I am trying to create a program that takes a number of tab delaminated text files, and works through them one at a time entering the data they hold into a MySQL database. There are several text files, like movies.txt which looks like this:
1 Avatar
3 Iron Man
3 Star Trek
and actors.txt that looks the same etc. Each text file has upwards of one hundred entries each with an id and corresponding value as seen above. I have found a number of code examples on this site and others but I can't quite get my head around how to implement them in this situation.
So far my code looks something like this ...
import MySQLdb
database_connection = MySQLdb.connect(host='localhost', user='root', passwd='')
cursor = database_connection.cursor()
cursor.execute('CREATE DATABASE library')
cursor.execute('USE library')
cursor.execute('''CREATE TABLE popularity (
PersonNumber INT,
Category VARCHAR(25),
Value VARCHAR(60),
)
''')
def data_entry(categories):
Everytime i try to get the other code I have found working with this I just get lost completely. Hopeing someone can help me out by either showing me what I need to do or pointing me in the direction of some more information.
Examples of the code I have been trying to adapt to my situation are:
import MySQLdb, csv, sys
conn = MySQLdb.connect (host = "localhost",user = "usr", passwd = "pass",db = "databasename")
c = conn.cursor()
csv_data=csv.reader(file("a.txt"))
for row in csv_data:
print row
c.execute("INSERT INTO a (first, last) VALUES (%s, %s), row")
c.commit()
c.close()
and:
Python File Read + Write
MySQL can read TSV files directly using the mysqlimport utility or by executing the LOAD DATA INFILE SQL command. This will be faster than processing the file in python and inserting it, but you may want to learn how to do both. Good luck!

Categories

Resources