I created a study to optimize a model with Optuna, which produced a .db file with the same name as the study_name.
The problem is that I'm trying to load the results by using:
study = optuna.create_study(study_name=study_name,
storage=f"sqlite:///{results_folder}/results.db",
directions=["maximize", "maximize"],
load_if_exists=True)
but I renamed the original .db file, and I can't remember what was its original name (i.e., the original study_name value).
I seem to understand that I can rename the file and use the new file name in the "storage" argument when loading the study, but I should use the original value of study_name. In case I don't, I get a message saying that:
[W 2022-08-31 14:55:26,962] Study instance does not contain completed trials.
[W 2022-08-31 14:55:26,964] Your study does not have any completed trials.
Is there any way I can get it back from the .db file?
Yep! The db file is query-able. Study names are stored in a table called "studies", so you can use your database interface of choice to query that table. However, if you're storing multiple study results in this one db file, you'll need to be able to remember which of the study names you find is actually the one you're after.
Here's a quick example of how to do it in python, probably not the best code but still:
import sqlite3
con = sqlite3.connect("/path/to/your/file/results.db")
cur = con.cursor()
res = cur.execute("SELECT * FROM studies")
res.fetchall()
Related
As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.
I am very new to Python and am trying to move data from a MongoDB collection into a CSV document (needs to be done with Python, not mongoexport, if possible).
I am using the Pymongo and CSV packages to bring the data from the database, into the CSV.
This is the structure of the data I am querying from MongoDB:
Primary identifer - Computer Name (parent): R2D2
Details - Computer Details (parent): Operating System (Child), Owner (Child)
I need for Operating System and Owner to have their own columns in the CSV sheet, but they keep on falling under a single column called Computer Name.
Is there away around this, so that the child objects can have their own columns, instead of being grouped under their parent object?
I've done this kind of thing many times. You will have a loop iterating your mongo query with pymongo. At the bottom end of the loop you will use the csv module to write each line of the csv file. Your question relates to what goes on in the middle of the loop.
MongoDB syntax is closest to the Java language which also fits with the 'associative array' objects in Mongo being JSON. Pymongo brings these things closer to Python types.
So the parent object will come straight from the pymongo iterable which if they have children, will by Python dictionaries:
import csv
from pymongo import MongoClient
client = MongoClient(ipaddress, port)
db = client.MyDatabase
collection = db.mycollection
f = open('output_file.csv', 'wb')
csvwriter = csv.writer(f)
for doc in collection.find():
computer_name = doc.get('Computer Name')
operating_system = computer_name.get('Operating System')
owner = computer_name.get('owner')
csvwriter.writerow([operating_system, owner])
f.close()
Of course there's probably a bunch of other columns in there without children that also come direct from the doc.
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
I have a script built using procedural programming that uses a sqlite database file. The script processes a CSV file, then uses a standard cursor to pass its particulars to a single SQLite DB.
After this, the script extracts from the DB to produce a number of spreadsheets in Excel via xlwt.
The problem with this is that the script only handles one input file at a time, whereas I will need to be iterating through about 70-90 of these files on any given day.
I've been trying to rewrite the script as object-oriented, but I'm having trouble with sharing the cursor.
The original input file comes in a zip archive that I would extract via Linux or Mac OS X command lines. Previously this was done manually; now I've managed to write classes and loop through totally ad-hoc numbers of multiple input files via the multi version of tkfiledialog.
Furthermore the original input file (ie one of the 70-90) is a text file in csv format with a DRF extension (obv., not really important) that gets picked by a simple tkfiledialog box:
FILENAMER = tkFileDialog.askopenfilename(title="Open file", \
filetypes=[("txt file",".DRF"),("txt file", ".txt"),\
("All files",".*")])
The DRF file itself is issued daily by location and date, ie 'BHP0123.DRF' is for the BHP location, issued for 23 January. To keep everything as straightforward as possible the procedural script further decomposes the DRF to just the BHP0123 part or prefix, then uses it to build a SQLite DB.
FBASENAME = os.path.basename(FILENAMER)
FBROOT = os.path.splitext(FBNAMED)[0]
OUTPUTDATABASE = 'sqlite_' + FBROOT + '.db'
Basically with the program as a procedural script I just had to create one DB, one connection and one cursor, which could be shared by all the functions in the script:
conn = sqlite3.connect(OUTPUTDATABASE) # <-- originally :: Is this the core problem?
curs = conn.cursor()
conn.text_factory = sqlite3.OptimizedUnicode
In the procedural version these variables above are global.
Procedurally I have
1) one function to handle formatting, and
2) another to handle the calculations needed. The DRF is indexed with about 2500 fields per row; I discard the majority and only use about 400-500 of these per row.
The formatting function parses out the CSV via a for-loop (discards junk characters, incomplete data, etc), then passes the formatted data for the calculator to process and chew on. The core problem seems to be that on the one hand I need the DB connection to be constant for each input DRF file, but on the other that connection can only be shared by the formatter and calculator, and 'regenerated' for each DRF.
Crucially, I've tried to rewrite as little of the formatter and calculator as possible.
I've tried to create a separate dbcxn class, then create an instance to share, but I'm confused as to how to handle the output DB situation with the cursor (and pass it intact to both formatter and calculator):
class DBcxn(object):
def __init__(self, OUTPUTDATABASE):
OUTPUTDATABASE = ?????
self.OUTPUTDATABASE = OUTPUTDATABASE
def init_db_cxn(self, OUTPUTDATABASE):
conn = sqlite3.connect(OUTPUTDATABASE) # < ????
self.conn = conn
curs = conn.cursor()
self.curs = curs
conn.text_factory = sqlite3.OptimizedUnicode
dbtest = DBcxn( ???? )
If anyone might suggest a way of untangling this I'd be very grateful. Please let me know if you need more information.
Cheers
Massimo Savino
I have a program in Python composed of 4 source files. One of them is the main file which imports the other 3. As I work with a small Sqlite database, I am creating tables in one of the "secondary" source files, but when I access the database again from the main source file, the tables just populated before are empty.
Can I save the tables' content in a more consistent way? I am quite surprised with what is happening.
So in the main file I typed:
conn = sqlite3.connect("bayes.db")
cur = conn.cursor()
cur.execute("select count(*) from TableA")
print cur.fetchone()
The result is 0 (rows).
Just before, in another source file I do the same thing and get size=8 of TableA.
You must call the commit function in order to save your changes in the database. You can see the full documentation here: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.commit