I have (what I would consider) a massive set of plain text files, around 400GB, that are being imported into a MySQL database (InnoDB engine). The .txt files range from 2GB to 26GB in size, and each file represents a table in the database. I was given a Python script which parses the .txt files and builds SQL statements. I have a machine specifically dedicated to this task with the following specs:
OS - Windows 10
32GB RAM
4TB hard drive
i7 3.40 GHz processor
I want to optimize this import to be as quick and dirty as possible. I've changed the following config settings in the MySQL my.ini file based on stack O questions, the MySQL docs, and other sources:
max_allowed_packet=1073741824;
autocommit=0;
net_buffer_length=0;
foreign_key_check=0;
unique_checks=0;
innodb_buffer_pool_size=8G; (this made a big difference in speed when I increased from the default of 128M)
Are there other settings in the config file that I missed (maybe around logging or caching) that would direct MySQL to use a significant portion of the machine's resources? Could there be another bottleneck I'm missing?
(Side note: not sure if this is related - when I start the import, the mysqld process spins up to use about 13-15% of the system's memory, but then never seems to purge it when I stop the Python script from continuing the import. I'm wondering if this is a result of messing with the logging and flush settings. Thanks in advance for any help.)
(EDIT)
Here is the relevant part of the Python script that populates the tables. It appears the script is connecting, committing and closing the connection for every 50,000 records. Could I remove the conn.commit() at the end of the function and let MySQL handle the committing? The comments below the while (true) are from the authors of the script, and I've adjusted that number so that it won't exceed max_allowed_packet size.
conn = self.connect()
while (True):
#By default, we concatenate 200 inserts into a single INSERT statement.
#a large batch size per insert improves performance, until you start hitting max_packet_size issues.
#If you increase MySQL server's max_packet_size, you may get increased performance by increasing maxNum
records = self.parser.nextRecords(maxNum=50000)
if (not records):
break
escapedRecords = self._escapeRecords(records) #This will sanitize the records
stringList = ["(%s)" % (", ".join(aRecord)) for aRecord in escapedRecords]
cur = conn.cursor()
colVals = unicode(", ".join(stringList), 'utf-8')
exStr = exStrTemplate % (commandString, ignoreString, tableName, colNamesStr, colVals)
#unquote NULLs
exStr = exStr.replace("'NULL'", "NULL")
exStr = exStr.replace("'null'", "NULL")
try:
cur.execute(exStr)
except MySQLdb.Warning, e:
LOGGER.warning(str(e))
except MySQLdb.IntegrityError, e:
#This is likely a primary key constraint violation; should only be hit if skipKeyViolators is False
LOGGER.error("Error %d: %s", e.args[0], e.args[1])
self.lastRecordIngested = self.parser.latestRecordNum
recCheck = self._checkProgress()
if recCheck:
LOGGER.info("...at record %i...", recCheck)
conn.commit()
conn.close()
Related
As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.
Say i have only 1GB of memory and 1 TB of hard disk space.
This is my code and i am using a postgres database.
import psycopg2
try:
db = psycopg2.connect("database parameters")
conn = db.cursor()
conn.execute(query)
#At this point, i am running
for row in conn:
for this case, I guess it is safe to assume that conn is a generator as i cannot seem to find a definitive answer online and i cannot try it on my environment as i cannot afford the system to crash.
I am expecting this query to return data in excess of 100 GB
I am using python 2.7 and psycopg2 library
If you use an anonymous cursor, which you are doing in your example, then the entire query result will be read into memory.
If you use a named cursor then it will read from the server in chunks as it loops over the data.
I have a script built using procedural programming that uses a sqlite database file. The script processes a CSV file, then uses a standard cursor to pass its particulars to a single SQLite DB.
After this, the script extracts from the DB to produce a number of spreadsheets in Excel via xlwt.
The problem with this is that the script only handles one input file at a time, whereas I will need to be iterating through about 70-90 of these files on any given day.
I've been trying to rewrite the script as object-oriented, but I'm having trouble with sharing the cursor.
The original input file comes in a zip archive that I would extract via Linux or Mac OS X command lines. Previously this was done manually; now I've managed to write classes and loop through totally ad-hoc numbers of multiple input files via the multi version of tkfiledialog.
Furthermore the original input file (ie one of the 70-90) is a text file in csv format with a DRF extension (obv., not really important) that gets picked by a simple tkfiledialog box:
FILENAMER = tkFileDialog.askopenfilename(title="Open file", \
filetypes=[("txt file",".DRF"),("txt file", ".txt"),\
("All files",".*")])
The DRF file itself is issued daily by location and date, ie 'BHP0123.DRF' is for the BHP location, issued for 23 January. To keep everything as straightforward as possible the procedural script further decomposes the DRF to just the BHP0123 part or prefix, then uses it to build a SQLite DB.
FBASENAME = os.path.basename(FILENAMER)
FBROOT = os.path.splitext(FBNAMED)[0]
OUTPUTDATABASE = 'sqlite_' + FBROOT + '.db'
Basically with the program as a procedural script I just had to create one DB, one connection and one cursor, which could be shared by all the functions in the script:
conn = sqlite3.connect(OUTPUTDATABASE) # <-- originally :: Is this the core problem?
curs = conn.cursor()
conn.text_factory = sqlite3.OptimizedUnicode
In the procedural version these variables above are global.
Procedurally I have
1) one function to handle formatting, and
2) another to handle the calculations needed. The DRF is indexed with about 2500 fields per row; I discard the majority and only use about 400-500 of these per row.
The formatting function parses out the CSV via a for-loop (discards junk characters, incomplete data, etc), then passes the formatted data for the calculator to process and chew on. The core problem seems to be that on the one hand I need the DB connection to be constant for each input DRF file, but on the other that connection can only be shared by the formatter and calculator, and 'regenerated' for each DRF.
Crucially, I've tried to rewrite as little of the formatter and calculator as possible.
I've tried to create a separate dbcxn class, then create an instance to share, but I'm confused as to how to handle the output DB situation with the cursor (and pass it intact to both formatter and calculator):
class DBcxn(object):
def __init__(self, OUTPUTDATABASE):
OUTPUTDATABASE = ?????
self.OUTPUTDATABASE = OUTPUTDATABASE
def init_db_cxn(self, OUTPUTDATABASE):
conn = sqlite3.connect(OUTPUTDATABASE) # < ????
self.conn = conn
curs = conn.cursor()
self.curs = curs
conn.text_factory = sqlite3.OptimizedUnicode
dbtest = DBcxn( ???? )
If anyone might suggest a way of untangling this I'd be very grateful. Please let me know if you need more information.
Cheers
Massimo Savino
I'm new at this blog although I've used several tips on publications done here.
I have a code that reads the GPS position, and other data from another device. After this, I do some calculations and store the data in a database and a text file. My problem is that after a while it stops storing the data, but the code keeps running. Any idea why?
My code is a little extensive, but basically it creates two serial variables so I can read two different devices.
Then in a infinite loop I do the following:
open database
manipulate and analyze data
store data in txt file
store data in mysql
close connection
All of this with a try catch so i can save any exception in another text file.
The code I use to store ina text file is:
with open("datos.txt", "a") as mydata:
mydata.write("(" + str(hora)+","+str(latitud)+","+str(longitud)+",#"+str(color)+","+str(latitudReal)+","+str(longitudReal)+","+str(libre)+","+espectro+")" + ", " + cc1 + "\n")
mydata.flush()
mydata.close()
The code to store in the mysql database is:
try:
sql2 = "INSERT INTO pichincha101(hora, latitud, longitud, color, latitudReal, longitudReal, porcentaje, espectro) VALUES ('"+str(hora)+"','"+str(latitud)+"','"+str(longitud)+"','#"+str(color)+"','"+str(latitudReal)+"','"+str(longitudReal)+"','"+str(libre)+"','"+espectro+"')"
cur.execute(sql2)
db.commit()
except Exception as e:
print "error INSERT mysql"
print e
I'm also deleting some arrays and cleaning up the ram before doing it again with this code:
del res1
del res2
del ampp
del amp1
os.system("echo 3 > /proc/sys/vm/drop_caches")
It's the same problem even if I dont include the last code section I just published.
In advanced, thank you for your help.
Perhaps you have a timeout in your mysql (SQL_TIMEOUT) or your server settings? For example, when I work in Apache, I have to adjust the max_execution_time variable. Cheers!
The solution was empting the buffers from my serial devices. Hope it helps someone else.
I have a small Python 2.7 script that uses LIKE statements to extract pieces of information from textual data stored in a SQLite database.
sql = "SELECT user_id, loc,\
FROM entity\
WHERE loc LIKE '%\"place\":%'\
AND loc LIKE '%\"geo\":%'\
AND loc LIKE '%\"coordinates\":%'"
cin.execute(sql)
entities = cin.fetchall()
cin is a cursor to a SQLite database (table entity with >10^6 rows) (~1.5GB) which was established using
import sqlite3
try:
dbin = sqlite3.connect(database=args['dbi'].name)
dbin.row_factory = sqlite3.Row
cin = dbin.cursor()
except sqlite3.Error, e:
errorLogger.error('... %e' % e)
sys.exit()
The script ran fine with database sizes of 10^2 MB, but now I get a
Traceback (most recent call last):
File "C:\Users\...\migrate.py", line 247, in <module>
entities = cin.fetchall()
MemoryError
after some seconds. I am running a W7 64bit machine with 8GB RAM. When the script is running, by looking at the resources monitor of W7, I can tell that successively all free memory is used and python.exe consumes as much as 1.9GB just before the program crashes. Still, there is about 3GB standby memory available (but don't ask me what's the difference between standby and free memory).
What can I do about this besides pre-filtering my query, e.g. by only looking at, let's say, 10'000 rows per query?
Calling fetchall requires allocating memory for all result records.
You should instead read the result records from cin one by one.