I have a small Python 2.7 script that uses LIKE statements to extract pieces of information from textual data stored in a SQLite database.
sql = "SELECT user_id, loc,\
FROM entity\
WHERE loc LIKE '%\"place\":%'\
AND loc LIKE '%\"geo\":%'\
AND loc LIKE '%\"coordinates\":%'"
cin.execute(sql)
entities = cin.fetchall()
cin is a cursor to a SQLite database (table entity with >10^6 rows) (~1.5GB) which was established using
import sqlite3
try:
dbin = sqlite3.connect(database=args['dbi'].name)
dbin.row_factory = sqlite3.Row
cin = dbin.cursor()
except sqlite3.Error, e:
errorLogger.error('... %e' % e)
sys.exit()
The script ran fine with database sizes of 10^2 MB, but now I get a
Traceback (most recent call last):
File "C:\Users\...\migrate.py", line 247, in <module>
entities = cin.fetchall()
MemoryError
after some seconds. I am running a W7 64bit machine with 8GB RAM. When the script is running, by looking at the resources monitor of W7, I can tell that successively all free memory is used and python.exe consumes as much as 1.9GB just before the program crashes. Still, there is about 3GB standby memory available (but don't ask me what's the difference between standby and free memory).
What can I do about this besides pre-filtering my query, e.g. by only looking at, let's say, 10'000 rows per query?
Calling fetchall requires allocating memory for all result records.
You should instead read the result records from cin one by one.
Related
I'm trying to read a huge PostgreSQL table (~3 million rows of jsonb data, ~30GB size) to do some ETL in Python. I use psycopg2 for working with the database. I want to execute a Python function for each row of the PostgreSQL table and save the results in a .csv file.
The problem is that I need to select the whole 30GB table, and the query runs for a very long time without any possibility to monitor progress. I have found out that there exists a cursor parameter called itersize which determines the number of rows to be buffered on the client.
So I have written the following code:
import psycopg2
conn = psycopg2.connect("host=... port=... dbname=... user=... password=...")
cur = conn.cursor()
cur.itersize = 1000
sql_statement = """
select * from <HUGE TABLE>
"""
cur.execute(sql_statement)
for row in cur:
print(row)
cur.close()
conn.close()
Since the client buffers every 1000 rows on the client, I expect the following behavior:
The Python script buffers the first 1000 rows
We enter the for loop and print the buffered 1000 rows in the console
We reach the point where the next 1000 rows have to be buffered
The Python script buffers the next 1000 rows
GOTO 2
However, the code just hangs on the cur.execute() statement and no output is printed in the console. Why? Could you please explain what exactly is happening under the hood?
As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.
Edit - I am using Windows 10
Is there a faster alternative to pd._read_sql_query for a MS SQL database?
I was using pandas to read the data and add some columns and calculations on the data. I have cut out most of the alterations now and I am basically just reading (1-2 million rows per day at a time; my query is to read all of the data from the previous date) the data and saving it to a local database (Postgres).
The server I am connecting to is across the world and I have no privileges at all other than to query for the data. I want the solution to remain in Python if possible. I'd like to speed it up though and remove any overhead. Also, you can see that I am writing a file to disk temporarily and then opening it to COPY FROM STDIN. Is there a way to skip the file creation? It is sometimes over 500mb which seems like a waste.
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
df.to_csv('../raw/temp_table.csv', index=False)
df= open('../raw/temp_table.csv')
process_file(conn=pg_engine, table_name=table_name, file_object=df)
UPDATE:
you can also try to unload data using bcp utility, which might be lot faster compared to pd.read_sql(), but you will need a local installation of Microsoft Command Line Utilities for SQL Server
After that you can use PostgreSQL's COPY ... FROM ......
OLD answer:
you can try to write your DF directly to PostgreSQL (skipping the df.to_csv(...) and df= open('../raw/temp_table.csv') parts):
from sqlalchemy import create_engine
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
pg_engine = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
df.to_sql(table_name, pg_engine, if_exists='append')
Just test whether it's faster compared to COPY FROM STDIN...
I have (what I would consider) a massive set of plain text files, around 400GB, that are being imported into a MySQL database (InnoDB engine). The .txt files range from 2GB to 26GB in size, and each file represents a table in the database. I was given a Python script which parses the .txt files and builds SQL statements. I have a machine specifically dedicated to this task with the following specs:
OS - Windows 10
32GB RAM
4TB hard drive
i7 3.40 GHz processor
I want to optimize this import to be as quick and dirty as possible. I've changed the following config settings in the MySQL my.ini file based on stack O questions, the MySQL docs, and other sources:
max_allowed_packet=1073741824;
autocommit=0;
net_buffer_length=0;
foreign_key_check=0;
unique_checks=0;
innodb_buffer_pool_size=8G; (this made a big difference in speed when I increased from the default of 128M)
Are there other settings in the config file that I missed (maybe around logging or caching) that would direct MySQL to use a significant portion of the machine's resources? Could there be another bottleneck I'm missing?
(Side note: not sure if this is related - when I start the import, the mysqld process spins up to use about 13-15% of the system's memory, but then never seems to purge it when I stop the Python script from continuing the import. I'm wondering if this is a result of messing with the logging and flush settings. Thanks in advance for any help.)
(EDIT)
Here is the relevant part of the Python script that populates the tables. It appears the script is connecting, committing and closing the connection for every 50,000 records. Could I remove the conn.commit() at the end of the function and let MySQL handle the committing? The comments below the while (true) are from the authors of the script, and I've adjusted that number so that it won't exceed max_allowed_packet size.
conn = self.connect()
while (True):
#By default, we concatenate 200 inserts into a single INSERT statement.
#a large batch size per insert improves performance, until you start hitting max_packet_size issues.
#If you increase MySQL server's max_packet_size, you may get increased performance by increasing maxNum
records = self.parser.nextRecords(maxNum=50000)
if (not records):
break
escapedRecords = self._escapeRecords(records) #This will sanitize the records
stringList = ["(%s)" % (", ".join(aRecord)) for aRecord in escapedRecords]
cur = conn.cursor()
colVals = unicode(", ".join(stringList), 'utf-8')
exStr = exStrTemplate % (commandString, ignoreString, tableName, colNamesStr, colVals)
#unquote NULLs
exStr = exStr.replace("'NULL'", "NULL")
exStr = exStr.replace("'null'", "NULL")
try:
cur.execute(exStr)
except MySQLdb.Warning, e:
LOGGER.warning(str(e))
except MySQLdb.IntegrityError, e:
#This is likely a primary key constraint violation; should only be hit if skipKeyViolators is False
LOGGER.error("Error %d: %s", e.args[0], e.args[1])
self.lastRecordIngested = self.parser.latestRecordNum
recCheck = self._checkProgress()
if recCheck:
LOGGER.info("...at record %i...", recCheck)
conn.commit()
conn.close()
Say i have only 1GB of memory and 1 TB of hard disk space.
This is my code and i am using a postgres database.
import psycopg2
try:
db = psycopg2.connect("database parameters")
conn = db.cursor()
conn.execute(query)
#At this point, i am running
for row in conn:
for this case, I guess it is safe to assume that conn is a generator as i cannot seem to find a definitive answer online and i cannot try it on my environment as i cannot afford the system to crash.
I am expecting this query to return data in excess of 100 GB
I am using python 2.7 and psycopg2 library
If you use an anonymous cursor, which you are doing in your example, then the entire query result will be read into memory.
If you use a named cursor then it will read from the server in chunks as it loops over the data.