COPY Postgres table to CSV output, paginated over n files using python - python

Using psycopg2 to export Postgres data to CSV files (not all at once, 100 000 rows at a time). Currently using LIMIT OFFSET but obviously this is slow on a 100M row db. Any faster way to keep track of the offset each iteration?
for i in (0, 100000000, 100000):
"COPY
(SELECT * from users LIMIT %s OFFSET %s)
TO STDOUT DELIMITER ',' CSV HEADER;"
% (100000, i)
Is the code run in a loop, incrementing i

Let me suggest you a different approach.
Copy the whole table and split it afterward. Something like:
COPY users TO STDOUT DELIMITER ',' CSV HEADER
And finally, from bash execute the split command (btw, you could call it inside your python script):
split -l 100000 --numeric-suffixes users.csv users_chunks_
It'll generate a couple of files called users_chunks_1, users_chunks_2, etc.

Related

SQLite Database: One Big vs. Several Small? Write to Database in Parallel?

As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.

Processing/loading huge gzip file into hive

I have a huge gzipped csv file (55GB compressed, 660GB expanded) that I am trying to to process and load into hive in a more useable format.
The file has 2.5B records of device events, with each row giving a device_id, event_id, timestamp, and event_name for identification and then about 160 more columns, only a few of which are non-null for a given event_name.
Ultimately, I'd like to format the data into the few identification columns and a JSON column that only stores the relevant fields for a given event_name and get this in a hive table partitioned by date, hour, and event_name (and partitions based on metadata like timezone, but hopefully that is a minor point) so we can query the data easily for further analysis. However, the sheer size of the file is giving me trouble.
I've tried several approaches, without much success:
Loading the file directly into hive and then doing 'INSERT OVERWRITE ... SELECT TRANSFORM(*) USING 'python transform.py' FROM ...' had obvious problems
I split the file into multiple gzip files of 2 million rows each with bash commands: gzip -cd file.csv.gz | split -l 2000000 -d -a 5 --filter='gzip > split/$FILE.gz' and just loading one of those into hive to do the transform, but I'm running into memory issues still, though we've tried to increase memory limits (I'd have to check to see what parameters we've changed).
A. Tried a transform script that uses pandas (because it makes it easy to group by event_name and then remove unneeded columns per event name) and limited pandas to reading in 100k rows at a time, but still needed to limit hive selection to not have memory issues (1000 rows worked fine, 50k rows did not).
B. I also tried making a secondary temp table (stored as ORC), with the same columns as in the csv file, but partitioned by event_name and then just selecting the 2 million rows into that temp table, but also had memory issues.
I've considered trying to start with splitting the data up by event_name. I can do this using awk, but I'm not sure that would be much better than just doing the 2 million row files.
I'm hoping somebody has some suggestions for how to handle the file. I'm open to pretty much any combination of bash, python/pandas, hadoop/hive (I can consider others as well) to get the job done, so long as it can be made mostly automatic (I'll have several other files like this to process too).
This is the hive transform query I'm using:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET mapreduce.map.memory.mb=8192;
SET mapreduce.reduce.memory.mb=10240;
ADD FILE transform.py;
INSERT OVERWRITE TABLE final_table
PARTITION (date,hour,zone,archetype,event_name)
SELECT TRANSFORM (*)
USING "python transform.py"
AS device_id,timestamp,json_blob,date,hour,zone,archetype,event_name
FROM (
SELECT *
FROM event_file_table eft
JOIN meta_table m ON eft.device_id=m.device_id
WHERE m.name = "device"
) t
;
And this is the python transform script:
import sys
import json
import pandas as pd
INSEP='\t'
HEADER=None
NULL='\\N'
OUTSEP='\t'
cols = [# 160+ columns here]
metaCols =['device_id','meta_blob','zone','archetype','name']
df = pd.read_csv(sys.stdin, sep=INSEP, header=HEADER,
names=cols+metaCols, na_values=NULL, chunksize=100000,
parse_dates=['timestamp'])
for chunk in df:
chunk = chunk.drop(['name','meta_blob'], axis=1)
chunk['date'] = chunk['timestamp'].dt.date
chunk['hour'] = chunk['timestamp'].dt.hour
for name, grp in chunk.groupby('event_name'):
grp = grp.dropna(axis=1, how='all') \
.set_index(['device_id','timestamp','date','hour',
'zone','archetype','event_name']) \
grp = pd.Series(grp.to_dict(orient='records'),grp.index) \
.to_frame().reset_index() \
.rename(columns={0:'json_blob'})
grp = grp[['device_id',timestamp','blob',
'date','hour','zone','archetype','event_name']]
grp.to_csv(sys.stdout, sep=OUTSEP, index=False, header=False)
Depending on the yarn configurations, and what hits the limits first, I've received error messages:
Error: Java heap space
GC overhead limit exceeded
over physical memory limit
Update, in case anybody has similar issues.
I did eventually manage to get this done. In our case, what seemed to work best was to use awk to split the file by date and hour. This allowed us to reduce memory overhead for partitions because we could load in a few hours at a time rather than potentially having hundreds of hourly partitions (multiplied by all the other partitions we wanted) to try to keep in memory and load at once. Running the file through awk took several hours, but could be done in parallel with loading another already split file into hive.
Here's the bash/awk I used for splitting the file:
gzip -cd file.csv.gz |
tail -n +2 |
awk -F, -v MINDATE="$MINDATE" -v MAXDATE="$MAXDATE" \
'{
if ( ($5>=MINDATE) && ($5<MAXDATE) ) {
split($5, ts, "[: ]");
print | "gzip > file_"ts[1]"_"ts[2]".csv.gz"
}
}'
where, obviously, the 5th column of the file was the timestamp, and MINDATE and MAXDATE were used to filter out dates we did not care about. This split the timestamp by spaces and colons so the first part of the timestamp was the date and the second the hour (third and fourth would be minutes and seconds) and used that to direct the line to the appropriate output file.
Once the the file was split by hour, I loaded the hours several at a time into a temporary hive table and proceeded with basically the same transform mentioned above.

How to fetch postgres table in parallel using python

I have a table named "order_lines" since it is has almost 7 million rows it takes me around 30 minutes to pull it and dump into a csv. I was hoping if I can pull it in parallel so as to reduce load time using python. My end objective is to replicate it in redshift.
Can someone please suggests methods to pull this table in python ?
Thanks in advance !
You can use below query with admin privilege :
COPY (select col1,col2 from your_table) TO 'some_file_location/filename.csv' DELIMITER ',' CSV HEADER;
OR on your command prompt of server :
psql -U user -d db_name -c "Copy (Select * From your_table) To STDOUT With CSV HEADER DELIMITER ',';" > filename_data.csv

Alternative to "write to file" for transfering CSV data to PostgreSQL using COPY for better performance?

I have a dataset in a CSV file consisting of 2500 lines. The file is structured that (simplified) way:
id_run; run_name; receptor1; receptor2; receptor3_value; [...]; receptor50_value
Each receptor of the file is already in a table and have a unique id.
I need to upload each line to a table with this format:
id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1
Actually, I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
For 2500 runs (so 2500 lines in the CSV file) and 50 receptors, my Python program generates ~110000 records in the text file to be uploaded.
I'm dropping the foreign keys of the destination table and restoring them after the upload.
Using this method, it takes actually ~8 seconds to generate the text file and 1 second to copy the file to the table.
Is there a way, method, library or anything else I could use to accelerate the preparation of the data for the upload so that 90% of the time required isn't for the writing of the text file?
Edit:
Here is my (updated) code. I'm now using a bulk writing to the text file. It looks likes it faster (uploaded 110 000 lines in 3.8 seconds).
# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
# the run_specs variable consists of the attributes defining a run
# (id_run, run_name, etc.). So basically a line in the CSV file without the
# receptors data
sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
for rec_i, rec in enumerate(rec_uids):
# the rec_uids variable is the unique IDs in the database for the
# receptors in the CSV file
line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
lines.append(line_to_write)
# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
tmp_data.writelines(lines)
# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys
# drop the foreign keys
for key in rr_fks:
DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs
# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
" ", False)
# restore foreign keys
for key in rr_fks:
DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])
# commit to database
conn.commit()
Edit #2:
Using cStringIO library, I replaced the creation of a temporary text file with a filelike object, but the speed gains is very very small.
Code changed:
outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))
cur.copy_from(outf, 'receptor_results')
Yes, there is something you can do to speed up writing the data to the file in advance: don't bother!
You already fit the data into memory, so that isn't an issue. So, instead of writing the lines to a list of strings, write them to a slightly different object - a StringIO instance. Then the data can stay in memory and serve as the parameter to psycopg2's copy_from function.
filelike = StringIO.StringIO('\n'.join(['1\tA', '2\tB', '3\tC']))
cursor.copy_from(filelike, 'your-table-name')
Notice that the StringIO must contain the newlines, the field separators and so on - just as the file would have.
I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
It is a heavy and unnecessary round-trip for all your data. Since you already have it in memory, you should just translate it into a multi-row insert directly:
INSERT INTO table(col1, col2) VALUES (val1, val2), (val3, val4), ...
i.e. concatenate your data into such a query and execute it as is.
In your case you would probably generate and execute 50 such inserts, with 2500 rows in each, according to your requirements.
It will be the best-performing solution ;)

Extract and sort data from .mdb file using mdbtools in Python

I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.
What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.
Cycle# Time Data
1 100.59 34
1 101.34 54
1 98.78 45
2
2
2 ...........
Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.
import sys, subprocess, os
DATABASE = sys.argv[1]
subprocess.call(["mdb-schema", DATABASE, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()
# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
if table != '':
filename = table.replace(" ","_") + ".csv"
file = open(filename, 'w')
print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", DATABASE, table],
stdout=subprocess.PIPE).communicate()[0]
file.write(contents)
file.close()
Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.
A few suggestions:
Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.
When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.
At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.
Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:
SELECT
v1.SourceFile,
v1.Cycle,
v1.Time,
v1.Data
FROM
Voltage v1
INNER JOIN
(
SELECT
SourceFile,
Cycle,
MAX([Time]) AS MaxTime
FROM Voltage
GROUP BY SourceFile, Cycle
) v2
ON v1.SourceFile=v2.SourceFile
AND v1.Cycle=v2.Cycle
AND v1.Time=v2.MaxTime
To bring it directly to Pandas in python3 I wrote this little snippet
import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
subprocess.call(["mdb-schema", database_path, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
sys.stdout.flush()
# Dump each table as a stringio using "mdb-export",
out_tables = {}
for rtable in tables:
table = rtable.decode()
if VERBOSE: print('running table:',table)
if table != '':
if VERBOSE: print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", database_path, table],
stdout=subprocess.PIPE).communicate()[0]
temp_io = StringIO(contents.decode())
print(table, temp_io)
out_tables[table] = pd.read_csv(temp_io)
return out_tables
There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.
It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.

Categories

Resources