Query data from a PostgreSQL dump

Query data from a PostgreSQL dump - python

I have a dump that was made from a PostgreSQL database. I want to check for some information in that dump, specifically checking if there are entries in a certain table with certain values in certain fields.
This is for a Python program that should run automatically on many different inputs on customer machines, so I need a programmatic solution, not manually opening the file and looking for where that table is defined. I could restore the dump to a database and then delete it, but I'm worrying that this operation is heavy or that it has side-effects. I want there to be no side-effects to my query, I just want to do the check without it affecting anything in my system.
Is that possible in any way? Preferably in Python?

Any dump format: restore and query
The most practical thing to do is restore them to a temporary PostgreSQL database then query the database. It's by far the simplest option. If you have a non-superuser with createdb rights you can do this pretty trivially and safely with pg_restore.
SQL-format
If it's a plaintext (.sql) format dump, if desperate and you know the dumps were not created with the --inserts or --column-inserts options and you don't use the same table name in multiple schemas, you could just search for the text
COPY tablename (
at the start of a line, then read the COPY-format data (see below) until you find \. at the start of a line.
If you do use the same table name in different schemas you have to parse the dump to find the SET search_path entry for the schema you want, then start looking for the desired table COPY statement.
Custom-format
However, if the dump is in the PostgreSQL custom format, which you should always prefer and request by using -Fc with pg_dump, it is IIRC really a tar file with a custom header. You can either seek within it to find the tar header then extract it, or you can use pg_restore to list its header and then extract individual tables.
For this task I'd do the latter. To list tables in the dump:
pg_restore --list out.dump
To dump a particular table as tab-separated COPY format by qualified name, e.g. table address in schema public:
pg_restore -n public -t address out.dump
The output has a bunch of stuff you can't get pg_restore to skip at the start, but your script can just look for the word COPY (uppercase) at the start of a line and start reading on the next line, until it reaches a \. at the end of a line. For details on the format see the PostgreSQL manual on COPY
Of course you need the pg_restore binary for this.
Make sure there is no PGDATABASE environment variable set when you invoke pg_restore. Otherwise it'll restore to a DB instead of printing output to stdout.

Dump the database to a CSV file (or a CSV file for each table) and then you can load and query them using pandas.

You could convert your dump to INSERT INTO dump with this little tool I've written :
https://github.com/freddez/pg-dump2insert
It will be easier to grep specific table data in this form.

Related

Separating multiple query results

I am running multiple (about 60) queries in impala using impala shell from a file and outputting to a file. I am using :
impala-shell -q "query_here; query_here; etc;" -o output_path.csv -B --output_delimiter=','
The issue is that they are not separated between queries, so query 2 would be directly appended as a new row right onto the bottom of query 1. I need to separate the results to math them up with each query, however I do not know where each query's results are done and another begins because it is a continuous CSV file.
Is there a way to run multiple queries like this and leave some type of space or delimiter between query results or any way to separate the results by which query they came from?
Thanks.

You could insert your own separators by issuing some extra queries, for example select '-----'; between the real queries.
Writing results of individual queries to local files is not yet possible, but there is already a feature request for it (IMPALA-2073). You can, however, easily save query results into HDFS as CSV files. You just have to create a new table to store the results specifying row format delimited fields terminated by ',', then use insert into table [...] select [...] to populate it. Please refer to the documentation sections Using Text Data Files with Impala Tables and INSERT Statement for details.
One comment suggested running the individual queries as separate commands and saving their results into separate CSV files. If you choose this solution, please be aware that DDL statements like create table are only guaranteed to take immediate effect in the connection in which they were issued. This means that creating a table and then immediately querying it in another impala shell is prone to failure. Even if you find that it works correctly, it may fail the next time you run it. On the other hand, running such queries one after the other in the same shell is always okay.

Random access csv file content

I'm looking at a way to access a csv file's cells in a random fashion. If I use Python's csv module, I can only iterate through all lines which is rather slow. I should also add that the file is pretty large (>100MB) and that I'm looking at short response time.
I could preprocess the file into a different data format for faster row/column access. Perhaps someone has done this before and can share some experiences.
Background:
I'd like to show an extract of the csv on screen provided by a web server (depending on scroll position). Keeping the file in memory is not an option.

I have found SQLite good for this sort of thing. It is easy to set up and you can store the data locally, but you also get easier control over what you select than csv files and you get the facility to add indexes etc.
There is also a built in facility for loading csv files into a table: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles.
Let me know if you want any further details on the SQLite route i.e. how to create the table, load the data in or query it from Python.
SQLite Instructions to load .csv file to table
To create a database file you can just add the filename required as an argument when opening SQLite. Navigate to the directory containing the csv file from the command line (I am assuming here that you want the SQLite .db file to be contained in the same dir). If using Windows add SQLite to your PATH environment variable if not already done, (instructions here if you need them) and open SQLite as follows with an argument for the name that you want to give your database file e.g.:
sqlite3 example.db
Check the database file has been created by entering:
.databases
Create a table to hold the data. I am using an example for a simple customer table here. If data types are inconsistent for any columns use text:
create table customers (ID integer, Title text, Forename text, Surname text, Postcode text, Addr_Line1 text, Addr_Line2 text, Town text, County text, Home_Phone text, Mobile text, Comments text);
Specify the separator to be used:
.separator ","
Issue the command to import the data, the sytnax takes the form .import filename.ext table_name e.g.:
.import cust.csv customers
Check that the data has loaded in:
select count(*) from customers;
Add an index for columns that you are likely to filter on (full syntax described here) e.g.:
create index cust_surname on customers(surname);
You should now have fast access to the data when filtering on any of the indexed columns. To leave SQLite use .exit, to get a list of other helpful non-SQL commands use .help.
Python Alternative
Alternatively if you want to stick with pure Python and pre-process the file then you could load the data into a dictionary which would allow much faster access to the data as the dictionary keys behave like an index meaning that you can get to values associated with a key quickly without going through the records one by one. I would need further details of your input data and what fields the lookups would be based on to provide further details on how to implement this.
However, unless you will know in advance when the data will be required (to be able to pre-process the file before the request for data) then you would still have the overhead of loading the file from disk into memory every time you run this. Depending on your exact usage this may make the database solution more appropriate.

How to read a Progress database file .b1 in Python

I need to export information from a legacy database system. The database is a 'Progress' database and the information is stored in a file with extension .b1
What is the easiest way to export all database tables to a text file?

The .b1 file is a a part of the Progress Database but actually not the database itself. It contains the "before-image" data. It is used for keeping track of transactions so that the database can undo in case of error/rollback etc. The data in that file will really not help you.
What you would want are the database files. Usually named .db, .d1, .d2, d3 etc.
However reading those (binary) files will be very tricky. I'm not even sure there are any specifications on how they are built. It would be a lot easier to use the Progress built in tools for dumping all data as textfiles. Those textfiles could easily be read by some simple programs in Python. If you have the database installed on a system you will find a directory with programs to serve the database etc. There you will also find some utilities.
Depending on version of OS and of Progress it might look a bit different. You would want to enter the Data Administration utility and go into Admin => Dump Data and Definitions.
If you take a look at the resulting files .df for data definitions (schema) and .d for the data itself you should be able to work out how it's formatted. Relations are not stored in the database at all. In a Progress environment they basically only exists in the application accessing the DB.
You can also select Export Data for various formats ("Text" is probably most interesting).
If you programatically can access the Progress environment it might even be easier to write a small program that exports individual tables. This will create a semicolon delimited file for "table1":
OUTPUT TO C:\temp\table1.txt.
FOR EACH table1 NO-LOCK:
EXPORT DELIMITER ";" table1.
END.
OUTPUT CLOSE.

how to generate various database dumps

I have a CSV file and want to generate dumps of the data for sqlite, mysql, postgres, oracle, and mssql.
Is there a common API (ideally Python based) to do this?
I could use an ORM to insert the data into each database and then export dumps, however that would require installing each database. It also seems a waste of resources - these CSV files are BIG.
I am wary of trying to craft the SQL myself because of the variations with each database. Ideally someone has already done this hard work, but I haven't found it yet.

SQLAlchemy is a database library that (as well as ORM functionality) supports SQL generation in the dialects of the all the different databases you mention (and more).
In normal use, you could create a SQL expression / instruction (using a schema.Table object), create a database engine, and then bind the instruction to the engine, to generate the SQL.
However, the engine is not strictly necessary; the dialects each have a compiler that can generate the SQL without a connection; the only caveat being that you need to stop it from generating bind parameters as it does by default:
from sqlalchemy.sql import expression, compiler
from sqlalchemy import schema, types
import csv
# example for mssql
from sqlalchemy.dialects.mssql import base
dialect = base.dialect()
compiler_cls = dialect.statement_compiler
class NonBindingSQLCompiler(compiler_cls):
def _create_crud_bind_param(self, col, value, required=False):
# Don't do what we're called; return a literal value rather than binding
return self.render_literal_value(value, col.type)
recipe_table = schema.Table("recipe", schema.MetaData(), schema.Column("name", types.String(50), primary_key=True), schema.Column("culture", types.String(50)))
for row in [{"name": "fudge", "culture": "america"}]: # csv.DictReader(open("x.csv", "r")):
insert = expression.insert(recipe_table, row, inline=True)
c = NonBindingSQLCompiler(dialect, insert)
c.compile()
sql = str(c)
print sql
The above example actually works; it assumes you know the target database table schema; it should be easily adaptable to import from a CSV and generate for multiple target database dialects.

I am no database wizard, but AFAIK in Python there's not a common API that would do out-of-the-box what you ask for. There is PEP 249 that defines an API that should be used by modules accessing DB's and that AFAIK is used at least by the MySQL and Postgre python modules (here and here) and that perhaps could be a starting point.
The road I would attempt to follow myself - however - would be another one:
Import the CVS nto MySQL (this is just because MySQL is the one I know best and there are tons of material on the net, as for example this very easy recipe, but you could do the same procedure starting from another database).
Generate the MySQL dump.
Process the MySQL dump file in order to modify it to meet SQLite (and others) syntax.
The scripts for processing the dump file could be very compact, although they might somehow be tricky if you use regex for parsing the lines. Here's an example script MySQL → SQLite that I simply pasted from this page:
#!/bin/sh
mysqldump --compact --compatible=ansi --default-character-set=binary mydbname |
grep -v ' KEY "' |
grep -v ' UNIQUE KEY "' |
perl -e 'local $/;$_=<>;s/,\n\)/\n\)/gs;print "begin;\n";print;print "commit;\n"' |
perl -pe '
if (/^(INSERT.+?)\(/) {
$a=$1;
s/\\'\''/'\'\''/g;
s/\\n/\n/g;
s/\),\(/\);\n$a\(/g;
}
' |
sqlite3 output.db
You could write your script in python (in which case you should have a look to re.compile for performance).
The rationale behind my choice would be:
I get the heavy-lifting [importing and therefore data consistency checks + generating starting SQL file] done for me by mysql
I only have to have one database installed.
I have full control on what is happening and the possibility to fine-tune the process.
I can structure my script in such a way that it will be very easy to extend it for other databases (basically I would structure it like a parser that recognises individual fields + a set of grammars - one for each database - that I can select via command-line option)
There is much more documentation on the differences between SQL flavours than on single DB import/export libraries.
EDIT: A template-based approach
If for any reason you don't feel confident enough to write the SQL yourself, you could use a sort of template-based script. Here's how I would do it:
Import and generate a dump of the table in all the 4 DB you are planning to use.
For each DB save the initial part of the dump (with the schema declaration and all the rest) and a single insert instruction.
Write a python script that - for each DB export - will output the "header" of the dump plus the same "saved line" into which you will programmatically replace the values for each line in your CVS file.
The obvious drawback of this approach is that your "template" will only work for one table. The strongest point of it is that writing such script would be extremely easy and quick.
HTH at least a bit!

You could do this - Create SQL tables from CSV files
or Generate Insert Statements from CSV file
or try this Generate .sql from .csv python
Of course you might need to tweak the scripts mentioned to suite your needs.

Python 3: Best string compression method to minimize the size of a sqlite3 db

I recently created a script that parses several web proxy logs into a tidy sqlite3 db file that is working great for me... with one snag. the file size. I have been pressed to use this format (a sqlite3 db) and python handles it natively like a champ, so my question is this... what is the best form of string compression that I can use for db entries when file size is the sole concern. zlib? base-n? Klingon?
Any advice would help me loads, again just string compression for characters that are compliant for URLs.

Here is a page with an SQLite extension to provide compression.
This extension provides a function that can be called on individual fields.
Here is some of the example text from the page
create a test table
sqlite> create table test(name varchar(20),surname varchar(20));
insert into test table some text by compressing text, you can also compress binary
content and insert it into a blob field
sqlite> insert into test values(mycompress('This is a sample text'),mycompress('This is a
sample text'));
this shows nothing because our data is in binary format and compressed
sqlite> select * from test;
following works, it uncompresses the data
sqlite> select myuncompress(name),myuncompress(surname) from test;

what sort of parsing do you do before you put it in the database? I get the impression that it is fairly simple with a single table holding each entry - if not then my apologies.
Compression is all about removing duplication, and in a log file most of the duplication is between entries rather than within each entry so compressing each entry individually is not going to be a huge win.
This is off the top of my head so feel free to shoot it down in flames, but I would consider breaking the table into a set of smaller tables holding the individual parts of the entry. A log entry would then mostly consist of a timestamp (as DATE type rather than a string) plus a set of indexes into other tables (e.g. requesting IP, request type, requested URL, browser type etc.)
This would have a trade-off of course, since it would make the database a lot more complex to maintain, but on the other hand it would enable meaningful queries such as "show me all the unique IPs that requested page X in the last week".

Instead of inserting compression/decompression code into your program, you could store the table itself on a compressed drive.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.