How to read a Progress database file .b1 in Python

How to read a Progress database file .b1 in Python - python

I need to export information from a legacy database system. The database is a 'Progress' database and the information is stored in a file with extension .b1
What is the easiest way to export all database tables to a text file?

The .b1 file is a a part of the Progress Database but actually not the database itself. It contains the "before-image" data. It is used for keeping track of transactions so that the database can undo in case of error/rollback etc. The data in that file will really not help you.
What you would want are the database files. Usually named .db, .d1, .d2, d3 etc.
However reading those (binary) files will be very tricky. I'm not even sure there are any specifications on how they are built. It would be a lot easier to use the Progress built in tools for dumping all data as textfiles. Those textfiles could easily be read by some simple programs in Python. If you have the database installed on a system you will find a directory with programs to serve the database etc. There you will also find some utilities.
Depending on version of OS and of Progress it might look a bit different. You would want to enter the Data Administration utility and go into Admin => Dump Data and Definitions.
If you take a look at the resulting files .df for data definitions (schema) and .d for the data itself you should be able to work out how it's formatted. Relations are not stored in the database at all. In a Progress environment they basically only exists in the application accessing the DB.
You can also select Export Data for various formats ("Text" is probably most interesting).
If you programatically can access the Progress environment it might even be easier to write a small program that exports individual tables. This will create a semicolon delimited file for "table1":
OUTPUT TO C:\temp\table1.txt.
FOR EACH table1 NO-LOCK:
EXPORT DELIMITER ";" table1.
END.
OUTPUT CLOSE.

Related

Mapping fields inside other fields

Hello I would like to make an app that allows the user to import data from a source of his choice (Airtable, xls, csv, JSON) and export to a JSON which will be pushed to an Sqlite database using an API.
The "core" of the functionality of the app is that it allows the user to create a "template" and "map" of the source columns inside the destination columns. Which source column(s) go to which destination column is up to the user. I am attaching two photos here (used in airtable/zapier), so you can get a better idea of the end result:
adding fields inside fields - airtableadding fields inside fields - zapier
I would like to know if you can recommend a library or a way to come about this problem? I have tried to look for some python or nodejs libraries, I am lost between using ETL libraries, some recommended using mapping/zipping features, others recommend coding my own classes. Do you know any libraries that allow to do the same thing as airtable/zapier ? Any suggestions ?

Save file on databases is really a bad practice since it takes up a lot of database storage space and would add latency in the communication.
I hardly recommend saving it on disk and store the path on database.

in-memory sqlite in production with python

I am creating a python system that needs to handle many files. Each of the file has more than 10 thousand lines of text data.
Because DB (like mysql) can not be used in that environment, when file is uploaded by a user, I think I will save all the data of the uploaded file in in-memory-SQLite so that I can use SQL to fetch specific data from there.
Then, when all operations by program are finished, save the processed data in a file. This is the file users will receive from the system.
But some websites say SQLite shouldn't be used in production. But in my case, I just save them temporarily in memory to use SQL for the data. Is there any problem for using SQLite in production even in this scenario?
Edit:
The data in in-memory-DB doesn't need to be shared between processes. It just creates tables, process data, then discard all data and tables after saving the processed data in file. I just think saving everything in list makes search difficult and slow. So using SQLite is still a problem?

SQLite shouldn't be used in production is not a one-for-all rule, it's more of a rule of thumb. Of course there are appliances where one could think of reasonable use of SQLite even in production environments.
However your case doesn't seem to be one of them. While SQLite supports multi-threaded and multi-process environments, it will lock all tables when it opens a write transaction. You need to ask yourself whether this is a problem for your particular case, but if you're uncertain go for "yes, it's a problem for me".
You'd be probably okay with in-memory structures alone, unless there are some details you haven't uncovered.

I'm not familiar with the specific context of your system, but if what you're looking for is a SQL database that is
light
Access is from a single process and a single thread.
If the system crashes in the middle, you have a good way to recover from it (either backing up the last stable version of the database or just create it from scratch).
If you meet all these criteria, using SQLite is production is fine. OSX, for example, uses sqlite for a few purposes (e.g. ./var/db/auth.db).

Query data from a PostgreSQL dump

I have a dump that was made from a PostgreSQL database. I want to check for some information in that dump, specifically checking if there are entries in a certain table with certain values in certain fields.
This is for a Python program that should run automatically on many different inputs on customer machines, so I need a programmatic solution, not manually opening the file and looking for where that table is defined. I could restore the dump to a database and then delete it, but I'm worrying that this operation is heavy or that it has side-effects. I want there to be no side-effects to my query, I just want to do the check without it affecting anything in my system.
Is that possible in any way? Preferably in Python?

Any dump format: restore and query
The most practical thing to do is restore them to a temporary PostgreSQL database then query the database. It's by far the simplest option. If you have a non-superuser with createdb rights you can do this pretty trivially and safely with pg_restore.
SQL-format
If it's a plaintext (.sql) format dump, if desperate and you know the dumps were not created with the --inserts or --column-inserts options and you don't use the same table name in multiple schemas, you could just search for the text
COPY tablename (
at the start of a line, then read the COPY-format data (see below) until you find \. at the start of a line.
If you do use the same table name in different schemas you have to parse the dump to find the SET search_path entry for the schema you want, then start looking for the desired table COPY statement.
Custom-format
However, if the dump is in the PostgreSQL custom format, which you should always prefer and request by using -Fc with pg_dump, it is IIRC really a tar file with a custom header. You can either seek within it to find the tar header then extract it, or you can use pg_restore to list its header and then extract individual tables.
For this task I'd do the latter. To list tables in the dump:
pg_restore --list out.dump
To dump a particular table as tab-separated COPY format by qualified name, e.g. table address in schema public:
pg_restore -n public -t address out.dump
The output has a bunch of stuff you can't get pg_restore to skip at the start, but your script can just look for the word COPY (uppercase) at the start of a line and start reading on the next line, until it reaches a \. at the end of a line. For details on the format see the PostgreSQL manual on COPY
Of course you need the pg_restore binary for this.
Make sure there is no PGDATABASE environment variable set when you invoke pg_restore. Otherwise it'll restore to a DB instead of printing output to stdout.

Dump the database to a CSV file (or a CSV file for each table) and then you can load and query them using pandas.

You could convert your dump to INSERT INTO dump with this little tool I've written :
https://github.com/freddez/pg-dump2insert
It will be easier to grep specific table data in this form.

Importing a CSV file into a PostgreSQL DB using Python-Django

Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.

How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?

I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.

I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);

Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.

SQLite table in RAM instead of FLASH

We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?

A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!

Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.