How to create a database for HDF5 files

How to create a database for HDF5 files - python

I am trying to create a database for a bunch of hdf5 files that are on my computer like more than a few hundreds. This database will be used by many people and then i need to create a API to be used to access the database.
So any idea how i can create a database that can contain the HDF5 files in a single place ??
I need to use python to create a database.

Related

How can I create table from parquet file

Given a parquet file how can I create the table associated with it into my redshift database? Oh the format of the parquet file is snappy.

If you're dealing with multiple files, especially over a long term, then I think the best solution is to upload them to an S3 bucket and run a Glue crawler.
In addition to populating the Glue data catalog, you can also use this information to configure external tables for Redshift Spectrum, and create your on-cluster tables using create table as select.
If this is just a one-off task, then I've used parquet-tools in the past. The version that I've used is a Java library, but I see that there's also a version on PyPi.

How to deal with SQL database in GitHub

I'm doing a web scraping + data analysis project that consists of scraping product prices every day, clean the data, and store that data into a PostgreSQL database. The final user won't have access to the data in this database (scraping every day becomes huge, so eventually I won't be able to upload in GitHub), but I want to explain how to replicate the project. The steps are basically:
Scraping with Selenium (Python), and save the raw data into CSV files (already in GitHub);
Read these CSV files, clean the data, and store it into the database (the cleaning script already in GitHub);
Retrieve the data from the database to create dashboards and anything that I want (not yet implemented).
To clarify, my question is about how can I teach someone that sees my project, to replicate it, given that this person won't have the database info (tables, columns). My idea is:
Add SQL queries in a folder, showing how to create the database skeleton (same tables and columns);
Add in README info like creating the environment variables to access the database;
It is okay doing that? I'm looking for best practices in this context. Thanks!

Best way to migrate CSV files to Django SQL back-end?

I have a directory full of CSV files which I would like to use for the back-end of my Django web-app but I am struggling to populate my Django models with the data from the CSV files. This is because some of the files are .csv join tables connecting the other .csv files in the directory. When I try to map the relationships from the CSV files to the SQL database I always run into problems.
Is there a better or more straight forward way of populating these models?

You can see an example below of loading in a fairly large dataset (entire USDA food nutrition database) from many CSV files that share IDs between them. But assuming that there is referential integrity in the CSV files (the IDs that cross reference between the CSV files are valid), it is a relatively straightforward approach.
https://github.com/pgalfi/homeapps/blob/master/foodtrack/management/commands/loadCSV.py
Could you share what problems you are running into with the relationships?

AWS Glue crawler: different schema for input data

I have a subfolder in a S3 bucket to store CSV files. These CSV files all contain data from one specific data source. The data source provides a new CSV file monthly. I have about 4 years worth of data.
At some point (~2 years ago) the data source decided to change the format of the data. The schema of the CSV changed (some columns were removed). The data is still more or less the same, and everything I want is still there.
I want to use a crawler to register both schemas, preferably in the same table. Ideally, I would like it to recognize the two versions of the schema.
How should I do that?
What I tried
I uploaded all the files in the subfolder and run a crawler with "Create a single schema for each S3 path" enabled.
Result: I got one table with both schemas merged: one big schema with all the columns from both formats
I uploaded all the files in the subfolder and run a crawler with "Create a single schema for each S3 path" disabled.
Result: I got two tables with the two distinct schemas
Why I need this
The two different schemas need to be processed differently. I'm writing a Python shell job to process the files. My idea was to use the catalog to pull the two different versions of the schema, and trigger different treatments for each file depending on the schema of the file.

How to read a Progress database file .b1 in Python

I need to export information from a legacy database system. The database is a 'Progress' database and the information is stored in a file with extension .b1
What is the easiest way to export all database tables to a text file?

The .b1 file is a a part of the Progress Database but actually not the database itself. It contains the "before-image" data. It is used for keeping track of transactions so that the database can undo in case of error/rollback etc. The data in that file will really not help you.
What you would want are the database files. Usually named .db, .d1, .d2, d3 etc.
However reading those (binary) files will be very tricky. I'm not even sure there are any specifications on how they are built. It would be a lot easier to use the Progress built in tools for dumping all data as textfiles. Those textfiles could easily be read by some simple programs in Python. If you have the database installed on a system you will find a directory with programs to serve the database etc. There you will also find some utilities.
Depending on version of OS and of Progress it might look a bit different. You would want to enter the Data Administration utility and go into Admin => Dump Data and Definitions.
If you take a look at the resulting files .df for data definitions (schema) and .d for the data itself you should be able to work out how it's formatted. Relations are not stored in the database at all. In a Progress environment they basically only exists in the application accessing the DB.
You can also select Export Data for various formats ("Text" is probably most interesting).
If you programatically can access the Progress environment it might even be easier to write a small program that exports individual tables. This will create a semicolon delimited file for "table1":
OUTPUT TO C:\temp\table1.txt.
FOR EACH table1 NO-LOCK:
EXPORT DELIMITER ";" table1.
END.
OUTPUT CLOSE.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.