I'm doing a web scraping + data analysis project that consists of scraping product prices every day, clean the data, and store that data into a PostgreSQL database. The final user won't have access to the data in this database (scraping every day becomes huge, so eventually I won't be able to upload in GitHub), but I want to explain how to replicate the project. The steps are basically:
Scraping with Selenium (Python), and save the raw data into CSV files (already in GitHub);
Read these CSV files, clean the data, and store it into the database (the cleaning script already in GitHub);
Retrieve the data from the database to create dashboards and anything that I want (not yet implemented).
To clarify, my question is about how can I teach someone that sees my project, to replicate it, given that this person won't have the database info (tables, columns). My idea is:
Add SQL queries in a folder, showing how to create the database skeleton (same tables and columns);
Add in README info like creating the environment variables to access the database;
It is okay doing that? I'm looking for best practices in this context. Thanks!
I have a subfolder in a S3 bucket to store CSV files. These CSV files all contain data from one specific data source. The data source provides a new CSV file monthly. I have about 4 years worth of data.
At some point (~2 years ago) the data source decided to change the format of the data. The schema of the CSV changed (some columns were removed). The data is still more or less the same, and everything I want is still there.
I want to use a crawler to register both schemas, preferably in the same table. Ideally, I would like it to recognize the two versions of the schema.
How should I do that?
What I tried
I uploaded all the files in the subfolder and run a crawler with "Create a single schema for each S3 path" enabled.
Result: I got one table with both schemas merged: one big schema with all the columns from both formats
I uploaded all the files in the subfolder and run a crawler with "Create a single schema for each S3 path" disabled.
Result: I got two tables with the two distinct schemas
Why I need this
The two different schemas need to be processed differently. I'm writing a Python shell job to process the files. My idea was to use the catalog to pull the two different versions of the schema, and trigger different treatments for each file depending on the schema of the file.
I am trying to create a database for a bunch of hdf5 files that are on my computer like more than a few hundreds. This database will be used by many people and then i need to create a API to be used to access the database.
So any idea how i can create a database that can contain the HDF5 files in a single place ??
I need to use python to create a database.
I need to export information from a legacy database system. The database is a 'Progress' database and the information is stored in a file with extension .b1
What is the easiest way to export all database tables to a text file?
The .b1 file is a a part of the Progress Database but actually not the database itself. It contains the "before-image" data. It is used for keeping track of transactions so that the database can undo in case of error/rollback etc. The data in that file will really not help you.
What you would want are the database files. Usually named .db, .d1, .d2, d3 etc.
However reading those (binary) files will be very tricky. I'm not even sure there are any specifications on how they are built. It would be a lot easier to use the Progress built in tools for dumping all data as textfiles. Those textfiles could easily be read by some simple programs in Python. If you have the database installed on a system you will find a directory with programs to serve the database etc. There you will also find some utilities.
Depending on version of OS and of Progress it might look a bit different. You would want to enter the Data Administration utility and go into Admin => Dump Data and Definitions.
If you take a look at the resulting files .df for data definitions (schema) and .d for the data itself you should be able to work out how it's formatted. Relations are not stored in the database at all. In a Progress environment they basically only exists in the application accessing the DB.
You can also select Export Data for various formats ("Text" is probably most interesting).
If you programatically can access the Progress environment it might even be easier to write a small program that exports individual tables. This will create a semicolon delimited file for "table1":
OUTPUT TO C:\temp\table1.txt.
FOR EACH table1 NO-LOCK:
EXPORT DELIMITER ";" table1.
END.
OUTPUT CLOSE.
Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.
How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?
I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.
I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);
Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.