dump CSV file from Django query to Github - python

We want to automate a process through django admin where, whenever a user makes a change to a record (or adds/deletes a record), a CSV file is created and then dumped into a Github repository with a commit message specified by the person who made the change.
Creating the csv file from a queryset is easy enough... But how would we go about then getting that csv file to a folder that is git initialized so that we can commit it to a repository?
Any ideas would be great. Essentially we're looking for a way of tracking specific changes to the database. With CSV files in github, we can really easily follow the changes, and we want to leverage that.
cheers

If you can create your csv files the next step would be to talk to github via api or to have a local representation of a git repo which needs to be synct after file creation.
But if I may ask why do you want to do this with csv files in a github repo? My first response to a requirement like that would be to logg changes with the python logging infrastructure or to create an additional model to track the specific changes in the db.
Eventually this could also meet your requirements: https://django-simple-history.readthedocs.io/en/latest/

This doesn't exactly answer the question, but have you thought of using something like django-simple-history?
It's a really easy to use Django package that tracks all Django model state on every create/update/delete. Should be much easier to get going than fiddling around pushing CSVs to github.

Related

Mapping fields inside other fields

Hello I would like to make an app that allows the user to import data from a source of his choice (Airtable, xls, csv, JSON) and export to a JSON which will be pushed to an Sqlite database using an API.
The "core" of the functionality of the app is that it allows the user to create a "template" and "map" of the source columns inside the destination columns. Which source column(s) go to which destination column is up to the user. I am attaching two photos here (used in airtable/zapier), so you can get a better idea of the end result:
adding fields inside fields - airtableadding fields inside fields - zapier
I would like to know if you can recommend a library or a way to come about this problem? I have tried to look for some python or nodejs libraries, I am lost between using ETL libraries, some recommended using mapping/zipping features, others recommend coding my own classes. Do you know any libraries that allow to do the same thing as airtable/zapier ? Any suggestions ?
Save file on databases is really a bad practice since it takes up a lot of database storage space and would add latency in the communication.
I hardly recommend saving it on disk and store the path on database.

Import csv data to web2py database and process uploads

I've made a really simple single-user database application with web2py to be deployed to a desktop machine. The reason I choose web2py is due to its simplicity and its not intrusive web server.
My problem is that I need to migrate an existing database from another application that I've just preprocessed and prepared into a csv file that can be now perfectly imported into web2py's sqlite database.
Now, I have a problem with a 'upload' field in one of the tables, which correspond to a small image, I've formated that field into de the csv, with the name of the corresponding .jpg file that I extrated from the original database. The problem is that I have not managed how to insert these correctly into the upload folder, as the web2py engine automatically changes the filename of the users' uploads to a safe format, and copying my files straight to the folder does not work.
My question is, does anyone know a proper way to include this image collection into the uploads folder?. I don't know if there is a way to disable this protection or if I have to manually change their name to a valid hash. I've also considered the idea of coding an automatic insert process into the database...
Thanks all for you attention!
EDIT (a working example):
An example database:
db.define_table('product',
Field('name'),
Field('color'),
Field('picture', 'upload'),
)
Then using the default appadmin module from my application I import a csv file with entries of the form:
product.name,product.color,product.picture
"p1","red","p1.jpg"
"p2","blue","p2.jpg"
Then in my application I have the usual download function:
def download():
return response.download(request, db)
Which I call requesting the images uploaded into the database, for example, to be included into a view:
<img src="{{=URL('download', args=product.picture)}}" />
So my problem is that I have all the images corresponding the database records and I need to import them into my application, by properly including them into the uploads folder.
If you want the files to be named via the standard web2py file upload mechanism (which is a good idea for security reasons) and easily downloaded via the built-in response.download() method, then you can do something like the following.
In /yourapp/controllers/default.py:
def copy_files():
import os
for row in db().select(db.product.id, db.product.picture):
picture = open(os.path.join(request.folder, 'private', row.picture), 'rb')
row.update_record(picture=db.product.picture.store(picture, row.picture))
return 'Files copied'
Then place all the files in the /yourapp/private directory and go to the URL /default/copy_files (you only need to do this once). This will copy each file into the /uploads directory and rename it, storing the new name in the db.product.picture field.
Note, the above function doesn't have to be a controller action (though if you do it that way, you should remove the function when finished). Instead, it could be a script that you run via the web2py command line (needs to be run in the app environment to have access to the database connection and model, as well as reference to the proper /uploads folder) -- in that case, you would need to call db.commit() at the end (this is not necessary during HTTP requests).
Alternatively, you can leave things as they are and instead (a) manage uploads and downloads manually instead of relying on web2py's built-in mechanisms, or (b) create custom_store and custom_retrieve functions (unfortunately, I don't think these are well documented) for the picture field, which will bypass web2py's built-in store and retrieve functions. Life will probably be easier, though, if you just go through the one-time process described above.

Importing a CSV file into a PostgreSQL DB using Python-Django

Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.
How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?
I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.
I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);
Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.

What do I need to consider when scaling an application that stores files in the filesystem?

I am interesting in making an app where users can upload large files (~2MB) that are converted into html documents. This application will not have a database. Instead, these html files are stored in a particular writable directory outside of the document source tree. Thus this directory will grow larger and larger as more files are added to it. Users should be able to view these html files by visiting the appropriate url. All security concerns aside, what do I need to be worried about if this directory continues to grow? Will accessing the files inside take longer when there are more of them? Will it potentially crash because of this? Should I create a new directory every 100 files or so to prevent this?
It it is important, I want to make this app using pyramid and python
You might want to partition the directories by user, app or similar so that it's easy to manage anyway - like if a user stops using the service you could just delete their directory. Also I presume you'll be zipping them up. If you keep it well decoupled then you'll be able to change your mind later.
I'd be interested to see how using something like SQLite would work for you, as you could have a sqlite db per partitioned directory.
I presume HTML files are larger than the file they uploaded, so why store the big HTML file.
Things like Mongodb etc are out of the question? as is your app scales with multiple servers you've the issue of accessing other files on a different server, unless you pick the right server in the first place using some technique. Then it's possible you've got servers sitting idle as no one wants there documents.
Why the limitation on just storing files in a directory, is it a POC?
EDIT
I find value in reading things like http://blog.fogcreek.com/the-trello-tech-stack/ and I'd advise you find a site already doing what you do and read about their tech. stack.
As someone already commented why not use Amazon S3 or similar.
Ask yourself realistically how many users do you imagine and really do you want to spend a lot of energy worrying about being the next facebook and trying to do the ultimate tech stack for the backend when you could get your stuff out there being used.
Years ago I worked on a system that stored insurance certificates on the filesystem, we use to run out of inodes.!
Dare I say it's a case of suck it and see what works for you and your app.
EDIT
HAProxy I believe are meant to handle all that load balancing concerns.
As I imagine as a user I wants to http://docs.yourdomain.com/myname/document.doc
although I presume there are security concerns of it being so obvious a name.
This greatly depends on your filesystem. You might want to look up which problems the git folks encountered (also using a sole filesystem based database).
In general, it will be wise do split that directory up, for example by taking the first two or three letters of the file name (or a hash of those) and group the files into subdirectories based on that key. You'd have a structure like:
uploaddir/
00/
files whose name sha1 starts with 00
01/
files whose name sha1 starts with 01
and so on. This takes some load off the filesystem by partitioning the possibly large directories. If you want to be sure that no user can perform an Denial-of-Service-Attack by specifically uploading files whose names hash to the same initial characters, you can also seed the hash differently or salt it or anything like that.
Specifically, the effects of large directories are pretty file-system specific. Some might become slow, some may cope really well, others may have per-directory limits for files.

Force GAE to update all files on deployment

Is there a way to force GAE to upload and update all files, even if it thinks it doesn't require any updations?
Clarification - If I make quick back-to-back updates, I find that certain files, that were definitely modified, refuse to be updated online. Apart from assigning version numbers to force the update, which is very painful, is there another way?
EDIT - I'm referring to javascript files
Those files do get updated, you don't see the update because of caching that happens in some where a long the chain. In order to get the latest files load the file with a slug (e.g. http://myapp.com/scripts/script.js?slug) and update the slug each time you deploy your application.

Categories

Resources