Daily import into Heroku hosted Postgres database

Daily import into Heroku hosted Postgres database - python

I'm in the process of porting over a MySQL database to a Heroku hosted, dedicated PostgreSQL instance. I understand how to get the initial data over to Heroku. However, there is a daily "feed" of data from an external company that will need to be imported each day. It is pushed up to an FTP server and it's a zip file containing several different CSV files. Normally, I could/would just scp it over to the Postgres box and then have a cron job that does a "COPY tablename FROM path/to/file.csv" to import the data. However, using Heroku has me a bit baffled as to the best way to do this. Note: I've seen and reviewed the heroku dev article on importing data. But, this is more of a dump file. I'm just dealing with a daily import from a CSV file.
Does anyone do something similar to this on Heroku? If so, can you give any advice on what's the best way.
Just a bit more info: My application is Python/Django 1.3.3 on the Cedar stack. And my files can be a bit large. Some of them can be over 50K records. So, to loop through them and use the Django ORM is probably going to be a bit slow (but, still might be the best/only solution).

Two options:
Boot up a non-heroku EC2 instance, fetch from FTP, unzip and initiate the copy from there. By making use of the COPY STDIN option (http://www.postgresql.org/docs/9.1/static/sql-copy.html) you can instruct it that the data is coming from the client connection, as opposed to from a file on the server's filesystem which you don't have access to.
How large is the file? It might fit in a dyno's ephemeral filesystem, so a process or one off job can download the file from the FTP server and do the whole process from within a dyno. Once the process exits, away goes the filesystem data.

Related

Should I deploy my web server with Django's default database?

I'm a noobie to Django and I'm almost at the stage of deploying a web server.
I was just having some doubts with Django's database. Currently I'm using the default sqlite3 database to store all the user models as well as the info models. I'm thinking of using AWS to deploy my web sever.
So when I get to that stage, should I continue with sqlite or should I switch to AWS's database or something like Firebase. If I continue with sqlite, where and how exactly will the information be stored? And what if I switch to something like PostgreSQL, where will the information be stored and will it be secure/fast (even if I manage to get thousands of users)?
Thanks so much, this question might be really basic but I'm super confused.

sqlite is a flat file database, it uses an exposed file in your project to save your data, this is fine in local environment, but when deploying you need to consider that the server and the database are in the same machine and using the same disk. that means if you accidentally remove the machine -and its disk space- used to serve the application, then the database itself will be deleted with all records.
Plus you will face problems if you tried to scale your servers, that is every server will have his own copy of the database and syncing all those files will be huge headache.
If your data is not that important then you can keep using sqlite, but if you are expecting high traffic and complex db structure, then I would recommend you consider a db engine like Mysql or maybe look up the databases offered by amazon here:
https://aws.amazon.com/products/databases/
For django, you will need to change the adapter when using a different db like mysql, sqlite or anything else.
https://docs.djangoproject.com/en/3.0/ref/databases/

How to set up AWS pipeline for data project with multiple users

I am in the process of moving an internal company tool written entirely in python to the AWS ecosystem, but was having issues figuring out the proper way to set up my data so that it stays organized. This tool is used by people throughout the company, with each person running the tool on their own datasets (that vary in size from a few megabytes to a few gigabytes in size). Currently, users clone the code to their local machines then run the tool on their data locally; we are now trying to move this usage to the cloud.
For a single person, it is simple enough to have them upload their data to s3, then point the python code to that data to run the tool, but I'm worried that as more people start using the tool, the s3 storage will become cluttered/disorganized.
Additionally, each person might make slight changes to the python tool in order to do custom work on their data. Our code is hosted in a bitbucket server, and users will be forking the repo for their custom work.
My questions are:
Are S3 and EC2 the only AWS tools needed to support a project like this?
What is the proper way for users to upload their data, run the code, then download their results so that the data stays organized in S3?
What are the best practices for using EC2 in a situation like this? Do people usually spin up a new EC2 for each job or is scheduling multiple jobs on a single EC2 more efficient?
Is there a way to automate the data uploading process that will allow users to easily run the code on their data without needing to know how to code?
If anyone has any input as to how to set up this project, or has links to any relevant guides/documents, it would be greatly appreciated. Thanks!

You can do something like this.
a) A boto3 script to upload s3 data to specified bucket with maybe
timestamp appended to it.
b) Configure S3 bucket to send notification over SQS when a new item comes
c) Keep 2-3 EC2 machines running actively listening to SQS.
d) When a new item comes, it gets key from SQS.Process it.
Delete event from SQS after successful completion.
e) Put processed data in some place, delete the key from Bucket.
Notify user through mail.
For custom users, they can create a new branch and provide it in data uploaded and ec2 reads it from there and checks out the required branch.After the job the branch can be deleted. This can be a single line with branch name over it.This will involve one time set up.You probably should be using some process manager on EC2 which would restart process if it crashes.

Tableau: How to automate publishing dashboard to Tableau server

I used python scripting to do a series of complex queries from 3 different RDS's, and then exported the data into a CSV file. I am now trying to find a way to automate publishing a dashboard that uses this data into Tableau server on a weekly basis, such that when I run my python code, it will generate new data, and subsequently, the dashboard on Tableau server will be updated as well.
I already tried several options, including using the full UNC path to the csv file as the live connection, but Tableau server had trouble reading this path. Now I'm thinking about just creating a powershell script that can be run weekly that calls the python script to create the dataset and then refreshes tableau desktop, then finally re-publishes/overwrites the dashboard to tableau server.
Any ideas on how to proceed with this?

Getting data from excel to Tableau Server:
Setup the UNC path so it is accessible from your server. If you do this, you can then set up an extract refresh to read in the UNC path at the frequency desired.
Create an extract with the Tableau SDK.
Use the Tableau SDK to read in the CSV file and generate a file.
In our experience, #2 is not very fast. The Tableau SDK seems very slow when generating the extract, and then the extract has to be pushed to the server. I would recommend transferring the file to a location accessible to the server. Even a daily file copy to a shared drive on the server could be used if you're struggling with UNC paths. (Tableau does support UNC paths; you just have to be sure to use them rather than a mapped drive in your setup.)
It can be transferred as a file and then pushed (which may be fastest) or it can be pushed remotely.
As far as scheduling two steps (python and data extract refresh), I use a poor man's solution myself, where I update a csv file at one point (task scheduler or cron are some of the tools which could be used) and then setup the extract schedule at a slightly later point in time. While it does not have the linkage of running the python script and then causing the extract refresh (surely there is a tabcmd for this), it works just fine for my purposes to put 30 minutes in between as my processes are reliable and the app is not mission critical.

Standalone Desktop App with Centralized Database (file) w/o Database Server?

I have been developing a fairly simple desktop application to be used by a group of 100-150 people within my department mainly for reporting. Unfortunately, I have to build it within some pretty strict confines similar to the specs called out in this post. The application will just be a self contained executable with no need to install.
The problem I'm running into is figuring out how to handle the database need. There will probably only be about 1GB of data for the app, but it needs to be available to everyone.
I would embed the database with the application (SQLite), but the data needs to be refreshed every week from a centralized process, so I figure it would be easier to maintain one database, rather than pushing updates down to the apps. Plus users will need to write to the database as well and those updates need to be seen by everyone.
I'm not allowed to set up a server for the database, so that rules out any good options for a true database. I'm restricted to File Shares or SharePoint.
It seems like I'm down to MS Access or SQLite. I'd prefer to stick with SQLite because I'm a fan of python and SQLAlchemy - but based on what I've read SQLite is not a good solution for multiple users accessing it over the network (or even possible).
Is there another option I haven't discovered for this setup or am I stuck working with MS Access? Perhaps I'll need to break down and work with SharePoint lists and apps?
I've been researching this for quite a while now, and I've run out of ideas. Any help is appreciated.
FYI, as I'm sure you can tell, I'm not a professional developer. I have enough experience in web / python / vb development that I can get by - so I was asked to do this as a side project.

SQLite can operate across a network and be shared among different processes. It is not a good solution when the application is write-heavy (because it locks the database file for the duration of a write), but if the application is mostly reporting it may be a perfectly reasonable solution.

As my options are limited, I decided to go with a built in database for each app using SQLite. The db will only need updated every week or two, so I figured a 30 second update by pulling from flat files will be OK. then the user will have all data locally to browse as needed.

How to quickly insert huge datas into datastore while running on GAE development server?

Background:
While coding on GAE's local Development Web Server, user need to upload Mega-level datas and store (not straight forward store, but need many format check and translate) them into Datastore using deferred library.
Usually about 50,000 entities, CSV File size is about 5MB, and I tried to insert 200 entities each time using deferred library.
And I used python.
Problem
The development server is really slow that I need to wait one/more hours to finish this upload process.
I used --use_sqlite option to speed up the development web server.
Question:
Is there any other method or tuning that can make it faster?

appengine-mapreduce is definitely an option for loading CSV files. Use blobstore to upload CSV file and then setup BlobstoreLineInputReader mapper type to load data into datastore.
Some more links: Python Guide to mapreduce reader types is here. The one of interest is BlobstoreLineInputReader. The only input it requires is the key to the blobstore record containing uploaded CSV file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.