Transfer Files from one cloud storage to another - python

How to transfer files from one cloud storage to another. The files are CSV.
Where is the best place to start in relation to this problem?
For the time being the file just needs to transfer the files every week via manual execution. Eventually the files will be transferred on a scheduled basis.

You can start searching for this sites APIs. For example, Dropbox has a very well documented API for python.
If you want to automate your script every X days/hours/etc, you can make use of cron if you are running Unix based systems.
Hope that helped.

Related

How to save link data as embedded data with only script without opening spotfire?

I am updating new data every day using db connect.
However, due to the large amount of data and unstable db, I want to embed data and distribute it to customers.
The problem is that new data needs to be updated and embedded every day,
but there are many dxp files and cannot be opened in the manual every day.
Can you automate it with Python package or c#?
※ I succeeded in converting the sbdf file.
Just Only Python code (pip install spotfire)
Is there any way to embed with thepython spotfire api?
thank you.
I haven't done it with code before but I use Spotfire's automation services to do this all the time. If you have a spotfire server it should work for you.
Under the tools menu in Spotfire Analyst there is Automation Services job builder. I create a folder in the spotfire library for "published" folders and then setup a job to open the original project and save it in the "published" folder. The trick is when you save there is a check box to say "embed data in analysis". You can see in the attached picture, the job only has two steps but you could do a number of open and saves in a row. I then save the job in the library and schedule it on the server to run nightly. (Via the spotfire web admin tool, but it can be done on a command line too.)
I then inform users that the published job is updated nightly and opens in a few seconds. If you need the latest and greatest you can still open the original project and wait the 5-10 minutes it take to load.
It seems a right candidate for something like a Automation Services job . You could even create a custom task as per your use case if needed.
But please be aware of the limitations of the size of the library items in a database. (If the analysis files get too large)
2 GB for SQL Server ,
4 GB for Oracle
See the KB article
https://support.tibco.com/s/article/Tibco-KnowledgeArticle-Article-48568

How do I work with large data with a web application?

I recently am wrapping up a personal project that involved using flask, python, and pythonanywhere. I learned a lot and now I have some new ideas for personal projects.
My next project involves changing video files and converting them into other file types for example JPGs. when I drafted up how my system could work I quikly realized that the current platform I am using for web application hosting, meaning pythonanywhere, will be too expensive and perhaps even too slow it since I will be working with large files.
I searched around and found AWS S3 for file storage but I am having trouble finding out how I can operate on that data to do my conversions in python. I definitely don't want to download from S3 operate, on the data in Python anywhere, and then reupload the converted files to a bucket. The project will be available for use on the internet so I am trying to make it as robust and scalable as possible.
I found it hard to even word this question on what to ask as I am not too sure if I am even asking the right questions. I guess I am looking for a way to manipulate large data files, preferably in python, without having to work with the data locally if that makes any sense.
I am open to learning new technologies if that is the case and am looking for some direction on how I might achieve this personal project.
Have you looked into AWS Elastic Transcoder?
Amazon Elastic Transcoder lets you convert media files that you have stored in Amazon Simple Storage Service (Amazon S3) into media files in the formats required by consumer playback devices. For example, you can convert large, high-quality digital media files into formats that users can play back on mobile devices, tablets, web browsers, and connected televisions.
Like all things AWS, there are SDKs (e.g. Python SDK) that allow you to programmatically access the service.

Alterantive to FileToGoogleCloudStorageOperator

So I found FileToGoogleCloudStorageOperator which helps in moving files from my local system to Google Cloud. But is there a similar airflow operator to move entire directory to Google Cloud.
Not an official one, but it'd be pretty easy to create one, you can use reuse most of the logic from https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/file_to_gcs.py
You can use the same GoogleCloudStorageHook it is using to upload a single file and just iterate over the directory, uploading all the files. This is what any directory upload functions for GCS would be do anyway.
Depending on the amount of files you routinely need to upload, you might be better off breaking the upload into multiple tasks. That way, should one upload task fail you don't have to restart the upload for all files. It depends on your use case though.

How to set up AWS pipeline for data project with multiple users

I am in the process of moving an internal company tool written entirely in python to the AWS ecosystem, but was having issues figuring out the proper way to set up my data so that it stays organized. This tool is used by people throughout the company, with each person running the tool on their own datasets (that vary in size from a few megabytes to a few gigabytes in size). Currently, users clone the code to their local machines then run the tool on their data locally; we are now trying to move this usage to the cloud.
For a single person, it is simple enough to have them upload their data to s3, then point the python code to that data to run the tool, but I'm worried that as more people start using the tool, the s3 storage will become cluttered/disorganized.
Additionally, each person might make slight changes to the python tool in order to do custom work on their data. Our code is hosted in a bitbucket server, and users will be forking the repo for their custom work.
My questions are:
Are S3 and EC2 the only AWS tools needed to support a project like this?
What is the proper way for users to upload their data, run the code, then download their results so that the data stays organized in S3?
What are the best practices for using EC2 in a situation like this? Do people usually spin up a new EC2 for each job or is scheduling multiple jobs on a single EC2 more efficient?
Is there a way to automate the data uploading process that will allow users to easily run the code on their data without needing to know how to code?
If anyone has any input as to how to set up this project, or has links to any relevant guides/documents, it would be greatly appreciated. Thanks!
You can do something like this.
a) A boto3 script to upload s3 data to specified bucket with maybe
timestamp appended to it.
b) Configure S3 bucket to send notification over SQS when a new item comes
c) Keep 2-3 EC2 machines running actively listening to SQS.
d) When a new item comes, it gets key from SQS.Process it.
Delete event from SQS after successful completion.
e) Put processed data in some place, delete the key from Bucket.
Notify user through mail.
For custom users, they can create a new branch and provide it in data uploaded and ec2 reads it from there and checks out the required branch.After the job the branch can be deleted. This can be a single line with branch name over it.This will involve one time set up.You probably should be using some process manager on EC2 which would restart process if it crashes.

Tableau: How to automate publishing dashboard to Tableau server

I used python scripting to do a series of complex queries from 3 different RDS's, and then exported the data into a CSV file. I am now trying to find a way to automate publishing a dashboard that uses this data into Tableau server on a weekly basis, such that when I run my python code, it will generate new data, and subsequently, the dashboard on Tableau server will be updated as well.
I already tried several options, including using the full UNC path to the csv file as the live connection, but Tableau server had trouble reading this path. Now I'm thinking about just creating a powershell script that can be run weekly that calls the python script to create the dataset and then refreshes tableau desktop, then finally re-publishes/overwrites the dashboard to tableau server.
Any ideas on how to proceed with this?
Getting data from excel to Tableau Server:
Setup the UNC path so it is accessible from your server. If you do this, you can then set up an extract refresh to read in the UNC path at the frequency desired.
Create an extract with the Tableau SDK.
Use the Tableau SDK to read in the CSV file and generate a file.
In our experience, #2 is not very fast. The Tableau SDK seems very slow when generating the extract, and then the extract has to be pushed to the server. I would recommend transferring the file to a location accessible to the server. Even a daily file copy to a shared drive on the server could be used if you're struggling with UNC paths. (Tableau does support UNC paths; you just have to be sure to use them rather than a mapped drive in your setup.)
It can be transferred as a file and then pushed (which may be fastest) or it can be pushed remotely.
As far as scheduling two steps (python and data extract refresh), I use a poor man's solution myself, where I update a csv file at one point (task scheduler or cron are some of the tools which could be used) and then setup the extract schedule at a slightly later point in time. While it does not have the linkage of running the python script and then causing the extract refresh (surely there is a tabcmd for this), it works just fine for my purposes to put 30 minutes in between as my processes are reliable and the app is not mission critical.

Categories

Resources