How to pass large data between systems

How to pass large data between systems - python

I have pricing data that is stored in XML format that is being generated on an hourly basis. It is roughly 100MB in size, if stored as XML. I need to send this data to my main system in order to process this. In the future, it is also possible that this data size is sent ever 1m.
What would be the best way to send this data? My thinking thus far was:
- It would be too large to send as JSON to a POST endpoint
- Possible to send it as XML and store it on my server
Is there a more optimal way to do this?

As mentioned in the answer by Michael Anderson, you could possibly send only a diff of the changes across each system.
One way to do this is to introduce a protocol such as git.
With git, you could:
Generate the data on the first system and push to a private repo
Have your second system pull the changes
This would be much more efficient than pulling the entire copy of the data every time.
It would also be compressed, and over an encrypted channel (depending on the git server/service)

Assuming you're on linux and the data already is written somewhere in your filesystem, why not just do a simple scp or rsync in a crontab entry?
You probably want to compress before sending, or enable compression in the protocol.
If your data only changes slightly, you could also try sending a patch against the previous version (generated with diff) instead of the entire data, and then regenerating on the other end.

Related

Is there a way to of getting part of a dataframe from an azure blob storage

So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.

Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.

Keeping an optimise log file for a predetermined number of requests?

I need to log certain requests made to my app being developed in Flask.
I know in advance whether the request is a candidate to be that needs to be logged or not.
I'll have at most 1000 requests to be logged. This is an upper limit, but it will probably be lower. The number is still uncertain for now, as we're adding those, but it won't be once deployed. Each request will generate on average two A4 text.
I wanted to save the content, the time, the filenames (usually at most 2 to 3 files) that came in with the requests.
I don't mind the space this log file will take up.
I would like it to be fast to open, and fast to search the log for previously done requests. I'll have a preference for fast opening and logging into the file new information, more than the searching itself, since it'll be just a few hundreds of requests.
Should I pickle, save in a json, or use some other format?
The way I was thinking of structuring the information would be this:
{ "past_request": <sorted_list_filenames>,
"filename_1" : {"content":..., "time_requested":...,...},
"filename_102" : {"content":..., "time_requested":...,...}
...

Best approach to handle large files on django website

Good morning all.
I have a generic question about the best approach to handle large files with Django.
I created a python project where the user is able to read a binary file (usually the size is between 30-100MB). Once the file is read, the program processes the file and shows relevant metrics to the user. Basically it outputs the max, min, average, std of the data.
At the moment, you can only run this project from the cmd line. I'm trying to create a user interface so that anyone can use it. I decided to create a webpage using django. The page is very simple. The user uploads files, he then selects which file he wants to process and it shows the metrics to the user.
Working on my local machine I was able to implement it. I upload the files (it saves on the user's laptop and then it processes it). I then created an S3 account, and now the files are all uploaded to S3. The problem that I'm having is that when I try to get the file (I'm using smart_open (https://pypi.org/project/smart-open/)) it is really slow to read the file (for a 30MB file it's taking 300sec), but if I download the file and read it, it only takes me 8sec.
My question is: What is the best approach to retrieve files from S3, and process them? I'm thinking of simply downloading the file to my server, process it, and then deleting it. I've tried this on my localhost and it's fast. Downloading from S3 takes 5sec and processing takes 4sec.
Would this be a good approach? I'm a bit afraid that for instance if I have 10 users at the same time and each one creates a report then I'll have 10*30MB = 300MB of space that the server needs. Is this something practical, or will I fill up the server?
Thank you for your time!
Edit
To give a bit more of a context, what's making it show is the f.read() line. Due to the format of the binary file. I have to read the file in the following way:
name = f.read(30)
unit = f.read(5)
data_length = f.read(2)
data = f.read(data_length) <- This is the part that is taking a lot of time when I read it directly from S3. If I download the file, then this is super fast.

All,
After some experimenting, I found a solution that works for me.
with open('temp_file_name', 'wb') as data:
s3.download_fileobj(Bucket='YOURBUCKETNAME', Key='YOURKEY', data)
read_file('temp_file_name')
os.remove('temp_file_name')
I don't know if this is the best approach or what are the possible downfalls of this approach. I'll use it and come back to this post if I end up using a different solution.
The problem with my previous approach was that f.read() was taking too long, the problem seems to be that every time I need to read a new line, the program needs to connect to S3 (or something) and this is taking too long. What ended up working for me, was to download the file directly to my server, read it, and then deleting it once I read the file. Using this solution I was able to get the speeds that I was getting when working on a localserver (reading directly from my laptop).
If you are working with medium size files (30-50mb) this approach seems to work. My only concern is if we try to download a really large file if the server will run out of disk space.

Database in Excel using win32com or xlrd Or Database in mysql

I have developed a website where the pages are simply html tables. I have also developed a server by expanding on python's SimpleHTTPServer. Now I am developing my database.
Most of the table contents on each page are static and doesn't need to be touched. However, there is one column per table (i.e. page) that needs to be editable and stored. The values are simply text that the user can enter. The user enters the text via html textareas that are appended to the tables via javascript.
The database is to store key/value pairs where the value is the user entered text (for now at least).
Current situation
Because the original format of my webpages was xlsx files I opted to use an excel workbook as my database that basically just mirrors the displayed web html tables (pages).
I hook up to the excel workbook through win32com. Every time the table (page) loads, javascript iterates through the html textareas and sends an individual request to the server to load in its respective text from the database.
Currently this approach works but is terribly slow. I have tried to optimize everything as much as I can and I believe the speed limitation is a direct consequence of win32com.
Thus, I see four possible ways to go:
Replace my current win32com functionality with xlrd
Try to load all the html textareas for a table (page) at once through one server call to the database using win32com
Switch to something like sql (probably use mysql since it's simple and robust enough for my needs)
Use xlrd but make a single call to the server for each table (page) as in (2)
My schedule to build this functionality is around two days.
Does anyone have any thoughts on the tradeoffs in time-spent-coding versus speed of these approaches? If anyone has any better/more streamlined methods in mind please share!

Probably not the answer you were looking for, but your post is very broad, and I've used win32coma and Excel a fair but and don't see those as good tools towards your goal. An easier strategy is this:
for the server, use Flask: it is a Python HTTP server that makes it crazy easy to respond to HTTP requests via Python code and HTML templates. You'll have a fully capable server running in 5 minutes, then you will need a bit of time create code to get data from your DB and render from templates (which are really easy to use).
for the database, use SQLite (there is far more overhead intergrating with MysQL); because you only have 2 days, so
you could also use a simple CSV file, since the API (Python has a CSV file read/write module) is much simpler, less ramp up time. One CSV per user, easy to manage. You don't worry about insertion of rows for a user, you just append; and you don't implement remove of rows for a user, you just mark as inactive (a column for active/inactive in your CSV). In processing GET request from client, as you read from the CSV, you can count how many certain rows are inactive, and do a re-write of the CSV, so once in a while the request will be a little slower to respond to client.
even simpler yet you could use in-memory data structure of your choice if you don't need persistence across restarts of the server. If this is for a demo this should be acceptable limitation.
for the client side, use jQuery on top of javascript -- maybe you are doing that already. Makes it super easy to manipulate the DOM and use effects like slide-in/out etc. Get yourself the book "Learning jQuery", you'll be able to make good use of jQuery in just a couple hours.
If you only have two days it might be a little tight, but you will probably need more than 2 days to get around the issues you are facing with your current strategy, and issues you will face imminently.

Compress data before storage on Google App Engine

I im trying to store 30 second user mp3 recordings as Blobs in my app engine data store. However, in order to enable this feature (App Engine has a 1MB limit per upload) and to keep the costs down I would like to compress the file before upload and decompress the file every time it is requested. How would you suggest I accomplish this (It can happen in the background by the way via a task queue but an efficient solution is always good)
Based on my own tests and research - I see two possible approaches to accomplish this
Zlib
For this I need to compress a certain number of blocks at a time using a While loop. However, App Engine doesnt allow you to write to the file system. I thought about using a Temporary File to accomplish this but I havent had luck with this approach when trying to decompress the content from a Temporary File
Gzip
From reading around the web, it appears that the app engine url fetch function requests content gzipped already and then decompresses it. Is there a way to stop the function from decompressing the content so that I can just put it in the datastore in gzipped format and then decompress it when I need to play it back to a user on demand?
Let me know how you would suggest using zlib or gzip or some other solution to accmoplish this. Thanks

"Compressing before upload" implies doing it in the user's browser -- but no text in your question addresses that! It seems to be about compression in your GAE app, where of course the data will only be after the upload. You could do it with a Firefox extension (or other browsers' equivalents), if you can develop those and convince your users to install them, but that has nothing much to do with GAE!-) Not to mention that, as #RageZ's comment mentions, MP3 is, essentially, already compressed, so there's little or nothing to gain (though maybe you could, again with a browser extension for the user, reduce the MP3's bit rate and thus the file's dimension, that could impact the audio quality, depending on your intended use for those audio files).
So, overall, I have to second #jldupont's suggestion (also in a comment) -- use a different server for storage of large files (S3, Amazon's offering, is surely a possibility though not the only one).

While the technical limitations (mentioned in other answers) of compressing MP3 files via standard compression or reencoding at a lower bitrate are correct, your aim is to store 30 seconds of MP3 encoded data. Assuming that you can enforce that on your users, you should be alright without applying additional compression techniques if the MP3 bitrate is 256kbit constant bitrate (CBR) or lower. At 256kbit CBR, 30 seconds of audio would require:
(((256 * 1000) / 8) * 30) / 1048576 = 0.91MB
The maximum standard bitrate is 320kbit which equates to 1.14MB, so you'd have to use 256 or less. The most commonly used bitrate in the wild is 128kbits.
There are additional overheads that will increase the final file size such as ID3 tags and framing, but you should be OK. If not, drop down to 224kbits as your maximum (30 secs = 0.80MB). There are other complexities such as variable bit rate encoding for which the file size is not so predictable and I am ignoring these.
So your problem is no longer how to compress MP3 files, but how to ensure that your users are aware that they can not upload more than 30 seconds encoded at 256kbits CBR, and how to enforce that policy.

You could try the new Blobstore API allowing the storage and serving of files up to 50MB
http://www.cloudave.com/link/the-new-google-app-engine-blobstore-api-first-thoughts
http://code.google.com/appengine/docs/python/blobstore/
http://code.google.com/appengine/docs/java/blobstore/

As Aneto mentions in a comment, you will not be able to compress MP3 data with a standard compression library like gzip or zlib. However, you could reencode the MP3 at a MUCH lower bitrate, possible with LAME.

You can store up to 10Mb with a list of Blobs. Search for google file service.
It's much more versatile than BlobStore in my opinion, since I just started using BlobStore Api yesterday and I'm still figuring out if it is possible to access the data bytewise.. as in changing doc to pdf, jpeg to gif..
You can storage Blobs of 1Mb * 10 = 10 Mb (max entity size I think), or you can use BlobStore API and get the same 10Mb or get 50Mb if you enable billing (you can enable it but if you don't pass the free quota you don't pay).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.