Write/Read files: NodeJS vs Python - python

I need to know what mechanism is more efficient (less RAM/CPU) to read and write files, especially write. Possibly a JSON data structure. The idea is perform these operations in a context of WebSockets (client -> server -> read/write file with data of the actual session -> response to client).... Best idea is to store the data in temporal variables and destroy vars when are not useful?
Any idea?

They will probably both be about the same. I/O is generally a lot slower than CPU, so the entire process of reading and writing files will depend on how fast your disk can handle the requests.
It also will depend on the data-processing approach you take. If you opt to read the whole file in at once, then of course it will use more memory than if you choose to read the file piece-by-piece.
So, the answer is: the performance will only (very minimally) depend on your choice of language. Choice of algorithm and I/O performance will easily account for the majority of CPU or RAM usage.

Related

Call Python tasks from Golang

I have been building big data application for stock market analysis. About 5TB of records per day. I use Golang for data transformation/calculation and saving in Cassandra/MySQL. But Python has very good libraries for data analysis Pandas, Spark and etc., but there is no easy way for multicore processing and takes a lot of time.
So, I want to call python data analysis tasks concurrently in Golang. One way is to execute command line task directly, but I think there should be more scalable solution. Maybe there is library for communication between Golang and Python. I thought maybe I should create multiple servers of Python Flask and give tasks to them. Speed is important, but I can sacrifice some of it for concise solution. Any ideas?
Splitting your app into multiple servers, as you've suggested, carries some trade-offs.
On the plus side, splitting it up provides you with more flexibility in terms of load balancing. In other words, if your flask servers are overburdened, you can always spin a few more and scale horizontally with a load-balancer. Of course this assumes that whatever it is you're doing on those flask server can be done in parallel (depends on your actual business logic).
It also offers high-availability: you eliminate one potential single-point-of-failure.
However, this 'microservice' approach does incur some overheads
more code to write, since now you're writing 2 kinds of servers
some network overhead, since now you're communicating over the network as opposed to function calls.
more machines to spin (although you could run everything in containers and they could all be on the same machine, if you dont need the extra processing power)
You could consider using google-protobuff to serialize/de serialize the messages. its language-agnostic and saves some of the network overhead. its not as easy as sending json, but if efficiency is paramount, it might be worth the trouble. Plus it's supported in both python and go.

Python Multiprocessing: is locking appropriate for (large) disk writes?

I have multiprocessing code wherein each process does a disk write (pickling data), and the resulting pickle files can be upwards of 50 MB (and sometimes even more than 1 GB depending on what I'm doing). Also, different processes are not writing to the same file, each process writes a separate file (or set of files).
Would it be a good idea to implement a lock around disk writes so that only one process is writing to the disk at a time? Or would it be best to just let the operating system sort it out even if that means 4 processes may be trying to write 1 GB to the disk at the same time?
As long as the processes aren't fighting over the same file; let the OS sort it out. That's its job.
Unless your processes try and dump their data in one big write, the OS is in a better position to schedule disk writes.
If you do use one big write, you mighy try and partition it in smaller chunks. That might give the OS a better chance of handling them.
Of course you will hit a limit somewhere. Your program might be the CPU-bound, memory-bound or disk-bound. It might hit different limits depending on the input or load.
But unless you've got evidence that you're constantly disk-bound and you've got a good idea how to solve that, I'd say don't bother. Because the days that a write system call actuall meant that the data was directly sent to disk are long gone.
Most operating systems these days use unallocated RAM as a disk cache. And HDD's have built-in caches as well. Unless you disable both of these (which will give you a huge performance hit) there is precious little connection between your program completing a write and and the data actually hitting the plates or flash.
You might consider using memmap (if your OS supports it), and let the OS's virtual memory do the work for you. See e.g. the architect notes for the Varnish cache.

Python Strategy for Large Scale Analysis (on-the-fly or deferred)

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?
I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?
On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon
Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?
Are there any python projects that are already geared toward this kind of analysis?
Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary
I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.
The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.
Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.
First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.
Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.
Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
This is fastest. Celery to load flat files.

How to reduce latency of data sent through a REST api

I have an application which obtains data in JSON format from one of our other servers. The problem I am facing is, there is is significant delay when when requesting for this information. Since a lot of data is passed (approx 1000 records per request where each record is pretty huge) is there a way that compression would help reducing the speed. If so which compression scheme would you recommend.
I read on another thread that they pattern of data also matters a lot on they type of compression that needs to be used. The pattern of data is consistent and resembles the following
:desc=>some_description
:url=>some_url
:content=>some_content
:score=>some_score
:more_attributes=>more_data
Can someone recommend a solution to how I could reduce this delay. They delay is approx 6-8 seconds. I'm using Ruby on Rails to develop this application and the server providing the data uses Python for the most part.
I would first look at how much of this 8s delay is related to:
Server side processing (how much took for the data to be generated)
There are a lot of techniques to improve this time, including:
DB indexes
caching
a faster to_json library
Some excellent resources are the NewRelic podcasts on Rails scalability http://railslab.newrelic.com/2009/02/09/episode-7-fragment-caching
Transmission delay(how much time took for the data to be sent between the server and the client)
if the keys are pretty much the same, you may implement the sollution from Compression algorithm for JSON encoded packets? ; You may want to look at https://github.com/WebReflection/json.hpack/wiki/specs-details and http://www.nwhite.net/?p=242
in addition to this, you may also compress (gzip) it from your frontend server
http://httpd.apache.org/docs/2.0/mod/mod_deflate.html
http://wiki.nginx.org/NginxHttpGzipModule
If the data structure is constant and you can also try to implement a binary service, that is much much faster, includes compression, but also more difficult to mantain, like thrift:
http://www.igvita.com/2007/11/30/ruby-web-services-with-facebooks-thrift/
If this is suitable to your needs, maybe you can make some kind of a versioning/cache system server-side, and send only the records that were modified (but that is pretty heavy to implement)
gzip might significantly reduce the size of text data and optimize load speeds. It's also recommended by YSlow.

Downloading a Large Number of Files from S3

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).
At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.
Would some type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thanks!
I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.
In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.
If you do use multiple threads, this is a good candidate for the Queue class.
You might consider using s3fs, and just running concurrent file system commands from Python.
I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).
Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.
If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.
what about thread + queue, I love this article: Practical threaded programming with Python
Each job can be done with appropriate tools :)
You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.
On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.
Of course there may other programs with usable interface exists.
Regards!

Categories

Resources