Downloading a Large Number of Files from S3

Downloading a Large Number of Files from S3 - python

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).
At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.
Would some type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thanks!

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.
If you do use multiple threads, this is a good candidate for the Queue class.

You might consider using s3fs, and just running concurrent file system commands from Python.

I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).
Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.
If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.

what about thread + queue, I love this article: Practical threaded programming with Python

Each job can be done with appropriate tools :)
You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.
On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.
Of course there may other programs with usable interface exists.
Regards!

Related

Call Python tasks from Golang

I have been building big data application for stock market analysis. About 5TB of records per day. I use Golang for data transformation/calculation and saving in Cassandra/MySQL. But Python has very good libraries for data analysis Pandas, Spark and etc., but there is no easy way for multicore processing and takes a lot of time.
So, I want to call python data analysis tasks concurrently in Golang. One way is to execute command line task directly, but I think there should be more scalable solution. Maybe there is library for communication between Golang and Python. I thought maybe I should create multiple servers of Python Flask and give tasks to them. Speed is important, but I can sacrifice some of it for concise solution. Any ideas?

Splitting your app into multiple servers, as you've suggested, carries some trade-offs.
On the plus side, splitting it up provides you with more flexibility in terms of load balancing. In other words, if your flask servers are overburdened, you can always spin a few more and scale horizontally with a load-balancer. Of course this assumes that whatever it is you're doing on those flask server can be done in parallel (depends on your actual business logic).
It also offers high-availability: you eliminate one potential single-point-of-failure.
However, this 'microservice' approach does incur some overheads
more code to write, since now you're writing 2 kinds of servers
some network overhead, since now you're communicating over the network as opposed to function calls.
more machines to spin (although you could run everything in containers and they could all be on the same machine, if you dont need the extra processing power)
You could consider using google-protobuff to serialize/de serialize the messages. its language-agnostic and saves some of the network overhead. its not as easy as sending json, but if efficiency is paramount, it might be worth the trouble. Plus it's supported in both python and go.

Python multithreading - Global Interpreter Lock

Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou

I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.

Clarification of use-cases for Hadoop versus RabbitMQ+Celery

I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.

They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets

A good persistent synchronous queue in python

I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.

Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.

I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.

This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)

Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.

Need CGI (or another solution compatible with IIS 7) to handle massive uploads

We need to handle massive file uploads without spending resources on an IIS 7 server. To emphasize how light-weight this needs to be, let's say that we need to handle file uploads of sizes that are completely insane, like 100GB uploads, or something that can continue running for an extremely long time without consuming additional resources. Basically we need something that gives us control over the reception of the file from the moment it starts to the moment it ends.
A bit of background:
We're using ColdFusion as the server-side processor, but it has failed us when handling uploads beyond about 1GB and we've exhausted our configuration options. There's a long story behind that, but essentially, if a .cfm page (ColdFusion) is the destination of the file upload and it goes over about 1GB, it gives a 503 error... even if the target file doesn't exist. So clearly too much is going on merely by telling the server that we intend to process the file with a .cfm page.
We suspect that this is due to Java limitations because the server (or really, the workstation in this case) does not show any signs of load on CPU or memory. Since we have limited memory and this website is intended for a lot of concurrent uploads, we can't trust simply raising the virtual machine memory usage, especially because that simply doesn't work currently, even for a single connection... let alone the hundreds of concurrent connections we expect when we go live.
So we're down to writing a specialized solution using CGI that will handle file uploads only. Basically, we need control on the server-side that we don't get with ColdFusion or ASP.NET because those technologies do so many things on their own, behind the scenes, without giving us the control we need. They always end up spending up too many resources one way or the other for an arguably obvious reason; what we're trying to do is completely insane and not the intended function of those technologies. That's why we want a specialized uploader through CGI that bypasses all that ColdFusion/ASP.NET magic that keeps getting in the way, hoping it gives us the control we need.
But before we spent countless hours on this, I figured I'd ask around and see if anyone knows of a proper solution to this problem that might be viable in our case.
The only real restriction here is that it has to be CGI, and it has to run on IIS 7, therefore a Windows "Server" environment. We're fine with it being written in Python, Perl, name it... provided it can run as a CGI, but it has to run as a CGI... unless of course someone has better ideas on how to do this.
So the magic question is; are there CGI solutions out there that already do this or are we stuck with writing it on our own, hoping that the reason no one else has done it already is some other than it being impossible?
Thanks in advance.

You're not going to get reliable multi-GB uploads from a dumb client (eg a browser and standard upload behaviour). Been there, done that, written commercial digital asset management solutions handling huge files.
The key to any degree of reliability in this scenario is chunking - you need to be able to chunk the upload, send each chunk as a discrete file, and re-assemble it server side.
What are your client restrictions though (if any)? Can you use a java applet? Could you even have a client side app ?
One possible starting point for a browser based solution would be the jupload opensource project but there are plenty of others.

You want WebDAV, not CGI. It provides all the nice bits that make file transfers not suck, like resuming and pausing.

the windows TCP stack is limited to 4GB file uploads. Anymore than that is not possible.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.