Why is Theano (much) slower on Windows than on Linux?

Why is Theano (much) slower on Windows than on Linux? - python

I implemented a recursive autoencoder with Theano and tested it on both Linux and Windows. It tooks ~3 hours, 2.3G memory on Linux, while ~9 hours, 0.5G memory on Windows. config.allow_gc=True for both cases.
It could be a Python issue, as discussed in the thread: Why is python so much slower on windows?
Is there any specific setting in Theano that could slow things down on Windows as well?
Thanks,
Ya

It could be that they use different BLAS librairies. From memory, autoencoder bottleneck is the matrix product, that call BLAS. Different BLAS implementation can have up to 10x speed difference.
So check if you used the same BLAS. I would recommand to install python via EPD/Canopy or Anaconda python packages. There not free version link to a good blas and Theano reuse it. The now free version is free for academic.

Related

Use GPU for PIL in python on mac (macOS Catalina)

I optimized my code to process images with pillow. It uses all the sources to get as fast as possible. Just the GPU would make it faster. I don't find any solution beside CUDA and that wont't work on Catalina. Is there any way to use my GPU(NVIDIA GeForce GT 750M 2 GB
Intel Iris Pro 1536 MB) to make the process more efficient?
Thanks for your help!

Actually there is no way to use pillow to do that! If you need a better speed, you can use ImageMagick (Wand as python wrapper) or GraphicsMagick (pgmagick as Python wrapper). If you need to use GPU, ImageMagick gives some option to use it if possible (I am not sure about GM) but it is not neither as efficient nor complete as you use CUDA or OpenCL. I recommend you to use Vulkan if you need better result and cross platform (Nvidia and AMD, MacOS, Linux Windows...).

Matrix calculation using gpu

Today i was asking to me if is possible to do matrix calculation using gpu instead cpu because i know that a gpu is designed to do them faster then a cpu.
I searched on the net and i found notices about the matrix calculation using gpu with different python's libraries but my question is exists a documentation that descibes how should we write code to comunicate with a gpu.
I'm asking that because i want to develop my own one to better understand how gpu work and to try something different.
Thanks to all.

I solved that problem with OpenCL
OpenCL is a standard library that the vendor of the GPU's implement by their own. Like NVIDIA support openCL and other features thanks to CUDA library.
Here a good guide to get start

Prebuilt numpy with BLAS/ATLAS?

I'm implementing a real-time LMS algorithm, and numpy.dot takes more time than my sampling time, so I need numpy to be faster (my matrices are 1D and 100 long).
I've read about building numpy with ATLAS and such, but never done such thing and spent all my day trying to do it, with zero succes...
Can someone explain why there aren't builds with ATLAS included? Can anyone provide me with one? Is there any other way to speed up dot product?
I've tried numba, and scipy.linalg.gemm_dot but none of them seemed to speed things up.
my system is Windows8.1 with Intel processor

If you download the official binaries, they should come linked with ATLAS. If you want to make sure, check the output of np.show_config(). The problem is that ATLAS (Automatically Tuned Linear Algebra System) checks many different combinations and algorithms, and keeps the best at compile time. So, when you run a precompiled ATLAS, you are running it optimised for a computer different than yours.
So, your options to improve dot are:
Compile ATLAS yourself. On Windows it may be a bit challenging, but it is doable. Note: you must use the same compiler used to compile Python. That is, if you decide to go for MinGW, you need to get Python compiled with MinGW, or build it yourself.
Try Christopher Gohlke's Numpy. It is linked against MKL, that is much faster than ATLAS (and does all the optimisations at run time).
Try Continuum analytics' Conda with accelerate (also linked with MKL). It costs money, unless you are an academic. In Linux, Conda is slower than system python because they have to use an old compiler for compatibility purposes; I don't know if that is the case on Windows.
Use Linux. Your Python life will be much easier, setting up the system to compile stuff is very easy. Also, setting up Cython is simple too, and then you can compile your whole algorithm, and probably get further speed up.
The note regarding Cython is valid for Windows too, it is just more difficult to get it working. I tried a few years ago (when I used Windows), and failed after a few days; I don't know if the situation has improved.
Alternative:
You are doing the dot product of two vectors. Then, np.dot is probably not the most efficient way. I would give a shot to spell it out in plain Python (vec1*vec2).sum() (could be very good for Numba, this expression it can actually optimise) or using numexpr:
ne.evaluate(`sum(vec1 * vec2)`)
Numexpr will also parallelise the expression automatically.

Multithreading on numpy/pandas matrix multiplication?

I really want to know how to utilize multi-core processing for matrix multiplication on numpy/pandas.
What I'm trying is here:
M = pd.DataFrame(...) # super high dimensional square matrix.
A = M.T.dot(M)
This takes huge processing time because of many sums of products, and I think it's straightforward to use multithreading for huge matrix multiplication. So, I was googling carefully, but I can't find how to do that on numpy/pandas. Do I need to write multi thread code manually with some python built-in threading library?

In NumPy, multithreaded matrix multiplication can be achieved with a multithreaded implementation of BLAS, the Basic Linear Algebra Subroutines. You need to:
Have such a BLAS implementation; OpenBLAS, ATLAS and MKL all include multithreaded matrix multiplication.
Have a NumPy compiled to use such an implementation.
Make sure the matrices you're multiplying both have a dtype of float32 or float64 (and meet certain alignment restrictions; I recommend using NumPy 1.7.1 or later where these have been relaxed).
A few caveats apply:
Older versions of OpenBLAS, when compiled with GCC, runs into trouble in programs that use multiprocessing, which includes most applications that use joblib. In particular, they will hang. The reason is a bug (or lack of a feature) in GCC. A patch has been submitted but not included in the mainline sources yet.
The ATLAS packages you find in a typical Linux distro may or may not be compiled to use multithreading.
As for Pandas: I'm not sure how it does dot products. Convert to NumPy arrays and back to be sure.

First of all I would also propose to convert to bumpy arrays and use numpys dot function. If you have access to an MKL build which is more or less the fastest implementation at the moment, you should try to set the environment variable OMP_NUM_THREADS. This should activate the other cores of your system. On my MAC it seems to work properly. In addition I would try to use np.einsum which seems to be faster than np.dot
But pay attention! If you have compiled an multithreaded library that is using OpenMP for parallelisation (like MKL), you have to consider, that the "default gcc" on all apple systems is not gcc, it is Clang/LLVM and Clang ist not able to build with OpenMP support at the moment, except you use the OpenMP trunk which is still experimental. So you have to install the intel compiler or any other that supports OpenMP

Can I link numpy with AMD's gpu accelerated blas library

I reconized numpy can link with blas, and I thought of why not using gpu accelerated blas library.
Did anyone use to do so?

Update (2014-05-22)
AMD has produced a beta release of AMD Core Math Library (ACML) version 6.0 that can offload FFT and BLAS functions to a GPU by using clMath internally. The announcement is here: ACML Beta 6.0 Release Leverages the Power of Heterogeneous Compute. The caveat here is that input data must be transferred from CPU to GPU and output data returned to the CPU on each BLAS or FFT call. Therefore, AMD has a bunch of scripts for tuning when a problem is large enough that ACML will use the GPU instead of the CPU.
For the sake of completeness, I'll also mention that Nvidia supports similar functionality with its nvBLAS library but that relies on cuBLAS and CUDA so it won't work on anything but Nvidia GPUs.
Original answer
Unfortunately, AMD's GPU accelerate BLAS library cannot directly link to Numpy or any other application expecting a standard CPU-based BLAS library. The reason is that existing GPU BLAS libraries all require one to first copy the matrices to the GPU before calling the BLAS functions. This requires that someone modify Numpy to do this copying.
Edit: CLyther looks like it can replace some of what Numpy does and converts everything to OpenCL. See here: http://srossross.github.io/Clyther/for_numpy_users.html

If memory servers, pyCuda at least, probably also pyOpenCL can work with numPy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.