Optimising string generate and test

Optimising string generate and test - python

I am trying to run a simulation to test the average Levenshtein distance between random
binary strings.
To speed it up I am using this C extension.
My code is as follows.
from Levenshtein import distance
for i in xrange(20):
sum = 0
for j in xrange(1000):
str1 = ''.join([random.choice("01") for x in xrange(2**i)])
str2 = ''.join([random.choice("01") for x in xrange(2**i)])
sum += distance(str1,str2)
print sum/(1000*2**i)
I think the slowest part is now the string generation. Can that be sped up somehow or is there some other speed up I could try?
I also have 8 cores but I don't know how hard it would be take advantage of those.
Unfortunately I can't use pypy because of the C extension.

The following solution should be way better in terms of runtime.
It generates a number with 2**i random bits (random.getrandbits), converts it to a string of the number's binary representation (bin), takes everything beginning with the 3nd character to the end (because the result of bin is prepended with '0b') and prepends the resulting string with zeros to have the length you want.
str1 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
Quick timing for your maximum string length of 2**20:
from timeit import Timer
>>> t=Timer("''.join(random.choice('01') for x in xrange(2**20))", "import random")
>>> sorted(t.repeat(10,1))
[0.7849910731831642, 0.787418033587528, 0.7894113893237318, 0.789840397476155, 0.7907980049587877, 0.7908638883536696, 0.7911707057912736, 0.7935838766477445, 0.8014726470912592, 0.8228315074311467]
>>> t=Timer("bin(random.getrandbits(2**20))[2:].zfill(2**20)", "import random")
>>> sorted(t.repeat(10,1))
[0.005115922216191393, 0.005215130351643893, 0.005234282501078269, 0.005451850921190271, 0.005531523863737675, 0.005627284612046424, 0.005746794025981217, 0.006217553864416914, 0.014556016781853032, 0.014710766150983545]
That's a speedup of a factor of 150 on average.

You can create a Python string using the Python/C API, which will be significantly faster than any method that exclusively uses Python, since Python itself is implemented in Python/C. Performance will likely primarily depend on the efficiency of the random number generator. If you are on a system with a reasonable random(3) implementation, such as the one in glibc, an efficient implementation of random string would look like this:
#include <Python.h>
/* gcc -shared -fpic -O2 -I/usr/include/python2.7 -lpython2.7 rnds.c -o rnds.so */
static PyObject *rnd_string(PyObject *ignore, PyObject *args)
{
const char choices[] = {'0', '1'};
PyObject *s;
char *p, *end;
int size;
if (!PyArg_ParseTuple(args, "i", &size))
return NULL;
// start with a two-char string to avoid the empty string singleton.
if (!(s = PyString_FromString("xx")))
return NULL;
_PyString_Resize(&s, size);
if (!s)
return NULL;
p = PyString_AS_STRING(s);
end = p + size;
for (;;) {
unsigned long rnd = random();
int i = 31; // random() provides 31 bits of randomness
while (i-- > 0 && p < end) {
*p++ = choices[rnd & 1];
rnd >>= 1;
}
if (p == end)
break;
}
return s;
}
static PyMethodDef rnds_methods[] = {
{"rnd_string", rnd_string, METH_VARARGS },
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initrnds(void)
{
Py_InitModule("rnds", rnds_methods);
}
Testing this code with halex's benchmark shows that it is 280x faster than the original code, and 2.3x faster than halex's code (on my machine):
# the above code
>>> t1 = Timer("rnds.rnd_string(2**20)", "import rnds")
>>> sorted(t1.repeat(10,1))
[0.0029861927032470703, 0.0029909610748291016, ...]
# original generator
>>> t2 = Timer("''.join(random.choice('01') for x in xrange(2**20))", "import random")
>>> sorted(t2.repeat(10,1))
[0.8376679420471191, 0.840252161026001, ...]
# halex's generator
>>> t3 = Timer("bin(random.getrandbits(2**20-1))[2:].zfill(2**20-1)", "import random")
>>> sorted(t3.repeat(10,1))
[0.007007122039794922, 0.007027149200439453, ...]
Adding C code to a project is a complication, but for a 280x speedup of a critical operation, it might well be worth it.
For further efficiency improvement, look into faster RNGs, and invoke call them from separate threads, in order to parallelize random number generation is parallelized. The latter would benefit from a lock-free synchronization mechanism to make sure that inter-thread communication doesn't bog down the otherwise fast generation process.

Related

C++ vs python numpy complex arrays performance

Can anyone tell me why these two programs have a huge difference in run time? I am simply multiplying two large complex arrays and comparing the time in python (numpy) and c++. I am using the -O3 flag with g++ to compile this C++ code. I find that the huge difference comes only when I use complex floats in C++, its more than 20 times faster in numpy.
python code:
import numpy as np
import time
if __name__ == "__main__":
# check the data type is the same
a = np.zeros((1), dtype=np.complex128)
a[0] = np.complex(3.4e38,3.5e38)
print(a)
b = np.zeros((1), dtype=np.complex64)
b[0] = np.complex(3.4e38,3.5e38)
print(b) # imaginary part is infinity
length = 5000;
A = np.ones((length), dtype=np.complex64) * np.complex(1,1)
B = np.ones((length), dtype=np.complex64) * np.complex(1,0)
num_iterations = 1000000
time1 = time.time()
for _ in range(num_iterations):
A *= B
time2 = time.time()
duration = ((time2 - time1)*1e6)/num_iterations
print(duration)
C++ code:
#include <iostream>
#include <complex>
#include <chrono>
using namespace std::chrono;
using namespace std;
int main()
{
// check the data type is the same
complex<double> a = complex<double>(3.4e38, 3.5e38);
cout << a << endl;
complex<float> b = complex<float>(3.4e38, 3.5e38);
cout << b << endl; // imaginary part is infinity
const int length = 5000;
static complex<float> A[length];
static complex<float> B[length];
for(int i=0; i < length; i++) {
A[i] = complex<float>(1,1);
B[i] = complex<float>(1,0);
}
int num_iterations = 1000000;
auto time1 = high_resolution_clock::now();
for(int k=0; k < num_iterations; k++)
for(int i=0; i < length; i++)
A[i] *= B[i];
auto time2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(time2 - time1);
cout << "average time:" << duration.count() / num_iterations << endl;
}

The C++ compiler is doing some extra checking gymnastics for you in order to properly handle NaNs and other such "standard" behavior.
If you add the -ffast-math optimization flag, you'll get more sane speed, but less "standard" behavior. e.g. complex<float>(inf,0)*complex<float>(inf,0) won't be evaluated as complex<float>(inf,0). Do you really care?
numpy is doing what makes sense, not hindered by a narrow reading of the C++ standard.
e.g. until very recent g++ versions, the latter of the following functions is much faster unless -ffast-math is used.
complex<float> mul1( complex<float> a,complex<float> b)
{
return a*b;
}
complex<float> mul2( complex<float> a,complex<float> b)
{
float * fa = reinterpret_cast<float*>(&a);
const float * fb = reinterpret_cast<float*>(&b);
float cr = fa[0]*fb[0] - fa[1]*fb[1];
float ci = fa[0]*fb[1] + fa[1]*fb[0];
return complex<float>(cr,ci);
}
You can experiment with this on https://godbolt.org/z/kXPgCh for the assembly output and how the former function defaults to calling __mulsc3
P.S. Ready for another wave of anger at what the C++ standard says about std::complex<T>? Can you guess how std::norm must be implemented by default? Play along. Follow the link and spend ten seconds thinking about it.
Spoiler: it probably is using a sqrt then squaring it.

Why is my Python NumPy code faster than C++?

Why is this Python NumPy code,
import numpy as np
import time
k_max = 40000
N = 10000
data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)
t1 = time.time()
for k in xrange(1,k_max+1):
cos_k = np.cos(k*data[0,:])
sin_k = np.sin(k*data[0,:])
coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()
print('Time:')
print(t2-t1)
faster than the following C++ code?
#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>
using namespace std;
// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;
int main()
{
time_t start, stop;
double diff;
// table with data
double data1[ N ];
double data2[ N ];
// table of results
double coefs1[ k_max ];
double coefs2[ k_max ];
// main loop
time( & start );
for( unsigned int j = 1; j<N; j++ )
{
for( unsigned int i = 0; i<k_max; i++ )
{
coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
}
}
// end of main loop
time( & stop );
// speed result
diff = difftime( stop, start );
cout << "Time: " << diff << " seconds";
return 0;
}
The first one shows: "Time: 8 seconds"
while the second: "Time: 11 seconds"
I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?
Version 2 of the code
I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.
Version 3 of the code
I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).
The answer is: your C++ code is not slower than your Python code when properly compiled.
I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.
I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):
#include <math.h>
#include <stdio.h>
#include <time.h>
const int k_max = 100000;
const int N = 10000;
int main(void)
{
clock_t t_start, t_end;
double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
int z;
for( z = 0; z < N; z++ )
{
data1[z] = 0.1;
data2[z] = 0.4;
}
int i, j;
t_start = clock();
for( i = 0; i < k_max; i++ )
{
for( j = 0; j < N-1; j++ )
{
coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
}
}
t_end = clock();
seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("Time: %f s\n", seconds);
return coefs1[0];
}
For k_max = 100000, N = 10000 results where following:
Python 70.284362 s
C++ 69.133199 s
C 61.638186 s
Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.
For k_max = 1000000, N = 1000 we have:
Python 115.42766 s
C++ 70.781380 s
For k_max = 1000000, N = 100:
Python 52.86826 s
C++ 7.050597 s
So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:
Python 0.651587 s
C++ 0.568518 s
Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.
Python script:
import numpy as np
from time import time
N = 10000
x_len = 100000
def main():
x = np.ones(x_len, dtype=np.float64) * 1.2345
start = time()
for i in xrange(N):
y1 = np.cos(x, dtype=np.float64)
end = time()
print('cos: {} s'.format(end-start))
start = time()
for i in xrange(N):
y2 = x * 7.9463
end = time()
print('multi: {} s'.format(end-start))
start = time()
for i in xrange(N):
res = np.sum(x, dtype=np.float64)
end = time()
print('sum: {} s'.format(end-start))
return y1, y2, res
if __name__ == '__main__':
main()
# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s
C script:
#include <math.h>
#include <stdio.h>
#include <time.h>
const int N = 10000;
const int x_len = 100000;
int main()
{
clock_t t_start, t_end;
double x[x_len], y1[x_len], y2[x_len], res, time;
int i, j;
for( i = 0; i < x_len; i++ )
{
x[i] = 1.2345;
}
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y1[i] = cos(x[i]);
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("cos: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y2[i] = x[i] * 7.9463;
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("multi: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
res = 0.0;
for( i = 0; i < x_len; i++ )
{
res += x[i];
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("sum: %f s\n", time);
return y1[0], y2[0], res;
}
// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s
Python results:
cos: 22.7199969292 s
multi: 0.841291189194 s
sum: 1.15971088409 s
C results:
cos: 20.910590 s
multi: 0.633281 s
sum: 1.153001 s
As you can see NumPy is incredibly fast, but always a bit slower than pure C.

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...
Numpy is using complex Linear Algebra libraries !
Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.
I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.
C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).
Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from #Jfs, and we observe: "There is no difference between C++ and numpy on my machine."
Some more reference:
Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).
I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:
#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>
const unsigned int k_max = 40000;
const unsigned int N = 10000;
template <class T>
class matrix2 {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
T &operator()(size_t y, size_t x) {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
T operator()(size_t y, size_t x) const {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
};
int main() {
matrix2<double> data(N, 2);
matrix2<double> coeffs(k_max, 2);
using namespace std::chrono;
auto start = high_resolution_clock::now();
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
auto end = high_resolution_clock::now();
std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}
This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.
The next obvious step would be to use multiple cores. To do that in C++, we can add this line:
#pragma omp parallel for
...before the outer for loop:
#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:
#include <iostream>
#include <cmath>
#include <time.h>
const int k_max = 40000;
const int N = 10000;
double cos_k, sin_k;
int main(int argc, char const *argv[])
{
time_t start, stop;
double data[2][N];
double coefs[k_max][2];
time(&start);
for(int i=0; i<k_max; ++i)
{
for(int j=0; j<N; ++j)
{
coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
}
}
// End of main loop
time(&stop);
// Speed result
double diff = difftime(stop, start);
std::cout << "Time: " << diff << " seconds" << std::endl;
return 0;
}
Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is often slower than C++ since C++ can do more optimizations. They'll only be around each other with Python running somewhere between the same time as C++ to about twice C++ when doing the majority of your computation in large computations that Python pushes off to compiled binaries to calculate. Most anything beyond large matrix multiplication, addition, scalar on matrix multiplication, etc. will perform much worse in Python. For example, look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. That makes sense too if you do anything other than simple math operations - anything with linked lists, binary search trees, procedural code, etc. The interpreted nature of Python combined with it storing metadata for each object (even int, double, float, etc.) significantly bogs things down in a way that no Python programmer can fix.

Convert integer to a random but deterministically repeatable choice

How do I convert an unsigned integer (representing a user ID) to a random looking but actually a deterministically repeatable choice? The choice must be selected with equal probability (irrespective of the distribution of the the input integers). For example, if I have 3 choices, i.e. [0, 1, 2], the user ID 123 may always be randomly assigned choice 2, whereas the user ID 234 may always be assigned choice 1.
Cross-language and cross-platform algorithmic reproducibility is desirable. I'm inclined to use a hash function and modulo unless there is a better way. Here is what I have:
>>> num_choices = 3
>>> id_num = 123
>>> int(hashlib.sha256(str(id_num).encode()).hexdigest(), 16) % num_choices
2
I'm using the latest stable Python 3. Please note that this question is similar but not exactly identical to the related question to convert a string to random but deterministically repeatable uniform probability.

Using hash and modulo
import hashlib
def id_to_choice(id_num, num_choices):
id_bytes = id_num.to_bytes((id_num.bit_length() + 7) // 8, 'big')
id_hash = hashlib.sha512(id_bytes)
id_hash_int = int.from_bytes(id_hash.digest(), 'big') # Uses explicit byteorder for system-agnostic reproducibility
choice = id_hash_int % num_choices # Use with small num_choices only
return choice
>>> id_to_choice(123, 3)
0
>>> id_to_choice(456, 3)
1
Notes:
The built-in
hash
method must not be used because it can preserve the input's
distribution, e.g. with hash(123). Alternatively, it can return values that differ when Python is restarted, e.g. with hash('123').
For converting an int to bytes, bytes(id_num) works but is grossly inefficient as it returns an array of null bytes, and so it must not be used. Using int.to_bytes is better. Using str(id_num).encode() works but wastes a few bytes.
Admittedly, using modulo doesn't offer exactly uniform probability,[1][2] but this shouldn't bias much for this application because id_hash_int is expected to be very large and num_choices is assumed to be small.
Using random
The random module can be used with id_num as its seed, while addressing concerns surrounding both thread safety and continuity. Using randrange in this manner is comparable to and simpler than hashing the seed and taking modulo.
With this approach, not only is cross-language reproducibility a concern, but reproducibility across multiple future versions of Python could also be a concern. It is therefore not recommended.
import random
def id_to_choice(id_num, num_choices):
localrandom = random.Random(id_num)
choice = localrandom.randrange(num_choices)
return choice
>>> id_to_choice(123, 3)
0
>>> id_to_choice(456, 3)
2

An alternative is to encrypt the user ID. If you keep the encryption key the same, then each input number will encrypt to a different output number up to the block size of the cipher you use. DES uses 64 bit blocks which cover IDs 000000 to 18446744073709551615. That will give a random appearing replacement for the user ID, which is guaranteed not to give two different user IDs the same 'random' number because encryption is a one-to-one permutation of the block values.

I apologize I don't have Python implementation but I do have very clear, readable and self evident implementation in Java which should be easy to translate into Python with minimal effort. The following produce long predictable evenly distributed sequences covering all range except zero
XorShift ( http://www.arklyffe.com/main/2010/08/29/xorshift-pseudorandom-number-generator )
public int nextQuickInt(int number) {
number ^= number << 11;
number ^= number >>> 7;
number ^= number << 16;
return number;
}
public short nextQuickShort(short number) {
number ^= number << 11;
number ^= number >>> 5;
number ^= number << 3;
return number;
}
public long nextQuickLong(long number) {
number ^= number << 21;
number ^= number >>> 35;
number ^= number << 4;
return number;
}
or XorShift128Plus (need to re-seed state0 and state1 to non-zero values before using, http://xoroshiro.di.unimi.it/xorshift128plus.c)
public class XorShift128Plus {
private long state0, state1; // One of these shouldn't be zero
public long nextLong() {
long state1 = this.state0;
long state0 = this.state0 = this.state1;
state1 ^= state1 << 23;
return (this.state1 = state1 ^ state0 ^ (state1 >> 18) ^ (state0 >> 5)) + state0;
}
public void reseed(...) {
this.state0 = ...;
this.state1 = ...;
}
}
or XorOshiro128Plus (http://xoroshiro.di.unimi.it/)
public class XorOshiro128Plus {
private long state0, state1; // One of these shouldn't be zero
public long nextLong() {
long state0 = this.state0;
long state1 = this.state1;
long result = state0 + state1;
state1 ^= state0;
this.state0 = Long.rotateLeft(state0, 55) ^ state1 ^ (state1 << 14);
this.state1 = Long.rotateLeft(state1, 36);
return result;
}
public void reseed() {
}
}
or SplitMix64 (http://xoroshiro.di.unimi.it/splitmix64.c)
public class SplitMix64 {
private long state;
public long nextLong() {
long result = (state += 0x9E3779B97F4A7C15L);
result = (result ^ (result >> 30)) * 0xBF58476D1CE4E5B9L;
result = (result ^ (result >> 27)) * 0x94D049BB133111EBL;
return result ^ (result >> 31);
}
public void reseed() {
this.state = ...;
}
}
or XorShift1024Mult (http://xoroshiro.di.unimi.it/xorshift1024star.c) or Pcg64_32 (http://www.pcg-random.org/, http://www.pcg-random.org/download.html)

The simplest method is to modulo user_id by number of options:
choice = user_id % number_of_options
It's very easy and fast. However if you know user_id's you may to guess an algorithm.
Also, pseudorandom sequences can be obtained from random seeded with user constants (e.g. user_id):
>>> import random
>>> def generate_random_value(user_id):
... random.seed(user_id)
... return random.randint(1, 10000)
...
>>> [generate_random_value(x) for x in range(20)]
[6312, 2202, 927, 3899, 3868, 4186, 9402, 5306, 3715, 7586, 9362, 7412, 7776, 4244, 1751, 3424, 5924, 8553, 2970, 709]
>>> [generate_random_value(x) for x in range(20)]
[6312, 2202, 927, 3899, 3868, 4186, 9402, 5306, 3715, 7586, 9362, 7412, 7776, 4244, 1751, 3424, 5924, 8553, 2970, 709]
>>>

implement the logic in c++ using python, how?

I want to implement below logic in c++ using python.
struct hash_string ///
{
hash_string() {}
uint32_t operator ()(const std::string &text) const
{
//std::cout << text << std::endl;
static const uint32_t primes[16] =
{
0x01EE5DB9, 0x491408C3, 0x0465FB69, 0x421F0141,
0x2E7D036B, 0x2D41C7B9, 0x58C0EF0D, 0x7B15A53B,
0x7C9D3761, 0x5ABB9B0B, 0x24109367, 0x5A5B741F,
0x6B9F12E9, 0x71BA7809, 0x081F69CD, 0x4D9B740B,
};
//std::cout << text.size() << std::endl;
uint32_t sum = 0;
for (size_t i = 0; i != text.size(); i ++) {
sum += primes[i & 15] * (unsigned char)text[i];
//std::cout << text[i] <<std::endl;
// std::cout << (unsigned char)text[i] << std::endl;
}
return sum;
}
};
python version is like this, which is not completed yet, since I haven't found a way to convert text to unsigned char. So, please help!
# -*- coding: utf-8 -*-
text = u'连衣裙女韩范'
primes = [0x01EE5DB9, 0x491408C3, 0x0465FB69, 0x421F0141,
0x2E7D036B, 0x2D41C7B9, 0x58C0EF0D, 0x7B15A53B,
0x7C9D3761, 0x5ABB9B0B, 0x24109367, 0x5A5B741F,
0x6B9F12E9, 0x71BA7809, 0x081F69CD, 0x4D9B740B]
//*text[i] does not work (of course), but how to mimic the logic above
rand = [primes[i & 15]***text[i]** for i in range(len(text))]
print rand
sum_agg = sum(rand)
print sum_agg
Take text=u'连衣裙女韩范' for example, c++ version returns 18 for text.size() and sum is 2422173716, while, in python, I don't know how to make it 18.
The equality of text size is essential, as a start at least.

Because you are using unicode, for an exact reproduction you will need to turn text in a series of bytes (chars in c++).
bytes_ = text.encode("utf8")
# when iterated over this will yield ints (in python 3)
# or single character strings in python 2
You should use more pythonic idioms for iterating over a pair of sequences
pairs = zip(bytes_, primes)
What if bytes_ is longer than primes? Use itertools.cycle
from itertools import cycle
pairs = zip(bytes_, cycle(primes))
All together:
from itertools import cycle
text = u'连衣裙女韩范'
primes = [0x01EE5DB9, 0x491408C3, 0x0465FB69, 0x421F0141,
0x2E7D036B, 0x2D41C7B9, 0x58C0EF0D, 0x7B15A53B,
0x7C9D3761, 0x5ABB9B0B, 0x24109367, 0x5A5B741F,
0x6B9F12E9, 0x71BA7809, 0x081F69CD, 0x4D9B740B]
# if python 3
rand = [byte * prime for byte, prime in zip(text.encode("utf8"), cycle(primes))]
# else if python 2 (use ord to convert single character string to int)
rand = [ord(byte) * prime for byte, prime in zip(text.encode("utf8"), cycle(primes))]
hash_ = sum(rand)

Returning C arrays into python scope from scipy's weave.inline

I am using scipy's weave.inline to perform computationally expensive tasks. I have problems returning an one-dimensional array back into the python scope. Weave.inline uses a special argument called "return_val" for the purpose of returning values back into the python scope.
The following example returning an integer value works well:
>>> from scipy.weave import inline
>>> print inline(r'''int N = 10; return_val = N;''')
10
However the following example, which indeed compiles without prompting an error, does not return the array i would expect:
>>> from scipy.weave import inline
>>> code =\
r'''
int* pairs;
int lenght = 0;
for (int i=0;i<N;i++){
lenght += 1;
pairs = (int *)malloc(sizeof(int)*lenght);
pairs[i] = i;
std::cout << pairs[i] << std::endl;
}
return_val = pairs;
'''
>>> N = 5
>>> R = inline(code,['N'])
>>> print "RETURN_VAL:",R
0
1
2
3
4
RETURN_VAL: 1
I need to reallocate the size of the array "pairs" dynamically which is why I can't pass a numpy.array or python list per se.

All you need to do is use the raw python c-api calls, or if you're looking for something a bit more convenient, the built in scipy weave wrappers.
No guarantees about leaks or efficiency, but it should look something a bit like this:
from scipy.weave import inline
code = r'''
py::list ret;
for(int i = 0; i < N; i++) {
py::list item;
for(int j = 0; j < i; j++) {
item.append(j);
}
ret.append(item);
}
return_val = ret;
'''
N = 5
R = inline(code,['N'])
print R

If you absolutely don't know the size of the output array in advance, you must create it in your inline code. I'm pretty sure that your array allocated by using malloc will result in leaked memory since you have no way of controlling when this memory is to be freed.
The solution is to create a numpy array, fill it with your function's results and return it.
import scipy.weave
code = r"""
npy_intp dims[1] = {n};
PyObject* out_array = PyArray_SimpleNew(1, dims, NPY_DOUBLE);
double* data = (double*) ((PyArrayObject*) out_array)->data;
for (int i=0; i<n; ++i) data[i] = i;
return_val = out_array;
Py_XDECREF(out_array);
"""
n = 5
out_array = scipy.weave.inline(code, ["n"])
print "Array:", out_array

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimising string generate and test - python

Related

C++ vs python numpy complex arrays performance

Why is my Python NumPy code faster than C++?

Convert integer to a random but deterministically repeatable choice

implement the logic in c++ using python, how?

Returning C arrays into python scope from scipy's weave.inline

Categories

Resources