I'm trying to work with an array of strings(words) in CUDA.
I tried flattening it by creating a single string, but then then to index it, I'd have to go through some of it each time a kernel runs. If there are 9000 words with a length of 6 characters, I'd have to examine 53994 characters in the worst case for each kernel call. So I'm looking for different ways to do it.
Update: Forgot to mention, the strings are of different lengths, so I'd have to find the end of each one.
The next thing I tried was copying each word to different memory locations, and then collect the addresses, and pass that to the GPU as an array with the following code:
# np = numpy
wordList = ['asd','bsd','csd']
d_words = []
for word in wordList:
d_words.append(gpuarray.to_gpu(np.array(word, dtype=str)))
d_wordList = gpuarray.to_gpu(np.array([word.ptr for word in d_words], dtype=np.int32))
ker_test(d_wordList, block=(1,1,1), grid=(1,1,1))
and in the kernel:
__global__ void test(char** d_wordList) {
printf("First character of the first word is: %c \n", d_wordList[0][0]);
}
The kernel should get an int32 array of pointers that point to the beginning of each word, effectively being a char** (or int**), but it doesn't work as I expect.
What is wrong with this approach?
Also what are the "standard" ways to work with strings in PyCUDA (or even in CUDA) in general?
Thanks in advance.
After some further thought, I've concluded that for this case of variable-length strings, using an "offset array" may not be much different than 2D indexing (i.e. double-pointer indexing), when considering the issue of data access within the kernel. Both involve a level of indirection.
Here's a worked example demonstrating both methods:
$ cat t5.py
#!python
#!/usr/bin/env python
import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
from sys import getsizeof
import pycuda.autoinit
kernel_code1 = """
__global__ void test1(char** d_wordList) {
(d_wordList[blockIdx.x][threadIdx.x])++;
}
"""
kernel_code2 = """
__global__ void test2(char* d_wordList, size_t *offsets) {
(d_wordList[offsets[blockIdx.x] + threadIdx.x])++;
}
"""
mod = compiler.SourceModule(kernel_code1)
ker_test1 = mod.get_function("test1")
wordList = ['asd','bsd','csd']
d_words = []
for word in wordList:
d_words.append(gpuarray.to_gpu(np.array(word, dtype=str)))
d_wordList = gpuarray.to_gpu(np.array([word.ptr for word in d_words], dtype=np.uintp))
ker_test1(d_wordList, block=(3,1,1), grid=(3,1,1))
for word in d_words:
result = word.get()
print result
mod2 = compiler.SourceModule(kernel_code2)
ker_test2 = mod2.get_function("test2")
wordlist2 = np.array(['asdbsdcsd'], dtype=str)
d_words2 = gpuarray.to_gpu(np.array(['asdbsdcsd'], dtype=str))
offsets = gpuarray.to_gpu(np.array([0,3,6,9], dtype=np.uint64))
ker_test2(d_words2, offsets, block=(3,1,1), grid=(3,1,1))
h_words2 = d_words2.get()
print h_words2
$ python t5.py
bte
cte
dte
['btectedte']
$
Notes:
for the double-pointer case, the only change from OP's example was to use the numpy.uintp type for the pointer as suggested in the comments by #talonmies
I don't think the double-pointer access of data will necessarily be quicker or slower than the indirection associated with the offset lookup method. One other performance consideration would be in the area of copying data from host to device and vice versa. The double pointer method effectively involves multiple allocations and multiple copy operations, in both directions, I believe. For a lot of strings, this will be noticeable in the host/device data copy operations.
Another possible merit of the offset method is that it is easy to determine the length of each string - just subtract two adjacent entries in the offset list. This could be useful so as to make it easy to determine how many threads can operate on a string in parallel, as opposed to having a single thread work on a string sequentially (or use a method in kernel code to determine string length, or pass the length of each string).
Related
I was inquiring myself the best way (or any good way actually) of simulating a small ram memory in Python.
In most languages, I would simply create a fixed size array of char, but this seems to be surprisingly complex in Python.
The closest thing I found was this:
self.two_KB_internal_ram = [] #goes from $0000-$07FF
for x in range (2048):
self.two_KB_internal_ram = 0
print ("two_KB_internal_ram: ", type(self.two_KB_internal_ram))
However, the type shows that the type is an int, and not char.
Is there a way of doing this with chars? If not (or even if there is), what would be a good way to emulate a ram memory?
To get a fixed sized Array you can use np.array which ist capable of Holding Strings aswell
Something Like this:
import numpy as np
np.array(["a"] * 2048)
But why would you do that in Python?
I have a Python program that needs to pass an array to a .dll that is expecting an array of c doubles. This is currently done by the following code, which is the fastest method of conversion I could find:
from array import array
from ctypes import *
import numpy as np
python_array = np.array(some_python_array)
temp = array('d', python_array.astype('float'))
c_double_array = (c_double * len(temp)).from_buffer(temp)
...where 'np.array' is just there to show that in my case the python_array is a numpy array. Let's say I now have two c_double arrays: c_double_array_a and c_double_array_b, the issue I'm having is I would like to append c_double_array_b to c_double_array_a without reconverting to/from whatever python typically uses for arrays. Is there a way to do this with the ctypes library?
I've been reading through the docs here but nothing seems to detail combining two c_type arrays after creation. It is very important in my program that they can be combined after creation, of course it would be trivial to just append python_array_b to python_array_a and then convert but that won't work in my case.
Thanks!
P.S. if anyone knows a way to speed up the conversion code that would also be greatly appreciated, it takes on the order of 150ms / million elements in the array and my program typically handles 1-10 million elements at a time.
Leaving aside the construction of the ctypes arrays (for which Mark's comment is surely relevant), the issue is that C arrays are not resizable: you can't append or extend them. (There do exist wrappers that provide these features, which may be useful references in constructing this.) What you can do is make a new array of the size of the two existing arrays together and then ctypes.memmove them into it. It might be possible to improve the performance by using realloc, but you'd have to go even lower than normal ctypes memory management to use it.
This is my first question on this site.
First of all, I need to make a module with one function for python in C++, which must work with numpy, using <numpy/arrayobject.h>. This function takes one numpy array and returns two numpy arrays. All arrays are one-dimensional.
The first question is how to get the data from a numpy array? I want to collect the information from array in std::vector, so then I can easily work with it C++.
The second: am I right that function should return a tuple of arrays, then user of my module can write like this in python:
arr1, arr2 = foo(arr)
?
And how to return like this?
Thank you very much.
NumPy includes lots of functions and macros that make it pretty easy to access the data of an ndarray object within a C or C++ extension. Given a 1D ndarray called v, one can access element i with PyArray_GETPTR1(v, i). So if you want to copy each element in the array to a std::vector of the same type, you can iterate over each element and copy it, like so (I'm assuming an array of doubles):
npy_intp vsize = PyArray_SIZE(v);
std::vector<double> out(vsize);
for (int i = 0; i < vsize; i++) {
out[i] = *reinterpret_cast<double*>(PyArray_GETPTR1(v, i));
}
One could also do a bulk memcpy-like operation, but keep in mind that NumPy ndarrays may be mis-aligned for the data type, have non-native byte order, or other subtle attributes that make such copies less than desirable. But assuming that you are aware of these, one could do:
npy_intp vsize = PyArray_SIZE(v);
std::vector<double> out(vsize);
std::memcpy(out.data(), PyArray_DATA(v), sizeof(double) * vsize);
Using either approach, out now contains a copy of the ndarray's data, and you can manipulate it however you like. Keep in mind that, unless you really need the data as a std::vector, the NumPy C API may be perfectly fine to use in your extension as a way to access and manipulate the data. That is, unless you need to pass the data to some other function which must take a std::vector or you want to use C++ library code that relies on std::vector, I'd consider doing all your processing directly on the native array types.
As to your last question, one generally uses PyArg_BuildValue to construct a tuple which is returned from your extension functions. Your tuple would just contain two ndarray objects.
This to understand things better. It is not an actual problem that I need to fix. A cstringIO object is supposed to emulate a string, file and also an iterator over the lines. Does it also emulate a buffer ? In anycase ideally one should be able to construct a numpy array as follows
import numpy as np
import cstringIO
c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')
#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.
#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file
#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number
#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer
b = np.fromstring(c.getvalue(), int) # does work
My question is why does it behave this way.
The practical problem where this came up is the following: I have a iterator which yields a tuple. I am interested in making a numpy array from one of the components of the tuple with as little copying and duplication as possible. My first cut was to keep writing the interesting components of the yielded tuple into a StringIO object and then use its memory buffer for the array. I can of course use getvalue() but will create and return a copy. What would be a good way to avoid the extra copying.
The problem seems to be that numpy doesn't like being given characters instead of numbers. Remember, in Python, single characters and strings have the same type — numpy must have some type detection going on under the hood, and takes '\x01' to be a nested sequence.
The other problem is that a cStringIO iterates over its lines, not its characters.
Something like the following iterator should get around both of these problems:
def chariter(filelike):
octet = filelike.read(1)
while octet:
yield ord(octet)
octet = filelike.read(1)
Use it like so (note the seek!):
c.seek(0)
b = np.fromiter(chariter(c), int)
As cStringIO does not implement the buffer interface, if its getvalue returns a copy of the data, then there is no way to get its data without copying.
If getvalue returns the buffer as a string without making a copy, numpy.frombuffer(x.getvalue(), dtype='S1') will give a (read-only) numpy array referring to the string, without an additional copy.
The reason why np.fromiter(c, int) and np.array(c, int) do not work is that cStringIO, when iterated, returns a line at a time, similarly as files:
>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']
Such a long string cannot be converted to a single integer.
***
It's best not to worry too much about making copies unless it really turns out to be a problem. The reason is that the extra overhead in e.g. using a generator and passing it to numpy.fromiter may be actually larger than what is involved in constructing a list, and then passing that to numpy.array --- making the copies may be cheap compared to Python runtime overhead.
However, if the issue is with memory, then one solution is to put the items directly into the final Numpy array. If you know the size beforehand, you can pre-allocate it. If the size is unknown, you can use the .resize() method in the array to grow it as needed.
I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:
# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
for channel in channel_names:
channel_content[channel]['recording'].extend(
[struct.unpack( "h", f.read(2))[0]
for iteration in xrange(int(channel_content[channel]['nsamples']))])
With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).
What I would like to know:
If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.
Thanks for reading
(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)
Edit: channel_names is a list, so I made the correction suggested by #eumiro (remove typoed brackets)
Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.
Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(
fullsamples[position:position+samples])
position += samples
The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.
You could use array to read your data:
import array
import os
fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)
It is 40x times faster than struct.unpack from #samplebias's answer.
If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:
data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices
A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.
A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.
In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:
Creating a Struct object once and calling its methods is more
efficient than calling the struct functions with the same format since
the format string only needs to be compiled once.
So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:
chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
chunkdata = structobj.unpack(fileobj.read(structobj.size))
(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)
extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list
There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something
What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?
Where do number_of_intervals come from ? What is the nature of the intervals ?
In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?
I read in the doc:
class struct.Struct(format)
Return a
new Struct object which writes and
reads binary data according to the
format string format. Creating a
Struct object once and calling its
methods is more efficient than calling
the struct functions with the same
format since the format string only
needs to be compiled once.
This expresses the same idea as samplebias.
If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument
.
EDIT
I propose
channel_content = {}
for rec in xrange(number_of_intervals)):
for channel in channel_names:
N = int(self.channel_content[channel]['nsamples'])
upk = str(N)+"h", f.read(2*N)
channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)
I don't know how to take account of the J.F. Sebastian's suggestion to use array
Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:
s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)