PyBytes_FromString different endianness

PyBytes_FromString different endianness - python

I have a python-wrapped C++ object whose underlying data is a container std::vector<T> that represents bits. I have a function that writes these bits to a PyBytes object. If the endianness is the same, then there is no issue. However if I wish to write the bytes in a different endianness, then I need to bitswap (or byteswap) each word.
Ideally, I could pass an output iterator to the PyBytes_FromString constructor, where the output operator just transforms the endianness of each word. This would be O(1) extra memory, which is the target.
Less ideally, I could somehow construct an empty PyBytes object, create the different-endianness char array manually and somehow assign that to the PyBytes object (basically reimplementing the PyBytes constructors). This would also be O(1) extra memory. Unfortunately, the way to do this would be to use _PyBytes_FromSize, but that's not available in the API.
The current way of doing this is to create an entire copy of the reversed words, just to then copy that representation over to the PyBytes objects representation.
I think the second option is the most practical way of doing this, but the only way I can see that working is by basically copying the _PyBytes_FromSize function into my source code which seems hacky. I'm new to the python-C api and am wondering if there's a cleaner way to do this.

PyBytes_FromStringAndSize lets you pass NULL as the first argument, in which case it returns an uninitialized bytes object (which you can edit). It's really just equivalent to _PyBytes_FromSize and would let you do your second option.
If you wanted to try your "output iterator" option instead, then the solution would be to call PyBytes_Type:
PyObject *result = PyObject_CallFunctionObjArgs((PyObject*)&PyBytes_Type, your_iterable, NULL);
Any iterable that returns values between 0 and 255 should work. You can pick the PyObject_Call* that you find easiest to use.
I suspect writing the iterable in C/C++ will be more trouble than writing the loop though.

Related

Python: how to use struct to pack and unpack references to objects?

I have a list of objects, for example:
L = [<CustomObject object at 0x101992eb8>, <CustomObject object at 0x101763908>, ...]
The items in the list are "references" so I guess it's like a list of unsigned integers, am I wrong?
In order to see if I can save some memory, I would like to pack this list using the struct module.
Is this possible? And if yes how to do it? (except if you know for sure I won't save memory like this)

The list is already an array of “integers” (pointers) internally; struct can’t compress that in any simple or significant fashion, and doing so would interfere with Python’s garbage collection.
The CustomObjects (if they are unique) take more than twice as much memory—closer to a hundred times unless you use __slots__ for the class.

3 questions about generators and iterators in Python

Everyone says you lose the benefit of generators if you put the result into a list.
But you need a list, or a sequence, to even have a generator to begin with, right? So, for example, if I need to go through the files in a directory, don't I have to make them into a list first, like with os.listdir()? If so, then how is that more efficient? (I am always working with strings and files, so I really hate that all the examples use range and integers, but I digress)
Taking it a step further, the mere presence of the yield keyword is supposed to make a generator. So if I do:
for x in os.listdir():
yield x
Is a list still being created? Or is os.listdir() itself now also magically a generator? Is it possible that, os.listdir() not having been called yet, that there really isn't a list here yet?
Finally, we are told that iterators need iter() and next() methods. But doesn’t that also mean they need an index? If not, what is next() operating on? How does it know what is next without an index? Before 3.6, dict keys had no order, so how did that iteration work?

No.
See, there's no list here:
def one():
while True:
yield 1
Index and next() are two independent tools to perform an iteration. Again, if you have an object such that its iterator's next() always returns 1, you don't need any indices.
In deeper detail...
See, technically, you can always associate a list and an index with any generator or iterator: simply write down all its returned values — you'll get at most countable set of values a₀, a₁, ... But those are merely a mathematical formalism quite unnecessarily having anything in common with how a real generator works. For instance, you have a generator that always yields one. You can count how many ones have you got from it so far, and call that an index. You can write down all that ones, comma-separated, and call that a list. Do those two objects correctly describe your elapsed generator's output? Apparently so. Are they in a least bit important for the generator itself? Not really.
Of course, a real generator will probably have a state (you can call it an index—provided you don't necessarily call something an index if it is only a non-negative integral scalar; you can write down all its states, provided it works deterministically, number them and call current state's number index—yes, approximately that). They will always have a source of their states and returned values. So, indices and lists can be regarded as abstractions that describe object's behaviour. But quite unnecessary they are concrete implementation details that are really used.
Consider unbuffered file reader. It retrieves a single byte from the disk and immediately yields it. There's no a real list in memory, only the file contents on the disk (there may even be no, if our file reader is connected to a net socket instead of a real disk drive, and the Oracle of Delphi is at connection's other end). You can call file position index—until you read the stdin, which is only forward-traversable and thus indexing it makes no real physical sense—same goes for network connections via unreliable protocol, BTW.
Something like this.

1) This is wrong; it is just the easiest example to explain a generator from a list. If you think of the 8 queens-problem and you return each position as soon as the program finds it, I can't recognize a result list anywhere. Note, that often iterators are alternately offered even by python standard library (islice() vs. slice(), and an easy example not representable by a list is itertools.cycle().
In consequence 2 and 3 are also wrong.

Use of .digest() in hashing?

What is the use of .digest() in this statement? Why do we use it ? I searched on google ( and documentation also) but still I am not able to figure it out.
train_hashes = [hashlib.sha1(x).digest() for x in train_dataset]
What I found is that it convert to string. Am I right or wrong?

The .digest() method returns the actual digest the hash is designed to produce.
It is a separate method because the hashing API is designed to accept data in multiple pieces:
hash = hashlib.sha1()
for chunk in large_amount_of_data:
hash.update(chunk)
final_digest = hash.digest()
The above code creates a hashing object without passing any initial data in, then uses the hash.update() method to put chunks of data in in a loop. This helps avoid having to all of the data into memory all at once, so you can hash anything between 1 byte and the entire Google index, if you ever had access to something that large.
If hashlib.sha1(x) produced the digest directly you could never add additional data to hash first. Moreover, there is also an alternative method of accessing the digest, as a hexadecimal string using the hash.hexdigest() method (equivalent to hash.digest().hex(), but more convenient).
The code you found uses the fact that the constructor of the hash object also accepts data; since that's the all of the data that you wanted to hash you can call .digest() immediately.
The module documentation covers it this way:
There is one constructor method named for each type of hash. All return a hash object with the same simple interface. For example: use sha256() to create a SHA-256 hash object. You can now feed this object with bytes-like objects (normally bytes) using the update() method. At any point you can ask it for the digest of the concatenation of the data fed to it so far using the digest() or hexdigest() methods.
(bold emphasis mine).

Numpy array from cStringIO object and avoiding copies

This to understand things better. It is not an actual problem that I need to fix. A cstringIO object is supposed to emulate a string, file and also an iterator over the lines. Does it also emulate a buffer ? In anycase ideally one should be able to construct a numpy array as follows
import numpy as np
import cstringIO
c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')
#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.
#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file
#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number
#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer
b = np.fromstring(c.getvalue(), int) # does work
My question is why does it behave this way.
The practical problem where this came up is the following: I have a iterator which yields a tuple. I am interested in making a numpy array from one of the components of the tuple with as little copying and duplication as possible. My first cut was to keep writing the interesting components of the yielded tuple into a StringIO object and then use its memory buffer for the array. I can of course use getvalue() but will create and return a copy. What would be a good way to avoid the extra copying.

The problem seems to be that numpy doesn't like being given characters instead of numbers. Remember, in Python, single characters and strings have the same type — numpy must have some type detection going on under the hood, and takes '\x01' to be a nested sequence.
The other problem is that a cStringIO iterates over its lines, not its characters.
Something like the following iterator should get around both of these problems:
def chariter(filelike):
octet = filelike.read(1)
while octet:
yield ord(octet)
octet = filelike.read(1)
Use it like so (note the seek!):
c.seek(0)
b = np.fromiter(chariter(c), int)

As cStringIO does not implement the buffer interface, if its getvalue returns a copy of the data, then there is no way to get its data without copying.
If getvalue returns the buffer as a string without making a copy, numpy.frombuffer(x.getvalue(), dtype='S1') will give a (read-only) numpy array referring to the string, without an additional copy.
The reason why np.fromiter(c, int) and np.array(c, int) do not work is that cStringIO, when iterated, returns a line at a time, similarly as files:
>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']
Such a long string cannot be converted to a single integer.
***
It's best not to worry too much about making copies unless it really turns out to be a problem. The reason is that the extra overhead in e.g. using a generator and passing it to numpy.fromiter may be actually larger than what is involved in constructing a list, and then passing that to numpy.array --- making the copies may be cheap compared to Python runtime overhead.
However, if the issue is with memory, then one solution is to put the items directly into the final Numpy array. If you know the size beforehand, you can pre-allocate it. If the size is unknown, you can use the .resize() method in the array to grow it as needed.

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.

The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).

Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.

The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.

There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.

C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/

A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.

Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.