Trying to create a Pytables EArray on the run based on one column from a numpy recarray. This seems to work if I am using createArray as I can simply pass it the numpy array extracted from the recarray. However, for the createEArray I need to define the atom - which is causing problems
In the example MyRecArray is a recordarray with 1-D arrays for columns, Myhdf5 is a predefined Pytables file, and Mynode is a predefined group in that file from which the EArray leaves will hang.
Myfield = MyRecArray[Colname]
afieldtype = Myfield.dtype
Myatom = tables.atom.Atom(afieldtype, (1,), -9999)
MyEarray = Myhdf5.createEArray(Mynode, Colname, Myatom, (0,))
MyEarray.append(Myfield )
MyEarray.flush()
MyEarray.close()
using this code give the error:
NotImplementedError: ``Atom`` is an abstract class;
please use one of its subclasses
I can probably write a subroutine with case statements based on the array time and pass back an atom, but I was just wondering if there is a generic way to create such an atom by passing it the array type to be created instead of having to call a specific function for different data types, such as "tables.atom.FloatAtom(....)"
Thanks
I believe using the function:
tables.Atom.from_dtype(afieldtype, dflt=-9999)
will allow you to create an atom without going the subroutine route. The shape is contained in the dtype "afieldtype" (eg. dtype([('col1', '<f8', (10,))]))
Related
I have a commercial package which has a COM interface. I am trying to control it via the COM interface from Python. Most things are working just fine with regular input parameters and outputs.
However one particular set of functions appear to take a pre-allocated data structure as input which they will then fill out with the results of the query. So: an out-parameter of type array in this instance.
Some helpful example VBA code which accompanies the product alongside an Excel spreadsheet seems to work just fine. It declares the input array as: Dim myArray(customsize) as Integer
This then gets passed directly into the call on the COM object: cominterface.GetContent(myArray)
However when I try something similar in Python, I either get errors or no results depending on how I try to pass the array in:
import comtypes
... code to generate bindings, create object, grab the interface ...
# create an array for storing the results
my_array_type = c_ulong * 1000
my_array_instance = my_array_type ()
# attempt to pass the array into the call on the COM interface
r = cominterface.GetContent(my_array_instance )
# expect to see id's 1,2,3,4
print(my_array_instance)
The above gives me error:
TypeError: Cannot put <__main__.c_ulong_Array_1000 object at 0x00000282514F5640> in VARIANT
So it would seem that the comtypes does not support ctype arrays for passing through as it tries to make it a VARIANT.
Thus a different attempt:
# create an array for storing the results
my_array_instance = [0] * 100
# attempt to pass the array into the call on the COM interface
r = cominterface.GetContent(my_array_instance )
# expect to see id's 1,2,3,4
print(my_array_instance)
The above call has a return code indicating success, but the array is unchanged and still contains the initial 0's it was preseaded with.
So I am assuming here that comtypes is somehow not transporting the written values back into the Python list. But thats a big assumption - I really don't know.
I have tried a number of things including using POINTER, byref() and various things. Almost everything results in some kind of error - either in the code doing the bindings or an error from the COM function I am calling to say the parameter does not meet its requirements.
If someone knows how I can pass in a pre-allocated array for this COM function to write to, I would be very much appreciative.
EDIT:
I rewrote the code in C# and it had the same problem, so I began to suspect the COM interface was not correct. By providing my own interface with modified function signatures (adding a 'ref' for the parameters), I was able to get the calls to work.
I suspect the tlb file was in error and happened to work with VBA, but I am unsure.
Context
In pySpark I broadcast a variable to all nodes with the following code:
sc = spark.sparkContext # Get context
# Extract stopwords from a file in hdfs
# The result looks like stopwords = {"and", "foo", "bar" ... }
stopwords = set([line[0] for line in csv.reader(open(SparkFiles.get("stopwords.txt"), 'r'))])
# The set of stopwords is broadcasted now
stopwords = sc.broadcast(stopwords)
After broadcasting the stopwords I want to make it accessible in mapPartitions:
# Some dummy-dataframe
df = spark.createDataFrame([(["TESTA and TESTB"], ), (["TESTB and TESTA"], )], ["text"])
# The method which will be applied to mapPartitions
def stopwordRemoval(partition, passed_broadcast):
"""
Removes stopwords from "text"-column.
#partition: iterator-object of partition.
#passed_stopwords: Lookup-table for stopwords.
"""
# Now the broadcast is passed
passed_stopwords = passed_broadcast.value
for row in partition:
yield [" ".join((word for word in row["text"].split(" ") if word not in passed_stopwords))]
# re-partitioning in order to get mapPartitions working
df = df.repartition(2)
# Now apply the method
df = df.select("text").rdd \
.mapPartitions(lambda partition: stopwordRemoval(partition, stopwords)) \
.toDF()
# Result
df.show()
#Result:
+------------+
| text |
+------------+
|TESTA TESTB |
|TESTB TESTA |
+------------+
Questions
Even though it works I'm not quite sure if this is the right usage of broadcasting variables. So my questions are:
Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
The second question relates to this question which partly answers my own. Anyhow, within the specifics it differs; that's why I've chosen to also ask this question.
Some time went by and I read some additional information which answered the question for me. Thus, I wanted to share my insights.
Question 1: Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
First it is of note that a SparkContext.broadcast() is a wrapper around the variable to broadcast as can be read in the docs. This wrapper serializes the variable and adds the information to the execution graph to distribute the this serialized form over the nodes. Calling the broadcasts .value-argument is the command to deserialize the variable again when used.
Additionally, the docs state:
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v [the variable] is not shipped to the nodes more than once.
Secondly, I found several sources stating that this works with UDFs (User Defined Functions), e.g. here. mapPartitions() and udf()s should be considered analogous since they both, in case of pySpark, pass the data to a Python instance on the respective nodes.
Regarding this, here is the important part: Deserialization has to be part of the Python function (udf() or whatever function passed to mapPartitions()) itself, meaning its .value argument must not be passed as function-parameter.
Thus, the broadcast done the right way: The braodcasted wrapper is passed as parameter and the variable is deserialized inside stopwordRemoval().
Question 2: Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
Its documented that there is only an advantage if serialization yields any value for the task at hand.
The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
This might be the case when you have a large reference to broadcast to your cluster:
[...] to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
If this applies to your broadcast, broadcasting has an advantage.
I'd like to create a tf.data.Dataset.from_generator(...) dataset. I need to pass in a Python generator.
I would like to pass in a property of a previous dataset to the generator like so:
dataset = dataset.interleave(
map_func=lambda x: tf.data.Dataset.from_generator(generator=lambda: gen(x), output_types=tf.int64),
cycle_length=2
)
Where I define gen(...) to take a value (which is a pointer to some data such as a filename which gen knows how to access).
This fails because gen receives a tensor object, not a python/numpy value.
Is there a way to resolve the tensor object to a value inside of gen(...)?
The reason for interleaving the generators is so I can manipulate the list of data-pointers/filenames with other dataset operations such as .shuffle() and .repeat() without the need to bake those into the gen(...) function, which would be necessary if I started with the generator directly from the list of data-pointers/filenames.
I want to use the generator because a large number of data values will be generated per data-pointer/filename.
TensorFlow now supports passing tensor arguments to the generator:
def map_func(tensor):
dataset = tf.data.Dataset.from_generator(generator, tf.float32, args=(tensor,))
return dataset
The answer is indeed no. Here is a reference to a couple of relevant git issues (open as of the time of this writing) for further developments on the question:
https://github.com/tensorflow/tensorflow/issues/13101
https://github.com/tensorflow/tensorflow/issues/16343
I have a whole series of arrays with similar names mcmcdata.rho0, mcmcdata.rho1, ... and I want to be able to loop through them while updating their values. I can't figure out how this might be done or even what such a thing might be called.
I read my data in from file like this:
names1='l b rho0 rho1 rho2 rho3 rho4 rho5 rho6 rho7 rho8 rho9 rho10 rho11 rho12 rho13 rho14 rho15 rho16 rho17 rho18 rho19 rho20 rho21 rho22 rho23'.split()
mcmcdata=np.genfromtxt(filename,names=names1,dtype=None).view(np.recarray)
and I want to update the "rho" arrays later on after I do some calculations.
for jj in range(dbins):
mcmc_x, mcmc_y, mcmc_z = wf.lbd_to_xyz(mcmcdata.l,mcmcdata.b,d[jj],R_sun)
rho, thindisk, thickdisk, halo = wf.total_density_fithRthinhRthickhzthinhzthickhrfRiA( mcmc_x, mcmc_y, mcmc_z, R_sun,params)
eval("mcmcdata."+names1[2+jj]) = copy.deepcopy(rho)
eval("mcmcthin."+names1[2+jj]) = copy.deepcopy(thindisk)
eval("mcmcthick."+names1[2+jj]) = copy.deepcopy(thickdisk)
eval("mcmchalo."+names1[2+jj]) = copy.deepcopy(halo)
But the eval command is giving an error:
File "<ipython-input-133-30322c5e633d>", line 13
eval("mcmcdata."+names1[2+jj]) = copy.deepcopy(rho)
SyntaxError: can't assign to function call
How can I loop through my existing arrays and update their values?
or
How can identify the arrays by name so I can update them?
The eval command doesn't work the way you seem to think it does. You appear to be using it like a text-replacement macro, hoping that Python will read the given string and then pretend you wrote that text in the original source code. Instead, it receives a string, and then it executes that code. You're giving it an expression that refers to an attribute of an object, which is fine, but the result of evaluating that expression does not yield a thing you can assign to. It yields the value of that attribute.
Although Python provides eval, it also provides many other things that often obviate the need for eval. In the case of your code, Python provides setattr. You give it an object, the name of an attribute on that object, and a value, and it assigns that object's attribute to refer to the given value.
setattr(mcmcdata, names1[2+jj], copy.deepcopy(rho))
It might make the code more readable to get rid of the names1 portion, too. I might write the code like this:
setattr(mcmcdata, 'rho' + str(jj), copy.deepcopy(rho))
That way, it's clear that I'm assigning the rho-related attributes of the object without having to go look at what's held in the names1 list; the name names1 doesn't offer much information about what's in it.
I've installed ioapiTools, a python module to manage ioapi format files. The module is supposed to handle file and perform operations on them, including basic arithmetic operations. But something is wrong and when I try to, say, multiply an array by a float or an integer, the result is a zero-valued array (both the array and the float/integer are different from zero).
The module in question creates a temporary variable using cdms2 according to the following syntax:
import cdms2 as cdms, cdtime, MV2 as MV, cdutil
import numpy as N
..........
def __mul__(self, other):
"""
Wrapper around cdms tvariable multiply
"""
tmpVar = cdms.tvariable.TransientVariable.__mul__(self,other)
iotmpVar = createVariable(tmpVar, self.ioM, id = self.id,\
attributes=self.attributes, copyFlag = False)
return iotmpVar
But the variable returns nothing but zeros.
Any ideas?
I tried to use ioapiTools, and latest version i found was 0.3.2 from http://www2-pcmdi.llnl.gov/Members/azubrow/ioapiTools/download-source-file .
unfortunately, the code doesn't seem to catchup with evolution of cdat, which now recommend using numpy instead of Numeric. automated translation tool may be resolving some problems, but not all. For example, the class iovar (defined in ioapiTools.py:2103) now needs to have _____new_____ method, as it is a subclass of numpy masked array (i dont know how things are in Numeric). With that, i seems to have _____mul_____ working. i couldn't reproduce your problem though, because i couldn't even get an instance of iovar without having _____new_____ method defined.
i can pass what i got to you if you still need one, but i am sure there are more problems hiding... let me know if you need it though.