How to modify a NumPy.recarray using its two views - python

I am new to Python and Numpy, and I am facing a problem, that I can not modify a numpy.recarray, when applying to masked views. I read recarray from a file, then create two masked views, then try to modify the values in for loop. Here is an example code.
import numpy as np
import matplotlib.mlab as mlab
dat = mlab.csv2rec(args[0], delimiter=' ')
m_Obsr = dat.is_observed == 1
m_ZeroScale = dat[m_Obsr].scale_mean < 0.01
for d in dat[m_Obsr][m_ZeroScale]:
d.scale_mean = 1.0
But when I print the result
newFile = args[0] + ".no-zero-scale"
mlab.rec2csv(dat[m_Obsr][m_ZeroScale], newFile, delimiter=' ')
All the scale_means in the files, are still zero.
I must be doing something wrong. Is there a proper way of modifying values of the
view? Is it because I am applying two views one by one?
Thank you.

I think you have a misconception in this term "masked views" and should (re-)read The Book (now freely downloadable) to clarify your understanding.
I quote from section 3.4.2:
Advanced selection is triggered when
the selection object, obj, is a
non-tuple sequence object, an ndarray
(of data type integer or bool), or a
tuple with at least one sequence
object or ndarray (of data type
integer or bool). There are two types
of advanced indexing: integer and
Boolean. Advanced selection always
returns a copy of the data (contrast
with basic slicing that returns a
view).
What you're doing here is advanced selection (of the Boolean kind) so you're getting a copy and never binding it anywhere -- you make your changes on the copy and then just let it go away, then write a new fresh copy from the original.
Once you understand the issue the solution should be simple: make your copy once, make your changes on that copy, and write that same copy. I.e.:
dat = mlab.csv2rec(args[0], delimiter=' ')
m_Obsr = dat.is_observed == 1
m_ZeroScale = dat[m_Obsr].scale_mean < 0.01
the_copy = dat[m_Obsr][m_ZeroScale]
for d in the_copy:
d.scale_mean = 1.0
newFile = args[0] + ".no-zero-scale"
mlab.rec2csv(the_copy, newFile, delimiter=' ')

Related

Getting error saying the truth value of an array with more than one element is ambiguous

...and a suggestion to Use a.any() or a.all().
I am new to python and i am trying to implement a sabr model. I have defined a function with the following parameters:
def haganimpliedvol(a,f,k,B,v,t,p):
if k != f:
z = v/a*math.pow(f*k,(1-B)/2)*math.log(f/k)
xz = math.log((math.sqrt(1-2*p*z+math.pow(z,2))+z-p)/(1-p))
sigma = a/math.pow(f*k,(1-B)/2)*(1 + math.pow(1-B,2)/24* math.pow(math.log(f/k),2)+\
math.pow(1-B,4)/1920* math.pow(math.log(f/k),4))*\
xz*\
(1+(math.pow(1-B,2)/24*math.pow(a,2)/math.pow(f/k,1-B)+1/4*(p*B*v*a)/math.pow(f/k,(1-B)/2)+\
(2-3*math.pow(p,2))/24*math.pow(v,2)))*t
else:
sigma = a/math.pow(f,1-B)*\
(1+(math.pow(1-B,2)/24*math.pow(a,2)/math.pow(f,(2-2*B))+\
1/4*(p*B*a*v)/math.pow(f,1-B)+(2-3*math.pow(p,2))/24*math.pow(v,2)))*t
return(sigma)
Now I define another function to and call the haganimpliedvol() function
params = [0.4,0.6,0.1,-0.4]
def objective(params):
global k,sigma_iv,t,f
a = params[0]
B = params[1]
v = params[2]
p = params[1]
for (i,j,k) in zip(k,t,f):
calc_vols = np.array([haganimpliedvol(a,f,k,B,v,t,p)])
return(calc_vols)
As can be seen, a few parameters in the functions are list. I want to get an array as an output. However, I keep getting the message in the subject line.
Pay attention to the variables in this call:
for (i,j,k) in zip(k,t,f):
calc_vols = np.array([haganimpliedvol(a,f,k,B,v,t,p)])
for the zip to work, k,t, f have to be lists or arrays of matching size;
Done use k for an iteration variable; it is already used in the zip. I think you are just being careless here; or confused.
And the arguments to the hagen... function. Are the f, k, t supposed to be variables used in the zip? It would make more sense to use the iteration variables (i,j,?). Again, this just looks like you are careless, or don't care what happens.
As for the ambiguity error, that most likely arises in the
if k != f:
If either k or f is an array (or both) the k!=f will be a boolean array. That can't be used in if, which requires a simple True or False value. It does not iterate on the conditions. It is a basic Python if - a switch.
This ambiguity error comes up frequently, in various contexts, but all with the same basic issue - using an array in a context that requires a scalar T/F. A simple web search should provide lots of examples.
#hpaulj thank you for leading me on the right path. I vectorized my function and made some edits and now it is working fine.
haganimpliedvol = np.vectorize(haganimpliedvol,excluded = ['a','B','v','p'])
params = [0.2,0.7,0.01,-0.4]
def objective(params):
global k,sigma_iv,t,f
a = params[0]
B = params[1]
v = params[2]
p = params[1]
calc_vols = haganimpliedvol(a,f,k,B,v,t,p)
return(calc_vols)
Are you sure you want to pass arrays into the haganimpliedvol() function?
The general convention is to write functions which take a single input type.
Maybe call it one per item in the array?
Or write the function in a way that, if it sees the input is a list it iterates and if it sees the inputs arent lists then it just calculates it one time.
See this thread for ideas
How to make a function that can handle single inputs or lists of inputs

"AttributeError: 'generator' object has no attribute 'replace' "

I'm not sure why I'm seeing this error message: AttributeError: 'generator' object has no attribute 'replace' (on line: modified_file = hex_read_file.replace(batch_to_amend_final, batch_amendment_final).
import binascii, os, re, time
os.chdir(...)
files_to_amend = os.listdir(...)
joiner = "00"
# Allow user to input the text to be replaced, and with what
while True:
batch_to_amend3 = input("\n\nWhat number would you like to amend? \n\n >>> ")
batch_amendment3 = input("\n\nWhat is the new number? \n\n >>> ")
batch_to_amend2 = batch_to_amend3.encode()
batch_to_amend = joiner.encode().join(binascii.hexlify(bytes((i,))) for i in batch_to_amend2)
batch_amendment2 = batch_amendment3.encode()
batch_amendment = joiner.encode().join(binascii.hexlify(bytes((i,))) for i in batch_amendment2)
# Function to translate label files
def lbl_translate(files_to_amend):
with open(files_to_amend, 'rb') as read_file:
read_file2 = read_file.read()
hex_read_file = (binascii.hexlify(bytes((i,))) for i in read_file2)
print(hex_read_file)
modified_file = hex_read_file.replace(batch_to_amend, batch_amendment)
with open(files_to_amend, 'wb') as write_file:
write_file.write(modified_file)
write_file.close()
print("Amended: " + files_to_amend)
# Calling function to modify labels
for label in files_to_amend:
lbl_translate(label)
hex_read_file is a generator comprehension (note the round brackets around the statement) defined here:
hex_read_file = (binascii.hexlify(bytes((i,))) for i in read_file2)
As many already pointed out in the comments, comprehesions don't have a replace method as strings have, so you have two possibilities, depending on your specific use-case:
Turn the comprehension in a bytestring and call replace on that (considering how you use write_file.write(modified_file) afterwards, this is the option that would work with that directly):
hex_read_file = bytes(binascii.hexlify(bytes((int(i),))) for i in read_file2) # note: I added th eadditional int() call to fix the issue mentioned in the comments
Filter and replace directly in the comprehension (and modify how you write out the result):
def lbl_translate(files_to_amend, replacement_map):
with open(files_to_amend, 'rb') as read_file:
read_file2 = read_file.read()
hex_read_file = ( replacement_map.get(binascii.hexlify(bytes((int(i),))), binascii.hexlify(bytes((int(i),)))) for i in read_file2) # see Note below
with open(files_to_amend, 'wb') as write_file:
for b in hex_read_file:
write_file.write(b)
print("Amended: " + files_to_amend)
where replacement_map is a dict that you fill in with the batch_to_amend as key and the batch_amendment value (you can speficy multiple amendments too and it will work just the same). The call would then be:
for label in files_to_amend:
lbl_translate(label,{batch_to_amend:batch_amendment})
NOTE: Using standard python dicts, because of how comprehensions work, you need to call binascii.hexlify(bytes((int(i),))) twice here. A better option uses collections.defaultdict
A better option would use defaultdict, if they were implemented in a sensible way (see here for more context on why I say that). defaltdicts expect a lambda with no parameters generating the value for unknown keys, instead you need to create your own subclass of dict and implement the __missing__ method to obtain the desired behaviour:
hex_read_file = ( replacement_map[binascii.hexlify(bytes((int(i),)))] for i in read_file2) # replacement_map is a collections.defaultdict
and you define replacement_map as:
class dict_with_key_as_default(dict): # find a better name for the type
def __missing__(self, key):
'''if a value is not in the dictionary, return the key value instead.'''
return key
replacement_map = dict_with_key_as_default()
replacement_map[batch_to_amend] = batch_amendment
for label in files_to_amend:
lbl_translate(label, replacement_map)
(class dict_with_key_as_default taken from this answer and renamed for clarity)
Edit note: As mentioned in the comments, the OP has an error in the comprehension where they call hexlify() on some binary string instead of integer values. The solution adds a cast to int for the bytes where relevant, but it's far from the best solution to this problem. Since the OP's intent is not clear, I left it as close to the original as possible, but an alternative solution should be used instead.

Difference between a numpy.array and numpy.array[:]

Me again... :)
I tried finding an answer to this question but again I was not fortunate enough. So here it is.
What is the difference between calling a numpy array (let's say "iris") and the whole group of data in this array (by using iris[:] for instance).
I´m asking this because of the error that I get when I run the first example (below), while the second example works fine.
Here is the code:
At this first part I load the library and import the dataset from the internet.
import statsmodels.api as sm
iris = sm.datasets.get_rdataset(dataname='iris',
package='datasets')['data']
If I run this code I get an error:
iris.columns.values = [iris.columns.values[x].lower() for x in range( len( iris.columns.values ) ) ]
print(iris.columns.values)
Now if I run this code it works fine:
iris.columns.values[:] = [iris.columns.values[x].lower() for x in range( len( iris.columns.values ) ) ]
print(iris.columns.values)
Best regards,
The difference is that when you do iris.columns.values = ... you try to replace the reference of the values property in iris.columns which is protected (see pandas implementation of pandas.core.frame.DataFrame) and when you do iris.columns.values[:] = ... you access the data of the np.ndarray and replace it with new values. In the second assignment statement you do not overwrite the reference to the numpy object. The [:] is a slice object that is passed to the __setitem__ method of the numpy array.
EDIT:
The exact implementation (there are multiple, here is the pd.Series implementation) of such property is:
#property
def values(self):
""" return the array """
return self.block.values
thus you try to overwrite a property that is constructed with a decorator #property followed by a getter function, and cannot be replaced since it is only provided with a getter and not a setter. See Python's docs on builtins - property()
iris.columns.values = val
calls
type(iris.columns).__setattr__(iris.columns, 'values', val)
This is running pandas' code, because type(iris.columns) is pd.Series
iris.columns.values[:] = val
calls
type(iris.columns.value).__setitem__(iris.columns.value, slice(None), val)
This is running numpy's code, because type(iris.columns.value) is np.ndarray

Changes to copies of object mutate original object

I have a class within which there is a DataFrame type property. I want to be able to perform arithmetic on the objects using the built-ins while keeping the original objects immutable. Unfortunately, the operations seem to be mutating the original objects as well. Here's an example:
import numpy as np
import pandas as pd
class Container:
def __init__(self):
self.data = pd.DataFrame()
def generate(self):
self.data = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=['A'])
return self
def __add__(self, other):
copy = self
new = Container()
new.data['A'] = copy.data.eval(f"A + {0}".format(other))
return new
one = Container().generate()
two = one + 1
print(one.data == two.data)
I think the problem is the copy = self line, but I can't seem to preserve the original object even using the copy() method.
How do I make sure the original object doesn't change when a new one is created from it?
Surprisingly, while copy = self isn't a copy, your bug doesn't actually have anything to do with that. I don't think you even need a copy there.
Your bug is due to double-formatting a string:
f"A + {0}".format(other)
f"A + {0}" is an f-string. Unlike format, it evaluates the text 0 as a Python expression and substitutes the string representation of the resulting object into the resulting string, producing "A + 0". Calling format on that doesn't do anything, since there's no format placeholder left. You end up calling
copy.data.eval("A + 0")
instead of adding what you wanted to add.
Did you deepcopy?
from copy import deepcopy
dupe=deepcopy(thing)
#now thing and dupe are two separate objects

scipy.io.loadmat nested structures (i.e. dictionaries)

Using the given routines (how to load Matlab .mat files with scipy), I could not access deeper nested structures to recover them into dictionaries
To present the problem I run into in more detail, I give the following toy example:
load scipy.io as spio
a = {'b':{'c':{'d': 3}}}
# my dictionary: a['b']['c']['d'] = 3
spio.savemat('xy.mat',a)
Now I want to read the mat-File back into python. I tried the following:
vig=spio.loadmat('xy.mat',squeeze_me=True)
If I now want to access the fields I get:
>> vig['b']
array(((array(3),),), dtype=[('c', '|O8')])
>> vig['b']['c']
array(array((3,), dtype=[('d', '|O8')]), dtype=object)
>> vig['b']['c']['d']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/<ipython console> in <module>()
ValueError: field named d not found.
However, by using the option struct_as_record=False the field could be accessed:
v=spio.loadmat('xy.mat',squeeze_me=True,struct_as_record=False)
Now it was possible to access it by
>> v['b'].c.d
array(3)
Here are the functions, which reconstructs the dictionaries just use this loadmat instead of scipy.io's loadmat:
import scipy.io as spio
def loadmat(filename):
'''
this function should be called instead of direct spio.loadmat
as it cures the problem of not properly recovering python dictionaries
from mat files. It calls the function check keys to cure all entries
which are still mat-objects
'''
data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
return _check_keys(data)
def _check_keys(dict):
'''
checks if entries in dictionary are mat-objects. If yes
todict is called to change them to nested dictionaries
'''
for key in dict:
if isinstance(dict[key], spio.matlab.mio5_params.mat_struct):
dict[key] = _todict(dict[key])
return dict
def _todict(matobj):
'''
A recursive function which constructs from matobjects nested dictionaries
'''
dict = {}
for strg in matobj._fieldnames:
elem = matobj.__dict__[strg]
if isinstance(elem, spio.matlab.mio5_params.mat_struct):
dict[strg] = _todict(elem)
else:
dict[strg] = elem
return dict
Just an enhancement to mergen's answer, which unfortunately will stop recursing if it reaches a cell array of objects. The following version will make lists of them instead, and continuing the recursion into the cell array elements if possible.
import scipy.io as spio
import numpy as np
def loadmat(filename):
'''
this function should be called instead of direct spio.loadmat
as it cures the problem of not properly recovering python dictionaries
from mat files. It calls the function check keys to cure all entries
which are still mat-objects
'''
def _check_keys(d):
'''
checks if entries in dictionary are mat-objects. If yes
todict is called to change them to nested dictionaries
'''
for key in d:
if isinstance(d[key], spio.matlab.mio5_params.mat_struct):
d[key] = _todict(d[key])
return d
def _todict(matobj):
'''
A recursive function which constructs from matobjects nested dictionaries
'''
d = {}
for strg in matobj._fieldnames:
elem = matobj.__dict__[strg]
if isinstance(elem, spio.matlab.mio5_params.mat_struct):
d[strg] = _todict(elem)
elif isinstance(elem, np.ndarray):
d[strg] = _tolist(elem)
else:
d[strg] = elem
return d
def _tolist(ndarray):
'''
A recursive function which constructs lists from cellarrays
(which are loaded as numpy ndarrays), recursing into the elements
if they contain matobjects.
'''
elem_list = []
for sub_elem in ndarray:
if isinstance(sub_elem, spio.matlab.mio5_params.mat_struct):
elem_list.append(_todict(sub_elem))
elif isinstance(sub_elem, np.ndarray):
elem_list.append(_tolist(sub_elem))
else:
elem_list.append(sub_elem)
return elem_list
data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
return _check_keys(data)
As of scipy >= 1.5.0 this functionality now comes built-in using the simplify_cells argument.
from scipy.io import loadmat
mat_dict = loadmat(file_name, simplify_cells=True)
I was advised on the scipy mailing list (https://mail.python.org/pipermail/scipy-user/) that there are two more ways to access this data.
This works:
import scipy.io as spio
vig=spio.loadmat('xy.mat')
print vig['b'][0, 0]['c'][0, 0]['d'][0, 0]
Output on my machine:
3
The reason for this kind of access: "For historic reasons, in Matlab everything is at least a 2D array, even scalars."
So scipy.io.loadmat mimics Matlab behavior per default.
Found a solution, one can access the content of the "scipy.io.matlab.mio5_params.mat_struct object" can be investigated via:
v['b'].__dict__['c'].__dict__['d']
Another method that works:
import scipy.io as spio
vig=spio.loadmat('xy.mat',squeeze_me=True)
print vig['b']['c'].item()['d']
Output:
3
I learned this method on the scipy mailing list, too. I certainly don't understand (yet) why '.item()' has to be added in, and:
print vig['b']['c']['d']
will throw an error instead:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
but I'll be back to supplement the explanation when I know it. Explanation of numpy.ndarray.item (from thenumpy reference):
Copy an element of an array to a standard Python scalar and return it.
(Please notice that this answer is basically the same as the comment of hpaulj to the initial question, but I felt that the comment is not 'visible' or understandable enough. I certainly did not notice it when I searched for a solution for the first time, some weeks ago).

Categories

Resources