spatial regression in Python - read matrix from list - python

I have a following problem. I am following this example about spatial regression in Python:
import numpy
import libpysal
import spreg
import pickle
# Read spatial data
ww = libpysal.io.open(libpysal.examples.get_path("baltim_q.gal"))
w = ww.read()
ww.close()
w_name = "baltim_q.gal"
w.transform = "r"
Example above works. But I would like to read my own spatial matrix which I have now as a list of lists. See my approach:
ww = libpysal.io.open(matrix)
But I got this error message:
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/libpysal/io/fileio.py", line 90, in __new__
cls.__registry[cls.getType(dataPath, mode, dataFormat)][mode][0]
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/libpysal/io/fileio.py", line 105, in getType
ext = os.path.splitext(dataPath)[1]
File "/usr/lib/python3.8/posixpath.py", line 118, in splitext
p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not list
this is how matrix looks like:
[[0, 2, 1], [2, 0, 4], [1, 4, 0]]
EDIT:
If I try to insert my matrix into the GM_Lag like this:
model = spreg.GM_Lag(
y,
X,
w=matrix,
)
I got following error:
warn("w must be API-compatible pysal weights object")
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 2, in <module>
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/spreg/twosls_sp.py", line 469, in __init__
USER.check_weights(w, y, w_required=True)
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/spreg/user_output.py", line 444, in check_weights
if w.n != y.shape[0] and time == False:
AttributeError: 'list' object has no attribute 'n'
EDIT 2:
This is how I read the list of lists:
import pickle
with open("weighted_matrix.pkl", "rb") as f:
matrix = pickle.load(f)
How can I insert list of lists into spreg.GM_Lag ? Thanks

Why do you want to pass it to the libpysal.io.open method? If I understand correctly this code, you first open a file, then read it (and the read method seems to be returning a List). So in your case, where you already have the matrix, you don't need to neither open nor read any file.
What will be needed though is what w is supposed to look like here: w = ww.read(). If it is a simple matrix, then you can initialize w = matrix. If the read method also format the data a certain way, you'll need to do it another way. If you could describe the expected behavior of the read method (e.g. what does the input file contain, and what is returned), it would be useful.
As mentioned, as the data is formatted into a libpysal.weights object, you must build one yourself. This can supposedly be done with this method libpysal.weights.W. (Read the doc too fast).

Related

Using `xarray.apply_ufunc` with `np.linalg.pinv` returns an error with `dask.array`

I get an error when running the following MWE:
import xarray as xr
import numpy as np
from numpy.linalg import pinv
import dask
data = np.random.randn(4, 4, 3, 2)
da = xr.DataArray(data=data, dims=("x", "y", "i", "j"),)
da = da.chunk(x=1, y=1)
da_inv = xr.apply_ufunc(pinv, da,
input_core_dims=[["i", "j"]],
output_core_dims=[["i", "j"]],
exclude_dims=set(("i", "j")),
dask = "parallelized",
)
This throws me this error:
Traceback (most recent call last):
File "/glade/scratch/tomasc/tracer_inversion2/mwe.py", line 14, in <module>
da_inv = xr.apply_ufunc(pinv, da,
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 1204, in apply_ufunc
return apply_dataarray_vfunc(
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 315, in apply_dataarray_vfunc
result_var = func(*data_vars)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 771, in apply_variable_ufunc
result_data = func(*input_data)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 747, in func
res = da.apply_gufunc(
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in apply_gufunc
core_output_shape = tuple(core_shapes[d] for d in ocd)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in <genexpr>
core_output_shape = tuple(core_shapes[d] for d in ocd)
KeyError: 'dim0'
Even though when using dask.array.map_blocks directly, things seem to work right out of the box:
data_inv = dask.array.map_blocks(pinv, da.data).compute() # works!
What am I missing here?
(Same question answered on the xarray repository here.)
You were almost there, you just needed to add the sizes of new the output dimensions by including the kwarg
dask_gufunc_kwargs={'output_sizes': {'i': 2, 'j': 3}}
It does sort of say this in the docstring for apply_ufunc but it could definitely be clearer!
That's a very unhelpful error, but it's ultimately being thrown because the keys 'i' and 'j' don't exist in the dict of expected sizes of the output (because you didn't provide them).
The actual error message has been improved in xarray version v2023.2.0.

How to Read nrows From Pandas HDF Storage?

What am I trying to do?
pd.read_csv(... nrows=###) can read the top nrows of a file. I'd like to do the same while using pd.read_hdf(...).
What is the problem?
I am confused by the documentation. start and stop look like what I need but when I try it, a ValueError is returned. The second thing I tried was using nrows=10 thinking that it might be an allowable **kwargs. When I do, no errors are thrown but also the full dataset is returned instead of just 10 rows.
Question: How does one correctly read a smaller subset of rows from an HDF file? (edit: without having to read the whole thing into memory first!)
Below is my interactive session:
>>> import pandas as pd
>>> df = pd.read_hdf('storage.h5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
df = pd.read_hdf('storage.h5')
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 367, in read_hdf
raise ValueError('key must be provided when HDF5 file '
ValueError: key must be provided when HDF5 file contains multiple datasets.
>>> import h5py
>>> f = h5py.File('storage.h5', mode='r')
>>> list(f.keys())[0]
'table'
>>> f.close()
>>> df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 740, in select
return it.get_result()
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 1447, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 733, in func
columns=columns, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 2890, in read
return self.obj_type(BlockManager(blocks, axes))
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 2795, in __init__
self._verify_integrity()
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 3006, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
passed, implied))
ValueError: Shape of passed values is (614, 593430), indices imply (614, 10)
>>> df = pd.read_hdf('storage.h5', key='table', nrows=10)
>>> df.shape
(593430, 614)
Edit:
I just attempted to use where:
mylist = list(range(30))
df = pd.read_hdf('storage.h5', key='table', where='index=mylist')
Received a TypeError indicating a Fixed format store (the default format value of df.to_hdf(...)):
TypeError: cannot pass a where specification when reading from a
Fixed format store. this store must be selected in its entirety
Does this mean I can't select a subset of rows if the format is Fixed format?
I ran into the same problem. I am pretty certain by now that https://github.com/pandas-dev/pandas/issues/11188 tracks this very problem. It is a ticket from 2015 and it contains a repro. Jeff Reback suggested that this is actually a bug, and he even pointed us towards a solution back in 2015. It's just that nobody built that solution yet. I might have a try.
Seems like this now works, at least with pandas 1.0.1. Just provide start and stop arguments:
df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)

broadcasting linalg.pinv on a 3D theano tensor

in the example below, there is a 3d numpy matrix of size (4, 3, 3)+ a solution about how to calculate pinv of each of 4 of those 3*3 matrices in numpy. I also tried to use the same function worked in numpy, in theano hoping that it is implemented the same, but it failed. Any idea how to do it in theano?
dt = np.dtype(np.float32)
a=[[[12,3,1],
[2,4,1],
[2,4,2],],
[[12,3,3],
[2,4,4],
[2,4,5],],
[[12,3,6],
[2,4,5],
[2,4,4],],
[[12,3,3],
[2,4,5],
[2,4,6]]]
a=np.asarray(a,dtype=dt)
print(a.shape)
apinv=np.zeros((4,3,3))
print(np.linalg.pinv(a[0,:,:]).shape)
#numpy solution
apinv = map(lambda n: np.linalg.pinv(n), a)
apinv = np.asarray(apinv,dtype=dt)
#theano solution (not working)
at=T.tensor3('a')
apinvt = map(lambda n: T.nlinalg.pinv(n), at)
The error is:
Original exception was:
Traceback (most recent call last):
File "pydevd.py", line 2403, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "pydevd.py", line 1794, in run
launch(file, globals, locals) # execute the script
File "exp_thn_pinv_map.py", line 35, in <module>
apinvt = map(lambda n: T.nlinalg.pinv(n), at)
File "theano/tensor/var.py", line 549, in __iter__
raise TypeError(('TensorType does not support iteration. '
TypeError: TensorType does not support iteration. Maybe you are using builtin.sum instead of theano.tensor.sum? (Maybe .max?)
The error message is
Traceback (most recent call last):
File "D:/Dropbox/source/intro_theano/pinv.py", line 32, in <module>
apinvt = map(lambda n: T.nlinalg.pinv(n), at)
File "d:\dropbox\source\theano\theano\tensor\var.py", line 549, in __iter__
raise TypeError(('TensorType does not support iteration. '
TypeError: TensorType does not support iteration. Maybe you are using builtin.sum instead of theano.tensor.sum? (Maybe .max?)
This is occurring because, as the error message indicates, the symbolic variable at is not iterable.
The fundamental problem here is that you're incorrectly mixing immediately executed Python code with delayed execution Theano symbolic code.
You need to use a symbolic loop, not a Python loop. The correct solution is to use Theano's scan operator:
at=T.tensor3('a')
apinvt, _ = theano.scan(lambda n: T.nlinalg.pinv(n), at, strict=True)
f = theano.function([at], apinvt)
print np.allclose(f(a), apinv)

Numpy array from pandas frames can't be count vectorized due to "'float' object has no attribute 'lower'" error

I have a pandas data frame that I am reading from a csv. It includes three columns, a subject line, and two numbers I am not using yet.
>>> input
0 1 2
0 Stress Free Christmas Gift They'll Love 0.010574 8
I have converted the list of subjects to a numpy array, and I want to use count vectorizer for naive bayes. When I do that, I get the following error.
>>> cv=CountVectorizer()
>>> subjects=np.asarray(input[0])
>>> cv.fit_transform(subjects)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'float' object has no attribute 'lower'
These items should definitely all be strings. When I read the csv in with the csv library instead and created an array of that column, I didn't have any problems. Any ideas?

Python: fromfile array too big and/or gzip will not write

My goal is to extract a large subimage from an even larger uncompressed image (30000, 65536), without reading the whole image into memory, then save the subimage in a compressed format. At the moment, I only care about determining how well the compression worked as an indicator of image complexity; I don't need to save the image in a visible format, but I would love to. This is my first python script and I am getting stuck on the entry-limits for some function calls.
I get two related errors based on two alternate attempts (with boring lines removed):
Version 1:
fd = open(fname,'rb')
h5file=h5py.File(h5fname, "w")
data = h5file.create_dataset("data", (Height, Width), dtype='i', maxshape=(None, None)) # try i8?
data = fromfile(file=fd, dtype='h', count=Height*Width) #FAIL
fd.close()
h5file.close()
outfilez = gzip.open(outfilename,'wb')
outfilez.write(data)
outfilez.close()
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:...\sitecustomize.py", line 523, in runfile
execfile(filename, namespace)
File "C:...\Script_v3.py", line 183, in <module>
data = fromfile(file=fd, dtype=datatype, count=BandHeight_tracks*Width)
ValueError: array is too big.
Version 2 (for loop to reduce fromfile usage):
fd = open(fname,'rb')
h5file=h5py.File(h5fname, "w")
data = h5file.create_dataset("data", (Height, Width), dtype='i', maxshape=(None, None)) # try i8?
for i in range(0, Height-1):
data[i:] = fromfile(file=fd, dtype='h', count=Width)
fd.close()
h5file.close()
outfilez = gzip.open(outfilename,'wb')
outfilez.write(data)
outfilez.close()
Error (I do not get this with the other version):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:...\sitecustomize.py", line 523, in runfile
execfile(filename, namespace)
File "C:...\Script_v4.py", line 195, in <module>
outfilez.write(data)
File "C:...\gzip.py", line 235, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
TypeError: must be string or read-only buffer, not Dataset
I am running code using spyder on a 64bit Win7 machine with 16GBRAM. The images are a max of 4GB.
You should look if the arrays in NumPy suits your needs:
NumPy is a general-purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays. NumPy is built on the Numeric code base and adds features introduced by numarray as well as an extended C-API and the ability to create arrays of arbitrary type which also makes NumPy suitable for interfacing with general-purpose data-base applications.
Solution as I found out eventually:
Error 1:
Ignore. If I need to operate with this much memory, switch to C++
Error 2:
The type was not set somehow. From the terminal we see:
>>> data[0]
array([2, 2, 0, ..., 3, 4, 2])
It should be:
>>> data[0]
array([2, 2, 0, ..., 3, 4, 2], dtype=int16)
The fix is that I added the following line after the for loop:
data=int16(data)

Categories

Resources