How to Read nrows From Pandas HDF Storage? - python

What am I trying to do?
pd.read_csv(... nrows=###) can read the top nrows of a file. I'd like to do the same while using pd.read_hdf(...).
What is the problem?
I am confused by the documentation. start and stop look like what I need but when I try it, a ValueError is returned. The second thing I tried was using nrows=10 thinking that it might be an allowable **kwargs. When I do, no errors are thrown but also the full dataset is returned instead of just 10 rows.
Question: How does one correctly read a smaller subset of rows from an HDF file? (edit: without having to read the whole thing into memory first!)
Below is my interactive session:
>>> import pandas as pd
>>> df = pd.read_hdf('storage.h5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
df = pd.read_hdf('storage.h5')
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 367, in read_hdf
raise ValueError('key must be provided when HDF5 file '
ValueError: key must be provided when HDF5 file contains multiple datasets.
>>> import h5py
>>> f = h5py.File('storage.h5', mode='r')
>>> list(f.keys())[0]
'table'
>>> f.close()
>>> df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 740, in select
return it.get_result()
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 1447, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 733, in func
columns=columns, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 2890, in read
return self.obj_type(BlockManager(blocks, axes))
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 2795, in __init__
self._verify_integrity()
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 3006, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
passed, implied))
ValueError: Shape of passed values is (614, 593430), indices imply (614, 10)
>>> df = pd.read_hdf('storage.h5', key='table', nrows=10)
>>> df.shape
(593430, 614)
Edit:
I just attempted to use where:
mylist = list(range(30))
df = pd.read_hdf('storage.h5', key='table', where='index=mylist')
Received a TypeError indicating a Fixed format store (the default format value of df.to_hdf(...)):
TypeError: cannot pass a where specification when reading from a
Fixed format store. this store must be selected in its entirety
Does this mean I can't select a subset of rows if the format is Fixed format?

I ran into the same problem. I am pretty certain by now that https://github.com/pandas-dev/pandas/issues/11188 tracks this very problem. It is a ticket from 2015 and it contains a repro. Jeff Reback suggested that this is actually a bug, and he even pointed us towards a solution back in 2015. It's just that nobody built that solution yet. I might have a try.

Seems like this now works, at least with pandas 1.0.1. Just provide start and stop arguments:
df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)

Related

Using `xarray.apply_ufunc` with `np.linalg.pinv` returns an error with `dask.array`

I get an error when running the following MWE:
import xarray as xr
import numpy as np
from numpy.linalg import pinv
import dask
data = np.random.randn(4, 4, 3, 2)
da = xr.DataArray(data=data, dims=("x", "y", "i", "j"),)
da = da.chunk(x=1, y=1)
da_inv = xr.apply_ufunc(pinv, da,
input_core_dims=[["i", "j"]],
output_core_dims=[["i", "j"]],
exclude_dims=set(("i", "j")),
dask = "parallelized",
)
This throws me this error:
Traceback (most recent call last):
File "/glade/scratch/tomasc/tracer_inversion2/mwe.py", line 14, in <module>
da_inv = xr.apply_ufunc(pinv, da,
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 1204, in apply_ufunc
return apply_dataarray_vfunc(
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 315, in apply_dataarray_vfunc
result_var = func(*data_vars)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 771, in apply_variable_ufunc
result_data = func(*input_data)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 747, in func
res = da.apply_gufunc(
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in apply_gufunc
core_output_shape = tuple(core_shapes[d] for d in ocd)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in <genexpr>
core_output_shape = tuple(core_shapes[d] for d in ocd)
KeyError: 'dim0'
Even though when using dask.array.map_blocks directly, things seem to work right out of the box:
data_inv = dask.array.map_blocks(pinv, da.data).compute() # works!
What am I missing here?
(Same question answered on the xarray repository here.)
You were almost there, you just needed to add the sizes of new the output dimensions by including the kwarg
dask_gufunc_kwargs={'output_sizes': {'i': 2, 'j': 3}}
It does sort of say this in the docstring for apply_ufunc but it could definitely be clearer!
That's a very unhelpful error, but it's ultimately being thrown because the keys 'i' and 'j' don't exist in the dict of expected sizes of the output (because you didn't provide them).
The actual error message has been improved in xarray version v2023.2.0.

spatial regression in Python - read matrix from list

I have a following problem. I am following this example about spatial regression in Python:
import numpy
import libpysal
import spreg
import pickle
# Read spatial data
ww = libpysal.io.open(libpysal.examples.get_path("baltim_q.gal"))
w = ww.read()
ww.close()
w_name = "baltim_q.gal"
w.transform = "r"
Example above works. But I would like to read my own spatial matrix which I have now as a list of lists. See my approach:
ww = libpysal.io.open(matrix)
But I got this error message:
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/libpysal/io/fileio.py", line 90, in __new__
cls.__registry[cls.getType(dataPath, mode, dataFormat)][mode][0]
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/libpysal/io/fileio.py", line 105, in getType
ext = os.path.splitext(dataPath)[1]
File "/usr/lib/python3.8/posixpath.py", line 118, in splitext
p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not list
this is how matrix looks like:
[[0, 2, 1], [2, 0, 4], [1, 4, 0]]
EDIT:
If I try to insert my matrix into the GM_Lag like this:
model = spreg.GM_Lag(
y,
X,
w=matrix,
)
I got following error:
warn("w must be API-compatible pysal weights object")
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 2, in <module>
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/spreg/twosls_sp.py", line 469, in __init__
USER.check_weights(w, y, w_required=True)
File "/home/vojta/Desktop/INTERNET_HANDEL/ZASILKOVNA/optimal-delivery-branches/venv/lib/python3.8/site-packages/spreg/user_output.py", line 444, in check_weights
if w.n != y.shape[0] and time == False:
AttributeError: 'list' object has no attribute 'n'
EDIT 2:
This is how I read the list of lists:
import pickle
with open("weighted_matrix.pkl", "rb") as f:
matrix = pickle.load(f)
How can I insert list of lists into spreg.GM_Lag ? Thanks
Why do you want to pass it to the libpysal.io.open method? If I understand correctly this code, you first open a file, then read it (and the read method seems to be returning a List). So in your case, where you already have the matrix, you don't need to neither open nor read any file.
What will be needed though is what w is supposed to look like here: w = ww.read(). If it is a simple matrix, then you can initialize w = matrix. If the read method also format the data a certain way, you'll need to do it another way. If you could describe the expected behavior of the read method (e.g. what does the input file contain, and what is returned), it would be useful.
As mentioned, as the data is formatted into a libpysal.weights object, you must build one yourself. This can supposedly be done with this method libpysal.weights.W. (Read the doc too fast).

IndexingError: Too many indexers in a function

I am trying to diagnose the following error: IndexingError: Too many indexers.
I following solutions already on stack overflow doesn't resolve my issue.
Here is my simplified code:
import pandas as pd
import numpy as np
def foo(T=None,*c):
A = df.loc[c][0]
B = df.loc[c][1]
C = df.loc[c][2]
return (10**(A-((B)/(T+C))))*760
file = r'C:\Users\ayubi\Documents\Python Files\Chemical Engineering Files\T-xy_Data.xlsx'
df = pd.read_excel(file)
df_adj = df.set_index('Component')
print(df_adj)
compOne = df_adj.loc['Benzene']
foo(25,*compOne)
The reason why i have the function foo arguments in that order is if T is not given, I will be able to use fsolve.
Here is the error:
runcell(0, 'C:/Users/ayubi/Documents/Python Files/Chemical Engineering Files/test_example.py')
A B C
Component
Benzene 6.87987 1196.76 219.161
Toluene 6.95087 1342.31 219.187
Water 8.07131 1730.63 233.426
Traceback (most recent call last):
File "C:\Users\ayubi\Documents\Python Files\Chemical Engineering Files\test_example.py", line 24, in <module>
foo(25,*compOne)
File "C:\Users\ayubi\Documents\Python Files\Chemical Engineering Files\test_example.py", line 12, in foo
A = df.loc[c][0]
File "C:\Users\ayubi\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 873, in __getitem__
return self._getitem_tuple(key)
File "C:\Users\ayubi\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1049, in _getitem_tuple
self._has_valid_tuple(tup)
File "C:\Users\ayubi\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 700, in _has_valid_tuple
raise IndexingError("Too many indexers")
IndexingError: Too many indexers
The function should be able to take the arguments T, if given, and compOne and grab the value associated with compOne then calculate something.
Thanks for your help in advance.

Memory Error: Masking not so big dataframe throwing error

I am getting a memory error when trying to use a mask to select values for a 4M rows table with 3 columns.
When I run df.memory_usage().sum() it returns me 173526080 which equates to 1,38820864 gb and I have 32gb of RAM. So it doesn't seem like it should run out of RAM as there is no previous code consuming lots of RAM.
This method worked for previous versions of code with the same 4M rows.
The code I run is:
x = df[exit_point] > 0
print(df[x].shape)
The error I get is:
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\frame.py", line 2175, in _getitem_array
return self._take(indexer, axis=0, convert=False)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 2143, in _take
self._consolidate_inplace()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 3677, in _consolidate_inplace
self._protect_consolidate(f)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 3666, in _protect_consolidate
result = f()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 3675, in f
self._data = self._data.consolidate()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 3826, in consolidate
bm._consolidate_inplace()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 3831, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 4853, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 4876, in _merge_blocks
new_values = new_values[argsort]
MemoryError
I am lost on how to start debugging this. Any clues and hints would be very much appreciated.
Maybe this helps:
[1] Use the low_memory=False argument while importing the file. For example:
df = pd.read_csv('filepath', low_memory=False)
[2] Use the dtype argument while importing the file.
[3] If you use Jupyter Notebook: Kernel > Restart & Clear Output.
Hope this helps!

python multiprocessing struct.error

I am looping through a set of large files, and using multiprocessing for manipulation/writing. I create an iterable out of my dataframe and pass it to multiprocessing's map function. The processing is fine for the smaller files, but when I hit the larger ones (~10g) I get the error:
python struct.error: 'i' format requires -2147483648 <= number <= 2147483647
the code:
data = np.array_split(data, 10)
with mp.Pool(processes=5, maxtasksperchild=1) as pool1:
pool1.map(write_in_parallel, data)
pool1.close()
pool1.join()
Based on this answer I thought the problem is the file I am passing to map is too large. So I tried first splitting the dataframe into 1.5g chunks and passing each independently to map, but I am still receiving the same error.
Full traceback:
Traceback (most recent call last):
File "_FNMA_LLP_dataprep_final.py", line 51, in <module>
write_files()
File "_FNMA_LLP_dataprep_final.py", line 29, in write_files
'.txt')
File "/DATAPREP/appl/FNMA_LLP/code/FNMA_LLP_functions.py", line 116, in write_dynamic_columns_fannie
pool1.map(write_in_parallel, first)
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/opt/Python364/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/opt/Python364/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
In the answer you mentioned was also another gist: the data should be loaded by the child function. In your case, it's function write_in_parallel. What I recommend you is to alter your child function in the next way:
def write_in_parallel('/path/to/your/data'):
""" We'll make an assumption that your data is stored in csv file"""
data = pd.read_csv('/path/to/your/data')
...
Then your "Pool code" should look like this:
with mp.Pool(processes=(mp.cpu_count() - 1)) as pool:
chunks = pool.map(write_in_parallel, ('/path/to/your/data',))
df = pd.concat(chunks)
I hope that will help you.

Categories

Resources