Dataframe memoryview - python

Is there any way to share the memory/data contained in a dataframe with another item? I naively tried this using memoryview, but I think it's quite a bit more complex than this (or possibly not supported at all):
>>> import pandas as pd
>>> df=pd.DataFrame([{'Name':'Brad'}])
>>> v=memoryview(df)
TypeError: cannot make memory view because object does not have the buffer interface

Related

Is the best way to test NaN and NaT in Numpy actually by using Pandas isnull()?

Working with a small csv dataset that I imported using Pandas read_csv(), some of the values that were missing came out as Numpy NaN type and some of the missing datetimes came in as Numpy NaT type. The actual data types of those columns aren't those types but Pandas doesn't know that - it thinks they're those types.
To test for both of these cases it seems like the best method is actually to use Pandas' isnull() function - my question is why does Numpy not have this functionality built in? Am I missing something or is using Pandas the best way to test Numpy types? Numpy's builtin isnan() function doesn't seem to be the way to do it.
A bit more context:
I get this typing warning - Module is not callable - when type checking with Numpy's isnan(), and digging deeper the error it gave me is TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''. so it seems to be concerned that I'm passing in a different type.
To reproduce this: pass in a string to np.isnan() and it will throw the error. e.g. np.isnan('test') will throw that error.
Why isn't that the expected way to use this method?? To pass in different data types? Why is the default behavior to throw an error? And why is the seeming best way to handle checking for a Numpy data type by using Panda's isnull() if the datatype is actually a Numpy data type?
Thanks all
My code:
import pandas as pd
import numpy as np
from typing import Dict, Any
def a(my_obj:Dict[str, Any]):
for key, value in my_obj.items():
# this will throw an error if you pass in a string
if np.isnan(value):
my_obj[key] = None
# this one gives no error and actually works
if pd.isnull(value):
my_obj[key] = None
return my_obj
# this is true only if you comment out the `np.isnan()` lines
assert a({"working":"123", "typing_not_working":np.nan}) == {"working":"123", "typing_not_working":None}```
np.nan >>> 'float dtype'
This np.isnan do not work on 'string type'
and which test-element-wise
Please do check this:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced

Why does dask throw an error when setting a String column as an index?

I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen? How can I solve it?
As stated in the bug report here for an unrelated issue: https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray. But unlike pandas, NumPy doesn't know how to handle a StringDtype.
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")

read strings with pd.readcsv() and convertes always creates objects, not strings

I use pandas pd.read_csv() to read and process a column of strings with a converter function while reading. I get 'object' as data type, but string would be much more space efficient.
Can I somehow convince pd.read_csv() to make the column of type 'string' from the beginning? I know how to convert later, but this may become a memory issue, the dataset is large.
f = lambda x: "/".join(x.split('/')[1:5])
pd.read_csv(..., convertes={'path':f}, ...)
I use pandas 1.0.3 and python 3.8.2
Would be even better if I could create type category (of strings) from the beginning ...
thank you,
Heiner
Pandas maps the strings to object type that is why you are getting an object as dtype.
You can see the different mappings of python type to pandas dtype below.
img src

How to convert all the memoryview columns to bytes columns in a Pandas dataframe?

I'm retrieving a large amount of data from PostgreSQL with:
it = pandas.read_sql_table(table, DB_CONN, chunksize=1000)
But Pandas uses the psycopg2 adapter for PostgreSQL, which returns a memoryview instead of bytes for historical reasons. To my knowledge, there is no option to make psycopg2 return bytes instead of a memoryview, so I'm stuck with this.
Now, the library I'm giving the Pandas dataframe to is written in C and doesn't accept memoryview and can only handle bytes, so I'd need a way to convert all the memoryview columns to bytes.
I tried to do this:
dataframe[column_name].astype(bytes)
but it doesn't work for memoryview -> bytes, apparently:
*** ValueError: setting an array element with a sequence
I also tried something like this:
dataframe.select_dtypes(include=[memoryview]).apply(bytes)
But it doesn't return any columns.
So does anyone know how I can have an efficient way of converting all the memoryview columns of an arbitrary pandas dataframe to bytes?
So, apparently when we use a memoryview, Pandas isn't able to recognize that datatype and just stores "object", so I ended up doing something like this:
def dataframe_memoryview_to_bytes(dataframe):
for col in dataframe.columns:
if type(dataframe[col][0]) == memoryview:
dataframe[col] = dataframe[col].apply(bytes)
return dataframe
It's really not ideal, and probably not very fast, but it seems to work reasonably well.

Arff Loader : AttributeError: 'dict' object has no attribute 'data'

I am trying to load a .arff file into a numpy array using liac-arff library. (https://github.com/renatopp/liac-arff)
This is my code.
import arff, numpy as np
dataset = arff.load(open('mydataset.arff', 'rb'))
data = np.array(dataset.data)
when executing, I am getting the error.
ArffLoader.py", line 8, in <module>
data = np.array(dataset.data)
AttributeError: 'dict' object has no attribute 'data'
I have seen similar threads, Smartsheet Data Tracker: AttributeError: 'dict' object has no attribute 'append'. I am new to Python and is not able to resolve this issue. How can I fix this?
Short version
dataset is a dict. For a dict, you access the values using the python indexing notation, dataset[key], where key could be a string, integer, float, tuple, or any other immutable data type (it is a bit more complicated than that, more below if you are interested).
In your case, the key is in the form of a string. To access it, you need to give the string you want as an index, like so:
import arff
import numpy as np
dataset = arff.load(open('mydataset.arff', 'rb'))
data = np.array(dataset['data'])
(you also shouldn't put the imports on the same line, although this is just a readability issue)
More detailed explanation
dataset is a dict, which on some languages is called a map or hashtable. In a dict, you access values in a similar way to how you index in a list or array, except the "index" can be any data-type that is "hashable" (which is, ideally, unique identifier for each possible value). This "index" is called a "key". In practice, at least for built-in types and most major packages, only immutable data types or hashable, but there is no actual rule that requires this to be the case.
Do you come from MATLAB? If so, then you are probably trying to use MATLAB's struct access technique. You could think of a dict as a much faster, more flexible struct, but syntax for accessing values are is different.
Its easy to load arff data into python using scipy.
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()

Categories

Resources