I am trying to load a .arff file into a numpy array using liac-arff library. (https://github.com/renatopp/liac-arff)
This is my code.
import arff, numpy as np
dataset = arff.load(open('mydataset.arff', 'rb'))
data = np.array(dataset.data)
when executing, I am getting the error.
ArffLoader.py", line 8, in <module>
data = np.array(dataset.data)
AttributeError: 'dict' object has no attribute 'data'
I have seen similar threads, Smartsheet Data Tracker: AttributeError: 'dict' object has no attribute 'append'. I am new to Python and is not able to resolve this issue. How can I fix this?
Short version
dataset is a dict. For a dict, you access the values using the python indexing notation, dataset[key], where key could be a string, integer, float, tuple, or any other immutable data type (it is a bit more complicated than that, more below if you are interested).
In your case, the key is in the form of a string. To access it, you need to give the string you want as an index, like so:
import arff
import numpy as np
dataset = arff.load(open('mydataset.arff', 'rb'))
data = np.array(dataset['data'])
(you also shouldn't put the imports on the same line, although this is just a readability issue)
More detailed explanation
dataset is a dict, which on some languages is called a map or hashtable. In a dict, you access values in a similar way to how you index in a list or array, except the "index" can be any data-type that is "hashable" (which is, ideally, unique identifier for each possible value). This "index" is called a "key". In practice, at least for built-in types and most major packages, only immutable data types or hashable, but there is no actual rule that requires this to be the case.
Do you come from MATLAB? If so, then you are probably trying to use MATLAB's struct access technique. You could think of a dict as a much faster, more flexible struct, but syntax for accessing values are is different.
Its easy to load arff data into python using scipy.
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()
Related
More of a conceptual question.
When I import files into Python (without specifying the data types) -- just straight up df = pd.read_csv("blah.csv/") or df = pd.read_excel("blah.xls"), Python naturally guesses the data types of the columns.
No issues here.
However, sometimes when I am working with one of the columns, say, an object column, and I know for certain that Python guessed correctly, my .str functions sometimes don't work as intended, or I get an error. Yet, if I were to specify the column data type after importation as a .str everything works as intended.
I also noticed that if I specify one of the object columns as a str datatype after importation, the size of the object increases. So I am guessing Python's object type is different from a "string object" datatype? What causes this discrepancy?
I'm reading a large CSV with dask, setting the dtypes as string and then setting it as an index:
dataframe = dd.read_csv(file_path, dtype={"colName": "string"}, blocksize=100e6)
dataframe.set_index("colName")
and it throws the following error:
TypeError: Cannot interpret 'StringDtype' as a data type
Why does this happen? How can I solve it?
As stated in the bug report here for an unrelated issue: https://github.com/dask/dask/issues/7206#issuecomment-797221227
When constructing the dask Array's meta object, we're currently assuming the underlying array type is a NumPy array, when in this case, it's actually going to be a pandas StringArray. But unlike pandas, NumPy doesn't know how to handle a StringDtype.
Currently, changing the column type to object from string solves the issue, but it's unclear if this is a bug or an expected behavior:
dataframe = dd.read_csv(file_path, dtype={"colName": "object"}, blocksize=100e6)
dataframe.set_index("colName")
Is there any way to share the memory/data contained in a dataframe with another item? I naively tried this using memoryview, but I think it's quite a bit more complex than this (or possibly not supported at all):
>>> import pandas as pd
>>> df=pd.DataFrame([{'Name':'Brad'}])
>>> v=memoryview(df)
TypeError: cannot make memory view because object does not have the buffer interface
I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.
Unfortunately, when I use to_pcollection() function, I get the following error:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?
I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.
to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. https://github.com/apache/beam/pull/14170 should fix this.
I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection). So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.
I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I'm getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'
You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag method, but it doesn't seem to do the right thing.
Also, you are not really getting much from the dataframe model here, you may want to start with db.read_text and the builtin CSV module.
Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.
The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))