How to Convert Dask DataFrame Into List of Dictionaries? - python

I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I'm getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'

You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag method, but it doesn't seem to do the right thing.
Also, you are not really getting much from the dataframe model here, you may want to start with db.read_text and the builtin CSV module.

Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.

The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))

Related

Python convert list of large json objects to DF

I want to convert a list of objects to a pandas dataframe. The objects are large and complex, a sample one can be found here
The output should be a DF with 3 columns: info, releases, URLs - as per the json object linked above. I've tried pd.DataFrame and from_records, but I keep getting hit with errors. Can anyone suggest a fix?
Have you tried pd.read_json() function?
Here's a link to the documentation
https://pandas.pydata.org/docs/reference/api/pandas.io.json.read_json.html

Pandas Dataframe to Apache Beam PCollection conversion problem

I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.
Unfortunately, when I use to_pcollection() function, I get the following error:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?
I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.
to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. https://github.com/apache/beam/pull/14170 should fix this.
I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection). So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.

Problem when stores dict into pandas Dataframe

Recently in my project, I need to store a dictionary into a Pandas DataFrame with the code
self.device_info.loc[i,'interference']=[temp_dict]
The device_info is a Pandas DataFrame. The temp_dict is a dictionary and I want it to be stored as an element in the DataFrame for future use. The square bracket is added to ensure there's no error when assigning.
I just found it today that with Pandas version 0.22.0, this code will pack the dictionary as a list and store it into the DataFrame. However, in the version of 0.24.2, this code directly stores the dictionary into Pandas DataFrame.
For example, say when i=0, after executing the code
with Pandas.version == '0.22.0'
type(self.device_info.loc[0,'interference'])
returns list, while Pandas.version == '0.24.2', this code returns a dict. From my perspective, I need a consistent performance that there is always a dictionary stored.
I am currently working on two PCs, one's home and one's at my office, and I cannot update the older version of pandas on my office PC. So I would be much appreciated if anyone can help me figure out why this happens.
Pandas has a from_dict method with many option which takes a dict as input and returns a DataFrame.
You can chose to infer type or force it (to str, for example).
Then, manipulating and appending dataframes is way easier as you won't ever again have dict object problem in that line or column.

What is the Dask equivalent of the Pandas .filter() attribute?

I am trying to make sub-DataFrames from a larger DataFrame in Dask. I realize that a lot of the tools found in Pandas used to manipulate DataFrames are present in Dask, however the devs are very transparent about what is not.
One such tool is the df.filter() attribute for DataFrames. I have code that looks like this:
comp2phys = df.filter(FOI[1:], axis=1)
where df is a DataFrame and FOI is a list containing the "fieldnames of interest".
I get this error when I try to filter the dataframe (as I explained above, it is not a tool in Dask)
AttributeError: 'DataFrame' object has no attribute 'filter'
Is there a tool in Dask that will allow me to do this?
Thanks!
EDIT
As #ayhan pointed out, this equivalent to df[FOI[1:]]
As #ayhan pointed out, this equivalent to df[FOI[1:]]

Arff Loader : AttributeError: 'dict' object has no attribute 'data'

I am trying to load a .arff file into a numpy array using liac-arff library. (https://github.com/renatopp/liac-arff)
This is my code.
import arff, numpy as np
dataset = arff.load(open('mydataset.arff', 'rb'))
data = np.array(dataset.data)
when executing, I am getting the error.
ArffLoader.py", line 8, in <module>
data = np.array(dataset.data)
AttributeError: 'dict' object has no attribute 'data'
I have seen similar threads, Smartsheet Data Tracker: AttributeError: 'dict' object has no attribute 'append'. I am new to Python and is not able to resolve this issue. How can I fix this?
Short version
dataset is a dict. For a dict, you access the values using the python indexing notation, dataset[key], where key could be a string, integer, float, tuple, or any other immutable data type (it is a bit more complicated than that, more below if you are interested).
In your case, the key is in the form of a string. To access it, you need to give the string you want as an index, like so:
import arff
import numpy as np
dataset = arff.load(open('mydataset.arff', 'rb'))
data = np.array(dataset['data'])
(you also shouldn't put the imports on the same line, although this is just a readability issue)
More detailed explanation
dataset is a dict, which on some languages is called a map or hashtable. In a dict, you access values in a similar way to how you index in a list or array, except the "index" can be any data-type that is "hashable" (which is, ideally, unique identifier for each possible value). This "index" is called a "key". In practice, at least for built-in types and most major packages, only immutable data types or hashable, but there is no actual rule that requires this to be the case.
Do you come from MATLAB? If so, then you are probably trying to use MATLAB's struct access technique. You could think of a dict as a much faster, more flexible struct, but syntax for accessing values are is different.
Its easy to load arff data into python using scipy.
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()

Categories

Resources