Pandas Dataframe to Apache Beam PCollection conversion problem - python

I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.
Unfortunately, when I use to_pcollection() function, I get the following error:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?
I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.

to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. https://github.com/apache/beam/pull/14170 should fix this.

I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection). So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.

Related

Trouble translating from Pandas to PySpark

I'm having alot of trouble translating a function that worked on a pandas DataFrame to a PySpark UDF. Mainly, PySpark is throwing error that I don't really understand because it this is my first time using it. First, My dataset does contain some NaNs, which I didn't know would add some complexity to my task. With that said, the dataset contains the standard data types, i.e. categories and integers. Finally, I am running my algorithm using Pandas groupby() method, apply() to every row and using a lambda function I'm told that PySpark supports all these methods.
Now let me tell you about the algorithm. It's pretty much a counting game that I'm running on one column. and itself is written in vanilla python. The reason I'm saying this is because it's a bit too long to post. It returns three lists, i.e. arrays. Which from what I understand PySpark also supports. This is what a super short version of the algo looks like:
def algo(x, col):
# you will be looking at a specific pandas column --- pd.Series
x = x[col]
# LOGIC GOES HERE...
return list1, list2, list3
I'm running the algorithm using:
data = df.groupby("GROUPBY_THIS").apply(lambda x: algo(x, "COLUMN1"))
And everything is working fine. I'm returning the three lists of the correct length. Now when I try to run this algorithm using PySpark I'm confused on whether to use UDFs or PandasUDF. In addition, I'm throwing error that I can quite understand. Can someone point me in the correct direction here. Thanks!
Error:
ValueError: Invalid udf: the udf argument must be a pandas_udf of type GROUPED_MAP.

How to Convert Dask DataFrame Into List of Dictionaries?

I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I'm getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'
You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag method, but it doesn't seem to do the right thing.
Also, you are not really getting much from the dataframe model here, you may want to start with db.read_text and the builtin CSV module.
Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.
The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))

What is the Dask equivalent of the Pandas .filter() attribute?

I am trying to make sub-DataFrames from a larger DataFrame in Dask. I realize that a lot of the tools found in Pandas used to manipulate DataFrames are present in Dask, however the devs are very transparent about what is not.
One such tool is the df.filter() attribute for DataFrames. I have code that looks like this:
comp2phys = df.filter(FOI[1:], axis=1)
where df is a DataFrame and FOI is a list containing the "fieldnames of interest".
I get this error when I try to filter the dataframe (as I explained above, it is not a tool in Dask)
AttributeError: 'DataFrame' object has no attribute 'filter'
Is there a tool in Dask that will allow me to do this?
Thanks!
EDIT
As #ayhan pointed out, this equivalent to df[FOI[1:]]
As #ayhan pointed out, this equivalent to df[FOI[1:]]

ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>

I´m new to programming. I´m trying to use scipy minimize, had several issues and gotten through most of them.
Right now this is the code, but I'm not understanding why I´m getting this error.
par_opt = so.minimize(fun=fun_obj, x0=par_ini, method='Nelder-Mead', args=[series_pt_cal, dt, series_caudal_cal])
Not enough info is given by the OP, but basically somewhere in the code it's specified to operate by data frame column (axis=1) on an object that is a Pandas Series. If the code typically works but occasional gives errors, check for degenerative cases where a data frame may have only 1 row. Pandas has a nasty habit of guessing what you want -- it may decide to reduce a 1-row data frame to a Series (e.g., the apply() function; you can disable that by using reduce=False in there).
Add a line of code to check the object is isinstance(df, pd.DataFrame) or else convert the offending pandas Series to a data frame, something like s.to_frame().T for the problems I had to deal with.
Use pd.DataFrame(df) before your so.minimize function.
Pandas wants to run on DataFrame for that function.

Adding meta-information/metadata to pandas DataFrame

Is it possible to add some meta-information/metadata to a pandas DataFrame?
For example, the instrument's name used to measure the data, the instrument responsible, etc.
One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!
Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'
Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.
Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.
As of pandas 1.0, possibly earlier, there is now a Dataframe.attrs property. It is experimental, but this is probably what you'll want in the future.
For example:
import pandas as pd
df = pd.DataFrame([])
df.attrs['instrument_name'] = 'Binky'
Find it in the docs here.
Trying this out with to_parquet and then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.
Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).
Not really. Although you could add attributes containing metadata to the DataFrame class as #unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485
There is currently an open pull request to add a MetaDataFrame object, which would support metadata better.
The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.
from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace()
df.meta.foo = [1,2,3]
As mentioned in other answers and comments, _metadata is not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):
df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val'])
df.my_attribute = "my_value"
df._metadata.append('my_attribute')
df.groupby('val').apply(lambda group: group.my_attribute)
Output:
val
1 my_value
2 my_value
3 my_value
dtype: object
As mentioned by #choldgraf I have found xarray to be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.
In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:
df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata
Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5io that I've been using to accomplish this.
It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:
save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']
etc...
Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.
I have been looking for a solution and found that pandas frame has the property attrs
pd.DataFrame().attrs.update({'your_attribute' : 'value'})
frame.attrs['your_attribute']
This attribute will always stick to your frame whenever you pass it!
Referring to the section Define original properties(of the official Pandas documentation) and if subclassing from pandas.DataFrame is an option, note that:
To let original data structures have additional properties, you should let pandas know what properties are added.
Thus, something you can do - where the name MetaedDataFrame is arbitrarily chosen - is
class MetaedDataFrame(pd.DataFrame):
"""s/e."""
_metadata = ['instrument_name']
#property
def _constructor(self):
return self.__class__
# Define the following if providing attribute(s) at instantiation
# is a requirement, otherwise, if YAGNI, don't.
def __init__(
self, *args, instrument_name: str = None, **kwargs
):
super().__init__(*args, **kwargs)
self.instrument_name = instrument_name
And then instantiate your dataframe with your (_metadata-prespecified) attribute(s)
>>> mdf = MetaedDataFrame(instrument_name='Binky')
>>> mdf.instrument_name
'Binky'
Or even after instantiation
>>> mdf = MetaedDataFrame()
>>> mdf.instrument_name = 'Binky'
'Binky'
Without any kind of warning (as of 2021/06/15): serialization and ~.copy work like a charm. Also, such approach allows to enrich your API, e.g. by adding some instrument_name-based members to MetaedDataFrame, such as properties (or methods):
[...]
#property
def lower_instrument_name(self) -> str:
if self.instrument_name is not None:
return self.instrument_name.lower()
[...]
>>> mdf.lower_instrument_name
'binky'
... but this is rather beyond the scope of this question ...
I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:
meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
dfMeta = pd.DataFrame.from_dict(meta, orient='index')
This dfMeta can then be saved alongside your original DF in pickle etc
See Saving and loading multiple objects in pickle file? (Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle
Adding raw attributes with pandas (e.g. df.my_metadata = "source.csv") is not a good idea.
Even on the latest version (1.2.4 on python 3.8), doing this will randomly cause segfaults when doing very simple operations with things like read_csv. It will be hard to debug, because read_csv will work fine, but later on (seemingly at random) you will find that the dataframe has been freed from memory.
It seems cpython extensions involved with pandas seem to make very explicit assumptions about the data layout of the dataframe.
attrs is the only safe way to use metadata properties currently:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html
e.g.
df.attrs.update({'my_metadata' : "source.csv"})
How attrs should behave in all scenarios is not fully fleshed out. You can help provide feedback on the expected behaviors of attrs in this issue: https://github.com/pandas-dev/pandas/issues/28283
For those looking to store the datafram in an HDFStore, according to pandas.pydata.org, the recommended approach is:
import pandas as pd
df = pd.DataFrame(dict(keys=['a', 'b', 'c'], values=['1', '2', '3']))
df.to_hdf('/tmp/temp_df.h5', key='temp_df')
store = pd.HDFStore('/tmp/temp_df.h5')
store.get_storer('temp_df').attrs.attr_key = 'attr_value'
store.close()

Categories

Resources