Adding meta-information/metadata to pandas DataFrame

Adding meta-information/metadata to pandas DataFrame - python

Is it possible to add some meta-information/metadata to a pandas DataFrame?
For example, the instrument's name used to measure the data, the instrument responsible, etc.
One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'
Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.
Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.

As of pandas 1.0, possibly earlier, there is now a Dataframe.attrs property. It is experimental, but this is probably what you'll want in the future.
For example:
import pandas as pd
df = pd.DataFrame([])
df.attrs['instrument_name'] = 'Binky'
Find it in the docs here.
Trying this out with to_parquet and then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

Not really. Although you could add attributes containing metadata to the DataFrame class as #unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485
There is currently an open pull request to add a MetaDataFrame object, which would support metadata better.

The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.
from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace()
df.meta.foo = [1,2,3]

As mentioned in other answers and comments, _metadata is not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):
df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val'])
df.my_attribute = "my_value"
df._metadata.append('my_attribute')
df.groupby('val').apply(lambda group: group.my_attribute)
Output:
val
1 my_value
2 my_value
3 my_value
dtype: object

As mentioned by #choldgraf I have found xarray to be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.
In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:
df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata

Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5io that I've been using to accomplish this.
It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:
save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']
etc...
Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.

I have been looking for a solution and found that pandas frame has the property attrs
pd.DataFrame().attrs.update({'your_attribute' : 'value'})
frame.attrs['your_attribute']
This attribute will always stick to your frame whenever you pass it!

Referring to the section Define original properties(of the official Pandas documentation) and if subclassing from pandas.DataFrame is an option, note that:
To let original data structures have additional properties, you should let pandas know what properties are added.
Thus, something you can do - where the name MetaedDataFrame is arbitrarily chosen - is
class MetaedDataFrame(pd.DataFrame):
"""s/e."""
_metadata = ['instrument_name']
#property
def _constructor(self):
return self.__class__
# Define the following if providing attribute(s) at instantiation
# is a requirement, otherwise, if YAGNI, don't.
def __init__(
self, *args, instrument_name: str = None, **kwargs
):
super().__init__(*args, **kwargs)
self.instrument_name = instrument_name
And then instantiate your dataframe with your (_metadata-prespecified) attribute(s)
>>> mdf = MetaedDataFrame(instrument_name='Binky')
>>> mdf.instrument_name
'Binky'
Or even after instantiation
>>> mdf = MetaedDataFrame()
>>> mdf.instrument_name = 'Binky'
'Binky'
Without any kind of warning (as of 2021/06/15): serialization and ~.copy work like a charm. Also, such approach allows to enrich your API, e.g. by adding some instrument_name-based members to MetaedDataFrame, such as properties (or methods):
[...]
#property
def lower_instrument_name(self) -> str:
if self.instrument_name is not None:
return self.instrument_name.lower()
[...]
>>> mdf.lower_instrument_name
'binky'
... but this is rather beyond the scope of this question ...

I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:
meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
dfMeta = pd.DataFrame.from_dict(meta, orient='index')
This dfMeta can then be saved alongside your original DF in pickle etc
See Saving and loading multiple objects in pickle file? (Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle

Adding raw attributes with pandas (e.g. df.my_metadata = "source.csv") is not a good idea.
Even on the latest version (1.2.4 on python 3.8), doing this will randomly cause segfaults when doing very simple operations with things like read_csv. It will be hard to debug, because read_csv will work fine, but later on (seemingly at random) you will find that the dataframe has been freed from memory.
It seems cpython extensions involved with pandas seem to make very explicit assumptions about the data layout of the dataframe.
attrs is the only safe way to use metadata properties currently:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html
e.g.
df.attrs.update({'my_metadata' : "source.csv"})
How attrs should behave in all scenarios is not fully fleshed out. You can help provide feedback on the expected behaviors of attrs in this issue: https://github.com/pandas-dev/pandas/issues/28283

For those looking to store the datafram in an HDFStore, according to pandas.pydata.org, the recommended approach is:
import pandas as pd
df = pd.DataFrame(dict(keys=['a', 'b', 'c'], values=['1', '2', '3']))
df.to_hdf('/tmp/temp_df.h5', key='temp_df')
store = pd.HDFStore('/tmp/temp_df.h5')
store.get_storer('temp_df').attrs.attr_key = 'attr_value'
store.close()

Related

When should I worry about using copy() with a pandas DataFrame?

I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.

From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.

I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.

Pandas Dataframe to Apache Beam PCollection conversion problem

I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.
Unfortunately, when I use to_pcollection() function, I get the following error:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?
I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.

to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. https://github.com/apache/beam/pull/14170 should fix this.

I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection). So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.

Why does a DF sometimes automatically update, other times does NOT?

So, I'm unclear why doing certain operations on a DF updates it right away, but other times it does not update it unless you re-use the old name or use a new df variable name.
Doesn't this make it really confusing where the last 'real' change is?

First of all, a df behaves like a list in python, meaning that if you make a soft copy of it and change it, the original df changes too. To answer your question you must know that some methods of updating a df, write on a hard copy version of that df (which are indicated by a warning given by pandas letting you know that you are writing on a copy) so the data might not change. the best and most reliable way to change data using pandas is either:
df.at[a_cell] = some_data
or:
df.loc[some_rows, some_columns] = some_data

Reading Parquet File with Array<Map<String,String>> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be:
import pandas as pd
df = pd.DataFrame.from_records([
(1, [{'job_id': 1, 'started': '2019-07-04'}, {'job_id': 2, 'started': '2019-05-04'}], 100),
(5, [{'job_id': 3, 'started': '2015-06-04'}, {'job_id': 9, 'started': '2019-02-02'}], 540)],
columns=['uid', 'job_history', 'latency']
)
The when using engine='fastparquet, Dask reads all other columns fine but returns a column of Nones for the column with the complex type. When I set engine='pyarrow', I get the following exception:
ArrowNotImplementedError: lists with structs are not supported.
A lot of googling has made it clear that reading a column with a Nested Array just isn't really supported right now, and I'm not totally sure what the best way to handle this is. I figure my options are:
Some how tell dask/fastparquet to parse the column using the standard json library. The schema is simple and that would do the job if possible
See if I can possibily re-run the Spark job that generated the output and save it as something else, though this almost isn't an acceptable solution since my company uses parquet everywhere
Turn the keys of the map into columns and break the data up across several columns with dtype list and note that the data across these columns are related/map to each other by index (e.g. the elements in idx 0 across these keys/columns all came from the same source). This would work, but frankly, breaks my heart :(
I'd love to hear how others have navigated around this limitation. My company uses nested arrays in their parquest frequently, and I'd hate to have to let go of using Dask because of this.

It would be fairer to say that pandas does not support non-simple types very well (currently). It may be the case that pyarrow will, without conversion to pandas, and that as some future point, pandas will use these arrow structures directly.
Indeed, the most direct method that I can think for you to use, is to rewrite the columns as B/JSON-encoded text, and then load with fastparquet, specifying to load using B/JSON. You should get lists of dicts in the column, but the performance will be slow.
Note that the old project oamap and its successor awkward provides a way to iterate and aggregate over nested list/map/struct trees using python syntax, but compiled with Numba, such that you never need to instantiate the intermediate python objects. They were not designed for parquet, but had parquet compatibility, so might just be useful to you.

I'm dealing with pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported when I try to read using Pandas; however, when I read using pyspark and then convert to pandas, the data at least loads:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.read.load(path)
pdf = df.toPandas()
and the offending field is now rendered as a pyspark Row object, which have some structured parsing but you would have to probably write custom pandas functions to extract data from them:
>>> pdf["user"][0]["sessions"][0]["views"]
[Row(is_search=True, price=None, search_string='ABC', segment='listing', time=1571250719.393951), Row(is_search=True, price=None, search_string='ZYX', segment='homepage', time=1571250791.588197), Row(is_search=True, price=None, search_string='XYZ', segment='listing', time=1571250824.106184)]
the individual record can be rendered as a dictionary, simply call .asDict(recursive=True) on the Row object you would like.
Unfortunately, it takes ~5 seconds to start the SparkSession context and every spark action also takes much longer than pandas operations (for small to medium datasets) so I would greatly prefer a more python-native option

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding meta-information/metadata to pandas DataFrame - python

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

I have been looking for a solution and found that pandas frame has the property attrs pd.DataFrame().attrs.update({'your_attribute' : 'value'}) frame.attrs['your_attribute'] This attribute will always stick to your frame whenever you pass it!

Related

When should I worry about using copy() with a pandas DataFrame?

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

Pandas Dataframe to Apache Beam PCollection conversion problem

Why does a DF sometimes automatically update, other times does NOT?

Reading Parquet File with Array<Map<String,String>> Column

Categories

Resources