Size immutability in pandas data structure

Size immutability in pandas data structure - python

While going through pandas Documentation for version 0.24.1 here, I came across this statement.
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
import pandas as pd
test_s = pd.Series([1,2,3])
id(test_s) # output: 140485359734400 (will vary)
len(test_s) # output: 3
test_s[3] = 37
id(test_s) # output: 140485359734400
len(test_s) # output: 4
The meaning of size immutable as per my inference is that operations like appending and deleting an element are not allowed, which is clearly not the case. Even the identity of the object remains the same, ruling out the possibility of a new object creation with the same name.
So, what does size immutability actually mean?

Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.
Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.
When you assign to a Series, you're actually calling Series.__setitem__ (which then delegates to NDFrame.__loc__) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.

#dbot_5
"All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.
A per my opinion, it is already written that all pandas data structures (Series, Dataframes) are value-mutable. It means we can add or delete values in Series and DataFrame.
"The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame."
As given in this statement, we cannot change the columns in the Series (by default, it has only one column and we cannot add new columns and we cannot even delete it.) but we can change- add, delete columns in a DataFrame. so here length means no. of columns not the number of values.

Related

HDFStore get column names

I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!

You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.

For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')

Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.

How to create a view of dataframe in pandas?

I have a large dataframe (10m rows, 40 columns, 7GB in memory). I would like to create a view in order to have a shorthand name for a view that is complicated to express, without adding another 2-4 GB to memory usage. In other words, I would rather type:
df2
Than:
df.loc[complicated_condition, some_columns]
The documentation states that, while using .loc ensures that setting values modifies the original dataframe, there is still no guarantee as to whether the object returned by .loc is a view or a copy.
I know I could assign the condition and column list to variables (e.g. df.loc[cond, cols]), but I'm generally curious to know whether it is possible to create a view of a dataframe.
Edit: Related questions:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views

You generally can't return a view.
Your answer lies in the pandas docs:
returning-a-view-versus-a-copy.
Whenever an array of labels or a boolean vector are involved in the
indexing operation, the result will be a copy. With single label /
scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view
will be returned.
This answer was found in the following post: Link.

Mutable indexed heterogeneous data structure?

Is there a data class or type in Python that matches these criteria?
I am trying to build an object that looks something like this:
ExperimentData
ID 1
sample_info_1: character string
sample_info_2: character string
Dataframe_1: pandas data frame
Dataframe_2: pandas data frame
ID 2
(etc.)
Right now, I am using a dict to hold the object ('ExperimentData'), which containsnamedtuple's for each ID. Each of the namedtuple's has a named field for the corresponding data attached to the sample. This allows me to keep all the ID's indexed, and have all of the fields under each ID indexed as well.
However, I need to update and/or replace the entries under each ID during downstream analysis. Since a tuple is immutable, this does not seem to be possible.
Is there a better implementation of this?

You could use a dict of dicts instead of a dict of namedtuples. Dicts are mutable, so you'll be able to modify the inner dicts.
Given what you said in the comments about the structures of each DataFrame-1 and -2 being comparable, you could also group all of each into one big DataFrame, by adding a column to each DataFrame containing the value of sample_info_1 repeated across all rows, and likewise for sample_info_2. Then you could concat all the DataFrame-1s into a big one, and likewise for the DataFrame-2s, getting all your data into two DataFrames. (Depending on the structure of those DataFrames, you could even join them into one.)

The nature of pandas DataFrame

As a followup to my question on mixed types in a column:
Can I think of a DataFrame as a list of columns or is it a list of rows?
In the former case, it means that (optimally) each column has to be homogeneous (type-wise) and different columns can be of different types. The latter case, suggests that each row is type-wise homogeneous.
For the documentation:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
This implies that a DataFrame is a list of columns.
Does it mean that appending a row to a DataFrame is more expensive than appending a column?

You are fully correct that a DataFrame can be seen as a list of columns, or even more a (ordered) dictionary of columns (see explanation here).
Indeed, each column has to be homogeneous of type, and different columns can be of different types. But by using the object dtype you can still hold different types of objects in one column (although not recommended apart for eg strings).
To illustrate, if you ask the data types of a DataFrame, you get the dtype for each column:
In [2]: df = pd.DataFrame({'int_col':[0,1,2], 'float_col':[0.0,1.1,2.5], 'bool_col':[True, False, True]})
In [3]: df.dtypes
Out[3]:
bool_col bool
float_col float64
int_col int64
dtype: object
Internally, the values are stored as blocks of the same type. Each column, or collection of columns of the same type is stored in a separate array.
And this indeed implies that appending a row is more expensive. In general, appending multiple single rows is not a good idea: better to eg preallocate an empty dataframe to fill, or put the new rows/columns in a list and concat them all at once.
See the note at the end of the concat/append docs (just before the first subsection "Set logic on the other axes").

To address the question: Is appending a row to a DataFrame is more expensive than appending a column?
We need to take into account various factors, but the most important one is the internal physical data layout of Pandas Dataframe.
The short and kind of naive answer:
If the table(aka DataFrame) is stored in a column-wise physical layout, then add or fetch a column is faster than with a row; if the table is stored in a row-wise physical layout, it's the other way. In general, the default Pandas DataFrame is stored column-wise(but NOT all the time). So in general, appending a row to a DataFrame is indeed more expensive than appending a column. And you could consider the nature of Pandas DataFrame to be a dict of columns.
A longer answer:
Pandas needs to choose a way to arrange the internal layout of a table in memory (such as a Dataframe of 10 rows and 2 columns). The most common two approaches are column-wise and row-wise.
Pandas is built on top of Numpy, and DataFrame and Seires are built on top of Numpy Array. But do notice though Numpy Array is internally stored row-wise in Memory, this is NOT the case for Pandas DataFrame. How DataFrame is stored depends on how it was initiated, cf this post:https://krbnite.github.io/Memory-Efficient-Windowing-of-Time-Series-Data-in-Python-2-NumPy-Arrays-vs-Pandas-DataFrames/
It's actually quite natural that Pandas adopt a column-wise layout most of the time because Pandas was designed to be a data analysis tool that relies more heavily on column-oriented operations than row-oriented operations. cf https://www.stitchdata.com/columnardatabase/
In the end, the answer to the question Is appending a row to a DataFrame is more expensive than appending a column? also depends on caching, prefetching etc. Thus it's a rather complicated question to answer and could depend on specific runtime conditions. But the most important factor is the data layout.
Answer from the authors of Pandas
The authors of Pandas actually mentioned this point in their design documentation. cf https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst#what-is-blockmanager-and-why-does-it-exist
So, to do anything row oriented on an all-numeric DataFrame, pandas
would concatenate all of the columns together (using numpy.vstack or
numpy.hstack) then use array broadcasting or methods like ndarray.sum
(combined with np.isnan to mind missing data) to carry out certain
operations.

Change dtype of a single column in a 2d numpy array

I am creating a 2d array full of zeros with the following line of code:
MyNewArray=zeros([4,12],float)
However, the first column will need to be populated with string-type textual data, while all the other columns will need to be populated with numerical data that can be manipulated mathematically.
How can I edit the code above so that the first column in the matrix can be of the string data type while keeping all the other columns as float?

You might want to use structured arrays
MyNewArray = zeros(12, dtype='S10,f4,f4,f4')
There are several ways of defining the structure, here I have defined 4 fields: one text with 10 characters, and three floats (you could write float instead of f4).
It is important to note that the number of characters of the array has to be specified, for array memory management reasons. You won't be able to store strings longer than this maximum length.
Each field is referenced by a field name, in this case, default field names f0 to f3 will been used. For example, to get the whole first column (the textual one):
MyNewArray['f0']
Of course, you can modifiy field names as you wish.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.