Suppose you have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.nan,columns=['A','B','C'],index=[0,1,2])
Suppose I want an additional "layer" on top of this pandas dataframe, such that column A, row 0 would have its value, column B, row 0 would have a different value, column C row 0 would have something, column A row 1 and so on. So like a dataframe on top of this existing one.
Is it possible to add other layers? How does one access these layers? Is this efficient, i.e. should I just use a separate data frame all together? And one would save these multiple layers as a csv, by accessing the individual layers?, or is there a function that would break them down into different worksheets in the same workbook?
pandas.DataFrame cannot have 3 dimensions:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
However, there is a way to fake 3-dimensions with MultiIndex / Advanced Indexing:
Hierarchical indexing (MultiIndex)
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data
analysis and manipulation, especially for working with higher
dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like Series (1d) and DataFrame (2d).
If you really need that extra dimension go with pandas.Panel:
Panel is a somewhat less-used, but still important container for 3-dimensional data.
but don't miss this important disclaimer from the docs:
Note: Unfortunately Panel, being less commonly used than Series and
DataFrame, has been slightly neglected feature-wise. A number of
methods and options available in DataFrame are not available in Panel.
There is also pandas.Panel4D (experimental) in the unlikely chance that you need it.
Related
I've just started learning numpy to analyse experimental data and I am trying to give a custom name to a ndarray column so that I can do operations without slicing.
The data I receive is generally a two column .txt file (I'll call it X and Y for the sake of clarity) with a large number of rows corresponding to the measured data. I do operations on those data and generate new columns (I'll call it F(X,Y), G(X,Y,F), ...). I know I can do column-wise operations by slicing ie Data[:,2]=Data[:,1]+Data[:,0], but with a large number of added columns, this becomes tedious. Hence I'm looking for a way to label the columns so that I can refer to a column by its label, and can also label the new columns I generate. So essentially I'm looking for something that'll allow me to directly write F=X+Y (as a substitute for the example above).
Currently, I'm assigning the entire column to a new variable and doing the operations, and then 'hstack'ing it to the data, but I'm unsure of the memory usage here. For example,
X=Data[:,0]
Y=Data[:,1]
F=X+Y
Data=numpy.hstack((Data,F.reshape(n,1)))
I've seen the use of structured array and record arrays, but the data I'm dealing with is homogeneous and new columns are being added continuously. Also, I hear Pandas is well suited for what I'm describing, but again, since I'm working with numerical data, I don't find the need to learn a new module, unless it's really needed. Thanks in advance.
I have one dataframe of readings that come in a particular arrangement due to the nature of the experiment. I also have another dataframe, that contains information about each point on the dataframe and what each point corresponds to, in terms of what chemical was at that point. Note, there are only a few different chemicals over the dataframe, but they are arranged all over the dataframe.
What I want to do is to create a new, reorganised dataframe, where the columns are the type of chemical. My inital thought would be to compare the data and information dataframes to produce a dictionary, which I could then transform into a new dataframe. I could not figure out how to do this, and might not actually be the best approach either!
I have previously achieved it by manually rearranging the points on the dataframe to match the pattern I want, but I'm not happy with this approach and must be a better way.
Thanks in advance for any help!
I'm relatively new to data analysis using Python and I'm trying to determine the most practical and useful way to read in my data so that I can index into it and use it in calculations. I have many images in the form of np.arrays that each have a corresponding set of data such as x- and y-coordinates, size, filter number, etc. I just want to make sure each set of data is grouped together with its corresponding image. My first thought was sticking the data in an np.array of dataclass instances (where each element of the array is an instance that contains all my data). My second thought was a pandas dataframe.
My gut is telling me that using a dataframe makes more sense. Do np.arrays store nicely inside dataframes? What are the pros/cons to each method and which would be best if I will need to be pulling data from them often, and I always need to make sure the data can be matched with its corresponding image?
What variables I have to read in: x_coord - float, y_coord - float, filter - int, image - np.ndarray.
I've been trying to stick the image arrays into a pandas dataframe but when indexing into it using .loc it is extremely slow to run the Jupyter Notebook cell. It was also very slow to populate the dataframe using .from_dict(). I'm guessing dataframes weren't meant to hold np.ndarrays?
My biggest concerns are the bookkeeping and ease of indexing - What can I do to always make sure I can retrieve the metadata for the corresponding image? In what form should my data be in so I can easily extract an image and its metadata, or all images with the same filter number, etc.
I need to only some of the functionalities of Pandas dataframe and need to remove others or restrict users from using them. So, I am planning to write my own dataframe class which would only have a subset of methods of Pandas dataframes.
The code for pandas DataFrame object can be found here.
Theoretically you could clone the repository and re-write sections of it. However, it's not a simple object and this may take a decent amount of reading into the code to understand how it works.
For example: pandas describes the dataframe object as a
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects.
In pandas you can replace the default integer-based index with an index made up of any number of columns using set_index().
What confuses me, though, is when you would want to do this. Regardless of whether the series is a column or part of the index, you can filter values in the series using boolean indexing for columns, or xs() for rows. You can sort on the columns or index using either sort_values() or sort_index().
The only real difference I've encountered is that indexes have issues when there are duplicate values, so it seems that using an index is more restrictive, if anything.
Why then, would I want to convert my columns into an index in Pandas?
In my opinion custom indexes are good for quickly selecting data.
They're also useful for aligning data for mapping, for aritmetic operations where the index is used for data alignment, for joining data, and for getting minimal or maximal rows per group.
DatetimeIndex is nice for partial string indexing, for resampling.
But you are right, a duplicate index is problematic, especially for reindexing.
Docs:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
Enables automatic and explicit data alignment
Allows intuitive getting and setting of subsets of the data set
Also you can check Modern pandas - Indexes, direct link.
As of 0.20.2, some methods, such as .unstack(), only work with indices.
Custom indices, especially indexing by time, can be particularly convenient. Besides resampling and aggregating over any time interval (the latter is done using .groupby() with pd.TimeGrouper()) which require a DateTimeIndex, you can call the .plot() method on a column, e.g. df['column'].plot() and immediately get a time series plot.
The most useful though, is alignment: for example, suppose you had some two sets of data that you want to add; they're labeled consistently, but sorted in a different order. If you set their labels to be the index of their dataframe, you can simply add the dataframes together and not worry about the ordering of the data.