Is pandas / numpy's axis the opposite of R's MARGIN? - python

Is it correct to think about these two things as being opposite? This has been a major source of confusion for me.
Below is an example where I find the column sums of a data frame in R and Python. Notice the opposite values for MARGIN and axis.
In R (using MARGIN=2, i.e. the column margin):
m <- matrix(1:6, nrow=2)
apply(m, MARGIN=2, mean)
[1] 1.5 3.5 5.5
In Python (using axis=0, i.e. the row axis):
In [25]: m = pd.DataFrame(np.array([[1, 3, 5], [2, 4, 6]]))
In [26]: m.apply(np.mean, axis=0)
Out[26]:
0 1.5
1 3.5
2 5.5
dtype: float64

Confusion arises because apply() talks both about which dimension the apply is "over", as well as which dimension is retained. In other words, when you apply() over rows, the result is a vector whose length is the number of columns in the input. This particular confusion is highlighted by Pandas' documentation (but not R's):
axis : {0 or ‘index’, 1 or ‘columns’}
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
As you can see, 0 means the index (row) dimension is retained, and the column dimension is "applied over" (thus eliminated).
Put another way, application over columns is axis=0 or MARGIN=2, and application over rows is axis=1 or MARGIN=1. The 1 values appear to match, but that's spurious: 1 in Python is the second dimension, because Python is 0-based.

You are correct, the "margin" concept in R's apply is opposite to the "axis" concept in numpy/panda's apply function.
Say we are applying the function f to a 2-dimensional array arr. The function f takes a vector input.
R: The MARGIN argument indicates which array index of arr will be held fixed within each call to f. So if MARGIN=1 each call to f applies to all of the data with same first array index. This means the function is applied once to each row.
So, f is applied to arr[1,], arr[2,], ..., arr[n,] in turn, where n is the number of rows in arr.
numpy/pandas: The axis argument indicates which array index of arr will be varied within each call to f. So if axis=0, for each call to f, the first array index is varied to generate an input vector. This means the function is applied once to each column.
So, f is applied to arr[:,0], arr[:,1], ..., arr[:,m-1] in turn, where m is the number of columns in arr.
The difference in indexing (0-based for Python, 1-based for R) can be confusing but is not the cause of the discrepancy. I have used the appropriate syntax for each language above.
Alternative Explanation
R asks "along which dimensions should the function be applied?". So, indicating rows to R means that you want the function applied to each row. Meanwhile numpy/pandas think of its "axes" as indicating directions, like the axes of a graph. So when you tell apply to work along the row axis, it figures the row axis is vertical, and it works vertically, applying the function to each column.

In both Pandas and R, 'axis' and 'margin' are pretty much synonyms: a data frame has a 'columns' axis or margin going down, and a 'rows' axis or margin going to the right.
Pandas and R's apply implementations differ in what they do with the axis/margin keyword, as follows.
In R, calling Rows <- 1; apply(df, Rows, sum) means
R: "'Row' is the shape of the inputs. Each invocation of f gets passed one row as an argument."
Rows <- 1
Columns <- 2
df <- data.frame(c1 = 1:2, c2 = 3:4, c3 = 5:6, row.names=c('r1', 'r2'))
df
# c1 c2 c3
# r1 1 3 5
# r2 2 4 6
apply(df, Rows, sum)
# r1 9
# r2 12
In Python, calling Rows = 0; df.apply(sum, axis=Rows) means
Pandas: "'Row' is the shape of the output. Every invocation of f gets passed one column as an argument."
import pandas as pd
Rows = 0
Columns = 1
df = pd.DataFrame(
{'c1': [1, 2], 'c2': [3, 4], 'c3': [5, 6]},
index=['r1', 'r2']
)
df
# c1 c2 c3
# r1 1 3 5
# r2 2 4 6
df.apply(sum, axis=Rows)
# c1 c2 c3
# 3 7 11

Related

How do I reverse the first four elements of the 1st axis and reversing the 2nd axis of a numpy array in a single operation?

I have a numpy array M of shape (n, 1000, 6). This can be thought of as n matrices with 1000 rows and 6 columns. For each matrix I would like to reverse the order of the rows (i.e. the top row is now at the bottom and vice versa) and then reverse the order of just the first 4 columns (so column 0 is now column 3, column 1 is column 2, column 2 is column 1 and column 3 is column 0 but column 4 is still column 4 and column 5 is still column 5). I would like to do this in a single operation, without doing indexing on the left side of the expression, so this would not be acceptable:
M[:,0:4,:] = M[:,0:4,:][:,::-1,:]
M[:,:,:] = M[:,:,::-1]
The operation needs to be achieveable using Keras backend which disallowes this. It must be of the form
M = M[indexing here that solves the task]
If I wanted to reverse the order of all the columns instead of just the first 4 this could easily be achieved with M = M[:,::-1,::-1] so I've being trying to modify this to achieve my goal but unfortunately can't work out how. Is this even possible?
M[:, ::-1, [3, 2, 1, 0, 4, 5]]

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

Perform function on multiple columns in python

I have a data array of 30 trials(columns) each of 256 data points (rows) and would like to run a wavelet transform (which requires a 1D array) on each column with the eventual aim of obtaining the mean coefficients of the 30 trials.
Can someone point me in the right direction please?
If you have a multidimensional numpy array then you can use a for loop:
import numpy as np
A = np.array([[1,2,3], [4,5,6]])
# A is the matrix: 1 2 3
# 4 5 6
for col in A.transpose():
print("Column:", col)
# Perform your wavelet transform here, you can save the
# results to another multidimensional array.
This gives you access to each column as a 1D array.
Output:
Column: [1 4]
Column: [2 5]
Column: [3 6]
If you want to access the rows rather than the columns then loop through A rather than A.transpose().

Numpy nanmean and dataframe (possible bug?)

I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days
df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })
x
0 1
1 3
2 NaN
x
0 2
1 NaN
2 5
In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]:
x
0 1.5
1 NaN
2 NaN
In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]:
array([[ 1.5],
[ 3. ],
[ 5. ]])
It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames are valid numpy elements at all.
np.nanmean probably converts the arguments into an np.array before applying operations. So lets look at
a = np.array([df1, df2])
First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame.
print(a.shape)
# (2,)
print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>
So nanmean is taking the mean of both of the DataFrames, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1 you'll get an error because it's a 1-d array.
np.nanmean(a, axis=1)
# IndexError: tuple index out of range
print(np.nanmean(a))
# x
# 0 1.5
# 1 NaN
# 2 NaN
That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.
b = np.array([df1.values, df2.values ])
print(b.shape)
# (2, 3, 1)
print(type(b[1]))
# <class 'numpy.ndarray'>
print(type(b[0,0,0]))
# <class 'numpy.float64'>
These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan. You might expect to get the same answer from nanmean as we did with a before, as it should exclude the nan value, right?
print(np.nanmean(np.array([df1, df2, np.nan])))
# x
# 0 NaN
# 1 NaN
# 2 NaN
Yea, so I'm not sure. Best to avoid making these.

Grouping boxplots in seaborn when input is a DataFrame

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.
Here we go with a reproducible example that fails:
import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
[10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
columns=['a1', 'a2', 'a3', 'a4', 'b'])
# display(df)
a1 a2 a3 a4 b
0 2 4 5 6 1
1 4 5 6 7 2
2 5 4 5 5 1
3 10 4 7 8 2
4 9 3 4 6 2
5 3 3 4 4 1
#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)
What I get is something that completely ignores groupby option:
Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :
sns.boxplot(df.a1, groupby=df.b)
So I would like to get all my columns in one plot (all columns come in a similar scale).
EDIT:
The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.
As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..
However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
Then it's very simple to plot:
sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")
You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).
As explained by #mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
# display(df_long.head())
b a c
0 1 a1 2
1 2 a1 4
2 1 a1 5
3 2 a1 10
4 2 a1 9
Then you just plot it:
sns.boxplot(x="a", hue="b", y="c", data=df_long)
Seaborn's groupby function takes Series not DataFrames, that's why it's not working.
As a work around, you can do this :
fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
sns.boxplot(grp[1], ax=ax[i])
it gives :
Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]
a1 a2 a3 a4
0 2 4 5 6
1 4 5 6 7
2 5 4 5 5
3 10 4 7 8
4 9 3 4 6
5 3 3 4 4
Hope this helps
It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.
Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.
g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')
It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.
output_graph
Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):
index
cluster
psychometric_tst
standard deviations from the mean
0
0.0
outcome_var_1
-1.276182
1
0.0
outcome_var_1
-1.118813
2
0.0
outcome_var_1
-1.276182
9754
0.0
outcome_var_6
0.892548
9755
0.0
outcome_var_6
1.420480
If you want to use indices with melt:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
And here's the graphing code:
(Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.
List item

Categories

Resources