Numpy nanmean and dataframe (possible bug?) - python

I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days
df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })
x
0 1
1 3
2 NaN
x
0 2
1 NaN
2 5
In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]:
x
0 1.5
1 NaN
2 NaN
In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]:
array([[ 1.5],
[ 3. ],
[ 5. ]])

It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames are valid numpy elements at all.
np.nanmean probably converts the arguments into an np.array before applying operations. So lets look at
a = np.array([df1, df2])
First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame.
print(a.shape)
# (2,)
print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>
So nanmean is taking the mean of both of the DataFrames, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1 you'll get an error because it's a 1-d array.
np.nanmean(a, axis=1)
# IndexError: tuple index out of range
print(np.nanmean(a))
# x
# 0 1.5
# 1 NaN
# 2 NaN
That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.
b = np.array([df1.values, df2.values ])
print(b.shape)
# (2, 3, 1)
print(type(b[1]))
# <class 'numpy.ndarray'>
print(type(b[0,0,0]))
# <class 'numpy.float64'>
These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan. You might expect to get the same answer from nanmean as we did with a before, as it should exclude the nan value, right?
print(np.nanmean(np.array([df1, df2, np.nan])))
# x
# 0 NaN
# 1 NaN
# 2 NaN
Yea, so I'm not sure. Best to avoid making these.

Related

How to compare if any value is similar to any other using numpy

I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])

Pandas iloc wrong index causing problems with subtraction

I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.
If I use ‍‍‍‍‍‍y = data.ix[:,-1:0] - the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column

Is pandas / numpy's axis the opposite of R's MARGIN?

Is it correct to think about these two things as being opposite? This has been a major source of confusion for me.
Below is an example where I find the column sums of a data frame in R and Python. Notice the opposite values for MARGIN and axis.
In R (using MARGIN=2, i.e. the column margin):
m <- matrix(1:6, nrow=2)
apply(m, MARGIN=2, mean)
[1] 1.5 3.5 5.5
In Python (using axis=0, i.e. the row axis):
In [25]: m = pd.DataFrame(np.array([[1, 3, 5], [2, 4, 6]]))
In [26]: m.apply(np.mean, axis=0)
Out[26]:
0 1.5
1 3.5
2 5.5
dtype: float64
Confusion arises because apply() talks both about which dimension the apply is "over", as well as which dimension is retained. In other words, when you apply() over rows, the result is a vector whose length is the number of columns in the input. This particular confusion is highlighted by Pandas' documentation (but not R's):
axis : {0 or ‘index’, 1 or ‘columns’}
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
As you can see, 0 means the index (row) dimension is retained, and the column dimension is "applied over" (thus eliminated).
Put another way, application over columns is axis=0 or MARGIN=2, and application over rows is axis=1 or MARGIN=1. The 1 values appear to match, but that's spurious: 1 in Python is the second dimension, because Python is 0-based.
You are correct, the "margin" concept in R's apply is opposite to the "axis" concept in numpy/panda's apply function.
Say we are applying the function f to a 2-dimensional array arr. The function f takes a vector input.
R: The MARGIN argument indicates which array index of arr will be held fixed within each call to f. So if MARGIN=1 each call to f applies to all of the data with same first array index. This means the function is applied once to each row.
So, f is applied to arr[1,], arr[2,], ..., arr[n,] in turn, where n is the number of rows in arr.
numpy/pandas: The axis argument indicates which array index of arr will be varied within each call to f. So if axis=0, for each call to f, the first array index is varied to generate an input vector. This means the function is applied once to each column.
So, f is applied to arr[:,0], arr[:,1], ..., arr[:,m-1] in turn, where m is the number of columns in arr.
The difference in indexing (0-based for Python, 1-based for R) can be confusing but is not the cause of the discrepancy. I have used the appropriate syntax for each language above.
Alternative Explanation
R asks "along which dimensions should the function be applied?". So, indicating rows to R means that you want the function applied to each row. Meanwhile numpy/pandas think of its "axes" as indicating directions, like the axes of a graph. So when you tell apply to work along the row axis, it figures the row axis is vertical, and it works vertically, applying the function to each column.
In both Pandas and R, 'axis' and 'margin' are pretty much synonyms: a data frame has a 'columns' axis or margin going down, and a 'rows' axis or margin going to the right.
Pandas and R's apply implementations differ in what they do with the axis/margin keyword, as follows.
In R, calling Rows <- 1; apply(df, Rows, sum) means
R: "'Row' is the shape of the inputs. Each invocation of f gets passed one row as an argument."
Rows <- 1
Columns <- 2
df <- data.frame(c1 = 1:2, c2 = 3:4, c3 = 5:6, row.names=c('r1', 'r2'))
df
# c1 c2 c3
# r1 1 3 5
# r2 2 4 6
apply(df, Rows, sum)
# r1 9
# r2 12
In Python, calling Rows = 0; df.apply(sum, axis=Rows) means
Pandas: "'Row' is the shape of the output. Every invocation of f gets passed one column as an argument."
import pandas as pd
Rows = 0
Columns = 1
df = pd.DataFrame(
{'c1': [1, 2], 'c2': [3, 4], 'c3': [5, 6]},
index=['r1', 'r2']
)
df
# c1 c2 c3
# r1 1 3 5
# r2 2 4 6
df.apply(sum, axis=Rows)
# c1 c2 c3
# 3 7 11

Python DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19

Based on my previous question python pandas standardize column for regression I am rescaling specific columns in my dataframe to be between 0 and 1.
scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
email['scaled_quantity'] = scaler.fit_transform(email['Quantity'])
Unfortunately, I get this error
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
#Grr proposed that I apply the scaling to the whole dataframe, but that is not an option. I need to maintain the columns the way they are and only want to add new additional scaled columns.
How can I address this depreciation error?
what about doing
scaler.fit_transform(email[['Quantity']])
instead of
scaler.fit_transform(email['Quantity'])
Demo: i used your sample data set from the previous question:
In [56]: scaler.fit_transform(df[['Event_Counts']])
Out[56]:
array([[ 0.99722347],
[ 1. ],
[ 0. ]])
Notice - it produced an array with the shape (3,1) instead of (3,)
as a new column:
In [58]: df['scaled_event_counts'] = scaler.fit_transform(df[['Event_Counts']])
In [59]: df
Out[59]:
Date Event_Counts Category_A Category_B scaled_event_counts
0 20170401 982457 0 1 0.997223
1 20170402 982754 1 0 1.000000
2 20170402 875786 0 1 0.000000

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

Categories

Resources