I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.
If I use y = data.ix[:,-1:0] - the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
Related
I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!
I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0
I have a column that I'm trying to smooth out the results. Most of the data creates a smooth chart but sometimes I get a random spike. I want to reduce the impact of the spike.
My thought was to take the outlier and just make it the mean of the values between it but I'm struggling and not getting the result I want.
Here's what I'm doing right now:
df = pd.DataFrame(np.random.randint(0,100,size=(5, 1)), columns=list('A'))
def aDetection(inputs):
median = inputs["A"].median()
std = inputs["A"].std()
outliers = (inputs["A"] - median).abs() > std
print("outliers")
print(outliers)
inputs[outliers]["A"] = np.nan #this isn't working.
inputs[outliers] = np.nan #works but wipes out entire row
inputs['A'].fillna(median, inplace=True)
print("modified:")
print(inputs)
print("original")
print(df)
aDetection(df)
original
A
0 4
1 86
2 40
3 99
4 97
outliers
0 True
1 False
2 True
3 False
4 False
Name: A, dtype: bool
modified:
A
0 86.0
1 86.0
2 86.0
3 99.0
4 97.0
For one, it seems to change all rows not just the single column. But the bigger problem is all the outliers in my example are using 86. I realize this is because I set the mean for the entire column, but I would like the mean between the previous column with the missing data.
For a single column, you can do your task with the following one-liner
(for readability folded into 2 lines):
df.A = df.A.mask((df.A - df.A.median()).abs() > df.A.std(),
pd.concat([df.A.shift(), df.A.shift(-1)], axis=1).mean(axis=1))
Details:
(df.A - df.A.median()).abs() > df.A.std() - computes outliers.
df.A.shift() - computes a Series of previous values.
df.A.shift(-1) - computes a Series of following values.
pd.concat(...) - creates a DataFrame from both the above Series.
mean(axis=1) - computes means by rows.
mask(...) - takes original values of A column for non-outliers
and the value from concat for outliers.
The result is:
A
0 86.0
1 86.0
2 92.5
3 99.0
4 97.0
If you want to apply this mechanism to all columns of your DataFrame,
then:
Change the above code to a function:
def replOutliers(col):
return col.mask((col - col.median()).abs() > col.std(),
pd.concat([col.shift(), col.shift(-1)], axis=1).mean(axis=1))
Apply it (to each column):
df = df.apply(replOutliers)
I have a pandas Data Frame which is a 50x50 correlation matrix. In the following picture you can see what I have as an example
What I would like to do, if it's possible of course, is to make a new data frame which has only the elements of the old one that are higher than 0.5 or lower than -0.5, indicating a strong linear relationship, but not 1, to avoid the variance parts.
I dont think what I ask is exactly possible because of course variable x0 wont have the same strong relationships that x1 have etc, so the new data frame wont be looking very good.
But is there any way to scan fast through this data frame, find the values I mentioned and maybe at least insert them into an array?
Any insight would be helpful. Thanks
you can't really look at a correlation matrix if you want to drop correlation pairs that are too low. One thing you could do is stack the frame and keep the relevant correlation pair.
having (randomly generated as an example):
0 1 2 3 4
0 0.038142 -0.881054 -0.718265 -0.037968 -0.587288
1 0.587694 -0.135326 -0.529463 -0.508112 -0.160751
2 -0.528640 -0.434885 -0.679416 -0.455866 0.077580
3 0.158409 0.827085 0.018871 -0.478428 0.129545
4 0.825489 -0.000416 0.682744 0.794137 0.694887
you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 5)))
df = df.stack()
df = df[((df > 0.5) | (df < -0.5)) & (df != 1)]
0 1 -0.881054
2 -0.718265
4 -0.587288
1 0 0.587694
2 -0.529463
3 -0.508112
2 0 -0.528640
2 -0.679416
3 1 0.827085
4 0 0.825489
2 0.682744
3 0.794137
4 0.694887
Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!
I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days
df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })
x
0 1
1 3
2 NaN
x
0 2
1 NaN
2 5
In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]:
x
0 1.5
1 NaN
2 NaN
In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]:
array([[ 1.5],
[ 3. ],
[ 5. ]])
It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames are valid numpy elements at all.
np.nanmean probably converts the arguments into an np.array before applying operations. So lets look at
a = np.array([df1, df2])
First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame.
print(a.shape)
# (2,)
print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>
So nanmean is taking the mean of both of the DataFrames, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1 you'll get an error because it's a 1-d array.
np.nanmean(a, axis=1)
# IndexError: tuple index out of range
print(np.nanmean(a))
# x
# 0 1.5
# 1 NaN
# 2 NaN
That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.
b = np.array([df1.values, df2.values ])
print(b.shape)
# (2, 3, 1)
print(type(b[1]))
# <class 'numpy.ndarray'>
print(type(b[0,0,0]))
# <class 'numpy.float64'>
These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan. You might expect to get the same answer from nanmean as we did with a before, as it should exclude the nan value, right?
print(np.nanmean(np.array([df1, df2, np.nan])))
# x
# 0 NaN
# 1 NaN
# 2 NaN
Yea, so I'm not sure. Best to avoid making these.