I'm currently learning how to use Pandas, and I'm in a situation where I'm attempting to replace missing data (Horsepower feature) using a best-fit line generated from linear regression with the Displacement column. What I'm doing is iterating through only the parts of the dataframe that are marked as NaN in the Horsepower column and replacing the data by feeding in the value of Displacement in that same row into the best-fit algorithm. My code looks like this:
for row, value in auto_data.HORSEPOWER[pd.isnull(auto_data.HORSEPOWER)].iteritems():
auto_data.HORSEPOWER[row] = int(round(slope * auto_data.DISPLACEMENT[row] + intercept))
Now, the code works and the data is replaced as expected, but it generates the SettingWithCopyWarning when I run it. I understand why the warning is generated, and that in this case I'm fine, but if there is a better way to iterate through the subset, or a method that's just more elegant, I'd rather avoid chained indexing that could cause a real problem in the future.
I've looked through the docs, and through other answers on Stack Overflow. All solutions to this seem to use .loc, but I just can't seem to figure out the correct syntax to get the subset of NaN rows using .loc Any help is appreciated. If it helps, the dataframe looks like this:
auto_data.dtypes
Out[15]:
MPG float64
CYLINDERS int64
DISPLACEMENT float64
HORSEPOWER float64
WEIGHT int64
ACCELERATION float64
MODELYEAR int64
NAME object
dtype: object
IIUC you should be able to just do:
auto_data.loc[auto_data[HORSEPOWER].isnull(),'HORSEPOWER'] = np.round(slope * auto_data['DISPLACEMENT'] + intercept)
The above will be vectorised and avoid looping, the error you get is from doing this:
auto_data.HORSEPOWER[row]
I think if you did:
auto_data.loc[row,'HORSEPOWER']
then the warning should not be raised
Instead of looping through the DataFrame row-by-row, it would be more efficient to calculate the extrapolated values in a vectorized way for the entire column:
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
and then use update to replace the NaN values:
auto_data['HORSEPOWER'].update(y)
Using update works for the particular case of replacing NaN values.
Ed Chum's solution shows how to replace the value in arbitrary rows by using a boolean mask and auto_data.loc.
For example,
import numpy as np
import pandas as pd
auto_data = pd.DataFrame({
'HORSEPOWER':[1, np.nan, 2],
'DISPLACEMENT': [3, 4, 5]})
slope, intercept = 2, 0.5
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
auto_data['HORSEPOWER'].update(y)
print(auto_data)
yields
DISPLACEMENT HORSEPOWER
0 3 6
1 4 8
2 5 10
Related
I've searched for answer around, but I cannot find them.
My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.
My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT
train_df = df[df['my_column'].notna()] #I need to train the model without using the missing data
train_x = train_df[['lat','long']] #Lat e Long are the inputs
train_y = train_df[['my_column']] #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y) #clf is the classifies, here we train it
df_x = df[['lat','long']] #I need this part to do the prediction
prediction = clf.predict(df_x) #clf.predict() returns an array
series_pred = pd.Series(prediction) #now the array is a series
print(series_pred.shape) #RETURNS (2381,)
print(series_pred.isna().sum()) #RETURN 0
So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)
Here I try to assign the predictions to my Dataframe:
#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred #I assign the predictions using .loc()
#test_2
df['my_colum'] = df['my_colum'].fillna(series_pred) #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape) #RETURNS (2381,)
print(df['my_colum'].isna().sum()) #RETURN 6
As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:
#test_3
df[['my_colum']] = df[['my_colum']].fillna(series_pred) #Will it work?
print(df[['my_colum']].shape) #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum()) #RETURNS 6
Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:
In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum()) #extreme test
Out[42]:
6
So... where is my very very stupid mistake? Thanks a lot
EDIT 1
To show a little bit of the data,
In[1]:
df.head()
Out[1]:
my_column lat long
id
9df Wil 51 5
4f3 Fabio 47 9
x32 Fabio 47 8
z6f Fabio 47 9
a6f Giovanni 47 7
Also, I've added info at the beginning of the question
#Ben.T or #Dan should post their own answers, they deserve to be accepted as the correct one.
Following their hints, I would say that there are two solutions:
Solution 1 (Best): Use loc()
The problem
The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column'] is expecting to receive X values, where X is the number of missing values. My variable prediction has actually both the prediction for the missing values and for the non missing values
The solution
pred_df = df[df['my_column'].isna()] #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction
Solution 2: Use fillna()
The problem
The problem with the current solution is that df['my_colum'].fillna(series_pred) requires the indexes of my df to be the same of series_pred, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]
The solution
Resetting the index of the df at the very beginning of the code.
Why is this not the best
The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification
Edit: series_pred.index = df['my_column'].isna().index Thanks #Dan
I want to scale all the values of a column of a dataframe with a function. This is the function so far:
def scale0_1(cname):
temp = array(cname)
for i in range(len(temp)):
value = temp[i]-min(temp)/(max(temp)-min(temp))
temp[i] = value
return pd.DataFrame(temp)
Here is a sample column to test the function with:
samplecolumn = pd.DataFrame([7.0, 15.8, 19.4, 11.4])
However, when I use the function with a column of a data frame (any numeric column should work), it just returns the original values, doing nothing. There is no error message. Does anyone have an idea how to fix this?
I would be very grateful for any help :)
Using np.interp
a=df[0].values
np.interp(a, (a.min(), a.max()), (0, +1))
Out[36]: array([0. , 0.70967742, 1. , 0.35483871])
With pandas dataframes you can apply operations to entire columns. This allows you to do something like this:
def scale0_1(cname):
scale_factor = min(cname) / (max(cname) - min(cname))
return cname - scale_factor
This also allows you to keep the data in a pandas Series or DataFrame through the whole operation and avoids the added complexity of converting it into an array and back.
Where possible, you should use a vectorised approach rather than iterating rows explicitly. For example, you can calculate a column's maximum and minimum. Then, when performing operations with series, the calculations are automatically vectorised.
df = pd.DataFrame({'A': [7.0, 15.8, 19.4, 11.4]})
col_min = df['A'].min()
col_max = df['A'].max()
df['B'] = (df['A'] - col_min) / (col_max - col_min)
This is a frequent task, so you will find it exists in other 3rd party libraries. For example, using sklearn:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df['B'] = min_max_scaler.fit_transform(df['A'])
Result
print(df)
A B
0 7.0 0.000000
1 15.8 0.709677
2 19.4 1.000000
3 11.4 0.354839
i have the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame([(0,2,5), (2,4,None),(7,-5,4), (1,None,None)])
def clean(series):
start = np.min(list(series.index[pd.isnull(series)]))
end = len(series)
series[start:] = series[start-1]
return series
my objective is to obtain a dataframe in which each row which contains a None value is filled in with the last available numerical value.
so, for example, running this function on just the 3rd row of the dataframe, i would produce the following:
row = df.ix[3]
test = clean(row)
test
0 1.0
1 1.0
2 1.0
Name: 3, dtype: float64
i cannot get this to work using the .apply() method, i.e. df.apply(clean,axis=1)
i should mention that this is a toy example - the custom function i would write in the real one is more dynamic in how it fills the values - so i am not looking for basic utilities like .ffill or .fillna
The apply method didn't work because when the row is completely filled your clean function will not know where to start the index from because of empty array for the given series.
So use a condition before altering series data i.e
def clean(series):
# Creating a copy for the sake of safety
series = series.copy()
# Alter series if only there exists a None value
if pd.isnull(series).any():
start = np.min(list(series.index[pd.isnull(series)]))
# for completely filled row
# series.index[pd.isnull(series)] will return
# Int64Index([], dtype='int64')
end = len(series)
series[start:] = series[start-1]
return series
df.apply(clean,1)
Output :
0 1 2
0 0.0 2.0 5.0
1 2.0 4.0 4.0
2 7.0 -5.0 4.0
3 1.0 1.0 1.0
Hope it clarifies why apply didn't work. I also suggest to take builtins to consideration to clean the data rather than writing functions from scratch.
At first, This is the code to solve your toy problem. But this code isn't what you want.
df.ffill(axis=1)
Next, I try to test your code.
df.apply(clean,axis=1)
#...start = np.min(list(series.index[pd.isnull(series)]))...
#=>ValueError: ('zero-size array to reduction operation minimum
# which has no identity', 'occurred at index 0')
To understand the situation, test with lambda function.
df.apply(lambda series:list(series.index[pd.isnull(series)]),axis=1)
0 []
1 [2]
2 []
3 [1, 2]
dtype: object
And next expression puts the same value error:
import numpy as np
np.min([])
In conclusion, pandas.apply() works well but clean function doesn't.
Could you use something like the fillna with backfill? I think this might be more efficient, if backfill meets your scenario..
i.e.
df.fillna(method='backfill')
However, this assumes a np.nan in the cells?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
I have a large DataFrame (circa 4e+07 rows).
When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31
df[[col1, col2, col3]].sum()
Out[1]:
col1 9.36e+07
col2 1.39e+09
col3 6.37e+08
dtype: float32
df.sum()[[col1, col2, col3]]
Out[2]:
col1 1.21e+08
col2 1.70e+09
col3 7.32e+08
dtype: float64
I am obviously missing something, has anybody had the same issue?
Thanks for your help.
To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:
Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:
Extracts the values of the dataframe into a two-dimensional NumPy array.
Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.
It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.
On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.
Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:
>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
A B C
0 1.0 1.0 1
1 1.0 1.0 1
2 1.0 1.0 1
3 1.0 1.0 1
4 1.0 1.0 1
>>> df.dtypes
A float32
B float32
C int32
dtype: object
Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.
>>> df.sum()
A 100000000.0
B 100000000.0
C 100000000.0
dtype: float64
But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.
>>> df[['A', 'B']].sum()
A 16777216.0
B 16777216.0
dtype: float32
The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.
However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":
>>> df[['A']].sum()
A 100000000.0
dtype: float32
Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.
In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.
You can lose precision with np.float32 relative to np.float64
np.finfo(np.float32)
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
And
np.finfo(np.float64)
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)
A contrived example
df = pd.DataFrame(dict(
x=[-60499999.315, 60500002.685] * int(2e7),
y=[-60499999.315, 60500002.685] * int(2e7),
z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))
print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')
y 80000000.0
z 80000000.0
dtype: float64
y 67108864.0
z 67108864.0
dtype: float32
I am trying to store the results of FFT calculations in a Pandas data frame:
ft = pd.DataFrame(index=range(90))
ft['y'] = ft.index.map(lambda x: np.sin(2*x))
ft['spectrum'] = np.fft.fft(ft['y'])
ft['freq'] = np.fft.fftfreq(len(ft.index)).real
ft['T'] = ft['freq'].apply(lambda f: 1/f if f != 0 else 0)
Everything seems to be working fine until the last line: the column T which is supposed to store periods has for some reason all the columns of the frame, ie.:
In [499]: ft.T[0]
Out[499]:
y 0j
spectrum (0.913756021471+0j)
freq 0j
T 0j
Name: 0, dtype: complex128
I cannot figure out why is that. It happens also when I only take the real part of freq:
ft['freq'] = np.fft.fftfreq(len(ft.index)).real
or I try to calculate T values using alternative ways, such as:
ft.T = ft.index.map(lambda i: 1/ft.freq[i] if ft.freq[i] else np.inf)
ft.T = 1/ft.freq
All other columns look tidy when I run head() or describe() on them no matter if they contain real or complex values. The freq column looks like a normal 1D series, because np.fft.fftfreq() returns 1D array of complex numbers, so what could be the reason why the column T is so messed up?
I am using Pandas v. 1.19.2 and Numpy v. 1.12.0.
Pandas DataFrame objects have a property called T, which is used "to transpose index and columns" of the DataFrame object. If you use a different column name instead of T, everything works as expected.