loss of precision when operating on a pandas dataframe with NaN values - python

I have a pandas dataframe where i would like to subtract two column values:
df = pd.DataFrame({"Label":["NoPrecisionLoss"],
"FirstNsae":[1577434369549916003],
"SecondNsae":[1577434369549938679]})
print(df.SecondNsae - df.FirstNsae)
The result of subraction is the correct 22676.
Now, when the input dataframe gets a second row with a nan value in it:
df2 = pd.DataFrame({"Label":["PrecisionLoss","NeedsToBeRemoved"],
"FirstNsae":[1577434369549916003,np.nan],
"SecondNsae":[1577434369549938679,66666666666666]})
This nan value is nasty so we will remove the row that contains it:
df2 = df2[np.isfinite(df2.FirstNsae) & np.isfinite(df2.SecondNsae)]
Let's convert the FirstNsae column back to being an int (FirstNsae is assigned to be float because of the nan value in the second row):
df2 = df2.astype({"FirstNsae":int}) # this is futile since precision as already been lost
print(df2.SecondNsae - df2.FirstNsae)
Printing the difference between the two columns produces 22775.
How can i avoid losing precision when constructing dataframes with extremely large integers in
possible presence of nan's?
Thank you!

To elaborate on piRSquared's answer (in the comments to the original question), here is am approach that has solved the original issue:
df2 = pd.DataFrame({"Label":["PrecisionLoss","NeedsToBeRemoved"],
"FirstNsae":[1577434369549916003,np.nan],
"SecondNsae"[1577434369549938679,66666666666666]},
dtype=object)
df2 = df2[np.isfinite(df2.FirstNsae.astype(float)) &
np.isfinite(df2.SecondNsae.astype(float)]
print(df2.SecondNsae - df2.FirstNsae)
prints 22676!
Update: Since Panda's version 1.0.0, this is not an issue anymore. Integer values are allowed to be NaN. https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/missing_data.html#missing-data-na

Related

Why does fillna with median on dataframe still leaves Na/NaN in pandas?

I've seen this and this thread here, but something else is wrong.
I have a very large pandas DataFrame, with many Na/NaN values. I want to replace them with the median value for that feature.
So, I first make a table that displays the Na values per feature, sorted by most Na values, then use fillna(), and then display that table again. Ideally, the second time, that table should have all 0's, because all the Na's have been filled.
nullCount = pd.DataFrame(TT_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))
TT_df = TT_df.fillna(TT_df.median())
nullCount = pd.DataFrame(TT_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))
However, I get these two tables:
null count tables, before and after
and if I take a look at the DataFrame, you can see NaN's in it:
display(TT_df[nullCount.index.tolist()[0:5]].head(50))
NaN examples
It seems like a common problem with fillna() is that it returns a copy, unless you use inplace=True (like in the linked threads above), but I'm not doing that: I'm overwriting TT_df, unless I'm misunderstanding something. You can see that the LotFrontage feature actually does disappear from the second table, implying that the fillna() did work for it. So why isn't it working for the others?
What I suspect is the culprit, though I don't know why, is that Na doesn't actually mean Na for these features: if I look at the data description file, it says:
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
Okay, that's fine. But it feels like those NA values should either count as Na for both isnull() and fillna(), or not count for either. Why does it appear to be counted by isnull() but not fillna()?
The problem is with this line:
TT_df = TT_df.fillna(TT_df.median())
Your dataframe has strings and you are attempting to calculate medians on strings. This doesn't work.
Here's a minimal example:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': ['A', 'B', np.nan, 'B']})
df = df.fillna(df.median())
print(df)
A
0 A
1 B
2 NaN
3 B
What you should do is fillna with median only for numeric columns:
for col in df.select_dtypes(include=np.number):
df[col] = df[col].fillna(df[col].median())

AttributeError: 'float' object has no attribute 'split'

I am calling this line:
lang_modifiers = [keyw.strip() for keyw in row["language_modifiers"].split("|") if not isinstance(row["language_modifiers"], float)]
This seems to work where row["language_modifiers"] is a word (atlas method, central), but not when it comes up as nan.
I thought my if not isinstance(row["language_modifiers"], float) could catch the time when things come up as nan but not the case.
Background: row["language_modifiers"] is a cell in a tsv file, and comes up as nan when that cell was empty in the tsv being parsed.
You are right, such errors mostly caused by NaN representing empty cells.
It is common to filter out such data, before applying your further operations, using this idiom on your dataframe df:
df_new = df[df['ColumnName'].notnull()]
Alternatively, it may be more handy to use fillna() method to impute (to replace) null values with something default.
E.g. all null or NaN's can be replaced with the average value for its column
housing['LotArea'] = housing['LotArea'].fillna(housing.mean()['LotArea'])
or can be replaced with a value like empty string "" or another default value
housing['GarageCond']=housing['GarageCond'].fillna("")
You might also use df = df.dropna(thresh=n) where n is the tolerance. Meaning, it requires n Non-NA values to not drop the row
Mind you, this approach will remove the row
For example: If you have a dataframe with 5 columns, df.dropna(thresh=5) would drop any row that does not have 5 valid, or non-Na values.
In your case you might only want to keep valid rows; if so, you can set the threshold to the number of columns you have.
pandas documentation on dropna

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

How to impute each categorical column in numpy array

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources