Value "not a number" in my dataframe not found - python

I have a problem because I can't find the NaN values that appear when I use describe() on my dataframe.
I'm working with Jupyter.
Here is what it looks like
my dataframe describe
And when I use .isnull() and info() functions I got :
isnull() utilization
info()
Can you help me please ?

As you confirmed, it seems that there is no NaN in df.
I think you are confused with what df.describes returns. df.describes returns a summary of the dataframe.
df is not df.describe().
When you use describe(), the NaN values mean it is impossible to calculate. For example, there are values of object types, such as Month, Date, and Age_Group. It is impossible to calculate mean value of Month because Month's data type is object, not int.

You're using describe to get basic descriptive statistics in order to make inferences, such as the number of years on your dataframe which is set to NaN in unique because probably all of the rows have the exact same year. So yes, there aren't NaNs in your dataset.

Related

Ghost NaN values in Pandas DataFrame, strange behaviour with Numpy

This is a very strange problem, I tried a lot of things but I can't find a way to solve it.
I have a DataFrame with data collected from API : no problem with that, then I'm using a library which is pandas-ta (https://github.com/twopirllc/pandas-ta), so this add new columns to the DataFrame.
Of course, sometimes there is NaN values in the new columns added (there is a lot of reasons but the main one is that some indicators are length-based).
Basic problem, so basic solution, just need to type df.fillna(0, inplace=True) and it works !
But when when I check the df.values (or the conversion to_numpy()) there is still nan values.
Properties of the problem :
_NaN not found with np.where() in the array both with np.nan & pandas-ta.npNaN
_df.isna().any().any() returns False
_NaN are float values, not string
_array has a dtype equal to object
_I tried various methods to replace the NaNs, not only fillna, but with the fact that they are not recognized it does not work at all
_I also thought it was because of large numbers, but using to_numpy(dtype='float64') gives the same problem
So these values are here only when converted to numpy array and not recognized.
These values are also here when I use PCA to my dataset, where I get a message error because of the NaNs.
Thanks a lot for your time, sorry for the mistakes I'm not a native speaker.
Have a good day y'all.
Edit :
There is a screen of the operations I'm doing and the result printed, you can see one NaN value.

Plotting non-numerical data in python

I'm a beginner in coding and I wrote some codes in python pandas that I didn't understand fully and need some clarification.
Lets say this is the data, DeathYear, Age, Gender and Country are all columns in an excel file.
How to plot a table with non-numeric values in python?
I saw this question and I used this command
df.groupby('Gender')['Gender'].count().plot.pie(autopct='%.2f',figsize=(5,5))
it works and gives me a pie chart of the percentage of each gender,
but the normal pie chart command that I know for numerical data looks like this
df["Gender"].plot.pie(autopct="%.2f",figsize=(5,5))
My question is why did we add the .count()?
is it to transform non numerical data to numerical?
and why did why use the group by and type the column twice ('Gender')['Gender']?
I'll address the second part of your question first since it makes more sense to explain it that way
The reason that you use ('Gender')['Gender'] is that it does two different things. The first ('Gender') is the argument to the groupby function. It tells you that you want the DataFrame to be grouped by the 'Gender' column. Note that the groupby function needs to have a column or level to group by or else it will not work.
The second ['Gender'] tells you to only look at the 'Gender' column in the resulting DataFrame. The easiest way to see what the second ['Gender'] does is to compare the output of df.groupby('Gender').count() and df.groupby('Gender')['Gender'].count() and see what happens.
One detail that I omitted in first part for clarity it that the output of df.groupby('Gender') is not a DataFrame, but actually a DataFrameGroupBy object. The details of what exactly this object is are not important to your question, but the key is that to get a DataFrame back you need to have a function that tells you what to put in the rows of the DataFrame that you wish to create. The .count() function is one of those options (along with many others such as .mean(), etc.). In your case, since you want the total counts to make a pie chart, the .count() function does exactly that; it will count the number of times 'Female' and 'Male' appears in the 'Gender' column and that sum will be the entries in the corresponding row. The DataFrame is then able to be used to create a pie chart. So you are correct in that the .count() function transforms the non-numeric 'Female' and 'Male' entries into a numeric value which corresponds to how often those entries appeared in the initial DataFrame.

What is the best way to calculate the mean of the values of a pandas dataframe with np.nan in it?

I'm trying to calculate the mean of the values (all of them numeric, not like in the 'How to calculate the mean of a pandas DataFrame with NaN values' question) of a pandas dataframe containing a lot of np.nan in it.
I've came with this code, that works quite well by the way :
my_df = pd.DataFrame ([[0,10,np.nan,220],\
[1,np.nan,21,221],[2,12,22,np.nan],[np.nan,13,np.nan,np.nan]])
print(my_df.values.flatten()[~np.isnan(my_df.values.flatten())].mean())
However, I found that this line of code gives the same result, which I don't understand why :
print(my_df.values[~np.isnan(my_df.values)].mean())
Is this really the same, and can I use it safely ?
I mean, my_df.values[~np.isnan(my_df.values) is still an array that is not flat and so what happened to the np.nan in it ?
Any improvement is welcome if you see a more efficient and pythonic way to do that.
Thanks a lot.
Is this really the same, and can I use it safely ?
Yes, since numpy here masks away the NaNs, and it will then calculate the mean over that array. But you make it overcomplicated here.
You can use numpy's nanmean(..) [numpy-doc] here:
>>> np.nanmean(my_df)
52.2
The NaN values are thus not take into account (not in the sum nor in the count of the mean). I think this is probably more declarative than calculating the mean with masking, since the above says what you are doing, and not that much how you are doing that.
In case you want to count the NaNs, we can replace these with 0 like #abdullah.cu says, like:
>>> my_df.fillna(0).values.mean()
32.625

How to ensure that a column in dataframe loaded from a csv file is formatted as an integer (without decimal characters)

I am using Python 3.7
I need to load data from two different sources (both csv) and determine which rows from the one sources are not in the second source.
I have used pandas data-frames to load the data and do a comparison between the two sources of data.
I loaded the data from the csv file and a value like 2010392 is turned to 2010392.0 in the data-frame column.
I have read quite a number of articles about formatting data-frame columns; unfortunately, most of them are about date and time conversions.
I came across an article "Format integer column of Data-frame in Python pandas" at http://www.datasciencemadesimple.com/format-integer-column-of-dataframe-in-python-pandas/ which does not solve my problem
Based on the above mentioned article I have tried the following:
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Out[63]:
0 2010392.0
1 111777967.0
2 2010392.0
3 2012554.0
4 2010392.0
5 2010392.0
6 2010392.0
7 1170126.0
and as you can see, the column values still have a decimal point with a zero.
I expect the load of the dataframe from a csv file to keep the format of a number such as 2010392 to be 2010392 and not 2010392.0
Here is the code that I have tried:
import pandas as pd
data = pd.read_csv("timetable_all_2019-2_groups.csv")
data02 = data.drop_duplicates()
print(f'Len data {len(data)}')
print(data.head(20))
print(f'Len data02 {len(data02)}')
print(data02.head(20))
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Here is a few lines of the content of the csv file:
The data in the one source looks like this:
IDDCYR,IDDSUBJ,IDDOT,IDDGRPTYP,IDDCLASSGROUP,IDDLECT,IDDPRIMARY
019,AAACA1B,VF,C,A1,2010392,Y
2019,AAACA1B,VF,C,A1,111777967,N
2019,AAACA3B,VF,C,A1,2010392,Y
2019,AAACA3B,VF,C,A1,2012554,N
2019,AAACB2A,VF,C,B1,2010392,Y
2019,AAACB2A,VF,P,B2,2010392,Y
2019,AAACB2A,VF,C,B1,2010392,N
2019,AAACB2A,VF,P,B2,1170126,N
2019,AAACH1A,VF,C,A1,2010392,Y
Looks like you have data which is not of integer type. Once loaded you should do something about that data and then convert the column to int.
From your error description, you have nans and/or inf values. You could impute the missing values with the mode, mean, median or a constant value. You can achieve that either with pandas or with sklearn imputer, which is dedicated to imputing missing values.
Note that if you use mean, you may end up with a float number, so make sure to get the mean as an int.
The imputation method you choose really depends on what you'll use this data for later. If you want to understand the data, filling nans with 0 may destroy aggregation functions later (e.g. if you'll want to know what the mean is, it won't be accurate).
That being said, I see you're dealing with categorical data. One option here is to use dtype='category'. If you want to later fit a model with this and you leave ids as numbers, the model can conclude weird things which are not correct (e.g. the sum of two ids equals to some third id, or that ids that are higher are more important than lower ones... things that a priori make no sense and should not be ignored and left to chance.)
Hope this helps!
data02['IDDLECT'] = data02['IDDLECT']fillna(0).astype('int')

How to ignore NaN in the dataframe for Mann-whitney u test?

I have a dataframe as below.
I want p-value of Mann-whitney u test by comparing each column.
As an example, I tried below.
from scipy.stats import mannwhitneyu
mannwhitneyu(df['A'], df['B'])
This results in the following values.
MannwhitneyuResult(statistic=3.5, pvalue=1.8224273379076809e-05)
I wondered whether NaN affected the result, thus I made the following df2 and df3 dataframes as described in the figure and tried below.
mannwhitneyu(df2, df3)
This resulted in
MannwhitneyuResult(statistic=3.5, pvalue=0.00025322465545184154)
So I think NaN values affected the result.
Does anyone know how to ignore NaN values in the dataframe?
you can use df.dropna() you can find extensive documentation here dropna
As per your example, the syntax would go something like this:
mannwhitneyu(df['A'].dropna(),df['B'])
As you can see, there is no argument in the mannwhitneyu function allowing you to specify its behavior when it encounters NaN values, but if you inspect its source code, you can see that it doesn't take NaN values into account when calculating some of the key values (n1, n2, ranked, etc.). This makes me suspicious of any results that you'd get when some of the input values are missing. If you don't feel like implementing the function yourself with NaN-ignoring capabilities, probably the best thing to do is to either create new arrays without missing values as you've done, or use df['A'].dropna() as suggested in the other answer.

Categories

Resources