Ghost NaN values in Pandas DataFrame, strange behaviour with Numpy

Ghost NaN values in Pandas DataFrame, strange behaviour with Numpy - python

This is a very strange problem, I tried a lot of things but I can't find a way to solve it.
I have a DataFrame with data collected from API : no problem with that, then I'm using a library which is pandas-ta (https://github.com/twopirllc/pandas-ta), so this add new columns to the DataFrame.
Of course, sometimes there is NaN values in the new columns added (there is a lot of reasons but the main one is that some indicators are length-based).
Basic problem, so basic solution, just need to type df.fillna(0, inplace=True) and it works !
But when when I check the df.values (or the conversion to_numpy()) there is still nan values.
Properties of the problem :
_NaN not found with np.where() in the array both with np.nan & pandas-ta.npNaN
_df.isna().any().any() returns False
_NaN are float values, not string
_array has a dtype equal to object
_I tried various methods to replace the NaNs, not only fillna, but with the fact that they are not recognized it does not work at all
_I also thought it was because of large numbers, but using to_numpy(dtype='float64') gives the same problem
So these values are here only when converted to numpy array and not recognized.
These values are also here when I use PCA to my dataset, where I get a message error because of the NaNs.
Thanks a lot for your time, sorry for the mistakes I'm not a native speaker.
Have a good day y'all.
Edit :
There is a screen of the operations I'm doing and the result printed, you can see one NaN value.

Related

pandas df.any() method returns True/False instead of NaN expected

I am trying to get into pandas so I work on some free data from kaggle.
Basically, I have a DataFrame, with few rows with only NaN's. I want to create new column using df.any(). I expect it to return True or False in other rows, but here is something I can't understand (and be sure I spent about 2 days of debugging, reading all documentations, online testing, forums, even stackoverflow, etc. before I decided to write this question):
If I run the same code in online jupyter editor, df.any(axis=1, skipna=False) do as follows:
for rows containing at least one True value, it returns True to new column
for rows with no True value (but some other values are there) it returns False to new column
for rows that contain only NaN's, it returns NaN in new column
THIS IS THE BEHAVIOR I EXPECTED AND I WANT (ABOVE)
If I copy the code from online jupyter editor and run it local on my notebook, the result looks like this:
for rows containing at least one True value, it returns True to new column
for rows with no True value (but some other values are there) it returns False to new column
for rows that contain only NaN's, it returns:
True, if I set the "skipna=False"
False, if I don't use "skipna=False", which basically means it's True due to documenation.
I thought I am insane (I guess you understand the stage of my insanity at this point), but as the data are free, I am posting few screens to show you.
In both cases above, I expect NaN values in column 'dissatisfied'.
Ofcourse, I simplified the description a bit to make my point at least clearer. What I really try (as can be seen on screenshots), I am creating new column ['dissatisfied'] by inspecting only two columns ['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction'] with df.any() method. Still, I expect returns as NaN's for those rows.
But I get this output only online, but not using jupyter on my notebook.
Read all of the documentation. I don't think it's causedby different pandas versions. df.any() should NEVER for ANY version return True/False for rows containing only NaN's due to my research.
Can you guys please explain this? Or am I really missing something what is causing this? You will make my weekend much easier if you can provide explanation/solution to this.

This is probably because you are using different versions of Python or pandas. In the documentation you can check the behavior of the any method for various versions.

Problem with Pandas error for RuntimeWarning

I have recently started learning pandas and I was trying to analyze the Stack Overflow developer survey. I am trying to learn the groupby function:
country_grp=df.groupby(['Country'])
country_grp.get_group('China')
ed=country_grp['EdLevel'].value_counts()
salary=country_grp['ConvertedComp'].value_counts()
response=country_grp['Country'].value_counts()
combine=pd.concat([ed,response,salary],axis='columns',sort=False)
combine
After this line it's giving me this warning:
RuntimeWarning: The values in the array are unorderable. Pass
`sort=False` to suppress this warning. uniq_tuples =
lib.fast_unique_multiple([self._values, other._values], sort=sort)
It gives me the data frame but all the rows for ['country'] columns are NaN. Can someone please guide me how can I solve this?

I am not sure if the problem is the same as mine since there is no underlying data.
I got a similar error, which was due to the data type of the indices index for each dataframe in the concat function being different, though it looks the same. After making all index data types the same, the warning and NaN disappear.

How to deal with NaN values in pandas (from csv file)?

I have a fairly large csv file filled with data obtained from a machine for material testing (compression test). The headers of the data are Time, Force, Stroke and they are repeated 10 times because of the sample size, so the last set of headers is Time.10, Force.10, Stroke.10.
Because of the nature of the experiment not all columns are equally long (some are approx.. 2000 rows longer than others). When I import the file into my IDE (spyder or jupyter) using pandas, all the cells within the the rows that are empty in the csv file are labeled as NaN.
The problem is... I can't do any mathematical operations within or between columns that have NaN values as they are treated as str. I have tried the most recommended solutions on pretty much all forums; .fillna(), dropna(), replace() and interpolate(). The mentioned methods work but only visually, e.g. df.fillna(0) replaces all NaN values with 0, but when I try to e.g. find the max value in the column, I still get the error that says that there are strings in my column (TypeError: '>' not supported between instances of 'float' and 'str'). The problem is caused 100% by the NaN values that are the result of the empty cells in the csv file as I have imported a csv file in which all columns where the same length (with no empty cells) and there where no problems. If anyone has any solution to this problem (doesn't need to be within pandas, just within Python) that I'm stuck on for over 2 weeks, I would be grateful.

Try read_csv() with na_filter=False.
This should at least prevent from setting "empty" source cells to NaN.
But note that:
such "empty" cells can have an empty string as the content,
the type of each column containing at least one such cell is object
(not a number),
so that (for the time being) they can't take part in any numeric
operations.
So probably, after read_csv() you should:
replace such empty strings with e.g. 0 (or whatever numeric value),
call to_numeric(...) to change type of each column from object
to whatever numeric type is appropriate in each case.

What is the best way to calculate the mean of the values of a pandas dataframe with np.nan in it?

I'm trying to calculate the mean of the values (all of them numeric, not like in the 'How to calculate the mean of a pandas DataFrame with NaN values' question) of a pandas dataframe containing a lot of np.nan in it.
I've came with this code, that works quite well by the way :
my_df = pd.DataFrame ([[0,10,np.nan,220],\
[1,np.nan,21,221],[2,12,22,np.nan],[np.nan,13,np.nan,np.nan]])
print(my_df.values.flatten()[~np.isnan(my_df.values.flatten())].mean())
However, I found that this line of code gives the same result, which I don't understand why :
print(my_df.values[~np.isnan(my_df.values)].mean())
Is this really the same, and can I use it safely ?
I mean, my_df.values[~np.isnan(my_df.values) is still an array that is not flat and so what happened to the np.nan in it ?
Any improvement is welcome if you see a more efficient and pythonic way to do that.
Thanks a lot.

Is this really the same, and can I use it safely ?
Yes, since numpy here masks away the NaNs, and it will then calculate the mean over that array. But you make it overcomplicated here.
You can use numpy's nanmean(..) [numpy-doc] here:
>>> np.nanmean(my_df)
52.2
The NaN values are thus not take into account (not in the sum nor in the count of the mean). I think this is probably more declarative than calculating the mean with masking, since the above says what you are doing, and not that much how you are doing that.
In case you want to count the NaNs, we can replace these with 0 like #abdullah.cu says, like:
>>> my_df.fillna(0).values.mean()
32.625

How to ignore NaN in the dataframe for Mann-whitney u test?

I have a dataframe as below.
I want p-value of Mann-whitney u test by comparing each column.
As an example, I tried below.
from scipy.stats import mannwhitneyu
mannwhitneyu(df['A'], df['B'])
This results in the following values.
MannwhitneyuResult(statistic=3.5, pvalue=1.8224273379076809e-05)
I wondered whether NaN affected the result, thus I made the following df2 and df3 dataframes as described in the figure and tried below.
mannwhitneyu(df2, df3)
This resulted in
MannwhitneyuResult(statistic=3.5, pvalue=0.00025322465545184154)
So I think NaN values affected the result.
Does anyone know how to ignore NaN values in the dataframe?

you can use df.dropna() you can find extensive documentation here dropna
As per your example, the syntax would go something like this:
mannwhitneyu(df['A'].dropna(),df['B'])

As you can see, there is no argument in the mannwhitneyu function allowing you to specify its behavior when it encounters NaN values, but if you inspect its source code, you can see that it doesn't take NaN values into account when calculating some of the key values (n1, n2, ranked, etc.). This makes me suspicious of any results that you'd get when some of the input values are missing. If you don't feel like implementing the function yourself with NaN-ignoring capabilities, probably the best thing to do is to either create new arrays without missing values as you've done, or use df['A'].dropna() as suggested in the other answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.