I have a DataFrame which has a lot of NAs. pandas's groupby operation is ignoring any combinations with NA in it. Is there a way to include NAs in groups? If not, what are the alternatives to pandas groupby? I really don't want to fill in NAs because the fact that something is missing is useful information.
Edit: I noticed that my question is exactly the same issue reported in groupby columns with NaN (missing) values
Has there been any developments technology to get around this issue?
I will use some kind of non-NA representation for NA only for groupby, which can't be confused with proper data (e.g. -999999 or 'missing')
df.fillna(-999999).groupby(...)
As inplace argument has default value False your original dataframe will not be affected.
Related
This is a very strange problem, I tried a lot of things but I can't find a way to solve it.
I have a DataFrame with data collected from API : no problem with that, then I'm using a library which is pandas-ta (https://github.com/twopirllc/pandas-ta), so this add new columns to the DataFrame.
Of course, sometimes there is NaN values in the new columns added (there is a lot of reasons but the main one is that some indicators are length-based).
Basic problem, so basic solution, just need to type df.fillna(0, inplace=True) and it works !
But when when I check the df.values (or the conversion to_numpy()) there is still nan values.
Properties of the problem :
_NaN not found with np.where() in the array both with np.nan & pandas-ta.npNaN
_df.isna().any().any() returns False
_NaN are float values, not string
_array has a dtype equal to object
_I tried various methods to replace the NaNs, not only fillna, but with the fact that they are not recognized it does not work at all
_I also thought it was because of large numbers, but using to_numpy(dtype='float64') gives the same problem
So these values are here only when converted to numpy array and not recognized.
These values are also here when I use PCA to my dataset, where I get a message error because of the NaNs.
Thanks a lot for your time, sorry for the mistakes I'm not a native speaker.
Have a good day y'all.
Edit :
There is a screen of the operations I'm doing and the result printed, you can see one NaN value.
I have a problem because I can't find the NaN values that appear when I use describe() on my dataframe.
I'm working with Jupyter.
Here is what it looks like
my dataframe describe
And when I use .isnull() and info() functions I got :
isnull() utilization
info()
Can you help me please ?
As you confirmed, it seems that there is no NaN in df.
I think you are confused with what df.describes returns. df.describes returns a summary of the dataframe.
df is not df.describe().
When you use describe(), the NaN values mean it is impossible to calculate. For example, there are values of object types, such as Month, Date, and Age_Group. It is impossible to calculate mean value of Month because Month's data type is object, not int.
You're using describe to get basic descriptive statistics in order to make inferences, such as the number of years on your dataframe which is set to NaN in unique because probably all of the rows have the exact same year. So yes, there aren't NaNs in your dataset.
I am doing some stuff using pandas in python and came across one dataset that needs to be clean.
It looks likes this-
I need to merge rows of index 0,1,2,3 into 1 single row by avoiding NaN value. Also, after doing this, I want to remove the default header and make the newly made row the default name for columns.
I tried to use groupby operation in pandas but nothing is happening.
Any idea on this?
Thanks
Akhi
pd.DataFrame([df.aggregate(lambda x:" ".join(str(e).strip() for e in x if not pd.isna(e))).tolist()],columns=[<add_new_list here>])
Can I create another index on an existing column of pandas DataFrame? Just like what CREATE INDEX in SQL does. For example: My DataFrame has two columns id_a and id_b, both of them are unique for each row, and I'd like to index rows sometimes with id_a while other times with id_b (so I think MultiIndex won't work for me). I want these operations to be fast, so "index" must be created for both id_a and id_b.
Is it possible I can do this in pandas currently?
You can't have 2 indices in a Pandas DataFrame object.
You will have to workaround this limitation, for example:
by code logic,
using other Pandas features,
using columns and flags depending which column needs to be used to index a given row
The operations should be fast. For additional performance, you can adjust the dtypes according to your needs. To match hashmap lookups or similar, you will have to add more thought into your use case and perhaps use a different logical approach with a separate mapping/dict or similar.
This question already has answers here:
Querying for NaN and other names in Pandas
(7 answers)
Closed 2 years ago.
I have a dataframe with columns of different dtypes and I need to use pandas.query to filter the columns.
Columns can include missing values: NaN, None and NaT and I need to display the rows that contain such values. Is there a way to do this in an expression passed to pandas.query? I am aware that it can be done using different methods but I need to know if it is doable through the query
For boolean columns I was able to use a workaround by stating:
df.query('col not in (True, False)')
but this won't work for other types of columns. Any help is appreciated including workarounds.
NaN is not equal to itself, so you can simply test if a column is equal to itself to filter it. This also seems to work for None although I'm not sure why, it may be getting cast to NaN at some point during the evaluation.
df.query('col == col')
For datetimes, this works, but feels pretty hacky, there might be a better way.
df.query('col not in [#pd.NaT]')