Groupby Sum Equals 0 When Min_count=1 - python

I have a dataframe that contains duplicate column names. Now I am trying to combine the duplicate columns into a single column using the following command (the following dataframe is for demo only. it doesn't contain duplicate column names, but the same problem will occur with duplicate column name as well).
d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).sum(min_count=1)
the output is:
col1 col2
0 0.0 3.0
1 0.0 4.0
But I expect
the output is:
col1 col2
0 Nan 3.0
1 Nan 4.0
My hope is that, by using min_count=1, pandas will return NaN when the columns being summed up are all NaN. However, now it is returning 0 instead of NaN. Any idea why?

This depends on your version number of pandas, when you set min_count=1.
If you have version < 0.22.0 then you would indeed get np.nan when there are less then 1 non-na values.
From version 0.22.0 and up, the default value has been changed to 0 when are only na values.
This is also explained in the documentation.

Related

How to replace a string value with the means of a column's groups in the entire dataframe

I have a large dataset with 400columns and 30,000 rows. The dataset is all numerical but some columns have weird string values in them (denoted as "#?") instead of being blank. This changes the dtypes of the columns that have "#?" into object type. (150 columns object dtype)
I need to convert all the columns into float or int dtypes, and then fill the normal NaN values in the data, with means of a column's groups. (e.g: means of X, means of Y in each column)
col1 col2 col3
X 21 32
X NaN 3
Y Nan 5
My end goal is to apply this to the entire data:
df.groupby("col1").transform(lambda x: x.fillna(x.mean()))
But I can't apply this for the columns that have "#?" in them, they get dropped.
I tried replacing the #? with a numerical value, and then convert all the columns into float dtype, which works, but the replaced values also should be included in the above code.
I thought about replacing #? with an weird value like -123.456 so that it doesn't get mixed with actual data points, and maybe replace all the -123.456 with the means of column groups but the -123.456 would need to be excluded from the mean. But I just don't know how that would even work. If I convert it back to NaN again, the dtype changes back to object.
I think the best way to go about it would be directly replacing the #? with the column group means.
Any ideas?
edit: I'm so dumb lol
df=df.replace('#?', '').astype(float, errors = 'ignore')
this works.
Use:
print (df)
col1 col2 col3
0 X 21 32
1 X #? 3
2 Y NaN 5
df = (df.set_index('col1')
.replace(r'#\?', np.nan, regex=True)
.astype(float)
.groupby("col1")
.transform(lambda x: x.fillna(x.mean())))
print (df)
col2 col3
col1
X 21.0 32.0
X 21.0 3.0
Y NaN 5.0

Pandas Series skipna not working as expected

So it seems to me that the standard aggregation functions in pandas don't behave very consistently across the three main pandas datatypes (groupBy, DataFrame, Series) when trying to include NaNs.
There is already an issue open for the case of groupBy objects here, but it seems like something similar is happening when using pandas Series. Minimum example
import pandas as pd
foo = pd.DataFrame({"user": ["a", "a", "b"], "value": [3, None, 5], "other_value": [None, 4, 6]})
foo
>>>
user value other_value
0 a 3.0 NaN
1 a NaN 3.0
2 b 5.0 6.0
Now when I try to get the maximum per row including NaNs using max(skipna=False) I get the expected:
foo[['value', 'other_value']].max(skipna=False, axis=1)
>>>>
0 NaN
1 NaN
2 6.0
However, when using the Series max operation, it seems to behave in a non-consistent way, depending on whether NaN was the first value in the Series:
foo.apply(lambda x: x[['value', 'other_value']].max(skipna=False), 1)
>>>
0 NaN
1 3.0
2 6.0
Is this a bug or am I doing something wrong?

How to select all elements greater than a given values in a dataframe

I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

Replace NULL and blank values in all Data Frame columns with the most frequent Non Null item of the respective columns

I am a novice to Python - I am trying to replace NULL and blank ('') values occurring in a column of a Pandas data frame with the most frequent item in that column. But I need to be able to do it for all columns and all rows of the data frame. I have written the following code - But it takes a lot of time to execute. Can you please help me optimize?
Thanks
Saptarshi
for column in df:
#Get the value and frequency from the column
tempDict = df[column].value_counts().to_dict()
#pop the entries for 'NULL' and '?'
tempDict.pop(b'NULL',None)
tempDict.pop(b'?',None)
#identify the max item of the remaining set
maxItem = max(tempDict)
#The next step is to replace all rows where '?' or 'null' appears with maxItem
#df_test[column] = df_test[column].str.replace(b'NULL', maxItem)
#df_test[column] = df_test[column].str.replace(b'?', maxItem)
df[column][df[column] == b'NULL'] = maxItem
df[column][df[column] == b'?'] = maxItem
You can use mode() to find the most common value in each column:
for val in ['', 'NULL', '?']:
df.replace(val, df.mode().iloc[0])
Because there may be multiple modal values, mode() returns a dataframe. Using .iloc[0] takes first value from that dataframe. You can use fillna() instead of replace() as #Wen does if you also want to convert NaN values.
I create a sample data here.
df = pd.DataFrame({'col1': [6,3,'null',4,4,2,'?'], 'col2': [6,3,2,'null','?',2,2]})
df.replace({'?':np.nan},inplace=True)
df.replace({'null':np.nan},inplace=True)
df.fillna(df.apply(lambda x : x.mode()[0]))
Out[98]:
col1 col2
0 6.0 6.0
1 3.0 3.0
2 4.0 2.0
3 4.0 2.0
4 4.0 2.0
5 2.0 2.0
6 4.0 2.0

Opposite of dropna() in pandas

I have a pandas DataFrame that I want to separate into observations for which there are no missing values and observations with missing values. I can use dropna() to get rows without missing values. Is there any analog to get rows with missing values?
#Example DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1,np.nan,3,4,5],'col2': [6,7,np.nan,9,10],})
#Get observations without missing values
df.dropna()
Check null by row and filter with boolean indexing:
df[df.isnull().any(1)]
# col1 col2
#1 NaN 7.0
#2 3.0 NaN
~ = Opposite :-)
df.loc[~df.index.isin(df.dropna().index)]
Out[234]:
col1 col2
1 NaN 7.0
2 3.0 NaN
Or
df.loc[df.index.difference(df.dropna().index)]
Out[235]:
col1 col2
1 NaN 7.0
2 3.0 NaN
I use the following expression as the opposite of dropna. In this case, it keeps rows based on the specified column that are null. Anything with a value is not kept.
csv_df = csv_df.loc[~csv_df['Column_name'].notna(), :]

Categories

Resources