Having Issues with pandas groupby.mean() not ignoring NaN as expected

Having Issues with pandas groupby.mean() not ignoring NaN as expected - python

Im currently trying to get the mean() of a group in my dataframe (tdf), but I have a mix of some NaN values and filled values in my dataset. Example shown below
Test #
a
b
1
1
1
1
2
NaN
1
3
2
2
4
3
My code needs to take this dataset, and make a new dataset containing the mean, std, and 95% interval of the set.
i = 0
num_timeframes = 2 #writing this in for example sake
new_df = pd.DataFrame(columns = tdf.columns)
while i < num_timeframes:
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).mean()
new_df = pd.concat([new_df,results])
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
results = 2*tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
new_df['Test #'] = new_df['Test #'].fillna(i) #fill out test number values
i+=1
For simplicity, i will show the desired output on the first pass of the while loop, only calculating the mean. The problem impacts every row however. The expected output for the mean of Test # 1 is shown below:
Test #
a
b
1
2
1.5
However, columns which contain any NaN rows are calculating the entire mean as NaN resulting in the output shown below
Test #
a
b
1
2
NaN
I have tried passing skipna=True, but got an error stating that mean doesn't have a skipna argument. Im really at a loss here because it was my understanding that df.mean() ignores NaN rows by default. I have limited experience with python so any help is greatly appreciated.

Use following
DataFrame.mean( axis=None, skipna=True)

I eventually solved this by removing the groupby function entirely (I was looking through it and realized I had no reason to call groupby here other than benefit from groupby keeping my columns in the correct orientation). Figured I'd post my fix in case anyone ever comes across this.
for i in range(num_timeframes):
results = tdf.loc[tdf["Test #"] == i].mean()
results = pd.concat([results, tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = pd.concat([results, 2*tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = results.transpose()
results["Test #"] = i
new_df = pd.concat([new_df,results])
new_df.loc[new_df.shape[0]] = [None]*len(new_df.columns)
All i had to do was transpose my results because df.mean() flips the dataframe for some reason which is likely why I had tried using groupby in the first place.

Related

Pandas - change cell value based on conditions from cell and from column

I have a Dataframe with a lot of "bad" cells. Let's say, they have all -99.99 as values, and I want to remove them (set them to NaN).
This works fine:
df[df == -99.99] = None
But actually I want to delete all these cells ONLY if another cell in the same row is market as 1 (e.g. in the column "Error").
I want to delete all -99.99 cells, but only if df["Error"] == 1.
The most straight-forward solution I thin is something like
df[(df == -99.99) & (df["Error"] == 1)] = None
but it gives me the error:
ValueError: cannot reindex from a duplicate axis
I tried every given solutions on the internet but I cant get it to work! :(
Since my Dataframe is big I don't want to iterate it (which of course, would work, but take a lot of time).
Any hint?

Try using broadcasting while passing numpy values:
# sample data, special value is -99
df = pd.DataFrame([[-99,-99,1], [2,-99,2],
[1,1,1], [-99,0, 1]],
columns=['a','b','Errors'])
# note the double square brackets
df[(df==-99) & (df[['Errors']]==1).values] = np.nan
Output:
a b Errors
0 NaN NaN 1
1 2.0 -99.0 2
2 1.0 1.0 1
3 NaN 0.0 1

At least, this is working (but with column iteration):
for i in df.columns:
df.loc[df[i].isin([-99.99]) & df["Error"].isin([1]), i] = None

Set "meta name" for rows and columns in pandas dataframe

I am trying to "prettyfi" pandas confusion matrix which just returns a 2D-numpy array.
What I want to, is to add "legends"; one above the columns saying "Pred" and one for the rows saying "Actual"
something like this
pred
0 1
--------
0|123 2
Actual |
1|17 200
(would be perfect if "actual" was rotated but thats just a minor thing).
I have the following lines for creating the dataframe w/o the meta-headers
conf_mat = confusion_matrix(y_true = y_true,y_pred = y_pred)
conf_mat = pd.DataFrame(conf_mat)
... #missing lines for the last part
and I have thought about using some multiindex but I cannot really make it work

Use:
conf_mat = pd.DataFrame(conf_mat)
Solution with MultiIndex:
conf_mat.columns = pd.MultiIndex.from_product([['pred'], conf_mat.columns])
conf_mat.index = pd.MultiIndex.from_product([['Actual'], conf_mat.index])
print (conf_mat)
pred
0 1
Actual 0 123 2
1 17 200
Or solution with index and columns names (but some pandas operation should remeved this meta data):
conf_mat = conf_mat.rename_axis(index='Actual', columns='pred')
print (conf_mat)
pred 0 1
Actual
0 123 2
1 17 200

python for loop using index to create values in dataframe

I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.

There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.

You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)

I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt

when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.

Pandas Apply/Lambda returning dataframe and not single row

New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1

Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)

Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15

This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))

faster replacement of -1 and 0 to NaNs in column for a large dataset

The 'azdias' is a dataframe which is my main dataset and meta data or feature summary of it lies in dataframe 'feat_info'. The 'feat_info' shows the values in every column that have been displayed as NaN.
Ex: column1 has values [-1,0] as NaN values. So my job will be to find and replace these -1,0 in column1 as NaN.
azdias dataframe:
feat_info dataframe:
I have tried following in jupyter notebook.
def NAFunc(x, miss_unknown_list):
x_output = x
for i in miss_unknown_list:
try:
miss_unknown_value = float(i)
except ValueError:
miss_unknown_value = i
if x == miss_unknown_value:
x_output = np.nan
break
return x_output
for cols in azdias.columns.tolist():
NAList = feat_info[feat_info.attribute == cols]['missing_or_unknown'].values[0]
azdias[cols] = azdias[cols].apply(lambda x: NAFunc(x, NAList))
Question 1: I am trying to impute NaN values. But my code is very
slow. I wish to speed up my process of execution.
I have attached sample of both dataframes:
azdias_sample
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST
0 -1 2 1 2.0 3
1 -1 1 2 5.0 1
2 -1 3 2 3.0 1
3 2 4 2 2.0 4
4 -1 3 1 5.0 4
feat_info_sample
attribute information_level type missing_or_unknown
AGER_TYP person categorical [-1,0]
ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
ANREDE_KZ person categorical [-1,0]
CJT_GESAMTTYP person categorical [0]
FINANZ_MINIMALIST person ordinal [-1]

If the azdias dataset is obtained from read_csv or similar IO functions, the na_values keyword argument can be used to specify column-specific missing value representations to make sure the returned data frame already has in-place NaN values from the very beginning. The sample code is shown in the following.
from ast import literal_eval
feat_info.set_index("attribute", inplace=True)
# A more concise but less efficient alternative is
# na_dict = feat_info["missing_or_unknown"].apply(literal_eval).to_dict()
na_dict = {attr: literal_eval(val) for attr, val in feat_info["missing_or_unknown"].items()}
df_azdias = pd.read_csv("azidas.csv", na_values=na_dict)
As for the data type, there is no built-in NaN representation for integer data types. Hence a float data type is needed. If the missing values are imputed using fillna, the downcast argument can be specified to make the returned series or data frame have an appropriate data type.

Try using the DataFrame's replace method. How about this?
for c in azdias.columns.tolist():
replace_list = feat_info[feat_info['attribute'] == c]['missing_or_unknown'].values
azidias[c] = azidias[c].replace(to_replace=list(replace_list), value=np.nan)
A couple things I'm not sure about without being able to execute your code:
In your example, you used .values[0]. Don't you want all the values?
I'm not sure if it's necessary to do to_replace=list(replace_list), it may work to just use to_replace=replace_list.
In general, I recommend thinking to yourself "surely Pandas has a function to do this for me." Often, they do. For performance with Pandas generally, avoid looping over and setting things. Vectorized methods tend to be much faster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having Issues with pandas groupby.mean() not ignoring NaN as expected - python

Use following DataFrame.mean( axis=None, skipna=True)

Related

Pandas - change cell value based on conditions from cell and from column

Set "meta name" for rows and columns in pandas dataframe

python for loop using index to create values in dataframe

Pandas Apply/Lambda returning dataframe and not single row

faster replacement of -1 and 0 to NaNs in column for a large dataset

Categories

Resources