Calculate the mean in pandas while a column has a string - python

I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.

You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()

if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values ​​that cannot be converted to float will be replaced by NaN regardless of which is

Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a

Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)

You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64

Related

Replace nan-values with the mean of their column/attribute

I have tried with everything I can come up with and would appreciate some help! :)
This is a method that's gonna return an imputed part of a data frame
from statistics import mean
from unicodedata import numeric
def imputation(df, columns_to_imputed):
# Step 1: Get a part of dataframe using columns received as a parameter.
import pandas as pd
import numpy as np
df.set_axis(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], axis=1, inplace=True)#Sätter rubrikerna
part_of_df = pd.DataFrame(df.filter(columns_to_imputed, axis=1))
part_of_df = part_of_df.drop([0], axis=0)
#Step 2: Change the zero values in the columns to np.nan
part_of_df = part_of_df.replace('0', np.nan)
# Step 3: Change the nan values to the mean of each attribute (column).
#You can use the apply(), fillna() functions.
part_of_df = part_of_df.fillna(part_of_df.mean(axis=0)) #####Ive tried everything on this row, can't get it to work. I want to fill each nan-value with the mean of the column its in..
return part_of_df ####Im returning this part to see if the nans are replaced but nothings happened...
You were on the right track, you just need to make a small change. Here I created a sample Df and introduced some NaNs:
dummy_df = pd.DataFrame({"col1":range(5), "col2":range(5)})
dummy_df['col1'][1] = None
dummy_df['col1'][3] = None
dummy_df['col2'][4] = None
and got this:
Disclaimer: Don't use my method of value assignment. Use proper indexing through loc.
Now, I use apply() and lambda to iterate over each column and fill NaNs with the mean value:
dummy_df = dummy_df.apply(lambda x: x.fillna(x.mean()), axis=0)
This gives me:
Hope this helps!

factorizing on a slice of a df

I'm trying to give numerical representations of strings, so I'm using Pandas'
factorize
For example Toyota = 1, Safeway = 2 , Starbucks =3
Currently it looks like (and this works):
#Create easy unique IDs for subscription names i.e. 1,2,3,4,5...etc..
df['SUBS_GROUP_ID'] = pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1
However, I only want to factorize subscription names where the SUB_GROUP_ID is null. So my thought was, grab all null rows, then run factorize function.
mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df[mask_to_grab_nulls]['SUBS_GROUP_ID'] = pd.factorize(df[mask_to_grab_nulls]['SUBSCRIPTION_NAME'])[0] + 1
This runs, but does not change any values... any ideas on how to solve this?
This is likely related to chained assignments (see more here). Try the solution below, which isn't optimal but should work fine in your case:
df2 = df[df['SUBS_GROUP_ID'].isnull()] # isolate the Null IDs
df2['SUBS_GROUP_ID'] = pd.factorize(df2['SUBSCRIPTION_NAME'])[0] + 1 # factorize
df = df.dropna() # drop Null rows from the original table
df_fin = pd.concat([df,df2]) # concat df and df2
What you are doing is called chained indexing, which has two major downsides and should be avoided:
It can be slower than the alternative, because it involves more function calls.
The result is unpredictable: Why does assignment fail when using chained indexing?
I'm a bit surprised you haven't seen a SettingWithCopy warning. The warning points you in the right direction:
... Try using .loc[row_indexer,col_indexer] = value instead
So this should work:
mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df.loc[mask_to_grab_nulls, 'SUBS_GROUP_ID'] = pd.factorize(
df.loc[mask_to_grab_nulls, 'SUBSCRIPTION_NAME']
)[0] + 1
You can use labelencoder.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df=df.dropna(subset=['SUBS_GROUP_ID'])#drop null values
df_results =le.fit_transform(df.SUBS_GROUP_ID.values) #encode string to classes
df_results
I would use numpy.where to factorize only the non nan values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'SUBS_GROUP_ID': ['ID-001', 'ID-002', np.nan, 'ID-004', 'ID-005'],
'SUBSCRIPTION_NAME': ['Toyota', 'Safeway', 'Starbucks', 'Safeway', 'Toyota']})
df['SUBS_GROUP_ID'] = np.where(~df['SUBS_GROUP_ID'].isnull(), pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1, np.nan)
>>> print(df)

How to prevent NaN when using str.lower in Python?

I'm looking to convert a column to lower case. The issue is there are some instances where the string within the column only contains numbers. In my real life case this is due to poor data entry. Instead of having these values converted to NaN, I would like to keep the numeric string as is. What is the best approach to achieving this?
Below is my current code and output
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']})
df['col'].str.lower()
Current Output
Desired Output
Just convert to column to strings first:
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']})
print(df['col'].astype(str).str.lower())
Pre-Define the data as str format.
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']}, dtype=str)
print(df['col'].str.lower())
to add a slight variation to Tim Roberts' solution without using the .str accessor:
import pandas as pd
df = pd.DataFrame({'col':['G5051', 'G5052', 5053, 'G5054']})
print(df['col'].astype(str).apply(lambda x: x.lower()))

Python dataframe exclude rows based on condition not working

I have a dataframe that I am concatenating from dataframes and arrays.
Somehow its inherited the index of the original dataframe - hence I am trying to exclude rows based on one of the columns that should not have missing values.
If I view my dataframe, it shows as this:
print(model_data2['is_62p_days_overdue'][0:11])
now, when I run:
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] != np.nan)[0:11])
I get the exact same output.
And when I run :
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] == np.nan)[0:11])
What am I missing? this is driving me nuts!
I've tried resetting the index - but this also does nothing.
IIUC:
Instead of this:
print(model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'] != np.nan)[0:11])
try with loc accessor and notna() method:
print(model_data2.loc[model_data2['is_62p_days_overdue'].notna(),'is_62p_days_overdue'][0:11])
Answer to the comment:
there are 2 reasons of it
you can't compare NaN's like that like you do in your method:
model_data2['is_62p_days_overdue'] != np.nan
#this is wrong instead use notna() method
2.You are using where method even when you corrected above method it will make that back to NaN:
model_data2['is_62p_days_overdue'].where(model_data2['is_62p_days_overdue'].notna())
see the "# rows you may want to see" in the bottom of my code
import pandas as pd
import numpy as np
# make a dataset
dict={'is_62p_days_overdue':[0, 0, 0, 0, 0, None, None, 0, None, 0, None]}
data=pd.DataFrame(dict)
print(data)
# append numeric 1~10
data=data.append(pd.DataFrame({'is_62p_days_overdue': list(range(1,10+1))}),ignore_index=True)
data
# rows you may want to see
data.loc[~(data.is_62p_days_overdue.isna())]
you can use .dropna() to drop the rows with NaN values.
use this:
model_data2.dropna(subset = ['is_62p_days_overdue'], inplace = True)

Rename column values using pandas DataFrame

in one of the columns in my dataframe I have five values:
1,G,2,3,4
How to make it change the name of all "G" to 1
I tried:
df = df['col_name'].replace({'G': 1})
I also tried:
df = df['col_name'].replace('G',1)
"G" is in fact 1 (I do not know why there is a mixed naming)
Edit:
works correctly with:
df['col_name'] = df['col_name'].replace({'G': 1})
If I am understanding your question correctly, you are trying to change the values in a column and not the column name itself.
Given you have mixed data type there, I assume that column is of type object and thus the number is read as string.
df['col_name'] = df['col_name'].str.replace('G', '1')
You could try the following line
df.replace('G', 1, inplace=True)
use numpy
import numpy as np
df['a'] = np.where((df.a =='G'), 1, df.a)
You can try this, lets say your data is like :
ab=pd.DataFrame({'a':[1,2,3,'G',5]})
And you will replace it as :
ab1=ab.replace('G',4)

Categories

Resources