Converting NaN in dataframe to zero - python

I have dictionary and created Pandas using
cars = pd.DataFrame.from_dict(cars_dict, orient='index')
and
sorted the index (columns in alphabetical order
cars = cars.sort_index(axis=1)
After sorting I noticed the DataFrame has NaN and I wasn't sure
if the really np.nan values?
print(cars.isnull().any()) and all column shows false.
I have tried different method to convert those "NaN" values to zero which is what I want to do but non of them is working.
I have tried replace and fillna methods and nothing works
Below is sample of my dataframe..
speedtest size
toyota 65 NaN
honda 77 800

Either use replace or np.where on the values if they are strings:
df = df.replace('NaN', 0)
Or,
df[:] = np.where(df.eq('NaN'), 0, df)
Or, if they're actually NaNs (which, it seems is unlikely), then use fillna:
df.fillna(0, inplace=True)
Or, to handle both situations at the same time, use apply + pd.to_numeric (slightly slower but guaranteed to work in any case):
df = df.apply(pd.to_numeric, errors='coerce').fillna(0, downcast='infer')
Thanks to piRSquared for this one!

#cs95's answer didn't work here.
Had to import numpy as np and use replace with np.Nan and inplace = True
import numpy as np
df.replace(np.NaN, 0, inplace=True)
Then all the columns got 0 instead of NaN.

Related

Replace nan-values with the mean of their column/attribute

I have tried with everything I can come up with and would appreciate some help! :)
This is a method that's gonna return an imputed part of a data frame
from statistics import mean
from unicodedata import numeric
def imputation(df, columns_to_imputed):
# Step 1: Get a part of dataframe using columns received as a parameter.
import pandas as pd
import numpy as np
df.set_axis(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], axis=1, inplace=True)#Sätter rubrikerna
part_of_df = pd.DataFrame(df.filter(columns_to_imputed, axis=1))
part_of_df = part_of_df.drop([0], axis=0)
#Step 2: Change the zero values in the columns to np.nan
part_of_df = part_of_df.replace('0', np.nan)
# Step 3: Change the nan values to the mean of each attribute (column).
#You can use the apply(), fillna() functions.
part_of_df = part_of_df.fillna(part_of_df.mean(axis=0)) #####Ive tried everything on this row, can't get it to work. I want to fill each nan-value with the mean of the column its in..
return part_of_df ####Im returning this part to see if the nans are replaced but nothings happened...
You were on the right track, you just need to make a small change. Here I created a sample Df and introduced some NaNs:
dummy_df = pd.DataFrame({"col1":range(5), "col2":range(5)})
dummy_df['col1'][1] = None
dummy_df['col1'][3] = None
dummy_df['col2'][4] = None
and got this:
Disclaimer: Don't use my method of value assignment. Use proper indexing through loc.
Now, I use apply() and lambda to iterate over each column and fill NaNs with the mean value:
dummy_df = dummy_df.apply(lambda x: x.fillna(x.mean()), axis=0)
This gives me:
Hope this helps!

factorizing on a slice of a df

I'm trying to give numerical representations of strings, so I'm using Pandas'
factorize
For example Toyota = 1, Safeway = 2 , Starbucks =3
Currently it looks like (and this works):
#Create easy unique IDs for subscription names i.e. 1,2,3,4,5...etc..
df['SUBS_GROUP_ID'] = pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1
However, I only want to factorize subscription names where the SUB_GROUP_ID is null. So my thought was, grab all null rows, then run factorize function.
mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df[mask_to_grab_nulls]['SUBS_GROUP_ID'] = pd.factorize(df[mask_to_grab_nulls]['SUBSCRIPTION_NAME'])[0] + 1
This runs, but does not change any values... any ideas on how to solve this?
This is likely related to chained assignments (see more here). Try the solution below, which isn't optimal but should work fine in your case:
df2 = df[df['SUBS_GROUP_ID'].isnull()] # isolate the Null IDs
df2['SUBS_GROUP_ID'] = pd.factorize(df2['SUBSCRIPTION_NAME'])[0] + 1 # factorize
df = df.dropna() # drop Null rows from the original table
df_fin = pd.concat([df,df2]) # concat df and df2
What you are doing is called chained indexing, which has two major downsides and should be avoided:
It can be slower than the alternative, because it involves more function calls.
The result is unpredictable: Why does assignment fail when using chained indexing?
I'm a bit surprised you haven't seen a SettingWithCopy warning. The warning points you in the right direction:
... Try using .loc[row_indexer,col_indexer] = value instead
So this should work:
mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df.loc[mask_to_grab_nulls, 'SUBS_GROUP_ID'] = pd.factorize(
df.loc[mask_to_grab_nulls, 'SUBSCRIPTION_NAME']
)[0] + 1
You can use labelencoder.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df=df.dropna(subset=['SUBS_GROUP_ID'])#drop null values
df_results =le.fit_transform(df.SUBS_GROUP_ID.values) #encode string to classes
df_results
I would use numpy.where to factorize only the non nan values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'SUBS_GROUP_ID': ['ID-001', 'ID-002', np.nan, 'ID-004', 'ID-005'],
'SUBSCRIPTION_NAME': ['Toyota', 'Safeway', 'Starbucks', 'Safeway', 'Toyota']})
df['SUBS_GROUP_ID'] = np.where(~df['SUBS_GROUP_ID'].isnull(), pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1, np.nan)
>>> print(df)

pandas index StringMethods loses index

I've just noticed that string operations on the index of a Pandas DataFrame doesn't maintain the index, so assigning the result back to the dataframe is kind of awkward. for example (and the case where I noticed it):
import pandas as pd
df = pd.DataFrame(
[[1,2],[3,4],[5,6]],
index=['a11','b12','c13'])
df['num'] = df.index.str.extract('([0-9]+)')
gives me:
0 1 num
a11 1 2 NaN
b12 3 4 NaN
c13 5 6 NaN
as the index has been lost and just reverts back to [0,1,2]
it took a bit of debugging to realise this index loss is why I was getting NaN's, but once I did it was obvious that I could just do:
df['num'] = df.index.str.extract('([0-9]+)').set_index(df.index)
is this right, or are there other methods that maintain the index?
You'll have to use the expand argument:
df['num'] = df.index.str.extract('([0-9]+)', expand=False)
from the docs:
expand : bool, default True
If True, return DataFrame with one column per capture group. If False, return a Series/Index if there is one capture group or
DataFrame if there are multiple capture groups.
New in version 0.18.0.
You can use expand command to give same desired results as yours using:
df['num'] = df.index.str.extract('([0-9]+)', expand=False)
expand=False returns series or index or dataframe, since you have only one extracting group you can use expand parameter.
How about use assign?
df.assign(num=df.index.str.extract('([0-9]+)').values)

Replacing nan with blanks in Python

Below is my dataframe:
Id,ReturnCreated,ReturnTime,TS_startTime
O108808972773560,Return Not Created,nan,2018-08-23 12:30:41
O100497888936380,Return Not Created,nan,2018-08-18 14:57:20
O109648374050370,Return Not Created,nan,2018-08-16 13:50:06
O112787613729150,Return Not Created,nan,2018-08-16 13:15:26
O110938305325240,Return Not Created,nan,2018-08-22 11:03:37
O110829757146060,Return Not Created,nan,2018-08-21 16:10:37
I want to replace the nan with Blanks. Tried the below code, but its not working.
import pandas as pd
import numpy as np
df = pd.concat({k:pd.Series(v) for k, v in ordercreated.items()}).unstack().astype(str).sort_index()
df.columns = 'ReturnCreated ReturnTime TS_startTime'.split()
df1 = df.replace(np.nan,"", regex=True)
df1.to_csv('OrderCreationdetails.csv')
Kindly help me understand where i am going wrong and how can i fix the same.
You should try DataFrame.fillna() method
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
In your case:
df1 = df.fillna("")
should work I think
I think nans are strings, because .astype(str). So need:
df1 = df.replace('nan',"")
Either you can use df.fillna("") (i think that will perform better) or simple replace that values with blank
df1 = df.replace('NaN',"")

Pandas recalculate index after a concatenation

I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.
If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.

Categories

Resources