I have a dataframe that has columns of attributes of male and female. There is a column df['long_hair'] with 0=no and 1=yes. I want to fill the missing values in this column with respect to it's gender. This is my code. But, the problem is, the inplace does not work with conditional statement column. So how can I do this?
df[df['Male']==1]['long_hair'].fillna(0,inplace=True)
This code means get the people who are male and fill the missing values with 0 (meaning they don't have long hair).
You can bypass it with assigning using pd.Series.where
cond bool Series/DataFrame, array-like, or callable
Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
df.long_hair = df.long_hair.where((df.Male != 1) | df.long_hair.isnull())
This will keep the values when it is not male or not null.
Related
Pandas loc method when used with row filter throws an error
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
IndexingError: Unalignable boolean Series provided as indexer (index
of the boolean Series and of the indexed object do not match).
whereas the same code without row filter works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
steps to reproduce
test=pd.DataFrame({"holiday":[0,0,0],"weekday":[1,2,3],"workingday":[1,1,1]})
test[test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)] ##works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)] ##fails
I am trying to understand what is the difference between these two which makes one fail whereas the other one succeed
So the basic syntax is DataFrame[things to look for, e.g row slices or columns]
With that in mind, you are trying to filter your dataframe test with the following commands (these are the code snippets in the brackets):
test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)
This returns True for every row in the dataframe and therefore the "filter" returns the entire dataframe
test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)
This part itself is working and it is doing so by slicing the rows 0 and 1 and then applying the lambda function. Therefore, the "filter" consists of True in only 2 rows. Now the point is, that there is no value for the third row and this causes your error: The indices of the dataframe that has to be sliced (3 rows) and the boolean Series used to slice it (2 values) don't match.
Solving this problem depends on what you actually want as your output, i.e. whether the lambda function is supposed to be applied only to a subset of the data or whether you want only a subset of the results being retrieved to work with.
I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).
In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:
> boolean = pd.Series([True, False, None])
> print(boolean)
0 True
1 False
2 None
dtype: object
so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)
> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0 True
1 False
2 True
dtype: bool
So 'np.nan' is being casted to 'True'?
Questions:
Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?
I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?
Q1: I would start with combining
np.any(pd.isna(boolean))
to identify if there are any None Values in a column, and with
set(boolean)
You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.
Q2: see comment of #WeNYoBen
I've hit the same problem. I came up with the following solution:
from pandas import Series
def is_boolean_series(col: Series):
val = col[~col.isna()].iloc[0]
return type(val) == bool
I'm trying to clean test data from Kaggle's Titanic dataset, specifically the columns - sex, fare, pclass, and age. In order to do this, I'd like to find out if any of these columns have empty values. I load the data:
import pandas as pd
test_address = 'path_to_data\test.csv'
test = pd.read_csv(test_address)
When I try to look for empty values in the columns,
True in test['Sex'].isna()
outputs True.
However,
test['Sex'].isna().value_counts()
outputs
False 418
Name: Sex, dtype: int64
which should mean there aren't any empty values (I could confirm this by visually scanning the csv file). These commands on test['Pclass'] have similar outputs.
Why is the 'True in' command giving me the wrong answer?
The operator in, when applied to a Series, checks if its left operand is in the index of the right operand. Since there is a row #1 (the numeric representation of True) in the series, the operator evaluates to True.
For the same reason False in df['Sex'].isna() is True, but False in df['Sex'][1:].isna() is False (there is no row #0 in the latter slice).
You should check if True in df['Sex'].isna().values.
I am calling this line:
lang_modifiers = [keyw.strip() for keyw in row["language_modifiers"].split("|") if not isinstance(row["language_modifiers"], float)]
This seems to work where row["language_modifiers"] is a word (atlas method, central), but not when it comes up as nan.
I thought my if not isinstance(row["language_modifiers"], float) could catch the time when things come up as nan but not the case.
Background: row["language_modifiers"] is a cell in a tsv file, and comes up as nan when that cell was empty in the tsv being parsed.
You are right, such errors mostly caused by NaN representing empty cells.
It is common to filter out such data, before applying your further operations, using this idiom on your dataframe df:
df_new = df[df['ColumnName'].notnull()]
Alternatively, it may be more handy to use fillna() method to impute (to replace) null values with something default.
E.g. all null or NaN's can be replaced with the average value for its column
housing['LotArea'] = housing['LotArea'].fillna(housing.mean()['LotArea'])
or can be replaced with a value like empty string "" or another default value
housing['GarageCond']=housing['GarageCond'].fillna("")
You might also use df = df.dropna(thresh=n) where n is the tolerance. Meaning, it requires n Non-NA values to not drop the row
Mind you, this approach will remove the row
For example: If you have a dataframe with 5 columns, df.dropna(thresh=5) would drop any row that does not have 5 valid, or non-Na values.
In your case you might only want to keep valid rows; if so, you can set the threshold to the number of columns you have.
pandas documentation on dropna
I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.