Creating new variable based on data in dataframe, ignore NaN

Creating new variable based on data in dataframe, ignore NaN - python

I have a dataframe like that below and want to create a new variable that is a 1/0 or True/False if all of the available scores in certain columns are equal to or above 4.
The data is quite messy. Some cells are NaN (respondent didn't provide a response), some are white space (bad formatting or respondent pressed space bar, maybe?).
ID Var1 Var2 Var3
id0001 2 NaN 2
id0002 10 3 10
id0003 8 0
id0004 NaN NaN NaN
id0005 7 3 7
id0006 NaN 9 9
I don't want to drop those rows with a missing value because most have a missing value. I can't just make NaN and white space cells 0 because 0 means something here. I can easily make all white space cells NaN, but I don't know how to ignore them as then I have instances of 'str' and 'int' when I do something like the following:
scoreoffouroraboveforall = [(df.Var1 >= 4) & (df.Var2 >= 4) & (df.Var3 >= 4)]
This is probably very simple to do, but I'm at a loss.

Use pd.to_numeric with optional parameter errors=coerce to convert each of the column in Var1, Var2 and Var3 to numeric type, then using DataFrame.ge and DataFrame.all along axis=1 to create the boolean mask as required with True/False values:
m = df[['Var1', 'Var2', 'Var3']].apply(
pd.to_numeric, errors='coerce').ge(4).all(axis=1)
Result:
print(m)
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool

Related

Pandas, numpy.where(), and numpy.nan

I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").
Consider:
>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
A
0 1
1 2
2 3
3 4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
A B C
0 1 NaN yes
1 2 NaN yes
2 3 NaN nan
3 4 NaN nan
>>> df.isna()
A B C
0 False True False
1 False True False
2 False True False
3 False True False
Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?
Should I use something other than numpy.nan inside where? None and pandas.NA both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.
Thank you!
Edit: As per #Tim Roberts and #DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None? Or something else?

np.where coerces the second and the third parameter to the same datatype. Since the second parameter is a string, the third one is converted to a string, too, by calling function str():
str(numpy.nan)
# 'nan'
As the result, the values in column C are all strings.
You can first fill the NaN rows with None and then convert them to np.nan with fillna():
df['C'] = numpy.where(df['A'] < 3, 'yes', None)
df['C'].fillna(np.nan, inplace=True)

B is a pure numeric column. C has a mixture of strings and numerics, so the column has type "object", and it prints differently.

How to fill a column under certain conditions?

I have two data frames df (with 15000 rows) and df1 ( with 20000 rows)
Where df looks like
Number Color Code Quantity
1 Red 12380 2
2 Bleu 14440 3
3 Red 15601 1
and df1 that has two columns Code and Quantity where I want to fill Quantity column under certain conditions using python in order to obtain like this
Code Quantity
12380 2
15601 1
15640 1
14400 0
The conditions that I want to take in considerations are:
If the two last caracters of Code column of df1 are both equal to zero, in this case I want to have 0 in the Quantity column of df1
If I don't find the Code in df, in this cas I put 1 in the Quantity column of df1
Otherwise I take the quantity value of df

Let us try:
mask = df1['Code'].astype(str).str[-2:].eq('00')
mapped = df1['Code'].map(df.set_index('Code')['Quantity'])
df1['Quantity'] = mapped.mask(mask, 0).fillna(1)
Details:
Create a boolean mask specifying the condition where the last two characters of Code are both 0:
>>> mask
0 False
1 False
2 False
3 True
Name: Code, dtype: bool
Using Series.map map the values in Code column in df1 to the Quantity column in df based on the matching Code:
>>> mapped
0 2.0
1 1.0
2 NaN
3 NaN
Name: Code, dtype: float64
mask the values in the above mapped column where the boolean mask is True, and lastly fill the NaN values with 1:
>>> df1
Code Quantity
0 12380 2.0
1 15601 1.0
2 15640 1.0
3 14400 0.0

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6

You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6

Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6

You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]

You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6

for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?

Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object

Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

Detecting certain columns and deleting these

I have a dataframe where some columns (not row) are like ["","","",""].
Those columns with that characteristic I would like to delete.
Is there an efficient way of doing that?

In pandas it would be del df['columnname'].

To delete columns where all values are empty, you first need to detect what columns contain only empty values.
So I made an example dataframe like this:
empty full nanvalues notempty
0 3 NaN 1
1 4 NaN 2
Using the apply function, we can compare entire columns to the empty string and then aggregate down with the .all() method.
empties = (df.astype(str) == "").all()
empties
empty True
full False
nanvalues False
notempty False
dtype: bool
Now we can drop these columns
empty_mask = empties.index[empties]
df.drop(empty_mask, axis=1)
full nanvalues notempty
0 3 NaN 1
1 4 NaN 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating new variable based on data in dataframe, ignore NaN - python

Related

Pandas, numpy.where(), and numpy.nan

How to fill a column under certain conditions?

How Can I drop a column if the last row is nan

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

Detecting certain columns and deleting these

Categories

Resources