What is a graceful way to fail when I want to access a value from a dataframe based on multiple conditions:
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['column_one']>2004) & (df['column_two']==9)]
If not value satisfying above condition exists, then pandas returns a keyerror. How do I instead just store a nan value in newdf.
If instead of dropping rows where the condition is not met, you want pandas to return a dataframe with rows of NaN where the condition is False and the original values otherwis, you can do the following.
You can assign a list of booleans with length equal to the number of rows of the dataframe to a view on all rows of the dataframe. This will get you NaN on the rows which are False and the original values for rows which correspond to True. If the entire list is False, you just get a dataframe full of NaN.
P.S. One of the column names is probably off. Even if everything is False, it should just return an empty dataframe instead of keyError.
Input:
print df1
df1[:] = df1[(df1["a"]>2)&(df1["b"]>1).tolist()]
print df1
Output:
a b c
0 1 2 3
1 2 2 3
2 3 2 3
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 2.0 3.0
Related
I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first() method, which does precisely this, with the additional property that if your updating data frame d2 is bigger than your original df, the additional rows and columns are added, as well.
df = df.combine_first(d2)
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN values at intersect aaa, A and eee, B
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition is true, the values from A will be used, otherwise B's values will be used.
For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions.
One important info missing from the other answers is that both combine_first and fillna match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there's a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use '\x00' as the suffix for columns from df2 since it's the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill horizontally to update df1 with values from df2.
Example:
Suppose you had df1:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1 with values in df2 for each pair of C1-C2 value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3
The problem is to get the index of dataframe based on a condition that it's sliced from all the non-null values while concatenating the indices without iterating using for-loop after getting the indices?
I able to do it but by using for-loop after slicing the df according to not null indices.I want this to be done without separately iterating over the indices.
df=pd.DataFrame([["a",2,1],["b",np.nan,np.nan],["c",np.nan,np.nan],["d",3,4]])
list1=[]
indexes=(df.dropna().index.values).tolist()
indexes.append(df.shape[0])
for i in range(len(indexes)-1):
list1.append(" ".join(df[0][indexes[i]:indexes[i+1]].tolist()))
# list1 becomes ['abc', 'de']
This is the sample DF:
0 1 2
0 a 2.0 1.0
1 b NaN NaN
2 c NaN NaN
3 d 3.0 4.0
4 e NaN NaN
The expected output will be a list like : [abc,de]
Explanation :
first string
a: not null (start picking)
b: null
c: null
second string
d: not null (second not-null encountered concat to second string)
e:null
This is a case for cumsum:
# change all(axis=1) to any(axis=1) if only one NaN in a row is enough
s = df.iloc[:,1:].notnull().all(axis=1)
df[0].groupby(s.cumsum()).apply(''.join)
output:
1 abc
2 de
Name: 0, dtype: object
I tried to insert a new row to a dataframe named 'my_df1' using the my_df1.loc function.But in the result ,the new row added has NaN values
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
my_df1.loc[3] = pd.Series([5,5,5])
Result displayed is as below
A B C
0 1.0 4.0 a
1 2.0 5.0 b
2 3.0 6.0 c
3 NaN NaN NaN
The reason that is all NaN is that my_df1.loc[3] as index (A,B,C) while pd.Series([5,5,5]) as index (0,1,2). When you do series1=series2, pandas only copies values of common indices, hence the result.
To fix this, do as #anky_91 says, or if you already has a series, use its values:
my_df1.loc[3] = my_series.values
Finally I found out how to add a Series as a row or column to a dataframe
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
Code1 adds a new column 'D' and values 5,5,5 to the dataframe
my_df1.loc[:,'D'] = pd.Series([5,5,5],index = my_df1.index)
print(my_df1)
Code2 adds a new row with index 3 and values 3,4,3,4 to the dataframe in code 1
my_df1.loc[3] = pd.Series([3,4,3,4],index = ('A','B','C','D'))
print(my_df1)
I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.
I have a dataframe where some columns (not row) are like ["","","",""].
Those columns with that characteristic I would like to delete.
Is there an efficient way of doing that?
In pandas it would be del df['columnname'].
To delete columns where all values are empty, you first need to detect what columns contain only empty values.
So I made an example dataframe like this:
empty full nanvalues notempty
0 3 NaN 1
1 4 NaN 2
Using the apply function, we can compare entire columns to the empty string and then aggregate down with the .all() method.
empties = (df.astype(str) == "").all()
empties
empty True
full False
nanvalues False
notempty False
dtype: bool
Now we can drop these columns
empty_mask = empties.index[empties]
df.drop(empty_mask, axis=1)
full nanvalues notempty
0 3 NaN 1
1 4 NaN 2