I have two CSV file that I'm comparing and returning only the columns side by side that have different values. So if one value is empty in one of the columns the code will through a error:
ValueError: Can only compare identically-labeled Series objects
import pandas as pd
df1=pd.read_csv('csv1.csv')
df2=pd.read_csv('csv2.csv')
def process_df(df):
res = df.set_index('Country').stack()
res.index.rename('Column', level=1, inplace=True)
return res
df1 = process_df(df1)
df2 = process_df(df2)
mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
df3 = pd.concat([df1[mask], df2[mask]], axis=1).rename({0:'From', 1:'To'}, axis=1)
print(df3)
My current output without missing values :
From To
Country Column
Bermuda 1980 0.00793 0.00093
1981 0.00687 0.00680
1986 0.00700 1.00700
Mexico 1980 3.72819 3.92819
If some values are missing I just want a empty cell, like the example below :
From To
Country Column
Bermuda 1980 0.00793 0.00093
1981 0.00687 <--- Missing value
1986 0.00700 1.00700
Mexico 1980 3.72819 3.92819
The issue is that the indexes don't match... As a simplified example (note that if you pass an empty element ('') into df1 instead of, say, the [4] element it produces the same result):
In [21]: df1 = pd.DataFrame([[1], [4]])
In [22]: df1
Out[22]:
0
0 1
1 4
Using the same DF structure but changing the index...
In [23]: df2 = pd.DataFrame([[3], [2]], index=[1, 0])
In [24]: df2
Out[24]:
0
1 3
0 2
Now to compare...
In [25]: df1[0] == df2[0]
ValueError: Can only compare identically-labeled Series objects
To prove out the index issue - recast df2 without the reverse index...
In [26]: df3 = pd.DataFrame([[3], [2]])
In [27]: df3
Out[27]:
0
0 3
1 2
And the resulting comparison:
In [28]: df1[0] == df3[0]
Out[28]:
0 False
1 False
Name: 0, dtype: bool
The Fix
You'll have to reindex one of the df's - like so (this is using "sortable" index - so more challenging for a more complex multi-index):
In [44]: df2.sort_index(inplace=True)
In [45]: df1[0] == df2[0]
Out[45]:
0 False
1 False
Name: 0, dtype: bool
If you can provide the CSV data, we could give it a try with a multi index...
Multi-Index
The .sort_index() method has a level= attribute that can be passed. You can pass an int or level name or list of ints or list of level names. So you could do something like:
df2.sort_index(level='level_name', inplace=True)
# as a list of levels... it will all depend on your original df index
levels = ['level_name1', 'level_name2]
df2.sort_index(level=levels, inplace=True)
Related
I am trying to do something relatively simple in summing all columns in a pandas dataframe that contain a certain string. Then making that a new column in the dataframe from the sum. These columns are all numeric float values...
I can get the list of columns which contain the string I want
StmCol = [col for col in cdf.columns if 'Stm_Rate' in col]
But when I try to sum them using:
cdf['PadStm'] = cdf[StmCol].sum()
I get a new column full of "nan" values.
You need to pass in axis=1 to .sum, by default (axis=0) sums over each column:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df[["A"]].sum() # Here I'm passing the list of columns ["A"]
Out[13]:
A 4
dtype: int64
In [14]: df[["A"]].sum(axis=1)
Out[14]:
0 1
1 3
dtype: int64
Only the latter matches the index of df:
In [15]: df["C"] = df[["A"]].sum()
In [16]: df["D"] = df[["A"]].sum(axis=1)
In [17]: df
Out[17]:
A B C D
0 1 2 NaN 1
1 3 4 NaN 3
I would like to ask how to count and show the number of missing value in dataframe only?
I am using:
df.isna().sum() but it will show all columns including non-missing value columns. How can I only count and show the columns with missing value with descending order value counts in dataframe?
Thank so much!
In my opinion simpliest is remove 0 values by boolean indexing and then sort_values:
s = df.isna().sum()
s = s[s != 0].sort_values(ascending=False)
Or use any for filter only columns with at least one True (one NaN):
df1 = df.isna()
s = df1.loc[:, df1.any()].sum().sort_values(ascending=False)
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,5,np.nan,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[np.nan,3,6,9,2,np.nan],
'F':list('aaabbb')
})
s = df.isna().sum()
s = s[s != 0].sort_values(ascending=False)
print (s)
B 3
E 2
C 1
dtype: int64
You can use pipe to remove zero values from your totals:
>>> df.isnull().sum().sort_values(ascending=False).pipe(lambda s: s[s > 0])
B 3
E 2
C 1
dtype: int64
Lets say one has a DataFrame df1 with INDEX, Column1, Column2 and another df2 with INDEX, Column1, Column3.
Both INDEX have similar values so I want to use that to merge the information of one table on the other.
I have been told to do as follows by other users:
df1.update(df2, join='left', overwrite=True)
This works if both INDEXES have similar values. The result will be df1 will now have INDEX, Column1 (from df2) and Column2 (original from df1). Column3 is not added to df1 (this behaviour is wanted vs. the "merge" command that adds everything).
Now, I would like to update df1 only on a few cases and based on Column2. I thought this would work:
df1[df1['Column2'] == 'Cond'].update(df2, join='left', overwrite=True)
But it doesn't; sometimes I get an error, other the command works but ALL df1 values have been modified.
Any idea on how to do this?
PS: Using .loc won't work as that requires that whatever INDEX you search for exists and this is not the case.
EDIT: Additional example
In [37]: df1 = pd.DataFrame([['USA',1],['USA',2],['USA',3],['FRA',1],['FRA',2]], columns = ['country', 'value'])
In [38]: df2 = pd.DataFrame([['USA',10],['FRA',20]], columns = ['country', 'value'])
In [39]: df1 = df1.set_index('country')
In [40]: df2 = df2.set_index('country')
In [41]: mask = df1['value'] >= 2
In [42]: idx = df1.index[mask]
In [43]: idx = idx.unique()
In [44]: df1
Out[44]:
value
country
USA 1
USA 2
USA 3
FRA 1
FRA 2
In [45]: df2
Out[45]:
value
country
USA 10
FRA 20
In [46]: idx
Out[46]: array(['USA', 'FRA'], dtype=object)
In [47]: df1.update(df2.loc[idx])
In [48]: df1
Out[48]:
value
country
USA 10
USA 10
USA 10
FRA 20
FRA 20
Define the boolean mask
mask = (df1['Column2'] == 'Cond')
If df1.index is identical to df2.index, then mask can be used to select
rows from df2 -- i.e., df2.loc[mask]. But if they are not identical, then
df2.loc[mask] may raise an error (if len(df1) != len(df2)), or worse, silently select the wrong rows
because the boolean mask is not aligning index values between df1 and df2.
So in the more general case when the indexes are not identical, the trick is to
convert the boolean mask into an Index that can be used to restrict
df2.
If df1.index is unique then call df1.update on the restricted df2:
idx = df1.index[mask]
df1.update(df2.loc[idx])
For example,
import pandas as pd
df1 = pd.DataFrame({'Column1':[1,2,3], 'Column2':['Cond',5,'Cond']}, index=['A','B','C'])
# Column1 Column2
# A 1 Cond
# B 2 5
# C 3 Cond
df2 = pd.DataFrame({'Column1':[10,20,30], 'Column3':[40,50,60]}, index=['D','B','C'])
# Column1 Column3
# D 10 40
# B 20 50
# C 30 60
mask = df1['Column2'] == 'Cond'
idx = df1.index[mask]
df1.update(df2.loc[idx])
print(df1)
prints
Column1 Column2
A 1 Cond
B 2 5
C 30 Cond
If df1.index is not unique, then make the index unique by adding mask to it:
df1['mask'] = df1['value'] >= 2
df2['mask'] = True
df1 = df1.set_index('mask', append=True)
df2 = df2.set_index('mask', append=True)
Then calling df1.update(df2) produces the desired result because update aligns indices.
For example,
import pandas as pd
df1 = pd.DataFrame([['USA',1],['USA',2],['USA',3],['FRA',1],['FRA',2]],
columns = ['country', 'value'])
df2 = pd.DataFrame([['USA',10],['FRA',20]], columns = ['country', 'value'])
df1 = df1.set_index('country')
# value
# country
# USA 1
# USA 2
# USA 3
# FRA 1
# FRA 2
df2 = df2.set_index('country')
# value
# country
# USA 10
# FRA 20
df1['mask'] = df1['value'] >= 2
df2['mask'] = True
df1 = df1.set_index('mask', append=True)
# value
# country mask
# USA False 1
# True 2
# True 3
# FRA False 1
# True 2
df2 = df2.set_index('mask', append=True)
# value
# country mask
# USA True 10
# FRA True 20
df1.update(df2)
df1.index = df1.index.droplevel('mask')
print(df1)
yields
value
country
USA 1
USA 10
USA 10
FRA 1
FRA 20
I've been searching for quite a while not not getting anywhere close to what I wanted to do...
I have a pandas dataframe in which I want to compare the value of column A to B and write a 1 or 0 in a new column if A and B are equal.
I could write an ugly for loop but I know this is not very pythony.
I'm pretty sure there is a way to do this with apply() but I'm not getting anywhere.
I'd like to be able to compare columns that contain integers as well as columns containing strings.
Thanks in advance for your help.
If df is a Pandas DataFrame, then
df['newcol'] = (df['A'] == df['B']).astype('int')
For example,
In [20]: df = pd.DataFrame({'A': [1,2,'foo'], 'B': [1,99,'foo']})
In [21]: df
Out[21]:
A B
0 1 1
1 2 99
2 foo foo
In [22]: df['newcol'] = (df['A'] == df['B']).astype('int')
In [23]: df
Out[23]:
A B newcol
0 1 1 1
1 2 99 0
2 foo foo 1
df['A'] == df['B'] returns a boolean Series:
In [24]: df['A'] == df['B']
Out[24]:
0 True
1 False
2 True
dtype: bool
astype('int') converts the True/False values to integers -- 0 for False and 1 for True.
How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.
You can use the attribute df.empty to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Source: Pandas Documentation
I use the len function. It's much faster than empty. len(df.index) is even faster.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def empty(df):
return df.empty
def lenz(df):
return len(df) == 0
def lenzi(df):
return len(df.index) == 0
'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)
10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop
len on index seems to be faster
'''
To see if a dataframe is empty, I argue that one should test for the length of a dataframe's columns index:
if len(df.columns) == 0: 1
Reason:
According to the Pandas Reference API, there is a distinction between:
an empty dataframe with 0 rows and 0 columns
an empty dataframe with rows containing NaN hence at least 1 column
Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.
Examples
Example 1: An empty dataframe with 0 rows and 0 columns
In [1]: import pandas as pd
df1 = pd.DataFrame()
df1
Out[1]: Empty DataFrame
Columns: []
Index: []
In [2]: len(df1.index) # or len(df1)
Out[2]: 0
In [3]: df1.empty
Out[3]: True
Example 2: A dataframe which is emptied to 0 rows but still retains n columns
In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df2
Out[4]: AA BB
0 1 11
1 2 22
2 3 33
In [5]: df2 = df2[df2['AA'] == 5]
df2
Out[5]: Empty DataFrame
Columns: [AA, BB]
Index: []
In [6]: len(df2.index) # or len(df2)
Out[6]: 0
In [7]: df2.empty
Out[7]: True
Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.
In [8]: len(df1.columns)
Out[8]: 0
In [9]: len(df2.columns)
Out[9]: 2
Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.
Why it matters
Let's add a new column to these dataframes to understand the implications:
# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
df1
Out[10]: CC
0 111
1 222
2 333
In [11]: len(df1.columns)
Out[11]: 1
# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
df2
Out[12]: AA BB CC
0 NaN NaN 111
1 NaN NaN 222
2 NaN NaN 333
In [13]: len(df2.columns)
Out[13]: 3
It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.
Practical solution
# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df
Out[1]: AA BB
0 1 11
1 2 22
2 3 33
# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
df
Out[2]: Empty DataFrame
Columns: [AA, BB]
Index: []
# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2
# And accordingly, the other answers on this page
In [4]: len(df.index) # or len(df)
Out[4]: 0
In [5]: df.empty
Out[5]: True
# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0: # <--- here
# Do something, e.g.
# drop any columns containing rows with `NaN`
# to make the df really empty
df = df.dropna(how='all', axis=1)
df
Out[6]: Empty DataFrame
Columns: []
Index: []
# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0
Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):
In [8]: df['CC'] = [111, 222, 333]
df
Out[8]: CC
0 111
1 222
2 333
In [9]: len(df.columns)
Out[9]: 1
I prefer going the long route. These are the checks I follow to avoid using a try-except clause -
check if variable is not None
then check if its a dataframe and
make sure its not empty
Here, DATA is the suspect variable -
DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty
If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
when this situation can happen?
This situation happens when a single function is used to plot more than one DataFrame
which are passed as parameter.In such a situation the function try to plot the data even
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
why?
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message
if a DataFrame is empty(instead of plotting):
df1 df2
col1 col2 col1 col2
Nan 2 Nan Nan
2 Nan Nan Nan
and the function:
def myfunc(df):
if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
print('not empty')
df.plot(kind='barh')
else:
display a message instead of plotting if it is empty
print('empty')