AssertionError when comparing pd DataFrame

AssertionError when comparing pd DataFrame - python

I'm developing a test for a function that I created. My function returns a pandas DataFrame and my test consists in comparing it with a csv file that is stored. I'm using the following script to do so. When I run it, I get AssertionError with no other message.
rates_over = get_rates_over(args)
gabarito = pd.read_csv(f'{ROOT_DIR}/data/static/rates_over_teste.csv', parse_dates=['date'])
assert rates_over.equals(gabarito)
But I suspected that my function was good, so I did the following and it didn't print anything, showing that my intuition was right. What is happening?
for index, row in gabarito.iterrows():
if not row.equals(rates_over.iloc[index]):
print('Not equal!')
EDIT: As suggested by #gallen, here is a print for type and head of both gabarito and Rates_over.

A DataFrame is never equal to a Series.
pd.DataFrame.equals
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.
It is meant to compare a DataFrame with a DataFrame, or a Series with a Series, not a mixture of a Series with a DataFrame.
A Series and a DataFrame have entirely different dimensionality.
import pandas as pd
df = pd.DataFrame({'foo': [1,2,3]})
s = df['foo']
print(df.shape)
#(3, 1)
print(s.shape)
#(3,)
The first check in the equals method is to check the dimensionality, so it quickly returns False without ever checking the data.
def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
#...
len(s._data.axes)
#1
len(df._data.axes)
#2
If you are certain your DataFrame only has a single column, then you can squeeze it before comparing with your Series.
df.squeeze().equals(s)
#True
Alternatively convert your Series to a DataFrame using the Series name.
df.equals(s.to_frame(s.name))
#True

Related

PYTHON check if a value in a column Dataset is within a range of values reported in another dataset

Have read through similar post but can't find an exact solution.
I have a dataset in a column named "A" and want to check if each value in this column is contained within any of the intervals in another dataset with two column intervals "Start" and "End". Return True or False in column "B" Please see attached image (data always in ascending order). Thank You

This is not the most efficient solution but it should do what you are asking:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"A":list(range(20))})
df2 = pd.DataFrame({"START":[1,3,5,7],
"END":[2,4,6,8]})
def compare_with_df(x,df):
for row in range(df.shape[0]):
if x >= df.loc[row,'START'] and x <= df.loc[row,'END']:
return True
return False
df1['B'] = df1['A'].apply(lambda x:compare_with_df(x,df2))
As you can see the compare_with_df() function loops through df2 and compares a given x to all possible ranges (this can and probably should be optimized for larger datasets). The apply() method is equivalent to looping trough the values of the give column (series).

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

I'm using Pandas 0.20.3 in my python 3.X. I want to add one column in a pandas data frame from another pandas data frame. Both the data frame contains 51 rows. So I used following code:
class_df['phone']=group['phone'].values
I got following error message:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
class_df.dtypes gives me:
Group_ID object
YEAR object
Terget object
phone object
age object
and type(group['phone']) returns pandas.core.series.Series
Can you suggest me what changes I need to do to remove this error?
The first 5 rows of group['phone'] are given below:
0 [735015372, 72151508105, 7217511580, 721150431...
1 []
2 [735152771, 7351515043, 7115380870, 7115427...
3 [7111332015, 73140214, 737443075, 7110815115...
4 [718218718, 718221342, 73551401, 71811507...
Name: phoen, dtype: object

In most cases, this error comes when you return an empty dataframe. The best approach that worked for me was to check if the dataframe is empty first before using apply()
if len(df) != 0:
df['indicator'] = df.apply(assign_indicator, axis=1)

You have a column of ragged lists. Your only option is to assign a list of lists, and not an array of lists (which is what .value gives).
class_df['phone'] = group['phone'].tolist()

The error of the Question-Headline
"ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
might as well occur if for what ever reason the table does not have any rows.

Instead of using an if-statement, you can use set result_type argument of apply() function to "reduce".
df['new_column'] = df.apply(func, axis=1, result_type='reduce')

The data assigned to a column in the DataFrame must be a single dimension array. For example, consider a num_arr to be added to a DataFrame
num_arr.shape
(1, 126)
For this num_arr to be added to a DataFrame column, It should be reshaped....
num_arr = num_arr.reshape(-1, )
num_arr.shape
(126,)
Now I could set this arr as a DataFrame column
df = pd.DataFrame()
df['numbers'] = num_arr

Why is 'series' type not a special case of 'dataframe' type in pandas? [duplicate]

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame
In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

The TLDR
When using loc
df.loc[:] = Dataframe
df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"] = Series
Not using loc
df["col_name"] = Series
df[["col_name"]] = Dataframe

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.
The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.

You wrote in a comment to joris' answer:
"I don't understand the design
decision for single rows to get converted into a series - why not a
data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Panel is a container for DataFrame objects.
We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.

If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :
df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3]
isinstance(result, pd.DataFrame) # True
result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

every time we put [['column name']] it returns Pandas DataFrame object,
if we put ['column name'] we got Pandas Series object

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).
This function ensures that you always get a list from your selection (if the df, index and column are valid):
def get_list_from_df_column(df, index, column):
df_or_series = df.loc[index,[column]]
# df.loc[index,column] is also possible and returns a series or a scalar
if isinstance(df_or_series, pd.Series):
resulting_list = df_or_series.tolist() #get list from series
else:
resulting_list = df_or_series[column].tolist()
# use the column key to get a series from the dataframe
return(resulting_list)

Return multiple objects from an apply function in Pandas

I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.

Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames

There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64

calling apply() on an empty pandas DataFrame

I'm having a problem with the apply() method of the pandas DataFrame. My issue is that apply() can return either a Series or a DataFrame, depending on the return type of the input function; however, when the frame is empty, apply() (almost) always returns a DataFrame. So I can't write code that expects a Series. Here's an example:
import pandas as pd
def area_from_row(row):
return row['width'] * row['height']
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1)
# This works as expected.
non_empty_frame = pd.DataFrame(data=[[2, 3]], columns=['width', 'height'])
add_area_column(non_empty_frame)
# This fails!
empty_frame = pd.DataFrame(data=None, columns=['width', 'height'])
add_area_column(empty_frame)
Is there a standard way of dealing with this? I can do the following, but it's silly:
def area_from_row(row):
# The way we respond to an empty row tells pandas whether we're a
# reduction or not.
if not len(row):
return None
return row['width'] * row['height']
(I'm using pandas 0.11.0, but I checked this on 0.12.0-1100-g0c30665 as well.)

You can set the result_type parameter in apply to 'reduce'.
From the documentation,
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
And then,
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
In your code, update here:
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1, result_type='reduce')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AssertionError when comparing pd DataFrame - python

Related

PYTHON check if a value in a column Dataset is within a range of values reported in another dataset

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

Why is 'series' type not a special case of 'dataframe' type in pandas? [duplicate]

Return multiple objects from an apply function in Pandas

calling apply() on an empty pandas DataFrame

Categories

Resources