Python unit tests for dictionaries of dataframes

Python unit tests for dictionaries of dataframes - python

I have a function that returns dictionaries of pandas dataframes and I wish to design a unit test for it.
I know how to unit-test equality over pandas dataframes:
import pandas as pd
from pandas.util.testing import assert_frame_equal
import unittest
df1 = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df2 =pd.DataFrame(df1)
class DictEq(unittest.TestCase):
def test_dict_eq(self):
assert_frame_equal(df1, df2)
unittest.main()
However, I do not seem to grasp how to design a test that compares the following:
dict1 = {'a': df1}
dict2 = {'a': df2}
I have tried the following, all of which fail:
from nose.tools import assert_equal, assert_dict_equal
class DictEq(unittest.TestCase):
def test_dict_eq1(self):
assert_equal(dict1, dict2)
def test_dict_eq2(self):
assert_dict_equal(dict1 , dict2)
def test_dict_eq3(self):
self.assertTrue(dict1 == dict2)
assert_dict_equal function of pandas.util.testing fails as well.

Try this:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df2 =pd.DataFrame(df1)
class DfWrap():
def __init__(self, df):
self.df = df
def __eq__(self, df2):
return all(self.df == df2)
dic1 = {'a': DfWrap(df1)}
dic2 = {'a': DfWrap(df2)}
print(dic1 == dic2)
This outputs True. It should work with assert_dict_equal as well, as long as you wrap your dataframe objects in DfWrap.
Here's why it works:
You have to imagine that in order to compare dictionaries, python will go through each key (recursively) and call __eq__ (or ==) on the items to compare. The problem is that when you call __eq__ (or ==) on a dataframe, it doesn't return a bool. Instead it returns another dataframe:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df2 =pd.DataFrame(df1)
df_compare = df1 == df2
print(type(df_compare))
this outputs:
<class 'pandas.core.frame.DataFrame'>
So, instead, the wrapper makes it so that doing df1 == df2 outputs a bool instead of a dataframe:
DfWrap(df1) == DfWrap(df2)
evaluates to True.
HTH.

I am not sure but you may do something like this:
import unittest
class DictEq(unittest.TestCase):
def test_dict_eq1(self):
dict1 = {'a': df1}
dict2 = {'a': df2}
key1 = dict1.keys()
key2 = dict2.keys()
self.assertEqual(key1,key2)
for key, val in dict1.items():
df1 = dict1[key]
df2 = dict2[key]
assert_frame_equal(df1, df2)

Related

Find columns in Pandas DataFrame containing dicts

I have a pandas DataFrame with several columns containing dicts. I am trying to identify columns that contain at least 1 dict.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'i': [0, 1, 2, 3],
'd': [np.nan, {'p':1}, {'q':2}, np.nan],
't': [np.nan, {'u':1}, {'v':2}, np.nan]
})
# Iterate over cols to find dicts
cdict = [i for i in df.columns if isinstance(df[i][0],dict)]
cdict
[]
How do I find cols with dicts? Is there a solution to find cols with dicts without iterating over every cell / value of columns?

You can do :
s = df.applymap(lambda x:isinstance(x, dict)).any()
dict_cols = s[s].index.tolist()
print(dict_cols)
['d', 't']

We can apply over the columns although this still is iterating but making use of apply.
df.apply(lambda x: [any(isinstance(y, dict) for y in x)], axis=0)
EDIT: I think using applymap is more direct. However, we can use our boolean result to get the column names
any_dct = df.apply(lambda x: [any(isinstance(y, dict) for y in
x)], axis=0, result_type="expand")
df.iloc[:,any_dct.iloc[0,:].tolist()].columns.values

How to enable row[col_name] syntax with Namedtuple Pandas from df.itertuples()

We have a DataFrame with many columns and need to cycle through the rows with df.itertuples(). Many column names are in variables and accessing the namedtuple row with getattr() works fine but is not very readable with many column accesses. Is there a way to enable the row[col_name] syntax? E.g with a subclassed NamedTuple like here https://stackoverflow.com/a/65301971/360265?
import pandas as pd
col_name = 'b'
df = pd.DataFrame([{'a': 1, 'b': 2.}, {'a': 3,'b': 4.}])
for row in df.itertuples():
print(row.a) # Using row._asdict() would disable this syntax
print(getattr(row, col_name)) # works fine but is not as readable as row[col_name]
print(row[col_name]) # how to enable this syntax?
Wrapping row in the following Frame class is a solution but not really a pythonic one.
class Frame:
def __init__(self, namedtuple: NamedTuple):
self.namedtuple = namedtuple
def __getattr__(self, item):
return getattr(self.namedtuple, item)
def __getitem__(self, item):
return getattr(self.namedtuple, item)

Use to_dict
import pandas as pd
col_name = 'b'
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
for row in df.to_dict('records'):
print(row[col_name])
Output
2
4
If you want to keep both notations, a possible approach would be to do:
def iterdicts(tuples):
yield from ((tup, tup._asdict()) for tup in tuples)
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
for tup, row in iterdicts(df.itertuples()):
print(tup.a)
print(row[col_name])
Output
1
2
3
4

A similar approach to yours, just using df.iterrows()
import pandas as pd
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3,'b': 4}])
for index, row in df.iterrows():
print(row.b)
print(getattr(row, 'b'))
print(row['b'])
These lines were tested using pandas versions 0.20.3 and 1.0.1.

Creating a dictionary out of pandas dataframe, where the value is the index

I have a pandas dataframe like so:
A
a
b
c
d
I am trying to create a python dictionary which would look like this:
df_dict = {'a':0, 'b':1, 'c':2, 'd':3}
What I've tried:
df.reset_index(inplace=True)
df = {x : y for x in df['A'] for y in df['index']}
But the df is 75k long and its taking a while now, not even sure if this produces the result I need. Is there a neat, fast way of achieving this?

Use dict with zip and range:
d = dict(zip(df['A'], range(len(df))))
print (d)
{'a': 0, 'b': 1, 'c': 2, 'd': 3}

You can do it like this:
#creating example dataframe with 75 000 rows
import uuid
df = pd.DataFrame({"col": [str(uuid.uuid4()) for _ in range(75000) ] } )
#your bit
{ i:v for i,v in df.reset_index().values }
It runs in seconds.

You could convert series to list and use enumerate:
lst = { x: i for i, x in enumerate(df['A'].tolist()) }

extract dataframes from list of dictionaries and combine into one

I have a list of dictionaries. Each item in the list is a dictionary. Each dictionary is a pair of key and value with the value being a data frame.
I would like to extract all the data frames and combine them into one.
I have tried:
df = pd.DataFrame.from_dict(data)
for both the full data file and for each dictionary in the list.
This gives the following error:
ValueError: If using all scalar values, you must pass an index
I have also tried turning the dictionary into a list, then converting to a pd.DataFrame, i get:
KeyError: 0
Any ideas?

It should be doable with pd.concat(). Let's say you have a list of dictionaries l:
l = (
{'a': pd.DataFrame(np.arange(9).reshape((3,3)))},
{'b': pd.DataFrame(np.arange(9).reshape((3,3)))},
{'c': pd.DataFrame(np.arange(9).reshape((3,3)))}
)
You can feed dataframes from each dict in the list to pd.concat():
df = pd.concat([[pd.DataFrame(df_) for df_ in dict_.values()][0] for dict_ in l])
In my example all data frames have the same number of columns, so the result has 9 x 3 shape. If your dataframes have different columns the output will be malformed and required extra steps to process.

This should work.
import pandas as pd
dict1 = {'d1': pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', 'three']})}
dict2 = {'d2': pd.DataFrame({'a': [4,5,6], 'b': ['four', 'five', 'six']})}
dict3 = {'d3': pd.DataFrame({'a': [7,8,9], 'b': ['seven', 'eigth', 'nine']})}
# dicts list. you would start from here
dicts_list = [dict1, dict2, dict3]
dict_counter = 0
for _dict in dicts_list:
aux_df = list(_dict.values())[0]
if dict_counter == 0:
df = aux_df
else:
df = df.append(aux_df)
dict_counter += 1
# Reseting and dropping old index
df = df.reset_index(drop=True)
print(df)
Just out of curiosity: Why are your sub-dataframes already included in a dictionary? An easy way of creating a dataframe from dictionaries is just building a list of dictionaries and then calling pd.DataFrame(list_with_dicts). If the keys are the same across all dictionaries, it should work. Just a suggestion from my side. Something like this:
list_with_dicts = [{'a': 1, 'b': 2}, {'a': 5, 'b': 4}, ...]
# my_df -> DataFrame with columns [a, b] and two rows with the values in the dict.
my_df = pd.DataFrame(list_with_dicts)

How to use tqdm with map for Dataframes

Can I use tqdm progress bar with map function to loop through dataframe/series rows?
specifically, for the following case:
def example(x):
x = x + 2
return x
if __name__ == '__main__':
dframe = pd.DataFrame([{'a':1, 'b': 1}, {'a':2, 'b': 2}, {'a':3, 'b': 3}])
dframe['b'] = dframe['b'].map(example)

Due to the integration of tqdm with pandas you can use progress_map function instead of map function.
Note: for this to work you should add tqdm.pandas() line to your code.
So try this:
from tqdm import tqdm
def example(x):
x = x + 2
return x
tqdm.pandas() # <- added this line
if __name__ == '__main__':
dframe = pd.DataFrame([{'a':1, 'b': 1}, {'a':2, 'b': 2}, {'a':3, 'b': 3}])
dframe['b'] = dframe['b'].progress_map(example) # <- progress_map here
Here is the documentation reference:
(after adding tqdm.pandas()) ... you can use progress_apply instead of apply and progress_map
instead of map

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python unit tests for dictionaries of dataframes - python

Related

Find columns in Pandas DataFrame containing dicts

How to enable row[col_name] syntax with Namedtuple Pandas from df.itertuples()

Creating a dictionary out of pandas dataframe, where the value is the index

extract dataframes from list of dictionaries and combine into one

How to use tqdm with map for Dataframes

Categories

Resources