How to use tqdm with map for Dataframes - python

Can I use tqdm progress bar with map function to loop through dataframe/series rows?
specifically, for the following case:
def example(x):
x = x + 2
return x
if __name__ == '__main__':
dframe = pd.DataFrame([{'a':1, 'b': 1}, {'a':2, 'b': 2}, {'a':3, 'b': 3}])
dframe['b'] = dframe['b'].map(example)

Due to the integration of tqdm with pandas you can use progress_map function instead of map function.
Note: for this to work you should add tqdm.pandas() line to your code.
So try this:
from tqdm import tqdm
def example(x):
x = x + 2
return x
tqdm.pandas() # <- added this line
if __name__ == '__main__':
dframe = pd.DataFrame([{'a':1, 'b': 1}, {'a':2, 'b': 2}, {'a':3, 'b': 3}])
dframe['b'] = dframe['b'].progress_map(example) # <- progress_map here
Here is the documentation reference:
(after adding tqdm.pandas()) ... you can use progress_apply instead of apply and progress_map
instead of map

Related

How to style pandas dataframe using for loop

I have a dataset where I need to display different values with different colors. Not all the cells in the data are highlighted and only some of the data is highlighted.
Here are some of the colors:
dict_colors = {'a': 'red', 'b': 'blue','e':'tomato'}
How can I highlight all these cells with given colors?
MWE
# data
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'B': list('aabbcc'), 'C': list('aaabbb')})
# without for loop
(df.style
.apply(lambda dfx: ['background: red' if val == 'a' else '' for val in dfx], axis = 1)
.apply(lambda dfx: ['background: blue' if val == 'b' else '' for val in dfx], axis = 1)
)
# How to do this using for loop (I have so many values and different colors for them)
# My attempt
dict_colors = {'a': 'red', 'b': 'blue','e':'tomato'}
s = df.style
for key,color in dict_colors.items():
s = s.apply(lambda dfx: [f'background: {color}' if cell == key else '' for cell in dfx], axis = 1)
display(s)
You can try that:
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'B': list('aabbcc'), 'C': list('aaabbb')})
dict_colors = {'a': 'red', 'b': 'blue', 'e':'tomato'}
# create a Styler object for the DataFrame
s = df.style
def apply_color(val):
if val in dict_colors:
return f'background: {dict_colors[val]}'
return ''
# apply the style to each cell
s = s.applymap(apply_color)
# display the styled DataFrame
display(s)
I found a way using eval method, it is not the most elegant method but it works.
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'B': list('aabbcc'), 'C': list('aaabbb')})
dict_colors = {'a': 'red', 'b': 'blue','e':'tomato'}
lst = [ 'df.style']
for key,color in dict_colors.items():
text = f".apply(lambda dfx: ['background: {color}' if cell == '{key}' else '' for cell in dfx], axis = 1)"
lst.append(text)
s = ''.join(lst)
display(eval(s))

How to enable row[col_name] syntax with Namedtuple Pandas from df.itertuples()

We have a DataFrame with many columns and need to cycle through the rows with df.itertuples(). Many column names are in variables and accessing the namedtuple row with getattr() works fine but is not very readable with many column accesses. Is there a way to enable the row[col_name] syntax? E.g with a subclassed NamedTuple like here https://stackoverflow.com/a/65301971/360265?
import pandas as pd
col_name = 'b'
df = pd.DataFrame([{'a': 1, 'b': 2.}, {'a': 3,'b': 4.}])
for row in df.itertuples():
print(row.a) # Using row._asdict() would disable this syntax
print(getattr(row, col_name)) # works fine but is not as readable as row[col_name]
print(row[col_name]) # how to enable this syntax?
Wrapping row in the following Frame class is a solution but not really a pythonic one.
class Frame:
def __init__(self, namedtuple: NamedTuple):
self.namedtuple = namedtuple
def __getattr__(self, item):
return getattr(self.namedtuple, item)
def __getitem__(self, item):
return getattr(self.namedtuple, item)
Use to_dict
import pandas as pd
col_name = 'b'
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
for row in df.to_dict('records'):
print(row[col_name])
Output
2
4
If you want to keep both notations, a possible approach would be to do:
def iterdicts(tuples):
yield from ((tup, tup._asdict()) for tup in tuples)
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
for tup, row in iterdicts(df.itertuples()):
print(tup.a)
print(row[col_name])
Output
1
2
3
4
A similar approach to yours, just using df.iterrows()
import pandas as pd
df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 3,'b': 4}])
for index, row in df.iterrows():
print(row.b)
print(getattr(row, 'b'))
print(row['b'])
These lines were tested using pandas versions 0.20.3 and 1.0.1.

Aggregate dictionary values with different function

I have a dictionary which basically it's look like this:
dict = {'A': [1,5,6,7],
'B':[1,8,8]}
I want to grp by keys and aggregate values with different function. i.e mean or standard deviation
Mean:
result = {'A':4.75, 'B': 5.6}
etc
Thanks
Using a dictionary comprehension and functions from statistics:
from statistics import mean, stdev
d = {'A': [1,5,6,7], 'B':[1,8,8]}
d_mean = {k:round(mean(v), 2) for k,v in d.items()}
# {'A': 4.75, 'B': 5.67}
d_std = {k:round(stdev(v), 2) for k,v in d.items()}
# {'A': 2.63, 'B': 4.04}

Python unit tests for dictionaries of dataframes

I have a function that returns dictionaries of pandas dataframes and I wish to design a unit test for it.
I know how to unit-test equality over pandas dataframes:
import pandas as pd
from pandas.util.testing import assert_frame_equal
import unittest
df1 = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df2 =pd.DataFrame(df1)
class DictEq(unittest.TestCase):
def test_dict_eq(self):
assert_frame_equal(df1, df2)
unittest.main()
However, I do not seem to grasp how to design a test that compares the following:
dict1 = {'a': df1}
dict2 = {'a': df2}
I have tried the following, all of which fail:
from nose.tools import assert_equal, assert_dict_equal
class DictEq(unittest.TestCase):
def test_dict_eq1(self):
assert_equal(dict1, dict2)
def test_dict_eq2(self):
assert_dict_equal(dict1 , dict2)
def test_dict_eq3(self):
self.assertTrue(dict1 == dict2)
assert_dict_equal function of pandas.util.testing fails as well.
Try this:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df2 =pd.DataFrame(df1)
class DfWrap():
def __init__(self, df):
self.df = df
def __eq__(self, df2):
return all(self.df == df2)
dic1 = {'a': DfWrap(df1)}
dic2 = {'a': DfWrap(df2)}
print(dic1 == dic2)
This outputs True. It should work with assert_dict_equal as well, as long as you wrap your dataframe objects in DfWrap.
Here's why it works:
You have to imagine that in order to compare dictionaries, python will go through each key (recursively) and call __eq__ (or ==) on the items to compare. The problem is that when you call __eq__ (or ==) on a dataframe, it doesn't return a bool. Instead it returns another dataframe:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df2 =pd.DataFrame(df1)
df_compare = df1 == df2
print(type(df_compare))
this outputs:
<class 'pandas.core.frame.DataFrame'>
So, instead, the wrapper makes it so that doing df1 == df2 outputs a bool instead of a dataframe:
DfWrap(df1) == DfWrap(df2)
evaluates to True.
HTH.
I am not sure but you may do something like this:
import unittest
class DictEq(unittest.TestCase):
def test_dict_eq1(self):
dict1 = {'a': df1}
dict2 = {'a': df2}
key1 = dict1.keys()
key2 = dict2.keys()
self.assertEqual(key1,key2)
for key, val in dict1.items():
df1 = dict1[key]
df2 = dict2[key]
assert_frame_equal(df1, df2)

Python if/elseif construction as dictionary

In Python I use regularly the following construct:
x = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
y = x[v] if v in x.keys() else None
where v is normally one of the dictionary values, and y gets the value of the dictionary if the key exists, otherwise None.
I was wondering if this is a desired construct or if it can be enhanced?
x[v] can be values as above, but I also use a similar construct to call a function depending on the value of v, like:
{'a': self.f1, 'b': self.f2, 'c': self.f3, 'd': self.f4}[v]()
Normally you'd use dict.get():
y = x.get(v)
.get() takes a default parameter to return if v is not present in the dictionary, but if you omit it None is returned.
Even if you were to use an explicit key test, you don't need to use .keys():
y = x[v] if v in x else None
Interestingly enough, the conditional expression option is slightly faster:
>>> [x.get(v) for v in 'acxz'] # demonstration of the test; two hits, two misses
[1, 3, None, None]
>>> timeit.timeit("for v in 'acxz': x.get(v)", 'from __main__ import x')
0.8269917964935303
>>> timeit.timeit("for v in 'acxz': x[v] if v in x else None", 'from __main__ import x')
0.67330002784729
until you avoid the attribute lookup for .get():
>>> timeit.timeit("for v in 'acxz': get(v)", 'from __main__ import x; get = x.get')
0.6585619449615479
so if speed matters, store a reference to the .get() method (note the get = x.get assignment).
What you have described can be said like this:
"y should be the value of key "v" if "v" exists in the dictionary, else it should be "none"
In Python, that is:
y = x.get(v, None)
Note: None is already the default return value when the key is not present.
In your question, you also mention that sometimes your dictionaries contain references to methods. In this case, y would then be the method if it is not None and you can call it normally:
y(*args, **kwargs)
Or like in your example:
{'a': self.f1, 'b': self.f2, 'c': self.f3, 'd': self.f4}.get(v)()
Note: that if the key is not in the dictionary, then it will raise a TypeError
After spending many hours profiling some similar code, I found the most efficient way to do this is to use defaultdict...
from collections import defaultdict
x = defaultdict(lambda: None, {'a': 1, 'b': 2, 'c': 3, 'd': 4})
y = x[v]
...and the next most efficient is two key lookups...
x = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
y = x[v] if x in v else None
Both of these are much faster than this...
x = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
y = x.get(v, None)
...and this...
x = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
try:
y = x[v]
except KeyError:
y = None

Categories

Resources