Pandas get single warning message per line - python

I have some code that calls Pandas functions millions of times and many of these calls are issuing warning notices. This is a background process, so I wanted to convert them to logs using logging.captureWarnings(True). The logging works fine, but no matter what filter options I seem to set, the warnings are repeated, millions of times. I have set the logger to record each row to a database table, which works fine with thousands of log records, but is slowed down by millions of warning messages as logs.
So, my question, is there any way to make sure the number of warnings is one per call site, which is the default AFAIK. I think this bug may have something to do with it: https://github.com/python/cpython/issues/73858. I did look at the pandas source code but couldn't find an obvious place where the catch_warnings exception manager was being used.
Does anyone have any potential solutions to return the warnings emitted to what should be the default?
A simple test script is instructive:
import pandas as pd
import warnings
warnings.simplefilter(action='default', category=FutureWarning)
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
a = ['col1','col2']
df[a]
b = set(a)
for i in range(3):
df[b]
Running it with python3 -Wd::FutureWarning test.py
Yields:
/home/stephen/test.py:16: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
df[b]
/home/stephen/test.py:16: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
df[b]
/home/stephen/test.py:16: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
df[b]
Not the behaviour I expected

Related

Configuring python warnings control to show first occurrence only

I am trying to use the python warnings module to show only the first occurrence of a given warning. According to Warning control documentation, this looks to be as simple as setting the action flag to 'once' but that doesn't work for me.
Example code.
# Tried with python 3.9.13 for all the pandas versions below,
# and python 3.8.13 and 3.10.6 with pandas 1.5.0.
# Recent pandas versions 1.4.4 and 1.5.0 result in a deprecated warning
# for DataFrame.append, pandas version 1.3.4 does not.
import pandas as pd
import warnings
# Set action parameter. Various options include 'default', 'once' and 'ignore'
warnings.filterwarnings(action='default')
data = [1]
appendDF = pd.DataFrame(data)
for i in range(5):
print('Trial:', i)
appendDF = appendDF.append(data)
For action='default' and action='once', I get several copies of the same deprecated warning. For action='ignore', I get none (as desired).
I can see a potential workaround in How to print only first occurrence of python warning?, but that doesn't go into the 'once' option for the action parameter and also involves additional bespoke coding.
Any suggestions as to why the 'once' option above does not seem to work?

xarray firing a Future Warning

Just upgraded to python3.8, I have updated xarray to v.0.16, but now I am always getting this warning:
/usr/local/lib/python3.8/dist-packages/xarray/core/common.py:1123: FutureWarning: 'base' in .resample() and in Grouper() is deprecated.
The new arguments that you should use are 'offset' or 'origin'.
>>> df.resample(freq="3s", base=2)
becomes:
>>> df.resample(freq="3s", offset="2s")
grouper = pd.Grouper(
The only point in my script in which I am using .resample is this:
mydata = xr.open_dataset(ncfile).resample(time='3H').reduce(np.mean)
but I don't know how to change it to avoid the warning.
Until this is updated in xarray, warnings can be ignored with a call to warnings.filterwarnings.
You're welcome to open an issue on GitHub for xarray (or even a PR!)

Deprecation warning for np.ptp

I'm using Python and when using the following code
df['timestamp'] = df.groupby(["id"]).timestamp.transform(np.ptp)
I'm getting the warning FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.. df is a Pandas DataFrame and timestamp and id are columns. I think np.ptp is causing this warning.
What do I have to change?
It means that the Method .ptp is being deprecated in favor of (from what I've read) the function np.ptp() so you can either set warnings to false in order to not read it, or replace the method with the function as numpy seems to suggest.
If you wish to supress the warnings, you can try with:
warnings.filterwarnings('ignore') or warnings.simplefilter('ignore', FutureWarning) if it's only FutureWarning you are ignoring.

Dask prints warning to use client.scatter althought I'm using the suggested approach

In dask distributed I get the following warning, which I would not expect:
/home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph:
(['int-58e78e1b34eb49a68c65b54815d1b158', 'int-5cd ... 161071d7ae7'],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
The reason I'm suprised is, that I'm doing exactly what the warning is suggesting:
import dask.dataframe as dd
import pandas
from dask.distributed import Client, LocalCluster
c = Client(LocalCluster())
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
filter_list = c.scatter(list(range(2,100000,2)))
mask = c.submit(dask_df['A'].isin, filter_list)
dask_df[mask.result()].compute()
So my question is: Am I doing something wrong or is this a bug?
pandas='0.22.0'
dask='0.17.0'
The main reason why dask is complaining isn't the list, it's the pandas dataframe inside the dask dataframe.
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
You are creating a biggish amount of data locally when you create a pandas dataframe in your local session. Then you operate with it on the cluster. This will require moving your pandas dataframe to the cluster.
You're welcome to ignore these warnings, but in general I would not be surprised if performance here is worse than with pandas alone.
There are a few other things going on here. Your scatter of a list produces a bunch of futures, which may not be what you want. You're calling submit on a dask object, which is usually unnecessary.

How to reset warnings completely

How can I see a warning again without restarting python. Now I see them only once.
Consider this code for example:
import pandas as pd
pd.Series([1]) / 0
I get
RuntimeWarning: divide by zero encountered in true_divide
But when I run it again it executes silently.
How can I see the warning again without restarting python?
I have tried to do
del __warningregistry__
but that doesn't help.
Seems like only some types of warnings are stored there.
For example if I do:
def f():
X = pd.DataFrame(dict(a=[1,2,3],b=[4,5,6]))
Y = X.iloc[:2]
Y['c'] = 8
then this will raise warning only first time when f() is called.
However, now when if do del __warningregistry__ I can see the warning again.
What is the difference between first and second warning? Why only the second one is stored in this __warningregistry__? Where is the first one stored?
How can I see the warning again without restarting python?
As long as you do the following at the beginning of your script, you will not need to restart.
import pandas as pd
import numpy as np
import warnings
np.seterr(all='warn')
warnings.simplefilter("always")
At this point every time you attempt to divide by zero, it will display
RuntimeWarning: divide by zero encountered in true_divide
Explanation:
We are setting up a couple warning filters. The first (np.seterr) is telling NumPy how it should handle warnings. I have set it to show warnings on all, but if you are only interested in seeing the Divide by zero warnings, change the parameter from all to divide.
Next we change how we want the warnings module to always display warnings. We do this by setting up a warning filter.
What is the difference between first and second warning? Why only the second one is stored in this __warningregistry__? Where is the first one stored?
This is described in the bug report reporting this issue:
If you didn't raise the warning before using the simple filter, this
would have worked. The undesired behavior is because of
__warningsregistry__. It is set the first time the warning is emitted.
When the second warning comes through, the filter isn't even looked at.
I think the best way to fix this is to invalidate __warningsregistry__
when a filter is used. It would probably be best to store warnings data
in a global then instead of on the module, so it is easy to invalidate.
Incidentally, the bug has been closed as fixed for versions 3.4 and 3.5.
warnings is a pretty awesome standard library module. You're going to enjoy getting to know it :)
A little background
The default behavior of warnings is to only show a particular warning, coming from a particular line, on its first occurrence. For instance, the following code will result in two warnings shown to the user:
import numpy as np
# 10 warnings, but only the first copy will be shown
for i in range(10):
np.true_divide(1, 0)
# This is on a separate line from the other "copies", so its warning will show
np.true_divide(1, 0)
You have a few options to change this behavior.
Option 1: Reset the warnings registry
when you want python to "forget" what warnings you've seen before, you can use resetwarnings:
# warns every time, because the warnings registry has been reset
for i in range(10):
warnings.resetwarnings()
np.true_divide(1, 0)
Note that this also resets any warning configuration changes you've made. Which brings me to...
Option 2: Change the warnings configuration
The warnings module documentation covers this in greater detail, but one straightforward option is just to use a simplefilter to change that default behavior.
import warnings
import numpy as np
# Show all warnings
warnings.simplefilter('always')
for i in range(10):
# Now this will warn every loop
np.true_divide(1, 0)
Since this is a global configuration change, it has global effects which you'll likely want to avoid (all warnings anywhere in your application will show every time). A less drastic option is to use the context manager:
with warnings.catch_warnings():
warnings.simplefilter('always')
for i in range(10):
# This will warn every loop
np.true_divide(1, 0)
# Back to normal behavior: only warn once
for i in range(10):
np.true_divide(1, 0)
There are also more granular options for changing the configuration on specific types of warnings. For that, check out the docs.

Categories

Resources