I'm doing some analysis with pandas in a jupyter notebook and since my apply function takes a long time I would like to see a progress bar.
Through this post here I found the tqdm library that provides a simple progress bar for pandas operations.
There is also a Jupyter integration that provides a really nice progress bar where the bar itself changes over time.
However, I would like to combine the two and don't quite get how to do that.
Let's just take the same example as in the documentation
import pandas as pd
import numpy as np
from tqdm import tqdm
df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="my bar!")
# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
df.progress_apply(lambda x: x**2)
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)
It even says "can use 'tqdm_notebook' " but I don't find a way how.
I've tried a few things like
tqdm_notebook(tqdm.pandas(desc="my bar!"))
or
tqdm_notebook.pandas
but they don't work.
In the definition it looks to me like
tqdm.pandas(tqdm_notebook(desc="my bar!"))
should work, but the bar doesn't properly show the progress and there is still additional output.
Any other ideas?
My working solution (copied from the documentation):
from tqdm.auto import tqdm
tqdm.pandas()
You can use:
tqdm_notebook().pandas(*args, **kwargs)
This is because tqdm_notebook has a delayer adapter, so it's necessary to instanciate it before accessing its methods (including class methods).
In the future (>v5.1), you should be able to use a more uniform API:
tqdm_pandas(tqdm_notebook, *args, **kwargs)
I found that I had to import tqdm_notebook also. A simple example is given below that works in Jupyter notebook.
Given you want to map a function on a variable to create a new variable in your pandas dataframe.
# progress bar
from tqdm import tqdm, tqdm_notebook
# instantiate
tqdm.pandas(tqdm_notebook)
# replace map with progress_map
# where df is a pandas dataframe
df['new_variable'] = df['old_variable'].progress_map(some_function)
If you want to use more than 1 CPU for that slow apply step, consider using swifter. As a bonus, swifter automatically enables a tqdm progress bar on the apply step. To customize the bar description, use :
df.swifter.progress_bar(enable=True, desc='bar description').apply(...)
from tqdm.notebook import tqdm
tqdm.pandas()
for versions 4.64.0 and greater.
Related
I was trying to replicate this code for stat forecasting in python, I came across the issue of not being able to load this model 'adida' form statsforecast library,
Here is the link for reference : https://towardsdatascience.com/time-series-forecasting-with-statistical-models-f08dcd1d24d1
import random
from itertools import product
from IPython.display import display, Markdown
from multiprocessing import cpu_count
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from nixtlats.data.datasets.m4 import M4, M4Info
from statsforecast import StatsForecast
from statsforecast.models import (
adida,
croston_classic,
croston_sba,
croston_optimized,
historic_average,
imapa,
naive,
random_walk_with_drift,
seasonal_exponential_smoothing,
seasonal_naive,
seasonal_window_average,
ses,
tsb,
window_average
)
Attached is the error message, Can you please have a look at this and let me know why is there an issue in importing this?
Given below is the error image:
I did some research and figured out the issue is probably with the version, try installing this specific version of statsforecast
pip install statsforecasts==0.6.0
Trying loading these models after that, hopefully this should work.
As of v1.0.0 of StatsForecast, the API changed to be more like sklearn, using classes instead of functions. You can find an example of the new syntax here: https://nixtla.github.io/statsforecast/examples/IntermittentData.html.
The new code would be
from statsforecast import StatsForecast
from statsforecast.models import ADIDA, IMAPA
model = StatsForecast(df=Y_train_df, # your data
models=[ADIDA(), IMAPA()],
freq=freq, # frequency of your data
n_jobs=-1)
If you want to use the old syntax, setting the version as suggested should work.
If you have updated the package ..use ADIDA it will work
see the model list name with new packages
ADIDA(),
IMAPA(),
(SimpleExponentialSmoothing(0.1)),
(TSB(0.3,0.2)),
(WindowAverage( 6))
I want to use hypothesis awesome features to create some sample data for my application. I use it roughly like this
from hypothesis import strategies as st
ints = st.integers() #simplified example
ints.example()
I get this warning:
NonInteractiveExampleWarning: The .example() method is good for exploring strategies, but should only be used interactively
Is there a simple way to disable this warning? Just to be clear: I want to use the example data generation outside of a testing and in a non-interactive context and I'm aware of what the warning is referring to. I just want to get rid of it.
The warnings module lets you selectively ignore specific warnings; for example:
from hypothesis.errors import NonInteractiveExampleWarning
from hypothesis import strategies as st
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=NonInteractiveExampleWarning)
ints = st.integers()
print( ints.example() )
I have used pandas in the past but I have recently run into a problem where my code is not displaying the .head() or the .describe() function. I have copied my code below from another website and it is still not displaying. Any help is appreciated.
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
training_df = pd.read_csv(filepath_or_buffer="california_housing_train.csv")
training_df["median_house_value"] /= 1000.0
training_df.describe(include = 'all')
Your answer will work in a notebook or REPL, but doesn't actually print. Make sure to call the print() function to output while running.
I am trying to clean up and streamline my code and recently came across named aggregation in Pandas(see link)
This note is on the page:
If your aggregation functions requires additional arguments, partially apply them with functools.partial().
Here is the setup code:
from functools import partial as fpart
import pandas as pd
import numpy as np
inputData = {'groupByVar1':['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'groupByVar2':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4],
'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]}
df = pd.DataFrame(inputData)
upperPercentile = 0.98
lowerPercentile=0.05
Here is the method I have been using
nsgdf = df.groupby(['groupByVar1','groupByVar2'])[['value']].quantile(upperPercentile).rename({'value':'upperPercentileValue'},axis=1).reset_index(drop=False)
nsgdf['lowerPercentileValue'] = df.groupby(['groupByVar1','groupByVar2'])[['value']].quantile(lowerPercentile).values
Here is the method I would like to use:
fpartUpper = fpart(np.quantile,q=upperPercentile)
fpartLower = fpart(np.quantile,q=lowerPercentile)
bdf = df.groupby(['groupByVar1','groupByVar2']).agg(
upperPercentileValue=pd.NamedAgg(column='value',aggfunc=fpartUpper),
lowerPercentileValue=pd.NamedAgg(column='value',aggfunc=fpartLower)
)
The following Error is returned from Pandas:
pandas.core.base.SpecificationError: Function names must be unique, found multiple named quantile
However if I execute the following I actually get a result:
fpartUpper([1,2,3,4,5])
Out[16]: 4.92
How can I get this particular method to work with pandas? What am i missing? Why is Pandas finding multiple definitions for quantile, where as running the bare function causes no issues?
Through rpy2 in jupyter, you may plot your data directly from python using R objects. How can you set par(mfrow=c(1,2) in python?
For instance, I want to automatically feed a matrix with variable size from python and plot it (among other statistical analyses) using rpy2. But instead of plotting a single boxplot, I want all of them to be output.
Here's some sample code
import rpy2.ipython
import rpy2.robjects as ro
import scipy as sp
import re #python for regex
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
%load_ext rpy2.ipython
%R
test=[[1,3,2],[6,5,7,8,9]]
def funtoanalyze(grouparray):
a={}
data=numpy.array(test)
for ig in range(len(grouparray)):
key=grouparray[ig]
value=data[ig]
a[key]=value
next
rbox=ro.r('boxplot')
for gro in a:
datar=a[gro]
ro.r('dev.new()')
rbox(ro.FloatVector(datar[:]),xlab="",main=gro)
return
funtoanalyze(["group33","group2"]) #only plots last group
Your use of %load_ext rpy2.ipython suggests that you want to have your figure in the jupyter notebook.
R is using "graphical devices" to output figures, and calling par(mfrow=c(...)) will either put the setting in an open graphical device or open a new default device and set the parameter.
The "magic" %%R is scanning if figures were generated on default devices and display them in the notebook. The following should work:
%%R
par(mfrow=c(1,2))
plot(0, 0)
plot(0, 0)
If you do not want to use the R magic, there are other utilities for the jupyter notebook in rpy2. For plotting there is a context manager (see https://bitbucket.org/rpy2/rpy2/issues/330/ipython-plotting-wrapper - I don't remember if there is more documentation), but the most advanced utilities are tailored for ggplot2. Check for example this slides and the following ones:
https://lgautier.github.io/odsc-ppda-slides/#/5/13
The full notebook is here:
https://github.com/lgautier/odsc-ppda-slides/blob/master/notebooks/slides.ipynb
There is a docker container shipping with everything needed to run the notebook:
https://github.com/lgautier/pragmatic-polyglot-data-analysis