Python - Rolling Function (Step - Pandas 1.5.0)

Python - Rolling Function (Step - Pandas 1.5.0) - python

I updated all the relevant libraries to the latest version, hence I thought that I could now use the new feature (step) in the rolling-function:
print(df['600028.SS'].rolling(window=125, step=20).corr(df['600121.SS']))
However, when executing the commande, I always get the following error message:
NotImplementedError: step not implemented for corr
How can I implement the step-feature for corr, or is there any way to circumvent the error message. By the way '600028.SS' and '600121.SS' are just the names of two columns in the dataframe
I want to get the correlation coefficient for those two stocks on a rolling basis. Every correlation coefficient should include the last 125 observations, and the step-size should be 20. And with the new step feature since the latest pandas update (1.5.0) I thought it should be fine to use now, however, I still receive the massage that step would not be implemented for corr.

The symptom is easily reproduced:
>>> df = pd.DataFrame([dict(a=1, b=2)])
>>> df.a.rolling(window=125, step=20).corr(df.b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jhanley/miniconda3/envs/problems/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 2829, in corr
return super().corr(
File "/Users/jhanley/miniconda3/envs/problems/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 1757, in corr
raise NotImplementedError("step not implemented for corr")
NotImplementedError: step not implemented for corr
I am reading the 1.5.1 documentation, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.corr.html .
It lists four possible args:
other
pairwise
ddof
numeric_only
It doesn't mention a step parameter.
It explains that additional args
are accepted
"for NumPy compatibility and will not have an effect on the result."
The diagnostic error message is accurate.
You are attempting to use something that
is not implemented.
That won't work.

Related

Semantic versioning in dask repository

Why didn't the the commit 7138f470f0e55f2ebdb7638ddc4dfe2e78671403 trigger a new major version of dask since the function read_metadata is incompatible with older versions? The commit introduced the return of 4 values, but the old version only returned 3. According to semantic versioning this would have been the correct behavior.
cudf got broken, because of that commit.
Code from the issue:
>>> import cudf
>>> import dask_cudf
>>> dask_cudf.from_cudf(cudf.DataFrame({'a':[1,2,3]}),npartitions=1).to_parquet('test_parquet')
>>> dask_cudf.read_parquet('test_parquet')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask_cudf/io/parquet.py", line 213, in read_parquet
**kwargs,
File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", line 234, in read_parquet
**kwargs
File "/nvme/0/vjawa/conda/envs/cudf_15_june_25/lib/python3.7/site-packages/dask_cudf/io/parquet.py", line 17, in read_metadata
meta, stats, parts, index = ArrowEngine.read_metadata(*args, **kwargs)
ValueError: not enough values to unpack (expected 4, got 3)
dask_cudf==0.14 is only compatible with dask<=0.19. In dask_cudf==0.16 the issue is fixed.
Edit: Link to the issue

Whilst Dask does not have a concrete policy around the value of the version string, one could argue that in this particular case, the IO code is non-core, and largely being pushed by upstream (pyarrow) development rather than our own initiative.
We are sorry that your code broke, but of course picking the correct versions of packages and expecting downstream packages to catch up is part of the open source ecosystem.
You may want to raise this as a github issue, if you'd like to get input from more of the dask maintenance team. (There isn't really much to "answer" here, from a stackoverflow perspective)

Which unsupervised clustering algorithm from the sklearn library can I use with custom distance?

I have a function that takes as input two samples and return their distance and from this function I have defined a metric
def TwoPointsDistance(x1, x2):
cord1 = f.rf.apply(x1)
cord2 = f.rf.apply(x2)
return 1 - (cord1==cord2).sum()/f.n_trees
metric = sk.neighbors.DistanceMetric.get_metric('pyfunc',
func=TwoPointsDistance)
Now I would like to cluster my data according to this metric. I would like to see some examples of algorithms for unsupervised clustering that use this as a distance metric.
EDIT: I am particularly interested in this algorithm:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
EDIT: I have tried
DBSCAN(metric=metric, algorithm='brute').fit(Xor)
but I receive an error:
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/dbscan_.py", line 249, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/dbscan_.py", line 100, in dbscan
metric=metric, p=p)
File "/usr/local/lib/python3.4/dist-packages/sklearn/neighbors/unsupervised.py", line 83, in __init__
leaf_size=leaf_size, metric=metric, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/sklearn/neighbors/base.py", line 127, in _init_params
% (metric, algorithm))
ValueError: Metric '<sklearn.neighbors.dist_metrics.PyFuncDistance object at 0x7ff5c299f358>' not valid for algorithm 'brute'
>>>

I've tried to figure out why this error arises... I first thought sklearn.neighbors.NearestNeighbors (which is what DBSCAN is based upon) would be constrained to those distances listed in sklearn.neighbors.base.VALID_METRICS["brute"]. But judging from the source code, any callable function should be okay - so it seems your distance isn't callable?
Please try this:
DBSCAN(metric=TwoPointsDistance, algorithm='brute').fit(Xor)
i.e. without wrapping your distance as neighbors.DistanceMetric. It seems a bit inconsistent to me to now allow these to be used here...
Myself, I have used ELKI with great success with a custom distance function, and there is a short tutorial on how to write these available: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions

Today, years later, I still stumbled over this in a different context. The solution is simple: pass the function directly as a metric.
BSCAN(metric=TwoPointsDistance, algorithm='brute').fit(Xor)

TypeError in Numpy happening randomly

I'm having a weird issue with numpy right now on a current assignment. I'm making an ant colony AI for a class (based on the Google AI Ants Challenge), and I'm using a diffusion based approach where I essentially diffuse out a scent from food/enemy hills on each turn. I've been using numpy since each turn basically consists of doing a lot of matrix manipulations, but I just recently got a weird bug that I can't figure out.
At the beginning of each turn, I update the field associated with each scent before I run the diffusion iterations:
# Here I update the "potential" field (hills_f) and the
# diffusion values (hills_l) for the hills scent. Diffusion values
# (lambda values) are 1 except for on ants, where they are higher
# or lower depending on their colony.
self.hills_f *= TURN_DECAY
self.hills_l = np.ones_like(self.hills_l)
# Update the lambda matrix
for r,c in ants.my_ants():
self.hills_l[r][c] = MY_HILLS_LAMBDA
for r,c in ants.enemy_ants():
self.hills_l[r][c] = ENEMY_HILLS_LAMBDA
So this code runs at the beginning of each turn (along with a similar snippet for the food scent), but on a random turn (ranges from 10-40), I get the following error:
Traceback (most recent call last):
File "long_file_path...", line 167, in run
bot.do_turn(ants)
File "MyBot.py", line 137, in do_turn
self.hills_l[r][c] = ENEMY_HILLS_LAMBDA
TypeError: 'numpy.float64' object does not support item assignment
It looks like it randomly turns self.hills_l into a scalar between the two for-loops, which doesn't make any sense to me. It's also weird that there's similar code for the food scent, which doesn't crash ever, and that this problem shows up so non-deterministically.
I can post more code if necessary, but I think everything should be there, especially since the problem seems to occur between the for loops.
Thanks!

Applying SVD throws a Memory Error instantaneously?

I am trying to apply SVD on my matrix (3241 x 12596) that was obtained after some text processing (with the ultimate goal of performing Latent Semantic Analysis) and I am unable to understand why this is happening as my 64-bit machine has 16GB RAM. The moment svd(self.A) is called, it throws an error. The precise error is given below:
Traceback (most recent call last):
File ".\SVD.py", line 985, in <module>
_svd.calc()
File ".\SVD.py", line 534, in calc
self.U, self.S, self.Vt = svd(self.A)
File "C:\Python26\lib\site-packages\scipy\linalg\decomp_svd.py", line 81, in svd
overwrite_a = overwrite_a)
MemoryError
So I tried using
self.U, self.S, self.Vt = svd(self.A, full_matrices= False)
and this time, it throws the following error:
Traceback (most recent call last):
File ".\SVD.py", line 985, in <module>
_svd.calc()
File ".\SVD.py", line 534, in calc
self.U, self.S, self.Vt = svd(self.A, full_matrices= False)
File "C:\Python26\lib\site-packages\scipy\linalg\decomp_svd.py", line 71, in svd
return numpy.linalg.svd(a, full_matrices=0, compute_uv=compute_uv)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1317, in svd
work = zeros((lwork,), t)
MemoryError
Is this supposed to be such a large matrix that Numpy cannot handle and is there something that I can do at this stage without changing the methodology itself?

Yes, the full_matrices parameter to scipy.linalg.svd is important: your input is highly rank-deficient (rank max 3,241), so you don't want to allocate the entire 12,596 x 12,596 matrix for V!
More importantly, matrices coming from text processing are likely very sparse. The scipy.linalg.svd is dense and doesn't offer truncated SVD, which results in a) tragic performance and b) lots of wasted memory.
Have a look at the sparseSVD package from PyPI, which works over sparse input and you can ask for top K factors only. Or try scipy.sparse.linalg.svd, though that's not as efficient and only available in newer versions of scipy.
Or, to avoid the gritty details completely, use a package that does efficient LSA for you transparently, such as gensim.

Apparently, as it turns out, thanks to #Ferdinand Beyer, I did not notice that I was using a 32-bit version of Python on my 64-bit machine.
Using a 64-bit version of Python and reinstalling all the libraries solved the problem.

Tracing Python warnings/errors to a line number in numpy and scipy

I am getting the error:
Warning: invalid value encountered in log
From Python and I believe the error is thrown by numpy (using version 1.5.0). However, since I am calling the "log" function in several places, I'm not sure where the error is coming from. Is there a way to get numpy to print the line number that generated this error?
I assume the warning is caused by taking the log of a number that is small enough to be rounded to 0 or smaller (negative). Is that right? What is the usual origin of these warnings?

Putting np.seterr(invalid='raise') in your code (before the errant log call)
will cause numpy to raise an exception instead of issuing a warning.
That will give you a traceback error message and tell you the line Python was executing when the error occurred.

If you have access to the numpy source, you should be able to find the line that prints that warning (using grep, etc) and edit the corresponding file to force an error (using an assertion, for example) when an invalid value is passed. That will give you a stack trace pointing to the place in your code that called log with the improper value.
I had a brief look in my numpy source, and couldn't find anything that matches the warning you described though (my version of numpy is older than yours, though).
>>> import numpy
>>> numpy.log(0)
-inf
>>> numpy.__version__
'1.3.0'
Is it possible that you're calling some other log function that isn't in numpy? For example, here is one that actually throws an exception when given invalid input.
>>> import math
>>> math.log(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: math domain error

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Rolling Function (Step - Pandas 1.5.0) - python

Related

Semantic versioning in dask repository

Which unsupervised clustering algorithm from the sklearn library can I use with custom distance?

TypeError in Numpy happening randomly

Applying SVD throws a Memory Error instantaneously?

Tracing Python warnings/errors to a line number in numpy and scipy

Categories

Resources