Problem with hdbscan used with bertopic: OSError: [Errno 22] Invalid argument - python

I am writing because I have a problem (silly and obvious introduction, I know).
I am trying to use the BERTopic package using the Python interpreter in RStudio and the reticulate extension:
Python 3.6.13 (C:/Users/Francesco/AppData/Local/r-miniconda/envs/r-reticulate/python.exe)
Reticulate 1.18.9008 REPL -- A Python interpreter in R.
I managed to install it with
pip3 install bertopic
At first, trying to install bertopic resulted in an error realating to its hdbscan dependence, specifically to the wheel used; I overcame it by installing hdbscan by conda (with pip the problem appeared unsolvable) and after doing it seemed that both were installed and fine (pip would confirm so).
Afterwards, I tried to follow the package tutorial in Medium/Towards Data Science (here the Colab version I’m following) to get accostumed with the package and to check that everything was working as supposed to.
I am basically copying and pasting the code of Colab on the Python chunks in the RMarkdown file I am using, but when I try to apply the same code of the tutorial to the same dataset used:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
from bertopic import BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)
I get the following error:
Batches: 100%|##########| 589/589 [28:21<00:00, 2.89s/it]
2021-04-29 16:24:25,973 - BERTopic - Transformed documents to Embeddings
2021-04-29 16:24:35,752 - BERTopic - Reduced dimensionality with UMAP
OSError: [Errno 22] Invalid argument
In theory, following the output on colab, I should get:
....................... - BERTopic - Clustered UMAP embeddings with HDBSCAN
Since I had problem with hdbscan I do believe it is somehow related to it, and I read several GitHub and Stackoverflow pages pointing out problems with such a package, but I do not know how to solve this, but I really need to since I need to use package for my thesis.
Can someone help me, please?
PS: it's the first time I am asking stuff on stackoverflow: I hoped I wrote down everything necessary, but if some info is missing, please tell me.

Related

To support decoding 'mp3' audio files, please install 'sox'

I'm trying to work on an ASR model using transfer learning on wav2vec 2 model.
Anyway when I ever I wan't to show or modifiy an audio file I get this problem
def prepare_dataset(batch):
audio = batch["audio"]
# batched output is "un-batched"
batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
batch["input_length"] = len(batch["input_values"])
with processor.as_target_processor():
batch["labels"] = processor(batch["sentence"]).input_ids
return batch
common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names)
The erorrs:
RuntimeError: Backend "sox_io" is not one of available backends: ['soundfile'].
ImportError: To support decoding 'mp3' audio files, please install 'sox'.
This is my pytorch and torchaudio versions:
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
1.13.1+cu117
0.13.1+cu117
I really need help fixing this problem, this is part of my junior project! )':
I've trying to installing pytorch and installing deffrent versions but nothing worked the code is working. fine in colab but it's impossible for me to train it there so I have to use visual code...
First, note that the second error message is not from torchaudio and it's not accurate. TorchAudio does not depend on an external sox package.
TorchAudio provides limited IO features on Windows, as libsox does not
compile on Windows with VS2019. This situation is being worked on, but as of v0.13, Windows users need a workaround.
A simple way is to use other libraries like soundfile and convert the decoded NumPy NdArray object into PyTorch Tensor.
Another way is to install FFmpeg, and use torchaudio.io.StreamReader. You can write your own load function, following the tutorial like this.
https://pytorch.org/audio/0.13.1/tutorials/streamreader_basic_tutorial.html#sphx-glr-tutorials-streamreader-basic-tutorial-py

Top2Vec Model Failing To Train (Following Simple PyPi Tutorial)

I am trying to follow this tutorial on PyPi (See Example -> Train Model): https://pypi.org/project/top2vec/
Very short amount of code, following it line by line:
from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)
I've tried running multiple times on different datasets, yet I keep running into the following error when training/building the model:
UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None
Has anyone encountered this error before and if so how have you fixed it? Otherwise, if anyone can run this same code please let me know if you run into the same error.
Thanks
Solved this by moving from a Jupyter notebook in favor for a typical .py file, as well as cloning the library, installing the requirements to a fresh virtualenv and running the setup.py file.

Importing t5-base from T5Tokenizer fails

I have been trying to load pretrained t5-base from the T5Tokenizer transformer in python. However it is not working after repeated attempts.
The Output shows "None"
!pip install sentencepiece==0.1.91
tokenizer = T5Tokenizer.from_pretrained("t5-base")
print(tokenizer)
The output of the above code is: None
A GitHub page says that version 0.1.91 of the sentencepiece library is required for t5-base. However, it is still not working as you can see in the above image.
What can be done in this case?

I am doing docker Jupyter deep learning course and ran in to a problem with importing keras libraries and packages

I tried running this command but i get erros that i dont have tenserflow 2.2 or higher. But I checked and I have the correct version of tenserflow. I also did pip3 install keras command
I know for a fact that all of the code is correct because it worked for my teacher the other day and nothing has changed. I just need to run his commands but i keep running into problems
I am doing this course following everything he does in a recorded video so there must be no issue there but for some reason it just doesn't work
just install tensorflow as requested in the last line of the error message: pip install tensorflow. It is needed as backend for Keras.
Also, since keras is part of tensorflow now, I recommend to write imports as from tensorflow.keras.[submodule name] import instead of from keras.[submodule name] import

Error "ValueError: bad marshal data (unknown type code)" with Python 2.7.13 and Keras 2.0.8

I get the ValueError: bad marshal data (unknown type code) above when trying to load a previously saved Keras model (I think it's a Python error though that has nothing to do with Keras, but not quite sure.)
from keras.models import load_model
from keras import __version__ as keras_version
model = load_model("model.h5")
I searched on Google but didn't find a working solution. I tried deleting pya-files with: sudo find /usr -name '*.pyc' -delete but that didn't help either.
Do you have an idea how I can fix this error? Thank you!
I know the post is a bit older, but I just ran into the same problem.
As #Daniel Möller said, it was because I had installed different versions of Python, Tensorflow and Keras. Try to train the model again, in the same environment that you use to load the model afterwards. Or at least make sure that the Python version and the modules used are installed in the same version.

Categories

Resources