I want to write a python code for Persian letter recognition. I have a dataset of Farsi alphabet that has 15 instances from each class. There are 19 classes.
Actually I don't have much experience in python. I almost know what are the steps theoretically but I dont know the coding.
Fisrt I want to convert images to feature vectors, but I don't know how to do this:/ I've searched a lot but I couldn't find anything useful.
Any help would be highly appreciated.
As you don't have enough data to train a deep convolutional network, I suggest you to take a look at this Python/OpenCV tutorial on a very similar dataset to yours (MNIST): https://www.learnopencv.com/handwritten-digits-classification-an-opencv-c-python-tutorial/
Related
I am trying to detect switches from one language to another within a sentence.
For instance, with the sentence "Je parle francais as well as English", I would like to return both "Fr" and "En".
So far, it seems that the tools available focus on the sentence as a whole.
I also tried working with langdetect's detect_langs as it returns a probability and is thus able to return two potential languages. However, I have found it to be quite inaccurate for this task.
This brought me to think about creating my own model with Keras.
However I am far from mastering that tool and have many questions about what can be done or not with it.
Which is why I was wondering whether I can train a model on data where elements have two labels. As in, can I feed it a sentence with two languages, with the labels French and English? I would do so while also training it with sentences only on French or in English.
Does this make any sense? Are there otherwise alternative ways to go at it?
On top of that, if anyone has suggestions as to how to conduct this language detection task, please share them, even if they do not include Keras
I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)
Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.
Hi there✌🏼I make a neural network that classifies the text. First I need to prepare the text and I ran into the problem of "mistakes in words". How can they be found and corrected? And what ideas do you have? Thanks in advance!
You can correct spelling errors by maintaining a vocabulary and finding the closest valid word using a string metric like the Levenshtein distance. There are also some more advanced Python tools, like SpaCy Hunspell. That being said, if you plan to use pre-trained word embeddings I wouldn't worry too much about text normalisation, as the embeddings will likely contain most common spelling variants. You can check how many out-of-vocabulary words you have in your data to see if it's worth investing time in extra cleaning except for basic tokenisation (and converting everything to lowercase).
I have a dataset where I have 12000+ data points and 25 features out of which last feature is the class label. This is classification problem. Now, I want to convert every data points into image, . I have no idea how to do that. Please help. I work on Python. If anyone have could provide sample code I will be grateful. Thanks in advance.
There is already some work on that, you can use either Gramian Angular Fields (GAF) or Markov Transition Fields (MTF), a good description is in Imaging Time-Series to Improve Classification and Imputation. Also, some other works used recurrent plots as Deep-Gap: deep learning framework. Imaging TS is an interesting way to think about them so you can use e.g. CNNs easily. But which method you like to use? BTW be aware this might not be an "efficient" way to classify time series :)
I have a dataset which has items with the following layout/schema:
{
words: "Hi! How are you? My name is Helennastica",
ratio: 0.32,
importantNum: 382,
wordArray: ["dog", "cat", "friend"],
isItCorrect: false,
type: 2
}
where I have a lot of different types of data, including:
Arrays (of one type only, e.g an array of strings or array of numbers, never both)
Booleans
Numbers with fixed min/max (i.e on a scale of 0 to 1)
Limitless integers (any integer -∞ to ∞)
Strings, with some dictionary, some new, words
The task is to create an RNN (well, generally, a system that can quickly retrain when given one extra bit of data instead of reprocessing it all - I think an RNN is the best choice; see below for reasoning) which can use all of these factors to categorise any dataset into one of 4 categories - labelled by the type key in the above example, a number 0-3.
I have a set of lots of the examples in the above format (with answer provided), and I have a database filled with uncategorised examples. My intention is to be able to run the ML model on that set, and sort all of them into categories. The reason I need to be able to retrain quickly is because of the feedback feature: if the AI gets something wrong, any user can report it, in which case that specific JSON will be added to the dataset. Obviously, having to retrain with 1000+ JSONs just to add one extra on would take ages - if I am not mistaken, an RNN can get around this.
I have found many possible use-cases for something like this, yet I have spent literal hours browsing through Github trying to find an implementation, or some Tensorflow module/addon to make this easier/copy, but to no avail.
I assume this would not be too difficult using Tensorflow, and I understand a bit of the maths and logic behind it (but not formally educated, so I probably have gaps!), but unfortunately I have essentially no experience with using Tensorflow/any other ML frameworks (beyond copy-pasting code for some other projects). If someone could point me in the right direction in the form of a Github repo/Python framework, or even write some demo code to help solve this problem, it would be greatly appreciated. And if you're just going to correct some of my technical knowledge/tell me where I've gone horrendously wrong, I'd appreciate that feedback to (just leave it as a comment).
Thanks in advance!