So I have basic question which I did not get answered by reading the documentation.
If I want to do text classification (sentiment) about lets say an Articel with a topic.
I already have the plain text without the html stuff through python libraries. Is it better to make an analysis about each sentence and then combine the results or just pass the whole text as a string and have already combined result (I have already tried the whole text option with flair and it worked quite well I guess).
The next thing would be how you could check if the sentences are about the asked topic and how to check it.
If you could give me some guidelines or hints how to approach these problems I would be happy.
In my spare time, I am transcribing a very old, rare book written in Romanian (in fact, it is the only remaining copy, to my knowledge). It was written over a hundred years ago, well before any computers existed. As such, no digital copies exist, and I am manually transcribing and digitizing it.
The book is thousands of pages long, and it is surprisingly time consuming (for me, at least) to add diacritic and accented marks (ă/â/î/ş/ţ) to every single word as I type. If I omit the marks and just type the bare letters (i.e a instead of ă/â), I am able to type more than twice as fast, which is a huge benefit. Currently I am typing everything directly into a .tex file to apply special formatting for the pages and illustrations.
However, I know that eventually I will have to add all these marks back into the text, and it seems tedious/unecessary to do all that manually, since I already have all the letters. I'm looking for some way to automatically/semi-automatically ADD diacritic/accent marks to a large body of text (not remove - I see plenty of questions asking how to remove the marks on SO).
I tried searching for large corpora of Romanian words (this and this were the most promising two), but everything I found fell short, missing at least a few words on any random sample of text I fed it (I used a short python script). It doesn't help that the book uses many archaic/uncommon words or uncommon spellings of words.
Does anyone have any ideas on how I might go about this? There are no dumb ideas here - any document format, machine learning technique, coding language, professional tool, etc that you can think of that might help is appreciated.
I should also note that I have substantial coding experience, and would not consider it a waste of time to build something myself. Tbh, I think it might be beneficial to the community, since I could not find such a tool in any western language (french, czech, serbian, etc). Just need some guidance on how to get started.
What comes to my mind is a simple replacement. About 10% of the words could are differentiated only by the diacritics, e.g. abandona and abandonă, those will not be fixed. But the other 90% will be fixed.
const dictUrl = 'https://raw.githubusercontent.com/ManiacDC/TypingAid/master/Wordlists/Wordlist%20Romanian.txt';
async function init(){
console.log('init')
const response = await fetch(dictUrl);
const text = await response.text();
console.log(`${text.length} characters`)
const words = text.split(/\s+/mg);
console.log(`${words.length} words`)
const denormalize = {}
let unique_count = 0;
for(const w of words){
const nw = w.normalize('NFD').replace(/[^a-z]/ig, '')
if(!Object.hasOwnProperty.call(denormalize, nw)){
denormalize[nw] = [];
unique_count += 1;
}
denormalize[nw].push(w);
}
console.log(`${unique_count} unique normalized words`)
for(const el of document.querySelectorAll('textarea')){
handleSpellings(el, denormalize);
}
}
function handleSpellings(el, dict){
el.addEventListener("keypress", function (e) {
if(e.key == ' ')
setTimeout(function () {
const restored = el.value.replace(
/\b\S+(?=[\x20-\x7f])/g,
(s) => {
const s2 = dict[s] ? dict[s][0] : s;
console.log([s, dict[s], s2]);
return s2;
}
);
el.value = restored;
}, 0)
})
}
window.addEventListener('load', init);
<body>
<textarea width=40 height=10 style="width: 40em; height:10em;">
</textarea>
</body>
Bob's answer is a static approach which will work depending on how good the word-list is.
So if a word is missing from this list it will never handled.
Moreover, as in many other languages, there are cases where two (or more) words exists with the same characters but different diacritics.
For Romanian I found the following example: peste = over vs. pesţe = fish.
These cases cannot be handled in a straightforward way either.
This is especially an issue, if the text you're converted contains words which aren't used anymore in today's language, especially diacritised ones.
In this answer I will present an alternative using machine learning.
The only caveat to this is that I couldn't find a publicly available trained model doing diacritic restoration for Romanian.
You may find some luck in contacting the authors of the papers I will mention here to see if they'd be willing to send their trained models for you to use.
Otherwise, you'll have to train yourself, which I'll give some pointers on.
I will try to give a comprehensive overview to get you started, but further reading is encouraged.
Although this process may be laborious, it can give you 99% accuracy with the right tools.
Language Model
The language model is a model which can be thought of as having a high-level "understanding" of the language.
It's typically pre-trained on raw text corpora.
Although you can train your own, be wary that these models are quite expensive to pre-train.
Whilst multilingual models can be used, language-specific models typically fare better if trained with enough data.
Luckily, there are publicly language models available for Romanian, such as RoBERT.
This language model is based on BERT, an architecture used extensively in Natural Language Processing & is more or less the standard in the field due as it attained state-of-the-art results in English & other languages.
In fact there are three variants: base, large, & small.
The larger the model, the better the results, due to the larger representation power.
But larger models will also have a higher footprint in terms of memory.
Loading these models is very easy with the transformers library.
For instance, the base model:
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = AutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)
The outputs above will contain vector representations of the inputted texts, more commonly know as "word embeddings".
Language models are then fine-tuned to a downstream task — in your case, diacritic restoration — and would take these embeddings as input.
Fine-tuning
I couldn't find any publicly available fine-tuned models, so you'll have to fine-tune your own unless you find a model yourself.
To fine-tune a language model, we need to build a task-specific architecture which will be trained on some dataset.
The dataset is used to tell the model how the input is & how we'd like the output to be.
Dataset
From Diacritics Restoration using BERT with Analysis on Czech language, there's a publicly available dataset for a number of languages including Romanian.
The dataset annotations will also depend on which fine-tuning architecture you use (more on that below).
In general, you'd choose a dataset which you trust has a high-quality of of diacritics.
From this text you can then build annotations automatically by producing the undiacritised variants of the words as well as the corresponding labels.
Keep in mind that this or any other dataset you'll use will contain biases especially in terms of the domain the annotated texts originate from.
Depending on how much data you have already transcribed, you may also want to build a dataset using your texts.
Architecture
The architecture you choose will have a bearing on the downstream performance you use & the amount of custom code you'll have to do.
Word-level
The aforementioned work, Diacritics Restoration using BERT with Analysis on Czech language, use a token-level classification mechanism where each word is is labelled with a set of instructions of the type of diacritic marks to insert at which character index.
For example, the undiacritised word "dite" with instruction set 1:CARON;3:ACUTE indicates adding the appropriate diacritic marks at index 1 and index 3 to result in "dítě".
Since this is a token-level classification task, there's not much custom code you have to do, as you can directly use a BertForTokenClassification.
Refer to the authors' code for a more complete example.
One sidenote is that the authors use a multililingual language model.
This can be easily replaced with another language model such as RoBERT mentioned above.
Character-level
Alternatively, the RoBERT paper use a character-level model.
From the paper, each character is annotated as one of the following:
make no modification to the current character (e.g., a → a), add circumflex mark (e.g., a → â and i → î), add breve mark (e.g., a → ̆ă), and two more classes for adding comma below (e.g., s → ş and t → ţ)
Here you will have to build your own custom model (instead of the BertForTokenClassification above).
But, the rest of the training code will largely be the same.
Here's a template for the model class which can be used using the transformers library:
from transformers import BertModel, BertPreTrainedModel
class BertForDiacriticRestoration(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config)
...
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None
):
...
Evaluation
In each section there's plethora of options for you to choose from.
A bit of pragmatic advice I'll offer is to start simple & complicate things if you want to improve things further.
Keep a testing set to measure if the changes you're making result in improvements or degradation over your previous setup.
Crucially, I'd suggest that at least a small part of your testing set is texts coming from the texts you have transcribed yourself, the more you use the better.
Primarily, this is data you annotated yourself, so you are more sure of the quality of this data then any other publicly available source.
Secondly, when you are testing on data coming from the target domain, you stand a better chance of more accurately evaluating your systems more accurately to your target task, due to certain biases which might be present from other domains.
I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.
You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.
I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot
I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.
I am trying to create an image classifier based on RandomTrees from OpenCV (version 2.4). Right now, I initialize everything likes this:
self.model = cv.RTrees()
max_num_trees=10
max_error=1
max_d=5
criteria=cv.TERM_CRITERIA_MAX_ITER+cv.TERM_CRITERIA_EPS
parameters = dict(max_depth=max_d, min_sample_count=5, use_surrogates=False, nactive_vars=0,
term_crit=(criteria, max_num_trees, max_error))
self.model.train(dataset, cv.CV_ROW_SAMPLE, responses, params=parameters)
I did it by looking at this question. The only problem is, whatever I change in the parameters, classification always remains the same (and wrong). Since the python documentation on this is very very scarce, I have no choice but to ask here what to do and how to check what I am doing. How to get the number of trees it generates and all the other things that are explained for C++ but not for Python - like train error? For example, I tried:
self.model.tree_count
self.model.get_tree_count()
but got an error every time. Also, am I doing the termination criteria initialization correctly?