Direction needed: finding terms in a corpus

Direction needed: finding terms in a corpus - python

My question isn't about a specific code issue, but rather about the best direction to take on a Natural Language Processing challenge.
I have a collection of several hundreds of Word and PDF files (from which I can then export the raw text) on the one hand, and a list of terms on the other. Terms can consist in one or more words. What I need to do is identify in which file(s) each term is used, applying stemming and lemmatization.
How would I best approach this? I know how to extract text, apply tokenization, lemmatization, etc., but I'm not sure how I could search for occurrences of terms in lemmatized form inside a corpus of documents.
Any hint would be most welcome.
Thanks!

I would suggest you create an inverted index of the documents, where you record the locations of each word in a list, with the word form as the index.
Then you create a secondary index, where you have as key the lemmatised forms, and as values a list of forms that belong to the lemma.
When you do a lookup of a lemmatised word, eg go, you go to the secondary index to retrieve the inflected forms, go, goes, going, went; next you go to the inverted index and get all the locations for each of the inflected forms.
For ways of implementing this in an efficient manner, look at Witten/Moffat/Bell, Managing Gigabytes, which is a fantastic book on this topic.
UPDATE: for multi-word units, you can work those out in the index. Looking for "software developer", look up "software" and "developer", and then merge the locations: everytime their location differs by 1, they are adjacent and you have found it. Discard all the ones where they are further apart.

Related

Use word2vec to expand a glossary in order to classify texts

I have a database containing about 3 million texts (tweets). I put clean texts (removing stop words, tags...) in a list of lists of tokens called sentences (so it contains a list of tokens for each text).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
I have also a list of words (belonging to the same topic, in this case: economics) called terms. I found that 7% of the texts contain at least one of these words (so we can say that 7% of total tweets talk about economics).
My goal is to expand the list terms in order to retrieve more texts belonging to the economic topic.
Then I use
results = model.most_similar(terms, topn=5000)
to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.
Finally if I create the data frame
df = pd.DataFrame(results, columns=['key', 'similarity'])
I get something like that:
key similarity
word1 0.795432
word2 0.787954
word3 0.778942
... ...
Now I think I have two possibilities to define the expanded glossary:
I take the first N words (what should be the value of N?);
I look at the suggested words one by one and decide which one to include in the expanded glossary based on my knowledge (does this word really belong to the economic glossary?)
How should I proceed in a case like this?

There's no general answer for what the cutoff should be, or how much you should use your own manual judgement versus cruder (but fast/automatic) processes. Those are inherently decisions which will be heavily influenced by your data, model quality, & goals – so you have to try different approaches & see what works there.
If you had a goal for what percentage of the original corpus you want to take – say, 14% instead of 7% – you could go as deeply into the ranked candidate list of 'similar words' as necessary to hit that 14% target.
Note that when you retrieve model.most_similar(terms), you are asking the model to 1st average all words in terms together, then return words close to that one average point. To the extent your seed set of terms is tightly around the idea of economics, that might find words close to that generic average idea – but might not find other interesting words, such as close sysnonyms of your seed words that you just hadn't thought of. For that, you might want to get not 5000 neighbors for one generic average point, but (say) 3 neighbors for every individual term. To the extent the 'shape' of the topic isn't a perfect sphere around someplace in the word-vector-space, but rather some lumpy complex volume, that might better reflect your intent.
Instead of using your judgement of the candidate words standing alone to decide whether a word is economics-related, you could instead look at the texts that a word uniquely brings in. That is, for new word X, look at the N texts that contain that word. How many, when applying your full judgement to their full text, deserve to be in your 'economics' subset? Only if it's above some threshold T would you want to move X into your glossary.
But such an exercise may just highlight: using a simple glossary – "for any of these hand-picked N words, every text mentioning at least 1 word is in" – is a fairly crude way of assessing a text's topic. There are other ways to approach the goal of "pick a relevant subset" in an automated way.
For example, you could view your task as that of training a text binary classifier to classify texts as 'economics' or 'not-economics'.
In such a case, you'd start with some training data - a set of example documents that are already labeled 'economics' or 'not-economics', perhaps via individual manual review, or perhaps via some crude bootstrapping (like labeling all texts with some set of glossary words as 'economics', & all others 'not-economics'). Then you'd draw from the full range of potential text-preprocessing, text-feature-extracton, & classification options to train & evaluate classifiers that make that judgement for you. Then you'd evaluate/tune those – a process wich might also improve your training data, as you add new definitively 'economics' or 'not-economics' texts – & eventually settle on one that works well.
Alternatively, you could use some other richer topic-modeling methods (LDA, word2vec-derived Doc2Vec, deeper neural models etc) for modeling the whole dataset, then from some seed-set of definite-'economics' texts, expand outward from them – finding nearest-examples to known-good documents, either auto-including them or hand-reviewing them.
Separately: min_count=1 is almost always a mistake in word2vec & related algorihtms, which do better if you discard words so rare they lack the variety of multiple usage examples the algorithm needs to generate good word-vectors.

How Lucene positional index works so efficiently?

Usually any search engine software creates inverted indexes to make searches faster. The basic format is:-
word: <docnum ,positions>, <docnum ,positions>, <docnum ,positions> .....
Whenever there is a search query inside quote like "Harry Potter Movies" it means there should be exact match of positions of word and in searches like within k word queries like hello /4 world it generally means that find the word world in the range of 4 word distance either in left or right from the word hello. My question is that we can employ solution like linearly checking the postings and calculating distances of words like in query, but if collection is really large we can't really search all the postings. So is there any other data structure or kind of optimisation lucene or solr uses?
One first solution can be only searching some k postings for each word. Other solution can be only searching top docs(usually called champion list sorted by tf-idf or similar during indexing), but more better docs can be ignored. Both solutions have some disadvantage, they both don't ensure quality. But in Solr server we get assured quality of results even in large collections. How?

The phrase query you are asking about here is actually really efficient to compute the positions of, because you're asking for the documents where 'Harry' AND 'Potter' AND 'Movies' occur.
Lucene is pretty smart, but the core of its algorithm for this is that it only needs to visit the positions lists of documents where all three of these terms even occur.
Lucene's postings are also sharded into multiple files:
Inside the counts-files are: (Document, TF, PositionsAddr)+
Inside the positions-files are: (PositionsArray)
So it can sweep across the (doc, tf, pos_addr) for each of these three terms, and only consult the PositionsArray when all three words occur in the specific document. Phrase queries have the opportunity to be really quick, because you only visit at most all the documents from the least-frequent term.
If you want to see a phrase query run slowly (and do lots of disk seeks!), try: "to be or not to be" ... here the AND part doesn't help much because all the terms are very common.

Consolidating and comparing the text per document

I just started learning how NLP works. What I can do right now is to get the number of frequency of a specific word per document. But what I'm trying to do is to compare the four documents that I have to compare their similarities and different as well as displaying the words that are similar and the words that is unique to each document.
My documents are in .csv format imported using pandas. As each row has their own sentiment.

To be honest, the question you're asking is very high level and difficult (maybe impossible) to answer on a forum like this. So here are some ideas that might be helpful:
You could try to use [term frequency–inverse document frequency (TFIDF)] (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to compare the vocabularies for similarities and differences. This is not a large step from your current word-frequency analysis.
For a more detailed analysis, it might be a good idea to substitute the words of your documents with something like wordnet's synsets. This makes it possible to compare the sentence meanings at a higher level of abstraction than the actual words themselves. For example, if each of your documents mentions "planes", "trains", and "automobiles", there is an underlying similarity (vehicle references) that a simple word comparison will ignore not be able to detect.

Automatic tagging of words or phrases

I want to automatically tag a word/phrase with one of the defined words/phrases from a list. My list contains about 230 words in columnA which are tagged in columnB. There are around 16 unique tags and every of those 230 words are tagged with one of these 16 tags.
Have a look at my list:
The words/phrases in column A are tagged as words/phrases in column B.
From time to time, new words are added for which tag has to be given manually.
I want to build a predictive algorithm/model to tag new words automatically(or suggest). So if I write a new word, let say 'MIP Reserve' (A36), then it should predict the tag as 'Escrow Deposits'(B36) and not 'Operating Reserve'(B33). How should I predict the tags of new word precisely even if the words do not match with the words in its actual tag?
If someone is willing to see the full list, I can happily share.

Short version
I think your question is a little ill-defined and doesn't have a short coding or macro answer. Given that each item contains such little information, I don't think it is possible to build a good predictive model from your source data. Instead, do the tagging exercise once and look at how you control tagging in the future.
Long version
Here are the steps I would take to create a predictive model and why I don't think you can do this.
Understand why you want to have a predictive program at all
Why do you need a predictive program? Are you sorting through hundreds or thousands of records, all of which are changing and need tagging? If so, I agree, you wouldn't want to do this manually.
If this is a one-off exercise, because over time the tags have become corrupted from their original meaning, your problem is that your tags have become corrupted, not that you need to somehow predict where each item should be tagged. You should be looking at controlling use of the tags, not at predicting how people in the future might mistag or misname something.
Don't forget that there are lots of tools in Excel to make the problem easier. Let's say you know for certain that all items with 'cash' definitely go to 'Operating Cash'. Put an AutoFilter on the list and filter on the word 'cash' - now just copy and paste 'Operating Cash' next to all of these. This way, you can quickly get rid of the obvious ones from your list and focus on the tricky ones.
Understand the characteristics of the tags you want to use.
Take time to look at the tags you are using - what do each of them mean? What are the unique features or combinations of features that this tag is representing?
For example, your tag 'Operating Cash' carries the characteristics of being cash (i.e. not tied up so available for use fairly quickly) and as being earmarked for operations. From these, we could possibly derive further characteristics that it is held in a certain place, or a certain person has responsibility for it.
If you had more source data to go on, you could perhaps use fields such as 'year created', or 'customer' to help you categorise further.
Understand what it is about the items you want to tag that could give you an idea of where they should go.
This is your biggest problem. A quick example - what in the string "MIP Reserve" gives any clues that it should be linked to "Escrow Deposits"? You have no easy way of matching many of the items in your list - many words appear in multiple items across multiple tags.
However, try and look for unique identifiers that will give you clues - for example, all items with the word 'developer' seem to be tagged to 'Developer Fee Note & Interest'. Do you have any more of those? Use these to reduce your problem, since they should be a straightforward mapping.
Any unique identifiers will allow you to set up rules for these strings. You don't even need to stick to one word - perhaps when you see several words, you can narrow down where it will end up e.g. when I see 'egg' this could go into 'bird' or 'reptile', but if 'egg' is paired with 'wing', I can be fairly confident it's 'bird'.
You need to match the characteristics of the items you want to tag with the unique identifiers of the tags you developed in step 1.
Write a program or macro to look for the identifiers in step 2 and return the relevant tag from step 1.
This is the straightforward bit. Look for the identifiers you want (e.g. uses 'cash', contains tag 'Really Important Customer') and look for the best match in the tags you have earlier.
Ensure you catch any errors - what happens if no tag is found? Does it create a new one? Does it recommend contacting you for help? What happens if more than one tag is relevant? What are your tiebreaking criteria?
But be aware of...
Understand how you will control use of these unique identifiers.
Imagine you somehow manage to come up with a list of unique identifiers. How will you control their use? If you have decided to send any item with the word 'cash' to the tag 'Operating Cash' and then in a year, someone comes along and makes an item 'Capital Cash', because they want somewhere to put cash that is about to be spent on capital items, how do you stop this? How are you going to control use of these words?
You will effectively need to take control of the item naming system and set up an agreed list of identifying words. Whenever anyone makes an item, they need to include your identifiers somewhere. I can tell you that this will not work. Either they will use the wrong words and you will end up manually doing it anyway, or they will ring you up confused and you will end up manually doing it anyway.
If you are the only person doing this, just do the exercise once, to your own standard (that you record) and stick to that standard. When you need to hand it over, it's clearly ordered and makes sense. If more than one person is doing this, do the exercise once between you and the team and then agree a way of controlling it.
Writing a predictive program sounds great and might save you some time. But consider why you are writing it. Are you likely to need to tag accounts constantly in the future? If so, control their naming centrally and make it so a tag is mandatory when they are made. If not, why are you writing a program to do this? Just do it once, manually.

Creating a list of words from Wikipedia

I am creating a game and I need a dictionary (a list of plain words in this case) containing not only the base form, but all the others as well. In this case the language is Italian and, for example, the verbs have many forms and nouns too.
Since the language is very irregular, I want to get the words from a huge source which may contain them all. At first I thought about Wikipedia: I would download every article, extract the text, and filter the words.
This will take so much time that I'd like to know whether there could be better solutions, both in terms of time and completeness of the list.

If you're on a Linux system you might want to look in /usr/share/dict/words.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.