Creating a list of words from Wikipedia

Creating a list of words from Wikipedia - python

I am creating a game and I need a dictionary (a list of plain words in this case) containing not only the base form, but all the others as well. In this case the language is Italian and, for example, the verbs have many forms and nouns too.
Since the language is very irregular, I want to get the words from a huge source which may contain them all. At first I thought about Wikipedia: I would download every article, extract the text, and filter the words.
This will take so much time that I'd like to know whether there could be better solutions, both in terms of time and completeness of the list.

If you're on a Linux system you might want to look in /usr/share/dict/words.

Related

How to find all words with any permutation of the given letters in SQL?

I am working with the sqlite3 module, using Python 3.10.0. I have created a database with a table of English words, where one of the columns is creatively named "word". My question is, how can I sample all the words that contain at most the letters within the given word? For example, if the input was "establishment", valid outputs could be "meant", "tame", "mate", "team", "establish", "neat", and so on. Invalid inputs consist of words with any other letters other than those found within the input. I have done some research on this, but the only thing I found which even comes close to this is using the LIKE keyword, which seems to be a limited version of regular expression matching. I mentioned using Python 3.10 because I think I read somewhere that sqlite3 supports user-defined functions, but I figured I'd ask first to see if somebody knows an easier solution.

Your question is extremely vague.
Let me answer a related question: "How may I efficiently find anagrams of a given word?"
There is a standard approach to this.
Simply alphabetize all letters within a word, and store them in sorted order.
So given a dictionary containing these "known" words,
we would have the first three map to the same string:
pale <--> aelp
peal <--> aelp
plea <--> aelp
plan <--> alnp
Now given a query word of "leap", how shall we efficiently find its anagrams?
Turn it into "aelp".
Query for that string, retrieving three matching dictionary words.
Sqlite is an excellent fit for such a task.
It can easily produce suitable column indexes.
Now let's return to your problem.
I suspect it's a bit more complex than anagrams.
Consider using a related approach.
Rip through each dictionary word, storing digrams in standard order.
So for "pale", we would store:
pale <--> ap
pale <--> al
pale <--> el
Repeat for all other dictionary words.
Then, at query time, given an input of "leap",
you might consult the database for "el", "ae", and "ap".
Notice that "ae" missed, there.
If that troubles you, when processing the whole dictionary
feel free to store all 2-letter combinations, even ones that aren't consecutive.
Possibly going to trigrams, or all 3-letter combinations, would prove helpful.
Spend some time working with the problem to find out.

Direction needed: finding terms in a corpus

My question isn't about a specific code issue, but rather about the best direction to take on a Natural Language Processing challenge.
I have a collection of several hundreds of Word and PDF files (from which I can then export the raw text) on the one hand, and a list of terms on the other. Terms can consist in one or more words. What I need to do is identify in which file(s) each term is used, applying stemming and lemmatization.
How would I best approach this? I know how to extract text, apply tokenization, lemmatization, etc., but I'm not sure how I could search for occurrences of terms in lemmatized form inside a corpus of documents.
Any hint would be most welcome.
Thanks!

I would suggest you create an inverted index of the documents, where you record the locations of each word in a list, with the word form as the index.
Then you create a secondary index, where you have as key the lemmatised forms, and as values a list of forms that belong to the lemma.
When you do a lookup of a lemmatised word, eg go, you go to the secondary index to retrieve the inflected forms, go, goes, going, went; next you go to the inverted index and get all the locations for each of the inflected forms.
For ways of implementing this in an efficient manner, look at Witten/Moffat/Bell, Managing Gigabytes, which is a fantastic book on this topic.
UPDATE: for multi-word units, you can work those out in the index. Looking for "software developer", look up "software" and "developer", and then merge the locations: everytime their location differs by 1, they are adjacent and you have found it. Discard all the ones where they are further apart.

How to generate homophones on substring level?

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.

How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.

Automatic tagging of words or phrases

I want to automatically tag a word/phrase with one of the defined words/phrases from a list. My list contains about 230 words in columnA which are tagged in columnB. There are around 16 unique tags and every of those 230 words are tagged with one of these 16 tags.
Have a look at my list:
The words/phrases in column A are tagged as words/phrases in column B.
From time to time, new words are added for which tag has to be given manually.
I want to build a predictive algorithm/model to tag new words automatically(or suggest). So if I write a new word, let say 'MIP Reserve' (A36), then it should predict the tag as 'Escrow Deposits'(B36) and not 'Operating Reserve'(B33). How should I predict the tags of new word precisely even if the words do not match with the words in its actual tag?
If someone is willing to see the full list, I can happily share.

Short version
I think your question is a little ill-defined and doesn't have a short coding or macro answer. Given that each item contains such little information, I don't think it is possible to build a good predictive model from your source data. Instead, do the tagging exercise once and look at how you control tagging in the future.
Long version
Here are the steps I would take to create a predictive model and why I don't think you can do this.
Understand why you want to have a predictive program at all
Why do you need a predictive program? Are you sorting through hundreds or thousands of records, all of which are changing and need tagging? If so, I agree, you wouldn't want to do this manually.
If this is a one-off exercise, because over time the tags have become corrupted from their original meaning, your problem is that your tags have become corrupted, not that you need to somehow predict where each item should be tagged. You should be looking at controlling use of the tags, not at predicting how people in the future might mistag or misname something.
Don't forget that there are lots of tools in Excel to make the problem easier. Let's say you know for certain that all items with 'cash' definitely go to 'Operating Cash'. Put an AutoFilter on the list and filter on the word 'cash' - now just copy and paste 'Operating Cash' next to all of these. This way, you can quickly get rid of the obvious ones from your list and focus on the tricky ones.
Understand the characteristics of the tags you want to use.
Take time to look at the tags you are using - what do each of them mean? What are the unique features or combinations of features that this tag is representing?
For example, your tag 'Operating Cash' carries the characteristics of being cash (i.e. not tied up so available for use fairly quickly) and as being earmarked for operations. From these, we could possibly derive further characteristics that it is held in a certain place, or a certain person has responsibility for it.
If you had more source data to go on, you could perhaps use fields such as 'year created', or 'customer' to help you categorise further.
Understand what it is about the items you want to tag that could give you an idea of where they should go.
This is your biggest problem. A quick example - what in the string "MIP Reserve" gives any clues that it should be linked to "Escrow Deposits"? You have no easy way of matching many of the items in your list - many words appear in multiple items across multiple tags.
However, try and look for unique identifiers that will give you clues - for example, all items with the word 'developer' seem to be tagged to 'Developer Fee Note & Interest'. Do you have any more of those? Use these to reduce your problem, since they should be a straightforward mapping.
Any unique identifiers will allow you to set up rules for these strings. You don't even need to stick to one word - perhaps when you see several words, you can narrow down where it will end up e.g. when I see 'egg' this could go into 'bird' or 'reptile', but if 'egg' is paired with 'wing', I can be fairly confident it's 'bird'.
You need to match the characteristics of the items you want to tag with the unique identifiers of the tags you developed in step 1.
Write a program or macro to look for the identifiers in step 2 and return the relevant tag from step 1.
This is the straightforward bit. Look for the identifiers you want (e.g. uses 'cash', contains tag 'Really Important Customer') and look for the best match in the tags you have earlier.
Ensure you catch any errors - what happens if no tag is found? Does it create a new one? Does it recommend contacting you for help? What happens if more than one tag is relevant? What are your tiebreaking criteria?
But be aware of...
Understand how you will control use of these unique identifiers.
Imagine you somehow manage to come up with a list of unique identifiers. How will you control their use? If you have decided to send any item with the word 'cash' to the tag 'Operating Cash' and then in a year, someone comes along and makes an item 'Capital Cash', because they want somewhere to put cash that is about to be spent on capital items, how do you stop this? How are you going to control use of these words?
You will effectively need to take control of the item naming system and set up an agreed list of identifying words. Whenever anyone makes an item, they need to include your identifiers somewhere. I can tell you that this will not work. Either they will use the wrong words and you will end up manually doing it anyway, or they will ring you up confused and you will end up manually doing it anyway.
If you are the only person doing this, just do the exercise once, to your own standard (that you record) and stick to that standard. When you need to hand it over, it's clearly ordered and makes sense. If more than one person is doing this, do the exercise once between you and the team and then agree a way of controlling it.
Writing a predictive program sounds great and might save you some time. But consider why you are writing it. Are you likely to need to tag accounts constantly in the future? If so, control their naming centrally and make it so a tag is mandatory when they are made. If not, why are you writing a program to do this? Just do it once, manually.

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl

If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.