n-gram name analysis in non-english languages (CJK, etc) - python

I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person.
My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word boundaries for something like initials in these languages. I have no idea whether n-gram analysis is valid on names in languages where names can be 2 characters. I also don't know if string edit-distance or other similarity metrics are valid in this context.
Any ideas from linguist programmers or native speakers?

Some more information regarding Japanese:
When it comes to splitting the names into family name and given name, morphological analyzers like mecab (mentioned in #Holden's answer) basically work, but the level of accuracy will not be very high, because they will only get those names right that are in their dictionary (the statistical 'guessing' capabilities of mecab mostly relate to POS tags and in dealing with ambiguous dictionary entries, but if a proper noun is not in the dictionary, mecab will most of the time split it into individual characters, which is almost always wrong). To test this, I used a random list of names on the web (this one, which contains 113 people's names), extracted the names, removed whitespace from them and tested mecab using the IPAdic. It got approx. 21% of the names wrong.
'Proper' Japanese names, i.e. names of Japanese people, consist of a family name (most of the time 2, but sometimes 1 or 3, Kanji) and a given name (most of the time 1 or 2, sometimes 3 Kanji, but sometimes 2-5 Hiragana instead). There are no middle names and there is no concept of initials. You could improve the mecab output by (1) using a comprehensive dictionary of family names, which you could build from web resources, (2) assuming the output is wrong whenever there are more than 2 elements, and then use you self-made family name dictionary to recognise the family name part, and if that fails use default splitting rules based on the number of characters. The latter will not always be accurate.
Of course foreign names can be represented in Japanese, too. Firstly, there are Chinese and Korean names, which are typically represented using Kanji, i.e. whatever splitting rules for Chinese or Korean you use can be applied more or less directly. Western as well as Arabic or Indian names are either represented using Latin characters (possibly full-width, though), or Katakana characters, often (but not always) using white space or a middle dot ・ between family name and given name. While for names of Japanese, Chinese or Korean people the order in Japanese representation will always be family name, then given name, the order for Western names is hard to predict.
Do you even need to split names into family and given part? For the purposes of deduplication / data cleansing, this should only be required if some of the possible duplicates appear in different order or with optional middle initials. None of this is possible in Japanese names (nor Chinese, nor Korean names for that matter). The only thing to keep in mind is that if you are given a Katakana string with spaces or middle dots in it, you are likely dealing with a Western name, in which case splitting at the space / middle dot is useful.
While splitting is probably not really required, you must take care of a number of other issues not mentioned in the previous answers:
Transliteration of foreign names. Depending on how your database was constructed, there may be situations that involve a Western name, say 'Obama' in one entry, and the Japanese Katakana representation 'オバマ' in a duplicate entry. Unfortunately, the mapping from Latin to Katakana is not straightforward, as Katakana tries to reflect the pronounciation of the name, which may vary depending on the language or origin and the accent of whoever pronounces it. E.g. somebody who hears the name 'Obama' for the first time, may be tempted to represent it as 'オバーマ' to emphasize the long vowel in the middle. Solving this is not trivial and will never work perfectly accurately, but if you think it is important for your cleansing problem, let's address it in a separate question.
Kanji variation. Japanese names (as well as Japanese representations of some Chinees or Korean names) use Kanji that are considered traditional versions of modern Kanji. For example many common family names contain 澤, which is a version of 沢. For example, the family name Takazawa may be written as 高沢 or 高澤. Usually, only one is the correct variant used by any particular person of that name, but it is not uncommon that the wrong variant is used in a database entry. You should therefore definitely normalise traditional variants to modern variants before comparing names. This web page provides a mapping that is certainly not comprehensive, but is probably good enough for your purposes.
Both Latin characters as well as Katakana characters exist as full-width as well as half-width variants. In Katakana the former and in Latin the latter is commonly used, but there is no guarantee. You should normalise all Kakatana to full-width and all Latin to half-width before comparing names.
Perhaps needless to say, but there are various versions of white space characters, which you also must normalise before comparing names. Moreover, in a pure Kanji sequence, I recommend removing all whitespace before comparing.
As said, some first names (especially female ones) are written in Hiragana. It may happen that those same names are written in Katakana in some instances. A mapping between Hiragana and Katakana is trivially possible. You should consider normalising all Kana (i.e. Hiragana and Katakana) to a common representation (either Hiragana or Katakana) before making any comparisons.
It may also happen that some Kanji names are represented using Kana. This is because whoever made the database entry might not have known the correct Kanji for the name (especially with first names, guessing the correct Kanji after hearing a name e.g. on the phone is very often impossible even for native speakers). Unfortunately, mapping between Kanji representations and Kana representations is very difficult and highly ambiguous, for example 真, 誠 and 実 are possible Kanji for the first name 'Makoto'. Anyone individual of that name will consider only one of them correct for himself, but it is impossible to know which one if the only thing you know is that the name is 'Makoto'. But Kana is sound-based, so all three versions are the same マコト in Katakana. Dictionaries built into morphological analyzers like mecab provide mappings, but because there is more than one possible Kanji for any Kana sequence and vice versa, actually using this during data cleansing will complicate your algorithm quite a lot. Depending on how your database was created in the first place, this may or may not be a relevant problem.
Edit specifically about publication author names: Japanese translations of non-Japanese books usually have the author name transliterated to Katakana. E.g. the book recommendation list of the Asahi newspaper has 30 books today; 7 have a Western author name in Katakana. They even have abbreviated first names and middle initials, which they keep in Latin, e.g.
H・S・フリードマン and L・R・マーティン
which corresponds to
H.S. Friedman (or Friedmann, or Fridman, or Fridmann?)
and
L.R. Martin (or Matin, or Mahtin?)
I'd say this exemplifies the most common way to deal with non-Japanese author names of books:
Initials are preserved as Latin
Unabbreviated parts of the name are given in Katakana (but there is no uniquely defined one-to-one mapping between Latin and Katakana, as described in 5.1)
The order is preserved: First, middle, surname. That is a very common convention for author names, but in something like a customer database that may be different.
Either whitespace, or middle dot (as above), or the standard ASCII dot are used to separate the elements
So as long as your project is related to author names of books, I believe the following is accurate with regards to non-Japanese authors:
The same author may appear in a Latin (in a non-Japanese entry) as well as a Katakana representation (in a Japanese entry). To be able to determine that two such entries refer to the same author, you'll need to map between Katakana and Latin. That is a non-trivial problem, but not totally unsurmountable either (although it will never work 100% correctly). I am unsure if a good solution is available for free; but let's address this in a separate question (perhaps with the japanese tag) if required.
Even if for some reason we can assume that there are no Latin duplicates of Katakana names, there is still a good chance that there are multiple variants in Katakana (due to 5.1). However, for author names (in particular of well-known authors), it may be safe to assume that the amount of variation is relatively limited. Hence, for a start, it may be sufficient to normalize dots and whitespace.
Splitting into first and last name is trivial (whitespace and dots), and the order of names will generally be the same across all variants.
Western authors will generally not be represented using Kanji. There are a few people who consider themselves so closely related to Japan that they choose Kanji for their own name (it's a matter of choice, not just transliteration, because the Kanji carry meaning), but that will be so rare that it is hardly worth worrying about.
Now regarding Japanese authors, those will be represented in Kanji as described in part 2 of the main answer. In Western translations of their books, their name will generally be given in Latin, and the order will be exchanged. For example,
村上春樹
(村上 = Murakami, the family name, 春樹 = Haruki, the given name)
will be represented as
Haruki Murakami
on translations of his books. This kind of mapping between Kanji and Latin requires a very comprehensive dictionary and quite a lot of work. Also, the spelling in Latin cannot always be uniquely determined, even if the reading of the Kanji can. E.g. one of the most frequent Japanese family names, 伊藤, may be spelled 'Ito' as well as 'Itoh' in English. Even 'Itou' and 'Itoo' are not impossible.
If Japanese-Latin cross matching is not required, the only kind of variation amongst the Kanji representations themselves you will see are Kanji variants (5.2). But to be clear, even where a traditional as well as a modern variant of a Kanji exists, only one of them is correct for any given individual. Typing the wrong Kanji variant may easily happen when a phone operator enters names into a database, but in a database of author names this will be relatively rare because the correct spelling of an author can be verified relatively easily.
Regarding the question about 5.6 (Kana vs. Kanji):
Some people's given name has no Kanji representation, only a Hiragana one. Since there is a one-to-one correspondence between Hiragana and Katakana, there is a fair chance that both variants appear in a database. I recommend converting all Hiragana to Katakana (or vice versa) before comparing.
However, most people's names are written in Kanji. On the cover of a book, those Kanji will be used, so most likely they will also be used in your database. The only reasons why somebody might input Kana instead of Kanji are: (a) when he/she does not know the correct Kanji (perhaps unlikely since you can easily search Amazon or whatever to find out), (b) when the database is made for search purposes. Search engines for book catalogues might include Katakana versions because that enables users to find authors even if they don't know the correct Kanji. Hence, whether or not you need Kanji-Kana conversion (which is a hard problem) depends on the original purpose of the data and how the database was created.
Regarding nicknames: There are nicknames used in daily conversation, but I doubt you would find them in an author database. I realize there are languages (e.g. Polish) that use nicknames or diminutives (e.g. 'Gosia' instead of 'Małgorzata') in an almost regular way, but I wouldn't say that is the case with Japanese.
Regarding Chinese: I am unable to give a comprehensive answer, but at least the whole Kanji-Kana variation problem does not exist, because Chinese uses Kanji (under the name of Hanzi) only. There is a major Kanji variation problem, however (especially between traditional variants (used in Taiwan) and simplified variants (used on the mainland)).
Regarding Korean: As far as I know, Koreans are generally able to write their name in Hanja (=Kanji), although they don't use Hanja for most of the rest of the language most of the time), but there is obviously a Hangul version of the name, too. I am unsure to what extent Hanja-Hangul conversion is required for a cleansing problem like yours. If it is, it will be a very hard problem.
Regarding regional variants: There are no regional variants of the Kanji characters themselves in Japanese (at least not in modern times). The Kanji of any given author will be written in the same way all over Japan. Of course there are certain family names that are more frequent in one region than another, though. If you are interested in the names themselves (rather than the people they refer to), regional variants (as well as variation between traditional and modern forms of the Kanji) will play a role.

For Chinese, most names consist of 3 characters: first character is the family name (!), the other two characters are the personal name, like
Mao Zedong = family name Mao and personal name Zedong.
There are also some 2-character names, then first character is the family name and the second character is the personal name.
4-character names are rare, but then the split is usually 2-2.
Seeing this, it does not really make much sense to do n-gram analysis of Chinese names - you're just researching what are the most common Chinese family/personal names then.

So doing bi-gram style matching is a common hack for doing search in Japanese, but there are better approaches that you can use to determine word boundaries. In a project I've worked on in the past we had fairly good results with mecab for Japanese brand names and some other text. I imagine you could get better performance by training it on a list of Japanese names. Sadly its in C, but we ended up using it anyways in Java through the JNI, you could do something similar in your python code.

Related

How to extract text between two headings with regex, requires complicated non-capture groups

I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?
Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n

How to generate homophones on substring level?

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.
How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.

Algorithm or tool in python to distinguish between gibberish/errors and foreign words/names?

I'm doing some machine extraction of sometimes-garbled PDF text, which often ends up with words incorrectly split up by spaces, or chunks of words put in incorrect order, resulting in pure gibberish.
I'd like a tool that can scan through and recognize these chunks of pure gibberish while skipping non-dictionary words that are likely to be proper names or simply words in a foreign language.
Not sure if this is even possible, but if it is I imagine something like this could be done using NLTK. I'm just wondering if this has been done before to save me the trouble of reinventing the wheel.
Hm, I imagine you could train a SVM or neural network on character n-grams...but you'd need pretty darn long ones. The problem is that this would probably have a high rate of false negatives (throwing out what you wanted) because you can have drastically different rates of character clusters in various languages.
Take Polish, for example (it's my only second language in easy-to-type Latin characters). Skrzywdy would be a highly unlikely series of letters in English, but is easily pronouncable in Polish.
A better technique might be to use language detection to detect languages used in a document above a certain probability, and then check the dictionaries for those languages...
This won't help for (for instance) a Linguistics textbook where a large variety of snippets of various languages are frequently used.
** EDIT **
Idea 2:
You say this is Bibliographic information. Meta-information like its position in the text or any font information your OCR software is returning to you is almost certainly more important than the series of characters you see showing up. If it's in the title, or near the position where author goes, or in Italics, it's worth considering as foreign...

fuzzy string matching with term weights

I'm working on an application that attempts to match an input set of potentially "messy" entity names to "clean" entity names in a reference list. I've been working with edit distance and other common fuzzy matching algorithms, but I'm wondering if there are any better approaches that allow for term weighting, such that common terms are given less weight in the fuzzy match.
Consider this example, using Python's difflib library. I'm working with organization names, which have many standardized components in common and therefore cannot be used to differentiate among entities.
from difflib import SequenceMatcher
e1a = SequenceMatcher(None, "ZOECON RESEARCH INSTITUTE",
"LONDON RESEARCH INSTITUTE")
print e1a.ratio()
0.88
e1b = SequenceMatcher(None, "ZOECON", "LONDON")
print e1b.ratio()
0.333333333333
e2a = SequenceMatcher(None, "WORLDWIDE SEMICONDUCTOR MANUFACTURING CORP",
"TAIWAN SEMICONDUCTOR MANUFACTURING CORP")
print e2a.ratio()
0.83950617284
e2b = SequenceMatcher(None, "WORLDWIDE",
"TAIWAN")
print e2b.ratio()
0.133333333333
Both examples score highly on the full string because RESEARCH, INSTITUTE, SEMICONDUCTOR, MANUFACTURING, and CORP are high frequency, generic terms in many organization names. I'm looking for any ideas of how to integrate term frequencies into fuzzy string matching (not necessarily using difflib), such that the scores are't as influenced by common terms, and the results might look more like the "e1b" and "e2b" examples.
I realize I could just make a big "frequent term" list and exclude those from the comparison, but I'd like to use frequencies if possible because even common words add some information, and also the cutoff point for any list would of course also be arbitrary.
Here's a weird idea for you:
Compress your input and diff that.
You could use e.g. Huffman or dictionary coder to compress your input, that automatically takes care of common terms. It may not do so well for typos though, in your example, London is probably a relatively common word, while misspelt Lundon is not at all, and dissimilarity between compressed terms is much higher than between raw terms.
how about splitting each string into a list of words, and running your comparison on each word to get a list which holds the scores of word matches. then you can average the scores, find the lowest/highest indirect match or partials...
gives you the ability to add your own weight.
you would of course need to handle offsets like..
"the london company for leather"
and
"london company for leather"
In my opinion, a general solution will never match your idea of similarity. As soon as you have some implicit knowledge about your data, you have to put that somehow into code. Which imediately disqualifies a fixed existing solution.
Perhaps you should have look at http://nltk.org/ to get an idea of some NLP techniques. You don't tell us enough about your data, but a POS tagger might help to identify more and less relevant terms. Available databases with names of cities, countries, ... might help to clean up the data before processing it further.
There are many tools available, but to get high quality output, you will need a solution which is customized for your data and use case.
I am just proposing another different approach. Since you mentioned that the entity names are coming from a reference list, I am wondering if you have additional context information, like co-author names, product/paper titles, address w/ city,state,country?
If you do have some useful context as above, you can actually build a graph of entities out of the relations between them. Relations could be, for example:
Author-paper relation
Co-author relation
author-institute relation
institute-city relation
....
Then it's time to use a graph-based entity resolution approach described in detail at:
http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattacharya-tkdd.pdf
http://drum.lib.umd.edu/bitstream/1903/4021/1/4758.pdf
The approach has a very good performance on co-author-paper domain.

Python - letter frequency count and translation

I am using Python 3.1, but I can downgrade if needed.
I have an ASCII file containing a short story written in one of the languages the alphabet of which can be represented with upper and or lower ASCII. I wish to:
1) Detect an encoding to the best of my abilities, get some sort of confidence metric (would vary depending on the length of the file, right?)
2) Automatically translate the whole thing using some free online service or a library.
Additional question: What if the text is written in a language where it takes 2 or more bytes to represent one letter and the byte order mark is not there to help me?
Finally, how do I deal with punctuation and misc characters such as space? It will occur more frequently than some letters, right? How about the fact that punctuation and characters can be sometimes mixed - there might be two representations of a comma, two representations for what looks like an "a", etc.?
Yes, I have read the article by Joel Spolsky on Unicode. Please help me with at least some of these items.
Thank you!
P.S. This is not a homework, but it is for self-educational purposes. I prefer using a letter frequency library that is open-source and readable as opposed to the one that is closed, efficient, but gets the job done well.
Essentially there are three main tasks to implement the described application:
1a) Identify the character encoding of the input text
1b) Identify the language of the input text
2) Get the text translated the text, by way of one of the online services' API
For 1a, you may want to take a look at decodeh.py, aside from the script itself, it provides many very useful resources regarding character sets and encoding at large. CharDet, mentioned in other answer also seems to be worthy of consideration.
Once the character encoding is known, as you suggest, you may solve 1b) by calculating the character frequency profile of the text, and matching it with known frequencies. While simple, this approach typically provide a decent precision ratio, although it may be weak on shorter texts and also on texts which follow particular patterns; for example a text in French with many references to units in the metric system will have an unusually high proportion of the letters M, K and C.
A complementary and very similar approach, use bi-grams (sequences of two letters) and tri-grams (three letters) and the corresponding tables of frequency distribution references in various languages.
Other language detection methods involve tokenizing the text, i.e. considering the words within the text. NLP resources include tables with the most used words in various languages. Such words are typically articles, possessive adjectives, adverbs and the like.
An alternative solution to the language detection is to rely on the online translation service to figure this out for us. What is important is to supply the translation service with text in a character encoding it understands, providing it the language may be superfluous.
Finally, as many practical NLP applications, you may decide to implement multiple solutions. By using a strategy design pattern, one can apply several filters/classifiers/steps in a particular order, and exit this logic at different points depending on the situation. For example, if a simple character/bigram frequency matches the text to English (with a small deviation), one may just stop there. Otherwise, if the guessed language is French or German, perform another test, etc. etc.
If you have an ASCII file then I can tell you with 100% confidence that it is encoded in ASCII. Beyond that try chardet. But knowing the encoding isn't necessarily enough to determine what language it's in.
As for multibyte encodings, The only reliable way to handle it is to hope that it has characters in the Latin alphabet and look for which half of the pair has the NULL. Otherwise treat it as UTF-8 unless you know better (Shift-JIS, GB2312, etc.).
Oh, and UTF-8. UTF-8, UTF-8, UTF-8. I don't think I can stress that enough. And in case I haven't... UTF-8.
Character frequency is pretty straight forward
I just noticed that you are using Python3.1 so this is even easier
>>> from collections import Counter
>>> Counter("Μεταλλικα")
Counter({'α': 2, 'λ': 2, 'τ': 1, 'ε': 1, 'ι': 1, 'κ': 1, 'Μ': 1})
For older versions of Python:
>>> from collections import defaultdict
>>> letter_freq=defaultdict(int)
>>> unistring = "Μεταλλικα"
>>> for uc in unistring: letter_freq[uc]+=1
...
>>> letter_freq
defaultdict(<class 'int'>, {'τ': 1, 'α': 2, 'ε': 1, 'ι': 1, 'λ': 2, 'κ': 1, 'Μ': 1})
I have provided some conditional answers however your question is a little vague and inconsistent. Please edit your question to provide answers to my questions below.
(1) You say that the file is ASCII but you want to detect an encoding? Huh? Isn't the answer "ascii"?? If you really need to detect an encoding, use chardet
(2) Automatically translate what? encoding? language? If language, do you know what the input language is or are you trying to detect that also? To detect language, try guess-language ... note that it needs a tweak for better detection of Japanese. See this SO topic which notes the Japanese problem and also highlights that for ANY language-guesser, you need to remove all HTML/XML/Javascript/etc noise from your text otherwise it will heavily bias the result towards ASCII-only languages like English (or Catalan!).
(3) You are talking about a "letter-frequency library" ... you are going to use this library to do what? If language guessing, it appears that using frequency of single letters is not much help distinguishing between languages which use the same (or almost the same) character set; one needs to use the frequency of three-letter groups ("trigrams").
(4) Your questions on punctuation and spaces: depends on your purpose (which we are not yet sure of). If purpose is language detection, the idea is to standardise the text; e.g. replace all runs of not (letter or apostrophe) with a single space, then remove any leading/trailing whitespace, than add 1 leading and 1 trailing space -- more precision is gained by treating start/end of word bigrams as trigrams. Note that as usual in all text processing you should decode your input into unicode immediately and work with unicode thereafter.

Categories

Resources