I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?
Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n
I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.
I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?
Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.
EDIT:
The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).
The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.
How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.
But suppose that you want to roll your own.
The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.
Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.
Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.
And that is it.
FontTools is producing some XML with all sorts of details in this structure
<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x20" name="space"/><!-- SPACE -->
<!--many, many more characters-->
</cmap_format_4>
<cmap_format_0 platformID="1" platEncID="0" language="0">
<map code="0x0" name=".notdef"/>
<!--many, many more characters again-->
</cmap_format_0>
<cmap_format_4 platformID="0" platEncID="3" language="0"> <!--"cmap_format_4" again-->
<map code="0x20" name="space"/><!-- SPACE -->
<!--more "map" nodes-->
</cmap_format_4>
</cmap>
I'm trying to figure out every character this font supports, so these code attributes are what I'm interested in. I believe I am correct in thinking that all code attributes are UTF-8 values: is this correct? I am curious why there are two nodes cmap_format_4 (they seem to be identical, but I haven't tested that with a thorough amount of fonts those, so if someone familiar with this module knows for certain, that is my first question).
To be assured I am seeing all characters contained in the typeface, do I need to combine all code attribute values, or just one or two. Will FontTools always produce these three XML nodes, or is the quantity variable? Any idea why? The documentation is a little vague.
the number of cmap_format_N nodes ("cmap subtables") is variable, as is the 'N' (the format). There are several formats; the most common is 4, but there is also format 12, format 0, format 6, and a few others.
fonts may have multiple cmap subtables, but are not required to. The reason for this is the history of the development of TrueType (which has evolved into OpenType). The format was invented before Unicode, at a time when each platform had their own way(s) of character mapping. The different formats and ability to have multiple mappings was necessity at the time in order to have a single font file that could map everything without multiple files, duplication, etc. Nowadays most fonts that are produced will only have a single Unicode subtable, but there are many floating around that have multiple subtables.
The code values in the map node are code point values expressed as hexadecimal. They might be Unicode values, but not necessarily (see the next point).
I think your font may be corrupted (or possibly there was copy/paste mix-up). It is possible to have multiple cmap_format_N entries in the cmap, but each combination of platformID/platformEncID/language should be unique. Also, it is important to note that not all cmap subtables map Unicodes; some express older, pre-Unicode encodings. You should look at tables where platformID="3" first, then platformID="0" and finally platformID="2" as a last resort. Other platformIDs do not necessarily map Unicode values.
As for discovering "all Unicodes mapped in a font": that can be a bit tricky when there are multiple Unicode subtables, especially if their contents differ. You might get close by taking the union of all code values in all of the subtables that are known to be Unicode maps, but it is important to understand that most platforms will only use one of the maps at a time. Usually there is a preferred picking order similar to what I stated above; when one is found, that is the one used. There's no standardized order of preference that applies to all platforms (that I'm aware of), but most of the popular ones follow an order pretty close to what I listed.
Finally, regarding Unicode vs UTF-8: the code values are Unicode code points; NOT UTF-8 byte sequences. If you're not sure of the difference, spend some time reading about character encodings and byte serialization at Unicode.org.
I'm doing some machine extraction of sometimes-garbled PDF text, which often ends up with words incorrectly split up by spaces, or chunks of words put in incorrect order, resulting in pure gibberish.
I'd like a tool that can scan through and recognize these chunks of pure gibberish while skipping non-dictionary words that are likely to be proper names or simply words in a foreign language.
Not sure if this is even possible, but if it is I imagine something like this could be done using NLTK. I'm just wondering if this has been done before to save me the trouble of reinventing the wheel.
Hm, I imagine you could train a SVM or neural network on character n-grams...but you'd need pretty darn long ones. The problem is that this would probably have a high rate of false negatives (throwing out what you wanted) because you can have drastically different rates of character clusters in various languages.
Take Polish, for example (it's my only second language in easy-to-type Latin characters). Skrzywdy would be a highly unlikely series of letters in English, but is easily pronouncable in Polish.
A better technique might be to use language detection to detect languages used in a document above a certain probability, and then check the dictionaries for those languages...
This won't help for (for instance) a Linguistics textbook where a large variety of snippets of various languages are frequently used.
** EDIT **
Idea 2:
You say this is Bibliographic information. Meta-information like its position in the text or any font information your OCR software is returning to you is almost certainly more important than the series of characters you see showing up. If it's in the title, or near the position where author goes, or in Italics, it's worth considering as foreign...
I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person.
My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word boundaries for something like initials in these languages. I have no idea whether n-gram analysis is valid on names in languages where names can be 2 characters. I also don't know if string edit-distance or other similarity metrics are valid in this context.
Any ideas from linguist programmers or native speakers?
Some more information regarding Japanese:
When it comes to splitting the names into family name and given name, morphological analyzers like mecab (mentioned in #Holden's answer) basically work, but the level of accuracy will not be very high, because they will only get those names right that are in their dictionary (the statistical 'guessing' capabilities of mecab mostly relate to POS tags and in dealing with ambiguous dictionary entries, but if a proper noun is not in the dictionary, mecab will most of the time split it into individual characters, which is almost always wrong). To test this, I used a random list of names on the web (this one, which contains 113 people's names), extracted the names, removed whitespace from them and tested mecab using the IPAdic. It got approx. 21% of the names wrong.
'Proper' Japanese names, i.e. names of Japanese people, consist of a family name (most of the time 2, but sometimes 1 or 3, Kanji) and a given name (most of the time 1 or 2, sometimes 3 Kanji, but sometimes 2-5 Hiragana instead). There are no middle names and there is no concept of initials. You could improve the mecab output by (1) using a comprehensive dictionary of family names, which you could build from web resources, (2) assuming the output is wrong whenever there are more than 2 elements, and then use you self-made family name dictionary to recognise the family name part, and if that fails use default splitting rules based on the number of characters. The latter will not always be accurate.
Of course foreign names can be represented in Japanese, too. Firstly, there are Chinese and Korean names, which are typically represented using Kanji, i.e. whatever splitting rules for Chinese or Korean you use can be applied more or less directly. Western as well as Arabic or Indian names are either represented using Latin characters (possibly full-width, though), or Katakana characters, often (but not always) using white space or a middle dot ・ between family name and given name. While for names of Japanese, Chinese or Korean people the order in Japanese representation will always be family name, then given name, the order for Western names is hard to predict.
Do you even need to split names into family and given part? For the purposes of deduplication / data cleansing, this should only be required if some of the possible duplicates appear in different order or with optional middle initials. None of this is possible in Japanese names (nor Chinese, nor Korean names for that matter). The only thing to keep in mind is that if you are given a Katakana string with spaces or middle dots in it, you are likely dealing with a Western name, in which case splitting at the space / middle dot is useful.
While splitting is probably not really required, you must take care of a number of other issues not mentioned in the previous answers:
Transliteration of foreign names. Depending on how your database was constructed, there may be situations that involve a Western name, say 'Obama' in one entry, and the Japanese Katakana representation 'オバマ' in a duplicate entry. Unfortunately, the mapping from Latin to Katakana is not straightforward, as Katakana tries to reflect the pronounciation of the name, which may vary depending on the language or origin and the accent of whoever pronounces it. E.g. somebody who hears the name 'Obama' for the first time, may be tempted to represent it as 'オバーマ' to emphasize the long vowel in the middle. Solving this is not trivial and will never work perfectly accurately, but if you think it is important for your cleansing problem, let's address it in a separate question.
Kanji variation. Japanese names (as well as Japanese representations of some Chinees or Korean names) use Kanji that are considered traditional versions of modern Kanji. For example many common family names contain 澤, which is a version of 沢. For example, the family name Takazawa may be written as 高沢 or 高澤. Usually, only one is the correct variant used by any particular person of that name, but it is not uncommon that the wrong variant is used in a database entry. You should therefore definitely normalise traditional variants to modern variants before comparing names. This web page provides a mapping that is certainly not comprehensive, but is probably good enough for your purposes.
Both Latin characters as well as Katakana characters exist as full-width as well as half-width variants. In Katakana the former and in Latin the latter is commonly used, but there is no guarantee. You should normalise all Kakatana to full-width and all Latin to half-width before comparing names.
Perhaps needless to say, but there are various versions of white space characters, which you also must normalise before comparing names. Moreover, in a pure Kanji sequence, I recommend removing all whitespace before comparing.
As said, some first names (especially female ones) are written in Hiragana. It may happen that those same names are written in Katakana in some instances. A mapping between Hiragana and Katakana is trivially possible. You should consider normalising all Kana (i.e. Hiragana and Katakana) to a common representation (either Hiragana or Katakana) before making any comparisons.
It may also happen that some Kanji names are represented using Kana. This is because whoever made the database entry might not have known the correct Kanji for the name (especially with first names, guessing the correct Kanji after hearing a name e.g. on the phone is very often impossible even for native speakers). Unfortunately, mapping between Kanji representations and Kana representations is very difficult and highly ambiguous, for example 真, 誠 and 実 are possible Kanji for the first name 'Makoto'. Anyone individual of that name will consider only one of them correct for himself, but it is impossible to know which one if the only thing you know is that the name is 'Makoto'. But Kana is sound-based, so all three versions are the same マコト in Katakana. Dictionaries built into morphological analyzers like mecab provide mappings, but because there is more than one possible Kanji for any Kana sequence and vice versa, actually using this during data cleansing will complicate your algorithm quite a lot. Depending on how your database was created in the first place, this may or may not be a relevant problem.
Edit specifically about publication author names: Japanese translations of non-Japanese books usually have the author name transliterated to Katakana. E.g. the book recommendation list of the Asahi newspaper has 30 books today; 7 have a Western author name in Katakana. They even have abbreviated first names and middle initials, which they keep in Latin, e.g.
H・S・フリードマン and L・R・マーティン
which corresponds to
H.S. Friedman (or Friedmann, or Fridman, or Fridmann?)
and
L.R. Martin (or Matin, or Mahtin?)
I'd say this exemplifies the most common way to deal with non-Japanese author names of books:
Initials are preserved as Latin
Unabbreviated parts of the name are given in Katakana (but there is no uniquely defined one-to-one mapping between Latin and Katakana, as described in 5.1)
The order is preserved: First, middle, surname. That is a very common convention for author names, but in something like a customer database that may be different.
Either whitespace, or middle dot (as above), or the standard ASCII dot are used to separate the elements
So as long as your project is related to author names of books, I believe the following is accurate with regards to non-Japanese authors:
The same author may appear in a Latin (in a non-Japanese entry) as well as a Katakana representation (in a Japanese entry). To be able to determine that two such entries refer to the same author, you'll need to map between Katakana and Latin. That is a non-trivial problem, but not totally unsurmountable either (although it will never work 100% correctly). I am unsure if a good solution is available for free; but let's address this in a separate question (perhaps with the japanese tag) if required.
Even if for some reason we can assume that there are no Latin duplicates of Katakana names, there is still a good chance that there are multiple variants in Katakana (due to 5.1). However, for author names (in particular of well-known authors), it may be safe to assume that the amount of variation is relatively limited. Hence, for a start, it may be sufficient to normalize dots and whitespace.
Splitting into first and last name is trivial (whitespace and dots), and the order of names will generally be the same across all variants.
Western authors will generally not be represented using Kanji. There are a few people who consider themselves so closely related to Japan that they choose Kanji for their own name (it's a matter of choice, not just transliteration, because the Kanji carry meaning), but that will be so rare that it is hardly worth worrying about.
Now regarding Japanese authors, those will be represented in Kanji as described in part 2 of the main answer. In Western translations of their books, their name will generally be given in Latin, and the order will be exchanged. For example,
村上春樹
(村上 = Murakami, the family name, 春樹 = Haruki, the given name)
will be represented as
Haruki Murakami
on translations of his books. This kind of mapping between Kanji and Latin requires a very comprehensive dictionary and quite a lot of work. Also, the spelling in Latin cannot always be uniquely determined, even if the reading of the Kanji can. E.g. one of the most frequent Japanese family names, 伊藤, may be spelled 'Ito' as well as 'Itoh' in English. Even 'Itou' and 'Itoo' are not impossible.
If Japanese-Latin cross matching is not required, the only kind of variation amongst the Kanji representations themselves you will see are Kanji variants (5.2). But to be clear, even where a traditional as well as a modern variant of a Kanji exists, only one of them is correct for any given individual. Typing the wrong Kanji variant may easily happen when a phone operator enters names into a database, but in a database of author names this will be relatively rare because the correct spelling of an author can be verified relatively easily.
Regarding the question about 5.6 (Kana vs. Kanji):
Some people's given name has no Kanji representation, only a Hiragana one. Since there is a one-to-one correspondence between Hiragana and Katakana, there is a fair chance that both variants appear in a database. I recommend converting all Hiragana to Katakana (or vice versa) before comparing.
However, most people's names are written in Kanji. On the cover of a book, those Kanji will be used, so most likely they will also be used in your database. The only reasons why somebody might input Kana instead of Kanji are: (a) when he/she does not know the correct Kanji (perhaps unlikely since you can easily search Amazon or whatever to find out), (b) when the database is made for search purposes. Search engines for book catalogues might include Katakana versions because that enables users to find authors even if they don't know the correct Kanji. Hence, whether or not you need Kanji-Kana conversion (which is a hard problem) depends on the original purpose of the data and how the database was created.
Regarding nicknames: There are nicknames used in daily conversation, but I doubt you would find them in an author database. I realize there are languages (e.g. Polish) that use nicknames or diminutives (e.g. 'Gosia' instead of 'Małgorzata') in an almost regular way, but I wouldn't say that is the case with Japanese.
Regarding Chinese: I am unable to give a comprehensive answer, but at least the whole Kanji-Kana variation problem does not exist, because Chinese uses Kanji (under the name of Hanzi) only. There is a major Kanji variation problem, however (especially between traditional variants (used in Taiwan) and simplified variants (used on the mainland)).
Regarding Korean: As far as I know, Koreans are generally able to write their name in Hanja (=Kanji), although they don't use Hanja for most of the rest of the language most of the time), but there is obviously a Hangul version of the name, too. I am unsure to what extent Hanja-Hangul conversion is required for a cleansing problem like yours. If it is, it will be a very hard problem.
Regarding regional variants: There are no regional variants of the Kanji characters themselves in Japanese (at least not in modern times). The Kanji of any given author will be written in the same way all over Japan. Of course there are certain family names that are more frequent in one region than another, though. If you are interested in the names themselves (rather than the people they refer to), regional variants (as well as variation between traditional and modern forms of the Kanji) will play a role.
For Chinese, most names consist of 3 characters: first character is the family name (!), the other two characters are the personal name, like
Mao Zedong = family name Mao and personal name Zedong.
There are also some 2-character names, then first character is the family name and the second character is the personal name.
4-character names are rare, but then the split is usually 2-2.
Seeing this, it does not really make much sense to do n-gram analysis of Chinese names - you're just researching what are the most common Chinese family/personal names then.
So doing bi-gram style matching is a common hack for doing search in Japanese, but there are better approaches that you can use to determine word boundaries. In a project I've worked on in the past we had fairly good results with mecab for Japanese brand names and some other text. I imagine you could get better performance by training it on a list of Japanese names. Sadly its in C, but we ended up using it anyways in Java through the JNI, you could do something similar in your python code.