FontTools is producing some XML with all sorts of details in this structure
<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x20" name="space"/><!-- SPACE -->
<!--many, many more characters-->
</cmap_format_4>
<cmap_format_0 platformID="1" platEncID="0" language="0">
<map code="0x0" name=".notdef"/>
<!--many, many more characters again-->
</cmap_format_0>
<cmap_format_4 platformID="0" platEncID="3" language="0"> <!--"cmap_format_4" again-->
<map code="0x20" name="space"/><!-- SPACE -->
<!--more "map" nodes-->
</cmap_format_4>
</cmap>
I'm trying to figure out every character this font supports, so these code attributes are what I'm interested in. I believe I am correct in thinking that all code attributes are UTF-8 values: is this correct? I am curious why there are two nodes cmap_format_4 (they seem to be identical, but I haven't tested that with a thorough amount of fonts those, so if someone familiar with this module knows for certain, that is my first question).
To be assured I am seeing all characters contained in the typeface, do I need to combine all code attribute values, or just one or two. Will FontTools always produce these three XML nodes, or is the quantity variable? Any idea why? The documentation is a little vague.
the number of cmap_format_N nodes ("cmap subtables") is variable, as is the 'N' (the format). There are several formats; the most common is 4, but there is also format 12, format 0, format 6, and a few others.
fonts may have multiple cmap subtables, but are not required to. The reason for this is the history of the development of TrueType (which has evolved into OpenType). The format was invented before Unicode, at a time when each platform had their own way(s) of character mapping. The different formats and ability to have multiple mappings was necessity at the time in order to have a single font file that could map everything without multiple files, duplication, etc. Nowadays most fonts that are produced will only have a single Unicode subtable, but there are many floating around that have multiple subtables.
The code values in the map node are code point values expressed as hexadecimal. They might be Unicode values, but not necessarily (see the next point).
I think your font may be corrupted (or possibly there was copy/paste mix-up). It is possible to have multiple cmap_format_N entries in the cmap, but each combination of platformID/platformEncID/language should be unique. Also, it is important to note that not all cmap subtables map Unicodes; some express older, pre-Unicode encodings. You should look at tables where platformID="3" first, then platformID="0" and finally platformID="2" as a last resort. Other platformIDs do not necessarily map Unicode values.
As for discovering "all Unicodes mapped in a font": that can be a bit tricky when there are multiple Unicode subtables, especially if their contents differ. You might get close by taking the union of all code values in all of the subtables that are known to be Unicode maps, but it is important to understand that most platforms will only use one of the maps at a time. Usually there is a preferred picking order similar to what I stated above; when one is found, that is the one used. There's no standardized order of preference that applies to all platforms (that I'm aware of), but most of the popular ones follow an order pretty close to what I listed.
Finally, regarding Unicode vs UTF-8: the code values are Unicode code points; NOT UTF-8 byte sequences. If you're not sure of the difference, spend some time reading about character encodings and byte serialization at Unicode.org.
Related
I have data (mostly a series of numpy arrays) that I want to convert into text that can be copied/pasted/emailed etc.. I created the following formula which does this.
def convert_to_ascii85(x):
p = pickle.dumps(x)
p = zlib.compress(p)
return b64.b85encode(p)
My issue is that the string it produces is longer than it needs to be because it only uses a subset of letters, numbers, and symbols. If I was able to encode using unicode, I feel like it could produce a shorter string because it would have access to more characters. Is there a way to do this?
Edit to clarify:
My goal is NOT the smallest amount of data/information/bytes. My goal is the smallest number of characters. The reason is that the channel I'm sending the data through is capped by characters (100k to be precise) instead of bytes (strange, I know). I've already tested that I can send 100k unicode characters, I just don't know how to convert my bytes into unicode.
UPDATE: I just saw that you changed your question to clarify that you care about character length rather than byte length. This is a really strange constraint. I've never heard of it before. I don't know quite what to make of it. But if that's your need, and you want predicable blocking behavior, then I'm thinking that your problem is pretty simple. Just pick the compatible character encoding that can represent the most possible unique characters, and then map blocks of your binary across that character set such that each block is the longest it can be and yet consists of fewer bits than the number of representable characters in your character encoding. Each such block then becomes a single character. Since this constraint is kinda strange, I don't know if there are libraries out there that do this.
UPDATE2: Being curious about the above myself, I just Google'd and found this: https://qntm.org/unicodings. If your tools and communication channels can deal with UFT-16 or UTF-32, then you might be onto something in seeking to use that. If so, I hope this article opens up to the solution you're looking for. I think this article is still optimizing for byte length vs character length, so maybe this won't provide the optimal solution, but it can only help (32 potential bits per char rather than 7 or 8). I couldn't find anything seeking to optimize on character count alone, but maybe a UTF-32 scheme like Base65536 is your answer. Check out https://github.com/qntm/base65536 .
If it is byte length that you cared about, and you want to stick to using what is usually meant by "printable characters" or "plain printable text", then here's my original answer...
There are options for getting better "readable text" encoding space efficiency from an encoding other than Base85. There's also a case to be made for giving up more space efficiency and going with Base64. Here I'll make the case for using both Base85 and Base64. If you can use Base85, you only take a 25% hit on the inflation of your binary, and you save a whole lot of headaches in doing so.
Base85 is pretty close to the best you're going to do if you seek to encode arbitrary binary to "plain text", and it is the BEST you can do if you want a "plain text" encoding that you can logically break into meaningful, predictable chunks. You can in theory use a character set that uses printable characters in the high-ASCII range, but experience has shown that many tools and communication channels don't deal well with high-ASCII if they can't handle straight binary. You don't get much in additional space savings for trying to use the extra 5 bits per 4 binary bytes or so that could potentially be used by using 256-bit high-ASCII vs 128-bit ASCII.
For any BaseXX encoding, the algorithm takes incoming binary bits and encodes them as tightly as it can using the XX printable characters it has at its disposal. Base85 will be more compact than Base64 because it uses more of the printable characters (85) than Base64 does (64 characters).
There are 95 printable characters in standard ASCII. So there is a Base95 that is the most compact encoding possible using all the printable characters. But to try to use all 95 bits is messy, because it leads to uneven blockings of the incoming bits. Each 4 binary bytes is mapped to some fractional number of characters less than 5.
It turns out that 85 characters is what you need to encode 4 bytes as exactly 5 printable characters. Many will choose to add about 10% of extra length to attain the fact that every 4 encoded bytes leads to exactly 5 ASCII characters. This is only a 25% inflation in size of the binary. That's not bad at all for all of the headaches it saves. Hence, the motivation behind Base85.
Base64 is used to produce longer, but even less problematic encodings. Characters that cause trouble for various text documents, like HTML, XML, JSON, etc, are not used. In this way, Base64 is useful in almost any context without any escaping. You have to be more careful with Base85, as it doesn't throw out any of these problematic characters. For encoding/decoding efficiency, it uses the range 33 (“!”) through 117 (‘u’), starting at 33 rather than 32 just to avoid the often problematic space character. The characters above 'u' that it doesn't use are nothing special.
So that's the story pretty much on binary -> ASCII encoding side. The other question is what you can do to reduce the size of what you're representing prior to the stage of encoding its binary representation to ASCII. You're choosing to use pickle.dumps() and zlib.compress(). If those are your best choices are left for another discussion...
Assume that I have some text (for example given as a string). Later I am going to "edit" this text, which means that I want to add something somewhere or remove something. In this way I will get another version of the text. However, I do not want to have two strings representing each version of the text since there are a lot of "repetitions" (similarities) between the two subsequent versions. In other words, the differences between the strings are small, so that it makes more sense just to save differences between them. For example, the first versions.
This is my first version of the texts.
The second version:
This is the first version of the text, that I want to use as an example.
I would like to save these two versions as one object (it should not necessarily be XML, I use it just as an example):
This is the <removed>my</removed> <added>first</added> version of the text<added>, that I want to use as an example</added>.
Now I want to go further. I want to save all subsequent edits as one object. In other words, I am going to have more than two versions of the text, but I would like to save them as one object such that it is easy to get a given version of the text and easy to find out what are the difference between two subsequent (or any two given) versions.
So, to summarize, my question is: What is the standard way to represent changes in a text and to work with this representation using Python.
I would probably go with difflib: https://docs.python.org/2/library/difflib.html
You can use it to represent changes between versions of string and create your own class to store consecutive diffs.
EDIT: I just realised it doesn't really make sense in your use case as the diffs from difflib are essentially storing both strings, so you will be better off in just storing them all. However I believe that this is the standard (library-wise) way of working with changes in text, so I won't delete this answer.
EDIT2: Although it seems that if you find a way to apply unified_diff to strings this may be your answer. It seems that there is no way to do this with difflib yet: https://bugs.python.org/issue2057
Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?
Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/
I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.
I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:
import csv
hashes = []
for row in csv.reader(open('old.csv','rb')):
hashes.append( hash(str(row)) )
for row in csv.reader(open('new.csv','rb')):
if hash(str(row)) not in hashes:
print 'Not found'
But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?
EDIT:
Thanks for all the great suggestions! Let me clarify some things. "Miserable failure" means that two rows that have the exact same data, after being read in by the CSV.reader object are not hashing to the same value after calling str on the list object. I shall try hashlib at some suggestions below. I also cannot do a hash on the raw file, since two lines below contain the same data, but different characters on the line:
1, 2.3, David S, Monday
1, 2.3, "David S", Monday
I am also already doing things like string stripping to make the data more uniform, but it seems to no avail. I'm not looking for an extremely smart diff logic, i.e. that 0 is the same as 0.0.
EDIT 2:
Problem solved. What basically worked is that I needed to a bit more pre-formatting like converting ints and floats, and so forth AND I needed to change my hashing function. Both these changes seemed to do the job for me.
It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.
The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic
Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)
After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.
Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:
0, 0
0, 0.0
More information would help yield a better answer your question.
More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.
I've run into trouble previously when using the built in hash library, and solved it with that.
Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.
I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.
Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.
An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.
You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.
We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.
Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.
This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.
I am using Python 3.1, but I can downgrade if needed.
I have an ASCII file containing a short story written in one of the languages the alphabet of which can be represented with upper and or lower ASCII. I wish to:
1) Detect an encoding to the best of my abilities, get some sort of confidence metric (would vary depending on the length of the file, right?)
2) Automatically translate the whole thing using some free online service or a library.
Additional question: What if the text is written in a language where it takes 2 or more bytes to represent one letter and the byte order mark is not there to help me?
Finally, how do I deal with punctuation and misc characters such as space? It will occur more frequently than some letters, right? How about the fact that punctuation and characters can be sometimes mixed - there might be two representations of a comma, two representations for what looks like an "a", etc.?
Yes, I have read the article by Joel Spolsky on Unicode. Please help me with at least some of these items.
Thank you!
P.S. This is not a homework, but it is for self-educational purposes. I prefer using a letter frequency library that is open-source and readable as opposed to the one that is closed, efficient, but gets the job done well.
Essentially there are three main tasks to implement the described application:
1a) Identify the character encoding of the input text
1b) Identify the language of the input text
2) Get the text translated the text, by way of one of the online services' API
For 1a, you may want to take a look at decodeh.py, aside from the script itself, it provides many very useful resources regarding character sets and encoding at large. CharDet, mentioned in other answer also seems to be worthy of consideration.
Once the character encoding is known, as you suggest, you may solve 1b) by calculating the character frequency profile of the text, and matching it with known frequencies. While simple, this approach typically provide a decent precision ratio, although it may be weak on shorter texts and also on texts which follow particular patterns; for example a text in French with many references to units in the metric system will have an unusually high proportion of the letters M, K and C.
A complementary and very similar approach, use bi-grams (sequences of two letters) and tri-grams (three letters) and the corresponding tables of frequency distribution references in various languages.
Other language detection methods involve tokenizing the text, i.e. considering the words within the text. NLP resources include tables with the most used words in various languages. Such words are typically articles, possessive adjectives, adverbs and the like.
An alternative solution to the language detection is to rely on the online translation service to figure this out for us. What is important is to supply the translation service with text in a character encoding it understands, providing it the language may be superfluous.
Finally, as many practical NLP applications, you may decide to implement multiple solutions. By using a strategy design pattern, one can apply several filters/classifiers/steps in a particular order, and exit this logic at different points depending on the situation. For example, if a simple character/bigram frequency matches the text to English (with a small deviation), one may just stop there. Otherwise, if the guessed language is French or German, perform another test, etc. etc.
If you have an ASCII file then I can tell you with 100% confidence that it is encoded in ASCII. Beyond that try chardet. But knowing the encoding isn't necessarily enough to determine what language it's in.
As for multibyte encodings, The only reliable way to handle it is to hope that it has characters in the Latin alphabet and look for which half of the pair has the NULL. Otherwise treat it as UTF-8 unless you know better (Shift-JIS, GB2312, etc.).
Oh, and UTF-8. UTF-8, UTF-8, UTF-8. I don't think I can stress that enough. And in case I haven't... UTF-8.
Character frequency is pretty straight forward
I just noticed that you are using Python3.1 so this is even easier
>>> from collections import Counter
>>> Counter("Μεταλλικα")
Counter({'α': 2, 'λ': 2, 'τ': 1, 'ε': 1, 'ι': 1, 'κ': 1, 'Μ': 1})
For older versions of Python:
>>> from collections import defaultdict
>>> letter_freq=defaultdict(int)
>>> unistring = "Μεταλλικα"
>>> for uc in unistring: letter_freq[uc]+=1
...
>>> letter_freq
defaultdict(<class 'int'>, {'τ': 1, 'α': 2, 'ε': 1, 'ι': 1, 'λ': 2, 'κ': 1, 'Μ': 1})
I have provided some conditional answers however your question is a little vague and inconsistent. Please edit your question to provide answers to my questions below.
(1) You say that the file is ASCII but you want to detect an encoding? Huh? Isn't the answer "ascii"?? If you really need to detect an encoding, use chardet
(2) Automatically translate what? encoding? language? If language, do you know what the input language is or are you trying to detect that also? To detect language, try guess-language ... note that it needs a tweak for better detection of Japanese. See this SO topic which notes the Japanese problem and also highlights that for ANY language-guesser, you need to remove all HTML/XML/Javascript/etc noise from your text otherwise it will heavily bias the result towards ASCII-only languages like English (or Catalan!).
(3) You are talking about a "letter-frequency library" ... you are going to use this library to do what? If language guessing, it appears that using frequency of single letters is not much help distinguishing between languages which use the same (or almost the same) character set; one needs to use the frequency of three-letter groups ("trigrams").
(4) Your questions on punctuation and spaces: depends on your purpose (which we are not yet sure of). If purpose is language detection, the idea is to standardise the text; e.g. replace all runs of not (letter or apostrophe) with a single space, then remove any leading/trailing whitespace, than add 1 leading and 1 trailing space -- more precision is gained by treating start/end of word bigrams as trigrams. Note that as usual in all text processing you should decode your input into unicode immediately and work with unicode thereafter.