I want to create a Caesar cipher that can encode/decode unicode printable characters (single- and multi codepoint grapheme clusters, emojis ect.) from the whole of Unicode (except the private use area). Preferably, it will use a list of all printable characters.
NOTE: Even though I want to create a caesar cipher, it is really not about encryption. The question is about investigating the properties of unicode.
I found these questions:
What is the range of Unicode Printable Characters?
Cipher with all unicode characters
But I didn't get an answer to what I want.
Note:
If you give a coding answer, I am mostly interested in a solution that
uses either python3 or perl6, as they are my main languages.
Recently, I was given an assignment to write a Caesar cipher and then encode and decode an English text.
I solved it in python by using the string library's built-in string.printable constant. Here is a printout of the constant:
(I used visual studio code)
[see python code and results below]
The documentation says:
'''
String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace.
'''
https://docs.python.org/3.6/library/string.html#string-constants
I am wondering how you could create a caesar cipher that could encode/decode all the possible printable characters you can make from unicode codepoints (just asume you have all necessary fonts to see those that should be visible on screen).
Here is my understanding of what it means for
something to be a printable character:
When I take the python string constant above,
and traverse it with the left or rigt arrow keys
on the keyboard, It takes me exactly 100 strokes to get
to the end (the same as the number of characters).
It looks like there is a one-to-one
correspondence between being a printable
character and being traversible with one stroke of an arrow key.
Now consider this string:
"👨👩👧👦ij
क्षि 🂡"
Based on pythons string.printable constant,
This string seems to me to be composed of the
following 7 printable characters:
(you can look up individual codepoints at: https://unicode-table.com/en/)
1 (family) 2 (Latin Small Ligature Ij)
3 (cariage return) 4 (Devanagari kshi)
5 (space) 6 (Zero Width No-Break Space)
7 (Ace of spades)
👨👩👧👦
codepoints: 128104 8205 128105 8205 128103 8205 128102
(reference: https://emojipedia.org/family-man-woman-girl-boy/)
(Latin Small Ligature Ij)
ij
codepoint: 307
(Carriage Return)
codepoint: 13
(Devanagari kshi)
क्षि
codepoints: 2325 2381 2359 2367
(see this page: http://unicode.org/reports/tr29/)
(the codepoints seems to be in hexadecimal rather than numerals)
(Space)
codepoint: 32
(Zero Width No-Break Space)
codepoint: 65279
(AKA U+FEFF BYTE ORDER MARK (BOM))
(https://en.wikipedia.org/wiki/Byte_order_mark)
(Playing Card Ace of Spades)
🂡
codepoint: 127137
When I paste this
string into notepad, and try to traverse it with an arrow key,
I end up using 10 key strokes rather than 7,
because the family emoji need
4 key strokes
(probably because notepad cant deal with the Zero Width Joiner,
codepoint: 8205, and of course notepad cant display a family glyph).
On the other hand when I post the string into google search,
i can traverse the whole string with 7 strokes.
Then I tried creating the string
in Perl6 to see what Perl6's grapheme
awareness would make of the string:
(I use the Atom editor)
[see perl6 code and results below]
perl6 thinks that the Devanagari kshi character क्षि (4 codepoints)
is actually 2 graphemes, each with 2 codepoints.
Even though it CAN be represented as two characters,
as seen in the above list,
I think this is a bug. Perl6 is supposed to be grapheme
aware, and even my windows notepad (and google search)
thinks it is a single grapheme/character.
Based on the 2 strings,
The practical definition of
a printable character seems to be this:
'It is any combination of unicode codepoints that can get traversed
by one push of a left or right arrow key on the keyboard
under ideal cirkumstances'.
"under ideal cirkumstances" means that
you are using an environment that, so to speak,
act like google search:
That is, it recognizes for example an emoji
(the 4 person family) or a grapheme cluster
(the devanagari character)
as one printable character.
3 questions:
1:
Is the above a fair definition of what it means
to be a printable character in unicode?
2:
Regardless of whether you accept the definition,
do you know of any list of printable characters
that cover the currently used unicode planes and possible
grapheme clusters, rather than just the 100 ASCII characters
the python string library has
(If I had such a list I imagine I could create a cipher
quite easily)?
3:
Given that such a list does not exist, and you
accept the definition,
how would you go about creating such a list with which
I could create a caesar cipher
that could cipher any/all printable
characters given the following 4 conditions?
NOTE: these 4 conditions are just
what I imagine is required for a proper
caesar cipher.
condition a
The string to be encrypted will
be a valid utf8 string consisting of standard
unicode code points (no unassigned, or private use area
codepoints)
condition b
The encrypted string must also be a valid
utf8 string consisting of standard
unicode code points.
condition c
You must be able to traverse the encrypted string
using the same number of strokes with
the left or right arrow keys on the keyboard as
the original string
(given ideal circumstances as described above).
This means that both the
man-woman-boy-girl family emoji
and the devanagari character,
when encoded, must each correspond to
exactly one other printable character and not a set
of "nonsence" codepoints that the
arrow keys will interpret as different characters.
It also means that a single codepoint character can
potentially be converted into a multi-codepoint character
and vice versa.
condition d
As in any encrypt/decrypt algoritm,
the string to be encrypted and
the string that has been decrypted
(the end result) must
contain the exact same codepoints
(the 2 strings must be equal).
# Python 3.6:
import string
# build-in library
print(string.printable)
print(type(string.printable))
print(len(string.printable))
# length of the string (number of ASCII characters)
#perl6
use v6;
my #ordinals = <128104 8205 128105 8205 128103 8205 128102>;
#array of the family codepoints
#ordinals.append(<307 13 2325 2381 2359 2367 32 65279 127137>);
#add the other codepoints
my $result_string = '';
for #ordinals {
$result_string = $result_string ~ $_.chr;
}
# get a string of characters from the ordinal numbers
say #ordinals; # the list of codepoints
say $result_string; # the string
say $result_string.chars; # the number of characters.
say $result_string.comb.perl; # a list of characters in the string
python results:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?#[]^_`{|}~
class 'str'
100
perl6 results:
[128104 8205 128105 8205 128103 8205 128102 307 13 2325 2381 2359 2367 32 65279 127137]
👨👩👧👦ij
क्षि 🂡
8
("👨👩👧👦", "ij", "\r", "क्", "षि", " ", "", "🂡").Seq
TL;DR I think your question is reasonable and deserves a better answer than the one I've written so far. Let's talk.
I don't think anyone can create a Caesar cipher with the requirements you list for multiple reasons.
But if your goal is really to "investigate properties of Unicode" rather than create a cipher then presumably that doesn't matter.
And of course I might just be experiencing a failure of imagination, or just a failure to understand Unicode despite spending years grappling with it.
If you critique technical aspects of my explanation below via comments I'll try to improve it and hopefully we'll both learn as we go. TIA.
"Caesar cipher with all Unicode printable characters"
This is the clean formulation you have in your title.
The only problematic parts are "Caesar", "cipher", "all", "Unicode", "printable", and "characters". Let's go thru them.
Caesar cipher
A Caesar cipher is an especially simple single alphabet cipher. Unicode isn't one big large single alphabet. But perhaps you could treat a subset of its codepoints as if they were.
I'd say that's what the SO Cipher with all unicode characters was all about.
You've currently rejected that and introduced a bunch of extra aspects that are either impossible or so difficult that they might as well be.
Ignoring your priority of investigating Unicode properties it would make sense if you instead settled for a regular ASCII cipher. Or perhaps go back to that Cipher with all unicode characters SO and pick up where they left off, perhaps noting that, according to a comment on that SO they apparently stopped at just the BMP plane:
Note that you’re only using BMP code points (i.e. from U+0000 to U+FFFF). Unicode ranges from U+0000 to U+10FFFF, so you’re missing about a million code points :)
So perhaps you could do better. I don't think it would be worthwhile from the perspective of creating a cipher for its own sake but it might be for learning more about the properties of Unicode.
Cipher
#TomBlodget in their comment on your question notes that:
The complexity of text motivates modern ciphers to not deal with characters. They deal with bytes for both input and output. Where the input is text, the receiver has to to be told the character encoding. Where further handling of the output must be as text, Base64 or similar is used. Expecting the output of a cipher to look like text is not generally a goal.
If you want a universal solution for a Unicode cipher, follow Tom's recipe.
All
In a comment on your question about the number of graphemes #nwellnhof noted that:
there's an infinite number
But you then also quite reasonably replied that there's only going to be a finite number in any given text; that Unicode's intent is that Unicode compliant software may/will generate mojibake results if given degenerate input (where what counts as degenerate is somewhat open to refinement in Unicode updates); and that that's the basis on which you hope to proceed.
That's a reasonable response, but you still can't have "all" even when restricted to "all non-degenerate" and "only ones that could appear in real life", because there's still an effectively infinite number of well formed and potentially reasonable characters.
I ought really insert some calculations here to put some bounds on the problem. Is "effectively infinite" a trillion? Why? That sort of thing. But before digging into that I'll await comments.
Let's pretend it's a trillion, and that that's not a problem, and move on.
Unicode
Unicode is enormously complex.
You've been given an assignment to produce a Caesar cipher, a very simple thing.
They really don't mix well unless you lean heavily on keeping things simple.
But you want to investigate properties of Unicode. So perhaps you want to wade into all the complexity. But then the question is, how many years do you want to spend exploring the consequences of opening this pandora's box? (I've been studying Unicode on and off for a decade. It's complicated.)
Printable
You linked to the SO question "What is the range of Unicode Printable Characters?". This includes an answer that notes:
The more you learn about Unicode, the more you realize how unexpectedly diverse and unfathomably weird human writing systems are. In particular whether a particular "character" is printable is not always obvious.
But you presumably read that and refused to be deterred. Which is both admirable and asking for trouble. For example, it seems to have driven you to define "printable" as something like "takes one or more keystrokes to traverse" which is so fraught it's hard to know where to start -- so I'll punt on that till later in this answer.
Characters
Given that your aim is to write a Caesar cipher, a cipher that was used thousands of years ago that acts on characters, it makes sense that you focused on "what a user thinks of as a character".
Per Unicode's definition, this is called a "grapheme".
One of your example characters makes it clear how problematic the distinction is between "what a user thinks of as a character" (a grapheme) and a codepoint (what Python thinks of as a character):
print('क्षि'[::-1])
िष्क
This shows mangling of a single "character" (a single grapheme) written in Devanagari, which is, according to Wikipedia, "one of the most used and adopted writing systems in the world".
(Or, if we want to ignore the half the planet this mangling ever more routinely affects and just focus on the folk who thought they were safe:
print('🇬🇧'[::-1])
🇧🇬
That's a flag of one nation turning into another's. Fortunately flags rarely appear in text -- though that's changing now text is increasingly arbitrary Unicode text like this text I'm writing -- and flag characters are not that important and both Great Britain and Bulgaria are members of the EU so it's probably not nearly as bad as scrambling the text of a billion Indians.)
Graphemes
So you quite reasonably thought to yourself, "Maybe Perl 6 will help".
To quote UAX#29, the Unicode Annex document on "Unicode Text Segmentation":
This document defines a default specification for grapheme clusters.
Perl 6 has implemented a grapheme clustering mechanism. It could in principle cluster in a variety of ways but for now it's implemented the default specification. This is what allows Perl 6 to avoid the mistakes Python's making in the above.
But the Unicode document continues:
[the specification for grapheme clusters] may be customized for particular languages, operations, or other situations.
So you can't just eyeball some text (or give it to some software) and say what "characters" it contains if by "character" you mean "what a user thinks of as a character".
It gets worse...
Keystrokes
"👨👩👧👦ij क्षि 🂡" ... notepad ... 10 key strokes ... google search ... 7 strokes ... Perl6 ... Atom editor ... perl6 thinks क्षि ... is actually 2 graphemes ... I think this is a bug ... notepad (and google search) thinks it is a single grapheme/character
For me, google search needs 10 keystrokes -- because it's not to do with google search but instead aspects of my system, including which web browser I'm using (Firefox) and other details.
Some editors could be configurable so that cursoring over 'क्षि' (or 'fi') would be either 1 or 2 keystrokes depending on how you configure them and/or what language you specify the text is written in. For me, editing this SO answer using Firefox on Linux Mint, it takes 2 keystrokes to cursor over क्षि.
Perl 6 correctly reports the .chars result for 'क्षि' as 2 by default because that's what Unicode says it is per the default grapheme clustering specification. ("Extended Grapheme Clusters".) That happens to match what Firefox on Linux Mint does editing this SO answer because the stars line up and it's Sunday.
Notepad or other software reasonably takes just one keystroke to cursor over क्षि, while other editors reasonably take two, because both are reasonable per the Unicode specification:
arrow key movement ... could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components
My emphasis added. Unicode leaves it up to the software to decide how the cursor will move.
Your questions
1: Is the above a fair definition of what it means to be a printable character in unicode?
I don't think so. Hopefully the foregoing explains why, or at least points you in the directions you would need to research (for a year or three) to understand why.
2: ... do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters ...
There's such a vast number of "possible grapheme clusters" that can reasonably occur that even excluding degenerate codepoint combinations leaves you with an effectively infinite list.
And any small subset anyone may have created would not be canonical because the Unicode consortium would not bless it and folk would argue about what should be included.
3: ... how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?
First, your conditions are far too strenuous. See the next section.
But even if you drop the ones that are too difficult, it's still far too difficult, and the outcome far too uninteresting, to make doing anything worthwhile.
4: If you think creating such a list is a terrible idea, how would you create the cipher?
If it were me and it had to be a Caesar cipher I'd make it just handle bytes, as per Tom's comment at the start of this answer.
Your conditions
The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)
It'll need to be more restricted than that, but it's reasonable to say it'll need to be a valid Unicode string. If you want to insist it's utf8 that's fine too.
The encrypted string must also be a valid utf8 string consisting of standard unicode code points
Sure.
You must be able to traverse the encrypted string using the same number of strokes ... as the original string
You can have that for a small subset of Unicode characters. But...
This means [one keystroke for both the original and encoded version of] the devanagari character [क्षि]
... is not a reasonable requirement.
You could ensure the same grapheme clustering (character) interpretation of a given text if you wrote a custom implementation of the grapheme clustering for your cipher that was a faithful copy of the implementation of grapheme clustering used to control the cursor.
But then you'd then have to maintain these two code bases to keep them in sync. And that would be for just one particular system configuration.
It would be a ridiculous amount of pain. And for zero, or at most minuscule, gain.
the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).
So, no normalization.
That rules out all Perl 6's grapheme aware goodies.
Then again, you need to drop paying attention to graphemes anyway because there's effectively an infinite number of them.
Conclusion
My answer only touches lightly on the topics it covers and probably contains lots of errors. If you critique it I'll try improve it.
Related
Is there a way to access or find character controls in Python, like these NUL, DEL, CR, LF, BEL which is its form as a single ASCII Unicode character to use as a parameter in the ord() built-in method to get a numeric value.
You can find their Unicode representation on the ASCII page on Wikipedia.
Key
Unicode
Unicode-Hex
NUL
␀
\u2400
DEL
␡
\u2421
CR
␍
\u240d
LF
␊
\u240a
BEL
␇
\u2407
To "access" or "search" to the control characters, unicode database module provides access to a set of data of characters through different methods by each type or name representation. I was a bit confused as to its representation in the text data type format and how you could use it ord function; lookup returns the name of the corresponding character thus solving the unknown about the subject to "search" control characters by gap unicode characters due to little experience or knowledge of the library and ASCII standards.
It is also important to note that the question is vaguely phrased and also does not provide a reproducible example or algorithm problem, and since these characters have different representations and they make them not only usable in form of text or letters or even names of control characters, this due to the wide variety in which ASCII is ruled, this does not mean that the question could not be answered because it is known that from the beginning it started as a problem for which a clarification or explanation has been provided to the doubts presented on this post.
I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)
I'm using BeautifulSoup to parse some web pages.
Occasionally I run into a "unicode hell" error like the following :
Looking at the source of this article on TheAtlantic.com [ http://www.theatlantic.com/education/archive/2013/10/why-are-hundreds-of-harvard-students-studying-ancient-chinese-philosophy/280356/ ]
We see this in the og:description meta property :
<meta property="og:description" content="The professor who teaches Classical Chinese Ethical and Political Theory claims, "This course will change your life."" />
When BeautifulSoup parses it, I see this:
>>> print repr(description)
u'The professor who teaches\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'
If I try encoding it to UTF-8 , like this SO comment suggests : https://stackoverflow.com/a/10996267/442650
>>> print repr(description.encode('utf8'))
'The professor who teaches\xc2\xa0Classical Chinese Ethical and Political Theory claims, "This course will change your life."'
Just when I thought I had all my unicode issues under control, I still don't quite understand what's going on, so I'm going to lay out a few questions:
1- why would BeautifulSoup convert the to \xa0 [a latin charset space character]? The charset and headers on this page are UTF-8, I thought BeautifulSoup pulls that data for the encoding ? Why wasn't it replaced with a <space> ?
2- Is there a common way to normalize whitespaces for conversion ?
3- When I encoded to UTF8 , where did \xa0 become the sequence of \xc2\xa0 ?
I can pipe everything through unicodedata.normalize('NFKD',string) to help get me where I want to be -- but I'd love to understand what's wrong and avoid problem like this in the future.
You aren't encountering a problem. Everything is behaving as intended.
indicates a non-breaking space character. This isn't replaced with a space because it doesn't represent a space; it represents a non-breaking space. Replacing it with a space would lose information: that where that space occurs, a text rendering engine shouldn't put a line break.
The Unicode code point for non-breaking space is U+00A0, which is written in a Unicode string in Python as \xa0.
The UTF-8 encoding of U+00A0 is, in hexadecimal, the two byte sequence C2 A0, or written in a Python string representation, \xc2\xa0. In UTF-8, anything beyond the 7-bit ASCII set needs two or more bytes to represent it. In this case, the highest bit set is the eighth bit. That means that it can be represented by the two-byte sequence (in binary) 110xxxxx 10xxxxxx where the x's are the bits of the binary representation of the code point. In the case of A0, that is 10000000, or when encoded in UTF-8, 11000010 10000000 or C2 A0.
Many people use in HTML to get spaces which aren't collapsed by the usual HTML whitespace collapsing rules (in HTML, all runs of consecutive spaces, tabs, and newlines get interpreted as a single space unless one of the CSS white-space rules are applied), but that's not really what they are intended for; they are supposed to be used for things like names, like "Mr. Miyagi", where you don't want there to be a line break between the "Mr." and "Miyagi". I'm not sure why it was used in this particular case; it seems out of place here, but that's more of a problem with your source, not the code that interprets it.
Now, if you don't really care about layout so you don't mind whether or not text layout algorithms choose that as a place to wrap, but would like to interpret this merely as a regular space, normalizing using NFKD is a perfectly reasonable answer (or NFKC if you prefer pre-composed accents to decomposed accents). The NFKC and NFKD normalizations map characters such that most characters that represent essentially the same semantic value in most contexts are expanded out. For instance, ligatures are expanded out (ffi -> ffi), archaic long s characters are converted into s (ſ -> s), Roman numeral characters are expanded into their individual letters (Ⅳ -> IV), and non-breaking space converted into a normal space. For some characters, NFKC or NFKD normalization may lose information that is important in some contexts: ℌ and ℍ will both normalize to H, but in mathematical texts can be used to refer to different things.
I am trying to build an encryption system using python. It is based on the lorenz cipher machine used by Germany in WWII, though a lot more complicated (7-bit ascii encryption and 30 rotors compared with the original's 5-bit and 12 rotors).
So far I have worked out and written the stepping system. I have also created a system for the and chopping up the plaintext. But when checking the output, in character for character (By not stitching together the ciphertext) I got this for hello:
['H', 'Z', '\x0e', '>', 'f']
I have realised that '\x0e' must be some special character in ascii, but I am certain that when the program goes to decrypt it will look at each of the letters in it individually. Can someone please tell me what '\x0e' signifies, if there are other such characters, and if there's an easy way to get around it.
Thanks in advance!
It's the ASCII "shift-out" control character and is nonprintable.
A control character which is used in conjunction with SHIFT IN and
ESCAPE to extend the graphic character set of the code. It may alter
the meaning of octets 33 - 126 (dec.). The effect of this character
when using code extension techniques is described in International
Standard ISO 2022.
'\x0e' is the ASCII SO (shift out) unprintable character. It is a single character, and any reasonable program dealing with the string will treat it as such; you're only seeing it represented like that because you're printing a list, which shows the repr of each value in the list.
As for the question of if there are others, yes, there are 33 of them; ASCII 0-31 and 127 are all generally considered "control characters" which aren't typically printable.
I am trying to write a generic document indexer from a bunch of documents with different encodings in python. I would like to know if it is possible to read all of my documents (that are encoded with utf-8,ISO8859-xx and windows-12xx) with utf-8 without character loss?
The reading part is as follows:
fin=codecs.open(doc_name, "r","utf-8");
doc_content=fin.read()
I'm going to rephrase your question slightly. I believe you are asking, "can I open a document and read it as if it were UTF-8, provided that it is actually intended to be ISO8869-xx or Windows-12xx, without loss?". This is what the Python code you've posted attempts to do.
The answer to that question is no. The Python code you posted will mangle the documents if they contain any characters above ordinal 127. This is because the "codepages" use the numbers from 128 to 255 to represent one character each, where UTF-8 uses that number range to proxy multibyte characters. So, each character in your document which is not in ASCII will be either interpreted as an invalid string or will be combined with the succeeding byte(s) to form a single UTF-8 codepoint, if you incorrectly parse the file as UTF-8.
As a concrete example, say your document is in Windows-1252. It contains the byte sequence 0xC3 0xAE, or "î" (A-tilde, registered trademark sign). In UTF-8, that same byte sequence represents one character, "ï" (small 'i' with diaresis). In Windows-874, that same sequence would be "รฎ". These are rather different strings - a moral insult could become an invitation to play chess, or vice versa. Meaning is lost.
Now, for a slightly different question - "can I losslessly convert my files from their current encoding to UTF-8?" or, "can I represent all the data from the current files as a UTF-8 bytestream?". The answer to these questions is (modulo a few fuzzy bits) yes. Unicode is designed to have a codepoint for every ideoglyph in any previously existing codepage, and by and large has succeeded in this goal. There are a few rough edges, but you will likely be well-served by using Unicode as your common interchange format (and UTF-8 is a good choice for a representation thereof).
However, to effect the conversion, you must already know and state the format in which the files exist as they are being read. Otherwise Python will incorrectly deal with non-ASCII characters and you will badly damage your text (irreparably, in fact, if you discard either the invalid-in-UTF8 sequences or the origin of a particular wrongly-converted byte range).
In the event that the text is all, 100% ASCII, you can open it as UTF-8 without a problem, as the first 127 codepoints are shared between the two representations.
UTF-8 covers everything in Unicode. I don't know for sure whether ISO-8859-xx and Windows-12xx are entirely covered by Unicode, but I strongly suspect they are.
I believe there are some encodings which include characters which aren't in Unicode, but I would be fairly surprised if you came across those characters. Covering the whole of Unicode is "good enough" for almost everything - that's the purpose of Unicode, after all. It's meant to cover everything we could possibly need (which is why it's grown :)
EDIT: As noted, you have to know the encoding of the file yourself, and state it - you can't just expect files to magically be read correctly. But once you do know the encoding, you could convert everything to UTF-8.
You'll need to have some way of determining which character set the document uses. You can't just open each one as "utf-8" and expect it to get magically converted. Open it with the proper character set, then convert.
The best way to be sure would be to convert a large set of documents, then convert them back and do a comparison.