Is there a way to access or find character controls in Python, like these NUL, DEL, CR, LF, BEL which is its form as a single ASCII Unicode character to use as a parameter in the ord() built-in method to get a numeric value.
You can find their Unicode representation on the ASCII page on Wikipedia.
Key
Unicode
Unicode-Hex
NUL
␀
\u2400
DEL
␡
\u2421
CR
␍
\u240d
LF
␊
\u240a
BEL
␇
\u2407
To "access" or "search" to the control characters, unicode database module provides access to a set of data of characters through different methods by each type or name representation. I was a bit confused as to its representation in the text data type format and how you could use it ord function; lookup returns the name of the corresponding character thus solving the unknown about the subject to "search" control characters by gap unicode characters due to little experience or knowledge of the library and ASCII standards.
It is also important to note that the question is vaguely phrased and also does not provide a reproducible example or algorithm problem, and since these characters have different representations and they make them not only usable in form of text or letters or even names of control characters, this due to the wide variety in which ASCII is ruled, this does not mean that the question could not be answered because it is known that from the beginning it started as a problem for which a clarification or explanation has been provided to the doubts presented on this post.
Related
I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)
This question already has an answer here:
Find emojis in a tweet as whole clusters and not as individual chars
(1 answer)
Closed 11 months ago.
env python3.6
There's a utf-8 encoded text like this
text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"
And I want to search only elements which three numbers or alphabets follow b'\xf0\x9f\x98\' - this actually indicates the facial expression emojis.
I tried this
if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)
but it doesn't work and when I print it off it comes like this b'\xf0\x9f\x98\\[a-zA-Z1-9]{3}' and \ automatically gets in it.
Any way out? thanks.
I can see two problems with your search:
you are trying to search the textual representation of the utf8 string (the \xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
you are including the "end-of-string" marker ($) in your search, where you're probably interested in its occurrence anywhere in the string.
Something like the following should work, though brittle (see below for a more robust solution):
re.search(b'\xf0\x9f\x98.', text_utf8)
This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98.
Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).
A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:
regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))
This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)
Note that in this case you can also deal with text_utf8 as an actual unicode (str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.
I want to create a Caesar cipher that can encode/decode unicode printable characters (single- and multi codepoint grapheme clusters, emojis ect.) from the whole of Unicode (except the private use area). Preferably, it will use a list of all printable characters.
NOTE: Even though I want to create a caesar cipher, it is really not about encryption. The question is about investigating the properties of unicode.
I found these questions:
What is the range of Unicode Printable Characters?
Cipher with all unicode characters
But I didn't get an answer to what I want.
Note:
If you give a coding answer, I am mostly interested in a solution that
uses either python3 or perl6, as they are my main languages.
Recently, I was given an assignment to write a Caesar cipher and then encode and decode an English text.
I solved it in python by using the string library's built-in string.printable constant. Here is a printout of the constant:
(I used visual studio code)
[see python code and results below]
The documentation says:
'''
String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace.
'''
https://docs.python.org/3.6/library/string.html#string-constants
I am wondering how you could create a caesar cipher that could encode/decode all the possible printable characters you can make from unicode codepoints (just asume you have all necessary fonts to see those that should be visible on screen).
Here is my understanding of what it means for
something to be a printable character:
When I take the python string constant above,
and traverse it with the left or rigt arrow keys
on the keyboard, It takes me exactly 100 strokes to get
to the end (the same as the number of characters).
It looks like there is a one-to-one
correspondence between being a printable
character and being traversible with one stroke of an arrow key.
Now consider this string:
"👨👩👧👦ij
क्षि 🂡"
Based on pythons string.printable constant,
This string seems to me to be composed of the
following 7 printable characters:
(you can look up individual codepoints at: https://unicode-table.com/en/)
1 (family) 2 (Latin Small Ligature Ij)
3 (cariage return) 4 (Devanagari kshi)
5 (space) 6 (Zero Width No-Break Space)
7 (Ace of spades)
👨👩👧👦
codepoints: 128104 8205 128105 8205 128103 8205 128102
(reference: https://emojipedia.org/family-man-woman-girl-boy/)
(Latin Small Ligature Ij)
ij
codepoint: 307
(Carriage Return)
codepoint: 13
(Devanagari kshi)
क्षि
codepoints: 2325 2381 2359 2367
(see this page: http://unicode.org/reports/tr29/)
(the codepoints seems to be in hexadecimal rather than numerals)
(Space)
codepoint: 32
(Zero Width No-Break Space)
codepoint: 65279
(AKA U+FEFF BYTE ORDER MARK (BOM))
(https://en.wikipedia.org/wiki/Byte_order_mark)
(Playing Card Ace of Spades)
🂡
codepoint: 127137
When I paste this
string into notepad, and try to traverse it with an arrow key,
I end up using 10 key strokes rather than 7,
because the family emoji need
4 key strokes
(probably because notepad cant deal with the Zero Width Joiner,
codepoint: 8205, and of course notepad cant display a family glyph).
On the other hand when I post the string into google search,
i can traverse the whole string with 7 strokes.
Then I tried creating the string
in Perl6 to see what Perl6's grapheme
awareness would make of the string:
(I use the Atom editor)
[see perl6 code and results below]
perl6 thinks that the Devanagari kshi character क्षि (4 codepoints)
is actually 2 graphemes, each with 2 codepoints.
Even though it CAN be represented as two characters,
as seen in the above list,
I think this is a bug. Perl6 is supposed to be grapheme
aware, and even my windows notepad (and google search)
thinks it is a single grapheme/character.
Based on the 2 strings,
The practical definition of
a printable character seems to be this:
'It is any combination of unicode codepoints that can get traversed
by one push of a left or right arrow key on the keyboard
under ideal cirkumstances'.
"under ideal cirkumstances" means that
you are using an environment that, so to speak,
act like google search:
That is, it recognizes for example an emoji
(the 4 person family) or a grapheme cluster
(the devanagari character)
as one printable character.
3 questions:
1:
Is the above a fair definition of what it means
to be a printable character in unicode?
2:
Regardless of whether you accept the definition,
do you know of any list of printable characters
that cover the currently used unicode planes and possible
grapheme clusters, rather than just the 100 ASCII characters
the python string library has
(If I had such a list I imagine I could create a cipher
quite easily)?
3:
Given that such a list does not exist, and you
accept the definition,
how would you go about creating such a list with which
I could create a caesar cipher
that could cipher any/all printable
characters given the following 4 conditions?
NOTE: these 4 conditions are just
what I imagine is required for a proper
caesar cipher.
condition a
The string to be encrypted will
be a valid utf8 string consisting of standard
unicode code points (no unassigned, or private use area
codepoints)
condition b
The encrypted string must also be a valid
utf8 string consisting of standard
unicode code points.
condition c
You must be able to traverse the encrypted string
using the same number of strokes with
the left or right arrow keys on the keyboard as
the original string
(given ideal circumstances as described above).
This means that both the
man-woman-boy-girl family emoji
and the devanagari character,
when encoded, must each correspond to
exactly one other printable character and not a set
of "nonsence" codepoints that the
arrow keys will interpret as different characters.
It also means that a single codepoint character can
potentially be converted into a multi-codepoint character
and vice versa.
condition d
As in any encrypt/decrypt algoritm,
the string to be encrypted and
the string that has been decrypted
(the end result) must
contain the exact same codepoints
(the 2 strings must be equal).
# Python 3.6:
import string
# build-in library
print(string.printable)
print(type(string.printable))
print(len(string.printable))
# length of the string (number of ASCII characters)
#perl6
use v6;
my #ordinals = <128104 8205 128105 8205 128103 8205 128102>;
#array of the family codepoints
#ordinals.append(<307 13 2325 2381 2359 2367 32 65279 127137>);
#add the other codepoints
my $result_string = '';
for #ordinals {
$result_string = $result_string ~ $_.chr;
}
# get a string of characters from the ordinal numbers
say #ordinals; # the list of codepoints
say $result_string; # the string
say $result_string.chars; # the number of characters.
say $result_string.comb.perl; # a list of characters in the string
python results:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?#[]^_`{|}~
class 'str'
100
perl6 results:
[128104 8205 128105 8205 128103 8205 128102 307 13 2325 2381 2359 2367 32 65279 127137]
👨👩👧👦ij
क्षि 🂡
8
("👨👩👧👦", "ij", "\r", "क्", "षि", " ", "", "🂡").Seq
TL;DR I think your question is reasonable and deserves a better answer than the one I've written so far. Let's talk.
I don't think anyone can create a Caesar cipher with the requirements you list for multiple reasons.
But if your goal is really to "investigate properties of Unicode" rather than create a cipher then presumably that doesn't matter.
And of course I might just be experiencing a failure of imagination, or just a failure to understand Unicode despite spending years grappling with it.
If you critique technical aspects of my explanation below via comments I'll try to improve it and hopefully we'll both learn as we go. TIA.
"Caesar cipher with all Unicode printable characters"
This is the clean formulation you have in your title.
The only problematic parts are "Caesar", "cipher", "all", "Unicode", "printable", and "characters". Let's go thru them.
Caesar cipher
A Caesar cipher is an especially simple single alphabet cipher. Unicode isn't one big large single alphabet. But perhaps you could treat a subset of its codepoints as if they were.
I'd say that's what the SO Cipher with all unicode characters was all about.
You've currently rejected that and introduced a bunch of extra aspects that are either impossible or so difficult that they might as well be.
Ignoring your priority of investigating Unicode properties it would make sense if you instead settled for a regular ASCII cipher. Or perhaps go back to that Cipher with all unicode characters SO and pick up where they left off, perhaps noting that, according to a comment on that SO they apparently stopped at just the BMP plane:
Note that you’re only using BMP code points (i.e. from U+0000 to U+FFFF). Unicode ranges from U+0000 to U+10FFFF, so you’re missing about a million code points :)
So perhaps you could do better. I don't think it would be worthwhile from the perspective of creating a cipher for its own sake but it might be for learning more about the properties of Unicode.
Cipher
#TomBlodget in their comment on your question notes that:
The complexity of text motivates modern ciphers to not deal with characters. They deal with bytes for both input and output. Where the input is text, the receiver has to to be told the character encoding. Where further handling of the output must be as text, Base64 or similar is used. Expecting the output of a cipher to look like text is not generally a goal.
If you want a universal solution for a Unicode cipher, follow Tom's recipe.
All
In a comment on your question about the number of graphemes #nwellnhof noted that:
there's an infinite number
But you then also quite reasonably replied that there's only going to be a finite number in any given text; that Unicode's intent is that Unicode compliant software may/will generate mojibake results if given degenerate input (where what counts as degenerate is somewhat open to refinement in Unicode updates); and that that's the basis on which you hope to proceed.
That's a reasonable response, but you still can't have "all" even when restricted to "all non-degenerate" and "only ones that could appear in real life", because there's still an effectively infinite number of well formed and potentially reasonable characters.
I ought really insert some calculations here to put some bounds on the problem. Is "effectively infinite" a trillion? Why? That sort of thing. But before digging into that I'll await comments.
Let's pretend it's a trillion, and that that's not a problem, and move on.
Unicode
Unicode is enormously complex.
You've been given an assignment to produce a Caesar cipher, a very simple thing.
They really don't mix well unless you lean heavily on keeping things simple.
But you want to investigate properties of Unicode. So perhaps you want to wade into all the complexity. But then the question is, how many years do you want to spend exploring the consequences of opening this pandora's box? (I've been studying Unicode on and off for a decade. It's complicated.)
Printable
You linked to the SO question "What is the range of Unicode Printable Characters?". This includes an answer that notes:
The more you learn about Unicode, the more you realize how unexpectedly diverse and unfathomably weird human writing systems are. In particular whether a particular "character" is printable is not always obvious.
But you presumably read that and refused to be deterred. Which is both admirable and asking for trouble. For example, it seems to have driven you to define "printable" as something like "takes one or more keystrokes to traverse" which is so fraught it's hard to know where to start -- so I'll punt on that till later in this answer.
Characters
Given that your aim is to write a Caesar cipher, a cipher that was used thousands of years ago that acts on characters, it makes sense that you focused on "what a user thinks of as a character".
Per Unicode's definition, this is called a "grapheme".
One of your example characters makes it clear how problematic the distinction is between "what a user thinks of as a character" (a grapheme) and a codepoint (what Python thinks of as a character):
print('क्षि'[::-1])
िष्क
This shows mangling of a single "character" (a single grapheme) written in Devanagari, which is, according to Wikipedia, "one of the most used and adopted writing systems in the world".
(Or, if we want to ignore the half the planet this mangling ever more routinely affects and just focus on the folk who thought they were safe:
print('🇬🇧'[::-1])
🇧🇬
That's a flag of one nation turning into another's. Fortunately flags rarely appear in text -- though that's changing now text is increasingly arbitrary Unicode text like this text I'm writing -- and flag characters are not that important and both Great Britain and Bulgaria are members of the EU so it's probably not nearly as bad as scrambling the text of a billion Indians.)
Graphemes
So you quite reasonably thought to yourself, "Maybe Perl 6 will help".
To quote UAX#29, the Unicode Annex document on "Unicode Text Segmentation":
This document defines a default specification for grapheme clusters.
Perl 6 has implemented a grapheme clustering mechanism. It could in principle cluster in a variety of ways but for now it's implemented the default specification. This is what allows Perl 6 to avoid the mistakes Python's making in the above.
But the Unicode document continues:
[the specification for grapheme clusters] may be customized for particular languages, operations, or other situations.
So you can't just eyeball some text (or give it to some software) and say what "characters" it contains if by "character" you mean "what a user thinks of as a character".
It gets worse...
Keystrokes
"👨👩👧👦ij क्षि 🂡" ... notepad ... 10 key strokes ... google search ... 7 strokes ... Perl6 ... Atom editor ... perl6 thinks क्षि ... is actually 2 graphemes ... I think this is a bug ... notepad (and google search) thinks it is a single grapheme/character
For me, google search needs 10 keystrokes -- because it's not to do with google search but instead aspects of my system, including which web browser I'm using (Firefox) and other details.
Some editors could be configurable so that cursoring over 'क्षि' (or 'fi') would be either 1 or 2 keystrokes depending on how you configure them and/or what language you specify the text is written in. For me, editing this SO answer using Firefox on Linux Mint, it takes 2 keystrokes to cursor over क्षि.
Perl 6 correctly reports the .chars result for 'क्षि' as 2 by default because that's what Unicode says it is per the default grapheme clustering specification. ("Extended Grapheme Clusters".) That happens to match what Firefox on Linux Mint does editing this SO answer because the stars line up and it's Sunday.
Notepad or other software reasonably takes just one keystroke to cursor over क्षि, while other editors reasonably take two, because both are reasonable per the Unicode specification:
arrow key movement ... could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components
My emphasis added. Unicode leaves it up to the software to decide how the cursor will move.
Your questions
1: Is the above a fair definition of what it means to be a printable character in unicode?
I don't think so. Hopefully the foregoing explains why, or at least points you in the directions you would need to research (for a year or three) to understand why.
2: ... do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters ...
There's such a vast number of "possible grapheme clusters" that can reasonably occur that even excluding degenerate codepoint combinations leaves you with an effectively infinite list.
And any small subset anyone may have created would not be canonical because the Unicode consortium would not bless it and folk would argue about what should be included.
3: ... how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?
First, your conditions are far too strenuous. See the next section.
But even if you drop the ones that are too difficult, it's still far too difficult, and the outcome far too uninteresting, to make doing anything worthwhile.
4: If you think creating such a list is a terrible idea, how would you create the cipher?
If it were me and it had to be a Caesar cipher I'd make it just handle bytes, as per Tom's comment at the start of this answer.
Your conditions
The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)
It'll need to be more restricted than that, but it's reasonable to say it'll need to be a valid Unicode string. If you want to insist it's utf8 that's fine too.
The encrypted string must also be a valid utf8 string consisting of standard unicode code points
Sure.
You must be able to traverse the encrypted string using the same number of strokes ... as the original string
You can have that for a small subset of Unicode characters. But...
This means [one keystroke for both the original and encoded version of] the devanagari character [क्षि]
... is not a reasonable requirement.
You could ensure the same grapheme clustering (character) interpretation of a given text if you wrote a custom implementation of the grapheme clustering for your cipher that was a faithful copy of the implementation of grapheme clustering used to control the cursor.
But then you'd then have to maintain these two code bases to keep them in sync. And that would be for just one particular system configuration.
It would be a ridiculous amount of pain. And for zero, or at most minuscule, gain.
the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).
So, no normalization.
That rules out all Perl 6's grapheme aware goodies.
Then again, you need to drop paying attention to graphemes anyway because there's effectively an infinite number of them.
Conclusion
My answer only touches lightly on the topics it covers and probably contains lots of errors. If you critique it I'll try improve it.
this is my first post on here so please excuse me if I have made any mistakes.
So, I was browsing around on the Metasploit page, and I found these strange types of codes. I tried searching it on google and on here, but couldn't find any other questions and answers like I had. I also noticed that Elliot used the method in "Mr. Robot" while programming in Python. I can see that the code is usually used in viruses, but I need to know why. This is the code that I found using this method:
buf +=
"\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c"
It's a string, just as any other string like "Hello World!". However, it's written in a different way. In computers, each character corresponds to a number, called a code-point, according to an encoding. One such encoding that you might have heard of is ASCII, another is UTF-8. To give an example, in both encodings, the letter H corresponds to the number 72. In Python, one usually specifies a string using the matching letters, like "Hello World!". However, it is also possible to use the code-points. In python, this can be denoted with \xab, where ab is replaced with the hexadecimal form of the code-point. So H would become '\x48', because 48 is the hexadecimal notation for 72, the code-point for the letter H. In this notation, "Hello World!" becomes "\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21".
The string you specify consists of the hexadecimal code-point 5b (decimal 91, the code-point for the character [), followed by the code-point 4d (M), etc., leading to the full string [MoviePlay]\r\nFileName0=C:\\. Here \r and \n are special characters together representing a line-break, so one could also read it as:
[MoviePlay]
FileName0=C:\\
In principle this notation is not necessarily found in viruses, but that kind of programming often requires very specific manipulation of numbers in memory without a lot of regard for the actual characters represented by those numbers, so that could explain why you'd see it arise there.
The code is a sequence of ASCII character encoded in hex.
It can be printed directly.
print('\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c')
The result is:
[MoviePlay]
FileName0=C:\
They use Metasploit, msfvenom to be more specific, to create or generate shellcodes specially for crafted or exploited file such as documents (docs, ppt, xls, etc) with different encoding.
I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.
I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!
There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).
If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:
re.compile(
u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
flags=re.UNICODE)
where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.
Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.
Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.