Python '\x0e' in character by character XOR encryption - python

I am trying to build an encryption system using python. It is based on the lorenz cipher machine used by Germany in WWII, though a lot more complicated (7-bit ascii encryption and 30 rotors compared with the original's 5-bit and 12 rotors).
So far I have worked out and written the stepping system. I have also created a system for the and chopping up the plaintext. But when checking the output, in character for character (By not stitching together the ciphertext) I got this for hello:
['H', 'Z', '\x0e', '>', 'f']
I have realised that '\x0e' must be some special character in ascii, but I am certain that when the program goes to decrypt it will look at each of the letters in it individually. Can someone please tell me what '\x0e' signifies, if there are other such characters, and if there's an easy way to get around it.
Thanks in advance!

It's the ASCII "shift-out" control character and is nonprintable.
A control character which is used in conjunction with SHIFT IN and
ESCAPE to extend the graphic character set of the code. It may alter
the meaning of octets 33 - 126 (dec.). The effect of this character
when using code extension techniques is described in International
Standard ISO 2022.

'\x0e' is the ASCII SO (shift out) unprintable character. It is a single character, and any reasonable program dealing with the string will treat it as such; you're only seeing it represented like that because you're printing a list, which shows the repr of each value in the list.
As for the question of if there are others, yes, there are 33 of them; ASCII 0-31 and 127 are all generally considered "control characters" which aren't typically printable.

Related

String change from Latin to ASCII

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

Caesar cipher with all Unicode printable characters

I want to create a Caesar cipher that can encode/decode unicode printable characters (single- and multi codepoint grapheme clusters, emojis ect.) from the whole of Unicode (except the private use area). Preferably, it will use a list of all printable characters.
NOTE: Even though I want to create a caesar cipher, it is really not about encryption. The question is about investigating the properties of unicode.
I found these questions:
What is the range of Unicode Printable Characters?
Cipher with all unicode characters
But I didn't get an answer to what I want.
Note:
If you give a coding answer, I am mostly interested in a solution that
uses either python3 or perl6, as they are my main languages.
Recently, I was given an assignment to write a Caesar cipher and then encode and decode an English text.
I solved it in python by using the string library's built-in string.printable constant. Here is a printout of the constant:
(I used visual studio code)
[see python code and results below]
The documentation says:
'''
String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace.
'''
https://docs.python.org/3.6/library/string.html#string-constants
I am wondering how you could create a caesar cipher that could encode/decode all the possible printable characters you can make from unicode codepoints (just asume you have all necessary fonts to see those that should be visible on screen).
Here is my understanding of what it means for
something to be a printable character:
When I take the python string constant above,
and traverse it with the left or rigt arrow keys
on the keyboard, It takes me exactly 100 strokes to get
to the end (the same as the number of characters).
It looks like there is a one-to-one
correspondence between being a printable
character and being traversible with one stroke of an arrow key.
Now consider this string:
"👨‍👩‍👧‍👦ij
क्षि 🂡"
Based on pythons string.printable constant,
This string seems to me to be composed of the
following 7 printable characters:
(you can look up individual codepoints at: https://unicode-table.com/en/)
1 (family) 2 (Latin Small Ligature Ij)
3 (cariage return) 4 (Devanagari kshi)
5 (space) 6 (Zero Width No-Break Space)
7 (Ace of spades)
👨‍👩‍👧‍👦
codepoints: 128104 8205 128105 8205 128103 8205 128102
(reference: https://emojipedia.org/family-man-woman-girl-boy/)
(Latin Small Ligature Ij)
ij
codepoint: 307
(Carriage Return)
codepoint: 13
(Devanagari kshi)
क्षि
codepoints: 2325 2381 2359 2367
(see this page: http://unicode.org/reports/tr29/)
(the codepoints seems to be in hexadecimal rather than numerals)
(Space)
codepoint: 32
(Zero Width No-Break Space)
codepoint: 65279
(AKA U+FEFF BYTE ORDER MARK (BOM))
(https://en.wikipedia.org/wiki/Byte_order_mark)
(Playing Card Ace of Spades)
🂡
codepoint: 127137
When I paste this
string into notepad, and try to traverse it with an arrow key,
I end up using 10 key strokes rather than 7,
because the family emoji need
4 key strokes
(probably because notepad cant deal with the Zero Width Joiner,
codepoint: 8205, and of course notepad cant display a family glyph).
On the other hand when I post the string into google search,
i can traverse the whole string with 7 strokes.
Then I tried creating the string
in Perl6 to see what Perl6's grapheme
awareness would make of the string:
(I use the Atom editor)
[see perl6 code and results below]
perl6 thinks that the Devanagari kshi character क्षि (4 codepoints)
is actually 2 graphemes, each with 2 codepoints.
Even though it CAN be represented as two characters,
as seen in the above list,
I think this is a bug. Perl6 is supposed to be grapheme
aware, and even my windows notepad (and google search)
thinks it is a single grapheme/character.
Based on the 2 strings,
The practical definition of
a printable character seems to be this:
'It is any combination of unicode codepoints that can get traversed
by one push of a left or right arrow key on the keyboard
under ideal cirkumstances'.
"under ideal cirkumstances" means that
you are using an environment that, so to speak,
act like google search:
That is, it recognizes for example an emoji
(the 4 person family) or a grapheme cluster
(the devanagari character)
as one printable character.
3 questions:
1:
Is the above a fair definition of what it means
to be a printable character in unicode?
2:
Regardless of whether you accept the definition,
do you know of any list of printable characters
that cover the currently used unicode planes and possible
grapheme clusters, rather than just the 100 ASCII characters
the python string library has
(If I had such a list I imagine I could create a cipher
quite easily)?
3:
Given that such a list does not exist, and you
accept the definition,
how would you go about creating such a list with which
I could create a caesar cipher
that could cipher any/all printable
characters given the following 4 conditions?
NOTE: these 4 conditions are just
what I imagine is required for a proper
caesar cipher.
condition a
The string to be encrypted will
be a valid utf8 string consisting of standard
unicode code points (no unassigned, or private use area
codepoints)
condition b
The encrypted string must also be a valid
utf8 string consisting of standard
unicode code points.
condition c
You must be able to traverse the encrypted string
using the same number of strokes with
the left or right arrow keys on the keyboard as
the original string
(given ideal circumstances as described above).
This means that both the
man-woman-boy-girl family emoji
and the devanagari character,
when encoded, must each correspond to
exactly one other printable character and not a set
of "nonsence" codepoints that the
arrow keys will interpret as different characters.
It also means that a single codepoint character can
potentially be converted into a multi-codepoint character
and vice versa.
condition d
As in any encrypt/decrypt algoritm,
the string to be encrypted and
the string that has been decrypted
(the end result) must
contain the exact same codepoints
(the 2 strings must be equal).
# Python 3.6:
import string
# build-in library
print(string.printable)
print(type(string.printable))
print(len(string.printable))
# length of the string (number of ASCII characters)
#perl6
use v6;
my #ordinals = <128104 8205 128105 8205 128103 8205 128102>;
#array of the family codepoints
#ordinals.append(<307 13 2325 2381 2359 2367 32 65279 127137>);
#add the other codepoints
my $result_string = '';
for #ordinals {
$result_string = $result_string ~ $_.chr;
}
# get a string of characters from the ordinal numbers
say #ordinals; # the list of codepoints
say $result_string; # the string
say $result_string.chars; # the number of characters.
say $result_string.comb.perl; # a list of characters in the string
python results:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?#[]^_`{|}~
class 'str'
100
perl6 results:
[128104 8205 128105 8205 128103 8205 128102 307 13 2325 2381 2359 2367 32 65279 127137]
👨‍👩‍👧‍👦ij
क्षि 🂡
8
("👨‍👩‍👧‍👦", "ij", "\r", "क्", "षि", " ", "", "🂡").Seq
TL;DR I think your question is reasonable and deserves a better answer than the one I've written so far. Let's talk.
I don't think anyone can create a Caesar cipher with the requirements you list for multiple reasons.
But if your goal is really to "investigate properties of Unicode" rather than create a cipher then presumably that doesn't matter.
And of course I might just be experiencing a failure of imagination, or just a failure to understand Unicode despite spending years grappling with it.
If you critique technical aspects of my explanation below via comments I'll try to improve it and hopefully we'll both learn as we go. TIA.
"Caesar cipher with all Unicode printable characters"
This is the clean formulation you have in your title.
The only problematic parts are "Caesar", "cipher", "all", "Unicode", "printable", and "characters". Let's go thru them.
Caesar cipher
A Caesar cipher is an especially simple single alphabet cipher. Unicode isn't one big large single alphabet. But perhaps you could treat a subset of its codepoints as if they were.
I'd say that's what the SO Cipher with all unicode characters was all about.
You've currently rejected that and introduced a bunch of extra aspects that are either impossible or so difficult that they might as well be.
Ignoring your priority of investigating Unicode properties it would make sense if you instead settled for a regular ASCII cipher. Or perhaps go back to that Cipher with all unicode characters SO and pick up where they left off, perhaps noting that, according to a comment on that SO they apparently stopped at just the BMP plane:
Note that you’re only using BMP code points (i.e. from U+0000 to U+FFFF). Unicode ranges from U+0000 to U+10FFFF, so you’re missing about a million code points :)
So perhaps you could do better. I don't think it would be worthwhile from the perspective of creating a cipher for its own sake but it might be for learning more about the properties of Unicode.
Cipher
#TomBlodget in their comment on your question notes that:
The complexity of text motivates modern ciphers to not deal with characters. They deal with bytes for both input and output. Where the input is text, the receiver has to to be told the character encoding. Where further handling of the output must be as text, Base64 or similar is used. Expecting the output of a cipher to look like text is not generally a goal.
If you want a universal solution for a Unicode cipher, follow Tom's recipe.
All
In a comment on your question about the number of graphemes #nwellnhof noted that:
there's an infinite number
But you then also quite reasonably replied that there's only going to be a finite number in any given text; that Unicode's intent is that Unicode compliant software may/will generate mojibake results if given degenerate input (where what counts as degenerate is somewhat open to refinement in Unicode updates); and that that's the basis on which you hope to proceed.
That's a reasonable response, but you still can't have "all" even when restricted to "all non-degenerate" and "only ones that could appear in real life", because there's still an effectively infinite number of well formed and potentially reasonable characters.
I ought really insert some calculations here to put some bounds on the problem. Is "effectively infinite" a trillion? Why? That sort of thing. But before digging into that I'll await comments.
Let's pretend it's a trillion, and that that's not a problem, and move on.
Unicode
Unicode is enormously complex.
You've been given an assignment to produce a Caesar cipher, a very simple thing.
They really don't mix well unless you lean heavily on keeping things simple.
But you want to investigate properties of Unicode. So perhaps you want to wade into all the complexity. But then the question is, how many years do you want to spend exploring the consequences of opening this pandora's box? (I've been studying Unicode on and off for a decade. It's complicated.)
Printable
You linked to the SO question "What is the range of Unicode Printable Characters?". This includes an answer that notes:
The more you learn about Unicode, the more you realize how unexpectedly diverse and unfathomably weird human writing systems are. In particular whether a particular "character" is printable is not always obvious.
But you presumably read that and refused to be deterred. Which is both admirable and asking for trouble. For example, it seems to have driven you to define "printable" as something like "takes one or more keystrokes to traverse" which is so fraught it's hard to know where to start -- so I'll punt on that till later in this answer.
Characters
Given that your aim is to write a Caesar cipher, a cipher that was used thousands of years ago that acts on characters, it makes sense that you focused on "what a user thinks of as a character".
Per Unicode's definition, this is called a "grapheme".
One of your example characters makes it clear how problematic the distinction is between "what a user thinks of as a character" (a grapheme) and a codepoint (what Python thinks of as a character):
print('क्षि'[::-1])
िष्क
This shows mangling of a single "character" (a single grapheme) written in Devanagari, which is, according to Wikipedia, "one of the most used and adopted writing systems in the world".
(Or, if we want to ignore the half the planet this mangling ever more routinely affects and just focus on the folk who thought they were safe:
print('🇬🇧'[::-1])
🇧🇬
That's a flag of one nation turning into another's. Fortunately flags rarely appear in text -- though that's changing now text is increasingly arbitrary Unicode text like this text I'm writing -- and flag characters are not that important and both Great Britain and Bulgaria are members of the EU so it's probably not nearly as bad as scrambling the text of a billion Indians.)
Graphemes
So you quite reasonably thought to yourself, "Maybe Perl 6 will help".
To quote UAX#29, the Unicode Annex document on "Unicode Text Segmentation":
This document defines a default specification for grapheme clusters.
Perl 6 has implemented a grapheme clustering mechanism. It could in principle cluster in a variety of ways but for now it's implemented the default specification. This is what allows Perl 6 to avoid the mistakes Python's making in the above.
But the Unicode document continues:
[the specification for grapheme clusters] may be customized for particular languages, operations, or other situations.
So you can't just eyeball some text (or give it to some software) and say what "characters" it contains if by "character" you mean "what a user thinks of as a character".
It gets worse...
Keystrokes
"👨‍👩‍👧‍👦ij क्षि 🂡" ... notepad ... 10 key strokes ... google search ... 7 strokes ... Perl6 ... Atom editor ... perl6 thinks क्षि ... is actually 2 graphemes ... I think this is a bug ... notepad (and google search) thinks it is a single grapheme/character
For me, google search needs 10 keystrokes -- because it's not to do with google search but instead aspects of my system, including which web browser I'm using (Firefox) and other details.
Some editors could be configurable so that cursoring over 'क्षि' (or 'fi') would be either 1 or 2 keystrokes depending on how you configure them and/or what language you specify the text is written in. For me, editing this SO answer using Firefox on Linux Mint, it takes 2 keystrokes to cursor over क्षि.
Perl 6 correctly reports the .chars result for 'क्षि' as 2 by default because that's what Unicode says it is per the default grapheme clustering specification. ("Extended Grapheme Clusters".) That happens to match what Firefox on Linux Mint does editing this SO answer because the stars line up and it's Sunday.
Notepad or other software reasonably takes just one keystroke to cursor over क्षि, while other editors reasonably take two, because both are reasonable per the Unicode specification:
arrow key movement ... could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components
My emphasis added. Unicode leaves it up to the software to decide how the cursor will move.
Your questions
1: Is the above a fair definition of what it means to be a printable character in unicode?
I don't think so. Hopefully the foregoing explains why, or at least points you in the directions you would need to research (for a year or three) to understand why.
2: ... do you know of any list of printable characters that cover the currently used unicode planes and possible grapheme clusters ...
There's such a vast number of "possible grapheme clusters" that can reasonably occur that even excluding degenerate codepoint combinations leaves you with an effectively infinite list.
And any small subset anyone may have created would not be canonical because the Unicode consortium would not bless it and folk would argue about what should be included.
3: ... how would you go about creating such a list with which I could create a caesar cipher that could cipher any/all printable characters given the following 4 conditions?
First, your conditions are far too strenuous. See the next section.
But even if you drop the ones that are too difficult, it's still far too difficult, and the outcome far too uninteresting, to make doing anything worthwhile.
4: If you think creating such a list is a terrible idea, how would you create the cipher?
If it were me and it had to be a Caesar cipher I'd make it just handle bytes, as per Tom's comment at the start of this answer.
Your conditions
The string to be encrypted will be a valid utf8 string consisting of standard unicode code points (no unassigned, or private use area codepoints)
It'll need to be more restricted than that, but it's reasonable to say it'll need to be a valid Unicode string. If you want to insist it's utf8 that's fine too.
The encrypted string must also be a valid utf8 string consisting of standard unicode code points
Sure.
You must be able to traverse the encrypted string using the same number of strokes ... as the original string
You can have that for a small subset of Unicode characters. But...
This means [one keystroke for both the original and encoded version of] the devanagari character [क्षि]
... is not a reasonable requirement.
You could ensure the same grapheme clustering (character) interpretation of a given text if you wrote a custom implementation of the grapheme clustering for your cipher that was a faithful copy of the implementation of grapheme clustering used to control the cursor.
But then you'd then have to maintain these two code bases to keep them in sync. And that would be for just one particular system configuration.
It would be a ridiculous amount of pain. And for zero, or at most minuscule, gain.
the string to be encrypted and the string that has been decrypted (the end result) must contain the exact same codepoints (the 2 strings must be equal).
So, no normalization.
That rules out all Perl 6's grapheme aware goodies.
Then again, you need to drop paying attention to graphemes anyway because there's effectively an infinite number of them.
Conclusion
My answer only touches lightly on the topics it covers and probably contains lots of errors. If you critique it I'll try improve it.

What does the "\x5b\x4d\x6f etc.." mean in Python?

this is my first post on here so please excuse me if I have made any mistakes.
So, I was browsing around on the Metasploit page, and I found these strange types of codes. I tried searching it on google and on here, but couldn't find any other questions and answers like I had. I also noticed that Elliot used the method in "Mr. Robot" while programming in Python. I can see that the code is usually used in viruses, but I need to know why. This is the code that I found using this method:
buf +=
"\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c"
It's a string, just as any other string like "Hello World!". However, it's written in a different way. In computers, each character corresponds to a number, called a code-point, according to an encoding. One such encoding that you might have heard of is ASCII, another is UTF-8. To give an example, in both encodings, the letter H corresponds to the number 72. In Python, one usually specifies a string using the matching letters, like "Hello World!". However, it is also possible to use the code-points. In python, this can be denoted with \xab, where ab is replaced with the hexadecimal form of the code-point. So H would become '\x48', because 48 is the hexadecimal notation for 72, the code-point for the letter H. In this notation, "Hello World!" becomes "\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21".
The string you specify consists of the hexadecimal code-point 5b (decimal 91, the code-point for the character [), followed by the code-point 4d (M), etc., leading to the full string [MoviePlay]\r\nFileName0=C:\\. Here \r and \n are special characters together representing a line-break, so one could also read it as:
[MoviePlay]
FileName0=C:\\
In principle this notation is not necessarily found in viruses, but that kind of programming often requires very specific manipulation of numbers in memory without a lot of regard for the actual characters represented by those numbers, so that could explain why you'd see it arise there.
The code is a sequence of ASCII character encoded in hex.
It can be printed directly.
print('\x5b\x4d\x6f\x76\x69\x65\x50\x6c\x61\x79\x5d\x0d\x0a\x46\x69\x6c\x65\x4e\x61\x6d\x65\x30\x3d\x43\x3a\x5c')
The result is:
[MoviePlay]
FileName0=C:\
They use Metasploit, msfvenom to be more specific, to create or generate shellcodes specially for crafted or exploited file such as documents (docs, ppt, xls, etc) with different encoding.

Bizarre behavior of python printing non-alphabetic ASCII characters

I have the following Python code:
for num in range(80, 150):
input()
print(num)
print(chr(27))
print(chr(num))
The input() statement is only there to control how quickly the for loop proceeds. I am not expecting this to do anything special, but when the loop hits certain numbers, printing that ASCII character, preceded by ASCII 27 (which is the ESC character) does some unexpected things:
At 92 and 94, the number does not print. http://i.stack.imgur.com/DzUew.png
At 99 (the letter c), a bunch of terminal output gets deleted. http://i.stack.imgur.com/5XPy3.png
At 108 (the letter l), the current line jumps up several lines (but text remains below). (didn't get a proper screencap, I'll add one later if it helps)
At 128 or 129, the first character starts getting masked. You have to type something (I typed "jjj") in order to prevent this from happening on that line. http://i.stack.imgur.com/DRwTm.png
I don't know why any of this happens although I imagine it has something to do with the ESC character interacting with the terminal. Could someone help me figure this out?
It is due to confusion between escape sequences and character-encoding.
Your program is printing escape sequences, including
escapec (resets the terminal)
escape^ (begins a privacy message, which causes other characters to be eaten)
In ISO-8859-1 (and ECMA-48), character bytes between 128 and 159 are considered control characters, referred to as C1 controls. Several of these are treated the same as escape combined with another character. The mapping between C1 and "another character" is not straightforward, but the interesting ones include
0x9a which is device attributes, causing characters to be sent to the host.
0x9b which is control sequence initiator, more usually seen as escape[.
On the other hand, bytes in the 128-159 range are legal parts of a UTF-8 character. If your terminal is not properly configured to match the locale settings, you can find that your terminal responds to control sequences.
OSX terminal implements (does not document...) many of the standard control sequences. XTerm documents these (and many others), so you may find the following useful:
XTerm Control Sequences
C1 (8-Bit) Control Characters (specifically)
Standard ECMA-48:
Control Functions for Coded Character Sets
For amusement, you are referred to the xterm FAQ: Interesting but misleading
Esc with those characters make a special code for terminal .
A terminal control code is a special sequence of characters that is
printed (like any other text). If the terminal understands the code,
it won't display the character-sequence, but will perform some action.
You can print the codes with a simple echo command.
Terminal Codes
For example,
ESC/ = ST, String Terminator (chr(92))
ESC^ = PM, Privacy Message (chr(94)) .
Control Sequences are different based on what terminal do you use.
More about:
Xterm Control Sequences
ANSI escape code
ANSI/VT100 Terminal Control Escape Sequences,

Processing delimiters with python

Im currently trying to parse a apache log in a format I can't do normally. (Tried using goaccess)
In sublime it the delimiters show up as ENQ, SOH, and ETX which too my understanding are "|", space, and superscript L. Im trying to use re.split to separate the individual components of the log, but i'm not sure how to deal w/ the superscript L.
On sublime it shows up as 3286d68255beaf010000543a000012f1/Madonna_Home_1.jpgENQx628a135bENQZ1e5ENQAB50632SOHA50.134.214.130SOHC98.138.19.91SOHD42857ENQwwww.newprophecy.net...
With ENQ's as '|' and SOH as ' ' when I open the file in a plain text editor (Like notepad)
I just need to parse out the IP addresses so the rest of the line is mostly irrelevant.
Currently I have
pkts = re.split("\s|\\|")
But I don't know what to do for the L.
Those 3-letter codes are ASCII control codes - these are ASCII characters which occur prior to 32 (space character) in the ASCII character set. You can find a full list online.
These character do not correspond to anything printable, so you're incorrect in assuming they correspond to those characters. You can refer to them as literals in several languages using \x00 notation - for example, control code ETX corresponds to \x03 (see the reference I linked to above). You can use these to split strings or anything else.
This is the literal answer to your question, but all this aside I find it quite unlikely that you actually need to split your Apache log file by control codes. At a guess what's actually happened is that perhaps som Unicode characters have crept into your log file somehow, perhaps with UTF-8 encoding. An encoding is a way of representing characters that extend beyond the 255 limit of a single byte by encoding extended characters with multiple bytes.
There are several types of encoding, but UTF-8 is one of the most popular. If you use UTF-8 it has the property that standard ASCII characters will appear as normal (so you might never even realise that UTF-8 was being used), but if you view the file in an editor which isn't UTF-8 aware (or which incorrectly identifies the file as plain ASCII) then you'll see these odd control codes. These are places where really the code and the character(s) before or after it should be interpreted together as a single unit.
I'm not sure that this is the reason, it's just an educated guess, but if you haven't already considered it then it's important to figure out the encoding of your file since it'll affect how you interpret the entire content of it. I suggest loading the file into an editor that understands encodings (I'm sure something as popular as Sublime does with proper configuration) and force the encoding to UTF-8 and see if that makes the content seem more sensible.

Categories

Resources