Identify the contents a file through a program in python [duplicate]

Identify the contents a file through a program in python [duplicate] - python

This question already has answers here:
Tools to help reverse engineer binary file formats
(9 answers)
Closed 6 years ago.
I have a file here. To me it appears it is a binary file. This is raw file and I believe that it has the stock information in OHLCV (Open, High, Low, Close, Volume). Besides it may also have some text.
One of the entries that I could possibly have for OHLCV is
464.95, 468.3, 460, 465.65, 3957854
This is the code that I have tried. I dont fully understand about ASCII and Unicode.
input_file = "00063181.dat" # tata motors
with open(input_file, "rb") as fh:
buf = fh.read()
output_l = list(map(int , buf))
print (output_l)
My Doubt: How do I decode this file and make sense out of it? Is there any way for me to read this file through a program written in python and separate the text from int/float? I am using Python 3 and Win 10 64 bit.

You're looking to reverse engineer the structure of a binary file using Python. Since you've declared that the file is binary, it may prove difficult. You're going to need to examine the contents of the file and use your best intuition to try to infer the structure. The first thing you're going to want is a way to display each of the bytes of the file an a way that will help you understand the meaning.
Fortunately, someone has already written a tool to do this, hexdump. Install that package using pip.
The function you need from that package is hexdump, so let's import it the package and get help on the function.
>>> import hexdump
>>> help(hexdump.hexdump)
Help on function hexdump in module hexdump:
hexdump(data, result='print')
Transform binary data to the hex dump text format:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[x] data argument as a binary string
[x] data argument as a file like object
Returns result depending on the `result` argument:
'print' - prints line by line
'return' - returns single string
'generator' - returns generator that produces lines
Now you can start to explore the contents of your file. Use the slice operator to do it in chunks. For example, to render the contents of the first 1KB of your file:
>>> hexdump.hexdump(buf[:1024])
00000000: C3 8E C2 8F 22 13 C2 AA 66 2A 22 47 C3 94 C3 AA ...."...f*"G....
00000010: C3 89 C3 A0 C3 B1 C3 91 6A C2 A4 C3 BF 3C C2 AA ........j....<..
00000020: C2 91 73 C3 85 46 57 47 C2 88 C3 99 C2 B6 3E 2D ..s..FWG......>-
00000030: C3 BA 69 10 C2 93 C3 94 38 C3 81 7A 6A 43 30 7C ..i.....8..zjC0|
00000040: C3 BB C2 AA 01 2D C2 97 C3 83 C3 88 64 14 C3 9C .....-......d...
00000050: C2 AB C2 AA C3 A2 74 C2 85 5D C3 97 4E 64 68 C3 ......t..]..Ndh.
...
000003C0: 42 C2 8F 06 7F 12 33 7F 79 1E 2C 2A 0F C3 92 36 B.....3.y.,*...6
000003D0: C3 A6 C2 96 C2 93 C2 8B 43 C2 9F 4C C2 95 48 24 ........C..L..H$
000003E0: C2 B3 C2 82 26 C3 88 C3 BD C3 96 12 1E 5E 18 2E ....&........^..
000003F0: 37 C3 A7 C2 87 C3 AE 00 4F 3F C2 9C C3 A8 1C C2 7.......O?......
Hexdump has a nice property of rendering the byte position, the hex code, and then (if possible) the printable form of the character on the right.
Hopefully some of your text values will be visible there and that will give some clue as to how to reverse engineer your file.
Once you've started to determine how your file is structured, you can use the various string operators to manipulate your data. For example, if you find that your file is split into sections by the null byte (b'\x00'), you can get those sections thus:
>>> sections = buf.split(b'\x00')
There are a lot of things that you're likely to have to learn as you dig deeper, like character encodings, number encodings (including little-endian for integers and floating-point encoding for floating point numbers). You'll want to find some way to externally validate your results.
Best of luck.

Related

Difference in result while reading same file with node and python

I have been trying to read the contents of the genesis.block given in this file of the Node SDK in Hyperledger Fabric using Python. However, whenever I try to read the file with Python by using
data = open("twoorgs.genesis.block").read()
The value of the data variable is as follows:
>>> data
'\n'
With nodejs using fs.readFileSync() I obtain an instance of Buffer() for the same file.
var data = fs.readFileSync('./twoorgs.genesis.block');
The result is
> data
<Buffer 0a 22 1a 20 49 63 63 ac 9c 9f 3e 48 2c 2c 6b 48 2b 1f 8b 18 6f a9 db ac 45 07 29 ee c0 bf ac 34 99 9e c2 56 12 e1 84 01 0a dd 84 01 0a d9 84 01 0a 79 ... >
How can I read this file successfully using Python?

You file has a 1a in it. This is Ctrl-Z, which is an end of file on Windows.
So try binary mode like:
data = open("twoorgs.genesis.block", 'rb').read()

Speed up python code

I have some text file in following format (network traffic collected by tcpdump):
1505372009.023944 00:1e:4c:72:b8:ae > 00:23:f8:93:c1:af, ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 64, id 5134, offset 0, flags [DF], proto TCP (6), length 83)
192.168.1.53.36062 > 74.125.143.139.443: Flags [P.], cksum 0x67fd (correct), seq 1255996541:1255996572, ack 1577943820, win 384, options [nop,nop,TS val 356377 ecr 746170020], length 31
0x0000: 0023 f893 c1af 001e 4c72 b8ae 0800 4500 .#......Lr....E.
0x0010: 0053 140e 4000 4006 8ab1 c0a8 0135 4a7d .S..#.#......5J}
0x0020: 8f8b 8cde 01bb 4adc fc7d 5e0d 830c 8018 ......J..}^.....
0x0030: 0180 67fd 0000 0101 080a 0005 7019 2c79 ..g.........p.,y
0x0040: a6a4 1503 0300 1a00 0000 0000 0000 04d1 ................
0x0050: c300 9119 6946 698c 67ac 47a9 368a 1748 ....iFi.g.G.6..H
0x0060: 1c .
and want to change it to:
1505372009.023944
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .
Here is what I have done:
import re
regexp_time =re.compile("\d\d\d\d\d\d\d\d\d\d.\d\d\d\d\d\d+")
regexp_hex = re.compile("(\t0x\d+:\s+)([0-9a-f ]+)+ ")
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt2.txt','w') as output:
for line in input:
if regexp_time.match(line):
output.write ("%s\n" % (line.split()[0]))
elif regexp_hex.match(line):
words = re.split(r'\s{2,}', line)
bytes=""
for byte in words[1].split():
if len(byte) == 4:
bytes += "%s%s %s%s "%(byte[0],byte[1],byte[2],byte[3])
elif len(byte) == 2:
bytes += "%s%s "%(byte[0],byte[1])
output.write ("%s %s %s \n" % (words[0].replace("0x","00"),"{:<47}".format (bytes),words[2].replace("\n","")))
input.close()
output.close()
Could some one help me in speed up?
Edit
Here is the new version of code depends on #Austin answer, It really speed up the code.
with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt1.txt','w') as output:
for line in input:
if line[0].isdigit():
output.write (line[:16])
output.write ('\n')
elif line.startswith("\t0x"):#(Since there is line which is not hex and not start with timestamp I should check this as well)
offset = line[:10] # " 0x0000: "
words = line[10:51] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[51:] # " .#......Lr....E."
line = [offset.replace('x', '0', 1)]
for a,b,c,d,space in zip (words[0::5],words[1::5],words[2::5],words[3::5],words[4::5]):
line.append(a)
line.append(b)
line.append(space)
line.append(c)
line.append(d)
line.append(space)
line.append (chars)
output.write (''.join (line))
input.close()
output.close()
Here is the result:
1505372009.02394
000000: 00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010: 00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..#.#......5J}
000020: 8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030: 01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040: a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050: c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060: 1c .

You haven't specified anything else about your file format, including what if any lines appear between blocks of packet data. So I'm going to assume that you just have paragraphs like the one you show, jammed together.
The best way to speed up something like this is to reduce the extra operations. You have a bunch! For example:
You use a regex to match the "start" line.
You use a split to extract the timestamp from the start line.
You use a %-format operator to write the timestamp out.
You use a different regex to match a "hex" line.
You use more than one split to parse the hex line.
You use various formatting operators to output the hex line.
If you're going to use regular expression matching, then I think you should just do one match. Create an alternate pattern (like a|b) that describes both lines. Use match.lastgroup or .lastindex to decide what got matched.
But your lines are so different that I don't think a regex is needed. Basically, you can decide what sort of line you have by looking at the very first character:
if line[0].isdigit():
# This is a timestamp line
else:
# This is a hex line
For timestamp processing, all you want to do is print out the 17 characters at the start of the line: 11 digits, a dot, and 6 more digits. So do that:
if line[0].isdigit():
output.write(line[:17], '\n')
For hex line processing, you want to make two kinds of changes: you want to replace the 'x' in the hex offset with a zero. That's easy:
hexline = line.replace('x', '0', 1) # Note: 1 replacement only!
Then, you want to insert spaces between the groups of 4 hex digits, and pad the short lines so the character display appears in the same column.
This is a place where regular expression replacement might help you. There's a limited number of occurrences, but it may be that the overhead of the Cpython interpreter costs more than the setup and teardown for a regex replacement. You probably should do some profiling on this.
That said, you can split the line into three parts. It's important to capture the trailing space on the middle part, though:
offset = line[:13] # " 0x0000: "
words = line[13:53] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars = line[53:] # " .#......Lr....E."
You already know how to replace the 'x' in the offset, and there's nothing to be done to the chars portion of the line. So we'll leave those alone. The remaining task is to spread out the characters in the
words string. You can do that in various ways, but it seems easy to process the characters in chunks of 5 (4 hex digits plus a trailing space).
We can do this because we captured the trailing space on the words part. If not, you might have to use itertools.zip_longest(..., fill_value=''), but it's probably easier just to grab one more character.
With that done, you can do:
for a,b,c,d,space in zip(words[0::5], words[1::5], words[2::5], words[3::5], words[4::5]):
output.write(a, b, space, c, d, space)
Alternatively, instead of making all those calls you could accumulate the characters in a buffer and then write the buffer one time. Something like:
line = [offset]
for ...:
line.extend(a, b, space, c, d, space)
line.append(chars)
line.append('\n')
output.write(''.join(line))
That's fairly straightforward, but like I said, it may not perform quite as well as a regular-expression replacement. That would be due to the regex code running as "C" rather than python bytecode. So you should compare it against a pattern replacement like:
words = re.sub(r'(..)(..) ', '\1 \2 ', words)
Note that I didn't require hex digits, in order to cause any trailing "padding" spaces on the last line of a paragraph to expand in proportion.
Again, please check the performance against the zip version above!

Python: convert hex bytestream to “int16"

So I'm working with incoming audio from Watson Text to Speech. I want to play the sound immediately when data arrives to Python with a websocket from nodeJS.
This is a example of data I'm sending with the websocket:
<Buffer e3 f8 28 f9 fa f9 5d fb 6c fc a6 fd 12 ff b3 00 b8 02 93 04 42 06 5b 07 e4 07 af 08 18 0a 95 0b 01 0d a2 0e a4 10 d7 12 f4 12 84 12 39 13 b0 12 3b 13 ... >
So the data arrives as a hex bytestream and I try to convert it to something that Sounddevice can read/play. (See documentation: The types 'float32', 'int32', 'int16', 'int8' and 'uint8' can be used for all streams and functions.) But how can I convert this?
I already tried something, but when I run my code I only hear some noise, nothing recognizable.
Here you can read some parts of my code:
def onMessage(self, payload, isBinary):
a = payload.encode('hex')
queue.put(a)
After I receive the bytesstream and convert to hex, I try to send the incoming bytestream to Sounddevice:
def stream_audio():
with sd.OutputStream(channels=1, samplerate=24000, dtype='int16', callback=callback):
sd.sleep(int(20 * 1000))
def callback(outdata, frames, time, status):
global reststuff, i, string
LENGTH = frames
while len(reststuff) < LENGTH:
a = queue.get()
reststuff += a
returnstring = reststuff[:LENGTH]
reststuff = reststuff[LENGTH:]
for char in returnstring:
i += 1
string += char
if i % 2 == 0:
print string
outdata[:] = int(string, 16)
string = ""

look at your stream of data:
e3 f8 28 f9 fa f9 5d fb 6c fc a6 fd 12 ff b3 00
b8 02 93 04 42 06 5b 07 e4 07 af 08 18 0a 95 0b
01 0d a2 0e a4 10 d7 12 f4 12 84 12 39 13 b0 12
3b 13
you see here that every two bytes the second one is starting with e/f/0/1 which means near zero (in two's complement).
So that's your most significant bytes, so your stream is little-endian!
you should consider that in your conversion.
If I have more data I would have tested but this is worth some miliseconds!

Writing hex data into a file

I'm trying to write hex data taken from ascii file to a newly created binary file
ascii file example:
98 af b7 93 bb 03 bf 8e ae 16 bf 2e 52 43 8b df
4f 4e 5a e4 26 3f ca f7 b1 ab 93 4f 20 bf 0a bf
82 2c dd c5 38 70 17 a0 00 fd 3b fe 3d 53 fc 3b
28 c1 ff 9e a9 28 29 c1 94 d4 54 d4 d4 ff 7b 40
my code
hexList = []
with open('hexFile.txt', 'r') as hexData:
line=hexData.readline()
while line != '':
line = line.rstrip()
lineHex = line.split(' ')
for i in lineHex:
hexList.append(int(i, 16))
line = hexData.readline()
with open('test', 'wb') as f:
for i in hexList:
f.write(hex(i))
Thought hexList holds already hex converted data and f.write(hex(i)) should write these hex data into a file, but python writes it with ascii mode
final output: 0x9f0x2c0x380x590xcd0x110x7c0x590xc90x30xea0x37 which is wrong!
where is the issue?

Use binascii.unhexlify:
>>> import binascii
>>> binascii.unhexlify('9f')
'\x9f'
>>> hex(int('9f', 16))
'0x9f'
import binascii
with open('hexFile.txt') as f, open('test', 'wb') as fout:
for line in f:
fout.write(
binascii.unhexlify(''.join(line.split()))
)

replace:
f.write(hex(i))
With:
f.write(chr(i)) # python 2
Or,
f.write(bytes((i,))) # python 3
Explanation
Observe:
>>> hex(65)
'0x41'
65 should translate into a single byte but hex returns a four character string. write will send all four characters to the file.
By contrast, in python2:
>>> chr(65)
'A'
This does what you want: chr converts the number 65 to the character single-byte string which is what belongs in a binary file.
In python3, chr(i) is replaced by bytes((i,)).

How can I make sense of a badly encoded message?

---------------------------
ƒGƒ‰[
---------------------------
ƒfƒBƒXƒvƒŒƒCƒ‚[ƒh‚ªÝ’è‚Å‚«‚Ü‚¹‚ñ.
---------------------------
OK
---------------------------
I get this clear error message out of Shooter's Solitude system 4, after I feed it this version of d3drm.dll (sigh.)
Here's an hexdump for your convenience:
00000000 c6 92 66 c6 92 42 c6 92 58 c6 92 76 c6 92 c5 92 |..f..B..X..v....|
00000010 c6 92 43 c6 92 e2 80 9a c2 81 5b c6 92 68 e2 80 |..C.......[..h..|
00000020 9a c2 aa c2 90 c3 9d e2 80 99 c3 a8 e2 80 9a c3 |................|
00000030 85 e2 80 9a c2 ab e2 80 9a c3 9c e2 80 9a c2 b9 |................|
00000040 e2 80 9a c3 b1 2e 0a |.......|
00000047
How would you turn this into a coherent error message -- that is, how would you go about to find the correct encoding/deconding couple for this error message?
Here's what I tried.
I guess the issue is the developer used the wrong encoding settings for this message (given the age of the game, developed for WinXP, this is unsurprising). By looking at it, one'd guess the message was encoded in some sort of multibyte encoding ( ƒf ƒB ƒX ƒv ƒŒ.)
However, each group seems to be made by three bytes (variable?). This rules out the usual suspects:
>>> wat = "ƒfƒBƒXƒvƒŒƒCƒ‚[ƒh‚ªÝ’è‚Å‚«‚Ü‚¹‚ñ. "
>>> wat.encode("UTF-8").decode("UTF-32")
UnicodeDecodeError: 'utf32' codec cannot decode bytes in position 0-3:
codepoint not in range(0x110000)
>>> wat.encode("UTF-8").decode("UTF-16")
UnicodeDecodeError: 'utf16' codec cannot decode bytes in position 70-70:
truncated data
>>> wat.encode("UTF-8")[:-1].decode("UTF-16")
'鋆왦䊒鋆왘皒鋆鋅鋆왃\ue292骀臂왛梒胢슚슪쎐\ue29d馀ꣃ胢쎚\ue285骀ꯂ胢쎚\ue29c骀맂胢쎚⺱'
#meaningless according to Google Translate.
I chose UTF-8 as the starting encoding because ASCII didn't work (UnicodeEncodeError: 'ascii' codec can't encode character '\u0192' in position 0: ordinal not in range(128)) and UTF-8 should be the default encoding for Windows 7 anyway (the OS I tried to use.)
Not quite there.
Kabie may be on something but that's not the full story. First off, I can't reproduce his encoding:
>>> print (wat.encode("UTF-8").decode("Shift-JIS"))
UnicodeDecodeError: 'shift_jis' codec cannot decode bytes in position 22-23: illegal multibyte sequence
>>> print (wat.encode("UTF-8")[:22].decode("Shift-JIS"))
ﾆ断ﾆ達ﾆ湛ﾆ致ﾆ椎槌辰ﾆ停
Wikipedia says there's a very similar encoding out there: cp932.
>>> print(wat.encode("UTF-8").decode("932"))
UnicodeDecodeError: 'cp932' codec cannot decode bytes in position 44-45: illegal multibyte sequence
>>> print(wat.encode("UTF-8")[:44].decode("932"))
ﾆ断ﾆ達ﾆ湛ﾆ致ﾆ椎槌辰ﾆ停喙ﾆ檀窶堋ｪﾃ昶凖ｨ窶堙
Again, very different from what he pasted. Let's see it, however:
>>> print("ディスプレイモ\x81[ドが\x90ﾝ定できません.\n")
ディスプレイモ[ドがﾝ定できません.
This is garbage for Google Translate, however. I then tried to remove some bits and pieces. Given that ディスプレイ means "display", if I removed "garbage" around the bits that can't be decoded I get:
ディスプレイモ\x81[ドが\x90ﾝ定できません.
→ ディスプレイ ドが ﾝ定できません.
→ The display mode is not specified.
However, since I asked on SO, this is not the full story. What is with those bytes that couldn't be decoded? How would you get these bytes to begin with.

=== file disupure.py ===
# start with the OP's hex dump:
hexbytes = """
c6 92 66 c6 92 42 c6 92 58 c6 92 76 c6 92 c5 92
c6 92 43 c6 92 e2 80 9a c2 81 5b c6 92 68 e2 80
9a c2 aa c2 90 c3 9d e2 80 99 c3 a8 e2 80 9a c3
85 e2 80 9a c2 ab e2 80 9a c3 9c e2 80 9a c2 b9
e2 80 9a c3 b1 2e 0a
"""
strg = ''.join(
chr(int(hexbyte, 16))
for hexbyte in hexbytes.split()
)
uc = strg.decode('utf8') # decodes OK but result is gibberish
uc_hex = ' '.join("%04X" % ord(x) for x in uc)
print uc_hex
# but it's stuffed ... U+0192??? oh yeah, 0x83
badenc = 'cp1252' # sort of, things like 0x81 have to be allowed for
fix_bad = {}
for i in xrange(256):
b = chr(i)
try:
fix_bad[ord(b.decode(badenc))] = i
except UnicodeDecodeError:
fix_bad[i] = i
recoded = uc.translate(fix_bad).encode('latin1')
better_uc = recoded.decode('cp932')
# It's on Windows; cp932 what would have been used
# but 'sjis' gives the same answer
better_uc_hex = ' '.join("%04X" % ord(x) for x in better_uc)
print better_uc_hex
print repr(better_uc)
print better_uc
Result of running this in IDLE (blank lines added for clarity):
0192 0066 0192 0042 0192 0058 0192 0076 0192 0152 0192 0043 0192 201A 0081 005B 0192 0068 201A 00AA 0090 00DD 2019 00E8 201A 00C5 201A 00AB 201A 00DC 201A 00B9 201A 00F1 002E 000A
30C7 30A3 30B9 30D7 30EC 30A4 30E2 30FC 30C9 304C 8A2D 5B9A 3067 304D 307E 305B 3093 002E 000A
u'\u30c7\u30a3\u30b9\u30d7\u30ec\u30a4\u30e2\u30fc\u30c9\u304c\u8a2d\u5b9a\u3067\u304d\u307e\u305b\u3093.\n'
ディスプレイモードが設定できません.
Google Translate: You can set the display mode.
Microsoft (Bing) Translate: Display mode is not set.
Update A bit more explanation on why the translation table is needed, and why it maps \x81 etc to U+0081, from the Wikipedia article on cp1252:
According to the information on
Microsoft's and the Unicode
Consortium's websites, positions 81,
8D, 8F, 90, and 9D are unused. However
the Windows API call for converting
from code pages to Unicode maps these
to the corresponding C1 control codes.

Obviously.
Since it is a Japanese game
'ディスプレイモ\x81[ドが\x90ﾝ定できません.\n'
'Disupureimo \ x81 [the de \ x90 applications can not be fixed. \ N'
Because I pasted the string, there are some missing.
The coding named Shift-JIS. I use my Opera to show the characters actually.
EDIT:
Sadly all my browsers can't add comments on SO. I guess it's about the network. So I have to update here.
You probably should set your display mode to 256 colors. That's many Japanese game needed.
EDIT2:
Interesting story.
About how I got the string, which is the most funny thing, is I DIDN'T directly encode the original bytes into it, as you may tried, only got this:
ﾆ断ﾆ達ﾆ湛ﾆ致ﾆ椎槌辰ﾆ停�堋ーﾆ檀窶堋ｪﾂ静昶�凖ｨ窶堙��堋ｫ窶堙懌�堋ｹ窶堙ｱ.
But pasting the string into another web page as source, then using Opera changed the coding to Shift-JIS.
Opera has this feature that let you modify source code of web page and show it. So I wrote a page like:
<!DOCTYPE html>
<head>
<title>test</title>
</head>
<body>
'ƒfƒBƒXƒvƒŒƒCƒ‚ƒh‚ªÝ’è‚Å‚«‚Ü‚¹‚ñ.
</body>
</html>
and that's what I got:
'ディスプレイモドがﾝ定できません.
Which is even more meaningless. And have you tried changing color mode to 256 colors?

Maybe this will help:
from binascii import unhexlify
data = '''\
c6 92 66 c6 92 42 c6 92 58 c6 92 76 c6 92 c5 92
c6 92 43 c6 92 e2 80 9a c2 81 5b c6 92 68 e2 80
9a c2 aa c2 90 c3 9d e2 80 99 c3 a8 e2 80 9a c3
85 e2 80 9a c2 ab e2 80 9a c3 9c e2 80 9a c2 b9
e2 80 9a c3 b1 2e 0a
'''
data = unhexlify(data.replace(' ','').replace('\n',''))
print data.decode('utf8').encode('windows-1252','xmlcharrefreplace').decode('shift-jis')
Output
ディスプレイモ[ドがﾝ定できません.
The hex data you provided was Shift_JIS decoded as windows-1252 and then re-encoded as UTF-8.
Edit
Building on John Machin's answer:
from binascii import unhexlify
import re
data = '''\
c6 92 66 c6 92 42 c6 92 58 c6 92 76 c6 92 c5 92
c6 92 43 c6 92 e2 80 9a c2 81 5b c6 92 68 e2 80
9a c2 aa c2 90 c3 9d e2 80 99 c3 a8 e2 80 9a c3
85 e2 80 9a c2 ab e2 80 9a c3 9c e2 80 9a c2 b9
e2 80 9a c3 b1 2e 0a
'''
data = unhexlify(data.replace(' ','').replace('\n',''))
data = data.decode('utf8').encode('windows-1252','xmlcharrefreplace')
# convert the XML entities that windows-1252 couldn't encode back into bytes
data = re.sub(r'&#(\d+);',lambda x: chr(int(x.group(1))),data)
print data.decode('shift-jis')
Output
ディスプレイモードが設定できません.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identify the contents a file through a program in python [duplicate] - python

Related

Difference in result while reading same file with node and python

Speed up python code

Python: convert hex bytestream to “int16"

Writing hex data into a file

How can I make sense of a badly encoded message?

Categories

Resources