I have a Pandas DataFrame, luminance_df, that looks like this:
barelyvisible
ultralight
light
abitlight
medium
abitdark
dark
evendarker
ultradark
almostblack
orange
96
92
83
72
61
53
48
40
34
28
gold
96
89
77
65
56
50
44
37
31
26
yellow
95
88
77
64
53
47
40
33
29
26
chartreuse
95
89
80
67
55
44
35
27
23
20
green
97
93
85
73
58
45
36
29
24
20
forest
96
90
80
67
52
39
30
24
20
16
aqua
97
89
78
64
50
40
32
26
22
19
teal
96
90
82
69
53
43
36
31
27
24
lightblue
97
94
86
74
60
48
39
32
27
24
blue
97
93
87
78
68
60
53
48
40
33
indigo
97
94
89
82
74
67
59
51
41
34
purple
98
95
92
85
76
66
58
50
42
35
royalpurple
98
95
92
85
75
65
56
47
39
32
magenta
98
95
91
83
73
61
49
40
33
28
pink
97
95
90
82
70
60
51
42
35
30
dustypink
97
95
90
82
71
60
50
41
35
30
red
97
94
89
82
71
60
51
42
35
31
So far, I'm building a single multi-chart HTML file like this:
with open(os.path.join(cwd, 'testout.html'), 'w') as outfile:
outfile.write("<p> </p><hr/><p> </p>".join(['<h1>Colors</h1>'+hex_styler.to_html(), '<h1>Hue</h1>'+hue_styler.to_html(), '<h1>Saturation</h1>'+saturation_styler.to_html(
), '<h1>Luminance</h1>'+luminance_styler.to_html(), '<h1>Perceived Brightness</h1>'+perceived_brightness_pivot_styler.to_html(), '<h1>Base Data</h1>'+basic_df.to_html()]))
I'd like to display an elevation/contour style map of the Luminance right after luminance_styler.to_html(), a lot like this one that I produced in Excel:
I'd like the colors to stay sorted "top to bottom" as values on a y-axis and the darknesses to stay sorted "left to right" as values on an x-axis, just like in the example above.
Question
I'm not a data scientist, nor do I use Python terribly regularly. I'm proud of myself for having made luminance_df in the first place, but I am not, for the life of me, figuring out how to make Python simply ... treat numeric cell values in a DataFrame whose labels in both directions are strings ... as a z-axis and make a contour-chart of it.
Everything I Google leads to really complicated data science nuanced questions.
Could someone get me on the right track by giving me the basic "hello world" code to get at least as far with luminance_df's data in Python as I got with the "insert chart" button in Excel?
If you can get me so I've got a img = BytesIO() that's image_base64 = base64.b64encode(img.read()).decode("utf-8")-able, I can f'<img src="data:image/png;base64, {image_base64}" />' it myself into the string concatenation that makes testout.html.
I'm on Windows and have myself set up to be able to pip install.
Notes
To be fair, I find these contour charts much more attractive and much easier to read than the one Excel made, but I'm fine with something sort of "brutish"-looking like the Excel version, as long as it makes "rising" & "falling" obvious and as long as it uses a ROYIGBV rainbow to indicate "less" vs. "more" (pet peeve of mine about the default Excel colors -- yes, I know, it's probably an accessibility thing):
While I'd like my chart's colors to follow a "rainbow" of sorts (because personally I find them easy to read), any "rainbow shading" on the chart should completely ignore the fact that the labels of the y-axis happen to describe colors. No correlation whatsoever. I'm simply plotting number facts between 16 and 98; colors of the chart should just indicate the change in "elevation" between those two extremes.
Effort so far
The only other "simple" question I've found so far that seems similar is Convert pandas DataFrame to a 3d graph using Index and Columns as X,Y and values as Z?, but this code didn't work for me at all, so I don't even know what it outputs, visually, so I have no idea if it's even relevant:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
lumX = luminance_df.columns
lumY = luminance_df.index
lumZ = luminance_df.values
fig = plt.figure()
ax = plt.axes(projection = '3d')
ax.contour3D(lumX,lumY,lumZ)
My script errored out with a message: ValueError: could not convert string to float: 'orange', and I don't know what I'm doing enough to accommodate for the fact that this answer seems to have been written around a presumption of numeric X- and Y-axis keys. (Also, it might not generate the type of chart I'm hoping for -- as I said, can't tell because it doesn't even execute and there's no visual sample in the answer.)
Dataset
Ready for pandas.DataFrame():
{"barelyvisible":{"orange":96,"gold":96,"yellow":95,"chartreuse":95,"green":97,"forest":96,"aqua":97,"teal":96,"lightblue":97,"blue":97,"indigo":97,"purple":98,"royalpurple":98,"magenta":98,"pink":97,"dustypink":97,"red":97},"ultralight":{"orange":92,"gold":89,"yellow":88,"chartreuse":89,"green":93,"forest":90,"aqua":89,"teal":90,"lightblue":94,"blue":93,"indigo":94,"purple":95,"royalpurple":95,"magenta":95,"pink":95,"dustypink":95,"red":94},"light":{"orange":83,"gold":77,"yellow":77,"chartreuse":80,"green":85,"forest":80,"aqua":78,"teal":82,"lightblue":86,"blue":87,"indigo":89,"purple":92,"royalpurple":92,"magenta":91,"pink":90,"dustypink":90,"red":89},"abitlight":{"orange":72,"gold":65,"yellow":64,"chartreuse":67,"green":73,"forest":67,"aqua":64,"teal":69,"lightblue":74,"blue":78,"indigo":82,"purple":85,"royalpurple":85,"magenta":83,"pink":82,"dustypink":82,"red":82},"medium":{"orange":61,"gold":56,"yellow":53,"chartreuse":55,"green":58,"forest":52,"aqua":50,"teal":53,"lightblue":60,"blue":68,"indigo":74,"purple":76,"royalpurple":75,"magenta":73,"pink":70,"dustypink":71,"red":71},"abitdark":{"orange":53,"gold":50,"yellow":47,"chartreuse":44,"green":45,"forest":39,"aqua":40,"teal":43,"lightblue":48,"blue":60,"indigo":67,"purple":66,"royalpurple":65,"magenta":61,"pink":60,"dustypink":60,"red":60},"dark":{"orange":48,"gold":44,"yellow":40,"chartreuse":35,"green":36,"forest":30,"aqua":32,"teal":36,"lightblue":39,"blue":53,"indigo":59,"purple":58,"royalpurple":56,"magenta":49,"pink":51,"dustypink":50,"red":51},"evendarker":{"orange":40,"gold":37,"yellow":33,"chartreuse":27,"green":29,"forest":24,"aqua":26,"teal":31,"lightblue":32,"blue":48,"indigo":51,"purple":50,"royalpurple":47,"magenta":40,"pink":42,"dustypink":41,"red":42},"ultradark":{"orange":34,"gold":31,"yellow":29,"chartreuse":23,"green":24,"forest":20,"aqua":22,"teal":27,"lightblue":27,"blue":40,"indigo":41,"purple":42,"royalpurple":39,"magenta":33,"pink":35,"dustypink":35,"red":35},"almostblack":{"orange":28,"gold":26,"yellow":26,"chartreuse":20,"green":20,"forest":16,"aqua":19,"teal":24,"lightblue":24,"blue":33,"indigo":34,"purple":35,"royalpurple":32,"magenta":28,"pink":30,"dustypink":30,"red":31}}
I believe you only need to do a countourf:
plt.contourf(df, cmap='RdYlBu')
plt.xticks(range(df.shape[1]), df.columns, rotation=90)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
Output:
Or a heatmap:
import seaborn as sns
sns.heatmap(df, cmap='RdYlBu')
Output:
I have raw data extracted from PDF and I decompressed the raw data and compressed it again.
I expected the same header and trailer, but the header was changed.
Original Hex Header
48 89 EC 57 ....
Converted Hex Header
78 9C EC BD ...
I dug into zlib compression and got header 48 also is one of zlib.header.
But mostly 78 is used for zlib compression.
It's my code which decompress and compress:
decompress_wbit = 12
compress_variable = 6
output_data = zlib.decompress(open(raw_data, "rb").read(), decompress_wbit)
output_data = zlib.compress(output_data, 6)
output_file = open(raw_data + '_', "wb")
output_file.write(output_data)
output_file.close()
I changed the decompress_wbit and compress_variable but still keeps 78.
So not sure how to get 48 as header.
Here is the short description about zlib.header.
CINFO (bits 12-15)
Indicates the window size as a power of two, from 0 (256 bytes) to 7 (32768 bytes). This will usually be 7. Higher values are not allowed.
CM (bits 8-11)
The compression method. Only Deflate (8) is allowed.
FLEVEL (bits 6-7)
Roughly indicates the compression level, from 0 (fast/low) to 3 (slow/high)
FDICT (bit 5)
Indicates whether a preset dictionary is used. This is usually 0. 1 is technically allowed, but I don't know of any Deflate formats that define preset dictionaries.
FCHECK (bits 0-4)
A checksum (5 bits, 0..31), whose value is calculated such that the entire value divides 31 with no remainder.
Typically, only the CINFO and FLEVEL fields can be freely changed, and FCHECK must be calculated based on the final value.* Assuming no preset dictionary, there is no choice in what the other fields contain, so a total of 32 possible headers are valid. Here they are:
FLEVEL: 0 1 2 3
CINFO:
0 08 1D 08 5B 08 99 08 D7
1 18 19 18 57 18 95 18 D3
2 28 15 28 53 28 91 28 CF
3 38 11 38 4F 38 8D 38 CB
4 48 0D 48 4B 48 89 48 C7
5 58 09 58 47 58 85 58 C3
6 68 05 68 43 68 81 68 DE
7 78 01 78 5E 78 9C 78 DA
Please let me know how to keep the zlib.header while decompression & compression
Thanks for your time.
I will first note that it doesn't matter. The data will be decompressed fine with that zlib header. Why do you care?
You are giving zlib.compress a small amount of data that permits a smaller window. Since it is permitted, the Python library is electing to compress with a smaller window.
A way to avoid that would be to use zlib.compressobj instead. Upon initiation, it doesn't know how much data you will be feeding it and will default to the largest window size.
I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in it
utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh#jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh#jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh#jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh#jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh#jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|
c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.
This question already has answers here:
Tools to help reverse engineer binary file formats
(9 answers)
Closed 6 years ago.
I have a file here. To me it appears it is a binary file. This is raw file and I believe that it has the stock information in OHLCV (Open, High, Low, Close, Volume). Besides it may also have some text.
One of the entries that I could possibly have for OHLCV is
464.95, 468.3, 460, 465.65, 3957854
This is the code that I have tried. I dont fully understand about ASCII and Unicode.
input_file = "00063181.dat" # tata motors
with open(input_file, "rb") as fh:
buf = fh.read()
output_l = list(map(int , buf))
print (output_l)
My Doubt: How do I decode this file and make sense out of it? Is there any way for me to read this file through a program written in python and separate the text from int/float? I am using Python 3 and Win 10 64 bit.
You're looking to reverse engineer the structure of a binary file using Python. Since you've declared that the file is binary, it may prove difficult. You're going to need to examine the contents of the file and use your best intuition to try to infer the structure. The first thing you're going to want is a way to display each of the bytes of the file an a way that will help you understand the meaning.
Fortunately, someone has already written a tool to do this, hexdump. Install that package using pip.
The function you need from that package is hexdump, so let's import it the package and get help on the function.
>>> import hexdump
>>> help(hexdump.hexdump)
Help on function hexdump in module hexdump:
hexdump(data, result='print')
Transform binary data to the hex dump text format:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[x] data argument as a binary string
[x] data argument as a file like object
Returns result depending on the `result` argument:
'print' - prints line by line
'return' - returns single string
'generator' - returns generator that produces lines
Now you can start to explore the contents of your file. Use the slice operator to do it in chunks. For example, to render the contents of the first 1KB of your file:
>>> hexdump.hexdump(buf[:1024])
00000000: C3 8E C2 8F 22 13 C2 AA 66 2A 22 47 C3 94 C3 AA ...."...f*"G....
00000010: C3 89 C3 A0 C3 B1 C3 91 6A C2 A4 C3 BF 3C C2 AA ........j....<..
00000020: C2 91 73 C3 85 46 57 47 C2 88 C3 99 C2 B6 3E 2D ..s..FWG......>-
00000030: C3 BA 69 10 C2 93 C3 94 38 C3 81 7A 6A 43 30 7C ..i.....8..zjC0|
00000040: C3 BB C2 AA 01 2D C2 97 C3 83 C3 88 64 14 C3 9C .....-......d...
00000050: C2 AB C2 AA C3 A2 74 C2 85 5D C3 97 4E 64 68 C3 ......t..]..Ndh.
...
000003C0: 42 C2 8F 06 7F 12 33 7F 79 1E 2C 2A 0F C3 92 36 B.....3.y.,*...6
000003D0: C3 A6 C2 96 C2 93 C2 8B 43 C2 9F 4C C2 95 48 24 ........C..L..H$
000003E0: C2 B3 C2 82 26 C3 88 C3 BD C3 96 12 1E 5E 18 2E ....&........^..
000003F0: 37 C3 A7 C2 87 C3 AE 00 4F 3F C2 9C C3 A8 1C C2 7.......O?......
Hexdump has a nice property of rendering the byte position, the hex code, and then (if possible) the printable form of the character on the right.
Hopefully some of your text values will be visible there and that will give some clue as to how to reverse engineer your file.
Once you've started to determine how your file is structured, you can use the various string operators to manipulate your data. For example, if you find that your file is split into sections by the null byte (b'\x00'), you can get those sections thus:
>>> sections = buf.split(b'\x00')
There are a lot of things that you're likely to have to learn as you dig deeper, like character encodings, number encodings (including little-endian for integers and floating-point encoding for floating point numbers). You'll want to find some way to externally validate your results.
Best of luck.
I am puzzled. I have a simple python code. What it does is logging on a binary file the status of an internet connection. Every few seconds it pings an host on the public internet, and then it logs the timestamp (as an unsigned int) and the online status (as a bool).
I am using struct.pack to pack the binary data. Here is the function to add an entry to the log:
def add(path, timestamp, online):
handle = open(path, "ab")
handle.write(struct.pack("<I?", timestamp, online))
handle.close()
And here is the call to that function:
storage.add(settings["log_path"], time.time(), online)
Now, it worked fine for a long time (like, a dozen of hours), and it logged properly everything. However, after a while I noticed that all the timestamps were nonsense. I realized that they were just out of alignment, so I traced back what happened when they went out of alignment.
Here is the binary data around the point where they get out of alignment:
DB 0C F2 55 01
E5 0C F2 55 01
EF 0C F2 55 01
F9 0C F2 55 01
03 0D F2 55 01
0D F2 55 01 <----- What??
17 0D F2 55 01
21 0D F2 55 01
2B 0D F2 55 01
35 0D F2 55 01
3F 0D F2 55 01
49 0D F2 55 01
53 0D F2 55 01
5D 0D F2 55 01
67 0D F2 55 01
71 0D F2 55 01
7B 0D F2 55 01
85 0D F2 55 01
I just grouped the bytes in groups of five bytes, with the boolean (I-am-online) at the end. The first four bytes of each group represent a meaningful timestamp (sometime last night), as little endian unsigned ints.
However in the point marked with an arrow, the group is just 4 bytes long..! No meaningful timestamp is represented, and everything gets out of alignment. In fact, it looks like it is the first byte of the timestamp that went missing. As you can see, I am doing and logging that ping test every 10 seconds, so around that point you should have F9, 03, then 0D (but that is missing...!), then 17, 21, 2B and so on. So the line that is shorter should have been something like 0D 0D F2 55 01.
I am puzzled. How can this happen? I thought that maybe when writing on the disk the program could have been killed or something, but it actually ran for the whole night without any pause. Is it possible that struct.pack misses one byte at some point...?
This is really odd. Any idea would be greatly appreciated.