How to keep the header and trailer while zlib decompress and compress - python

I have raw data extracted from PDF and I decompressed the raw data and compressed it again.
I expected the same header and trailer, but the header was changed.
Original Hex Header
48 89 EC 57 ....
Converted Hex Header
78 9C EC BD ...
I dug into zlib compression and got header 48 also is one of zlib.header.
But mostly 78 is used for zlib compression.
It's my code which decompress and compress:
decompress_wbit = 12
compress_variable = 6
output_data = zlib.decompress(open(raw_data, "rb").read(), decompress_wbit)
output_data = zlib.compress(output_data, 6)
output_file = open(raw_data + '_', "wb")
output_file.write(output_data)
output_file.close()
I changed the decompress_wbit and compress_variable but still keeps 78.
So not sure how to get 48 as header.
Here is the short description about zlib.header.
CINFO (bits 12-15)
Indicates the window size as a power of two, from 0 (256 bytes) to 7 (32768 bytes). This will usually be 7. Higher values are not allowed.
CM (bits 8-11)
The compression method. Only Deflate (8) is allowed.
FLEVEL (bits 6-7)
Roughly indicates the compression level, from 0 (fast/low) to 3 (slow/high)
FDICT (bit 5)
Indicates whether a preset dictionary is used. This is usually 0. 1 is technically allowed, but I don't know of any Deflate formats that define preset dictionaries.
FCHECK (bits 0-4)
A checksum (5 bits, 0..31), whose value is calculated such that the entire value divides 31 with no remainder.
Typically, only the CINFO and FLEVEL fields can be freely changed, and FCHECK must be calculated based on the final value.* Assuming no preset dictionary, there is no choice in what the other fields contain, so a total of 32 possible headers are valid. Here they are:
FLEVEL: 0 1 2 3
CINFO:
0 08 1D 08 5B 08 99 08 D7
1 18 19 18 57 18 95 18 D3
2 28 15 28 53 28 91 28 CF
3 38 11 38 4F 38 8D 38 CB
4 48 0D 48 4B 48 89 48 C7
5 58 09 58 47 58 85 58 C3
6 68 05 68 43 68 81 68 DE
7 78 01 78 5E 78 9C 78 DA
Please let me know how to keep the zlib.header while decompression & compression
Thanks for your time.

I will first note that it doesn't matter. The data will be decompressed fine with that zlib header. Why do you care?
You are giving zlib.compress a small amount of data that permits a smaller window. Since it is permitted, the Python library is electing to compress with a smaller window.
A way to avoid that would be to use zlib.compressobj instead. Upon initiation, it doesn't know how much data you will be feeding it and will default to the largest window size.

Related

Python Crypto AES 128 with PKCS7Padding different outputs from Swift vs Python

The output produced by crypto with following key
key = base64.b64decode('PyxZO31GlgKvWm+3GLySzAAAAAAAAAAAAAAAAAAAAAA=') (16 bytes)
and the
message = "y_device=y_C9DB602E-0EB7-4FF4-831E-8DA8CEE0BBF5"
My IV object looks like this:
iv = base64.b64decode('AAAAAAAAAAAAAAAAAAAAAA==')
Objective C CCCrypt produces the following hash 4Mmg/BPgc2jDrGL+XRA3S1d8vm02LqTaibMewJ+9LLuE3mV92HjMvVs/OneUCLD4
It appears to be using AlgorithmAES128 uses PKCS7Padding with the key provided above.
I'm trying to implement the same crypto encode functionality to get an output like 4Mmg/BPgc2jDrGL+XRA3S1d8vm02LqTaibMewJ+9LLuE3mV92HjMvVs/OneUCLD4
This is what I've been able to put so far
from Crypto.Util.Padding import pad, unpad
from Crypto . Cipher import AES
class MyCrypt():
def __init__(self, key, iv):
self.key = key
self.iv = iv
self.mode = AES.MODE_CBC
def encrypt(self, text):
cryptor = AES.new(self.key, self.mode, self.iv)
length = 16
text = pad(text, 16)
self.ciphertext = cryptor.encrypt(text)
return self.ciphertext
key = base64.b64decode('PyxZO31GlgKvWm+3GLySzAAAAAAAAAAAAAAAAAAAAAA=')
IV = base64.b64decode('AAAAAAAAAAAAAAAAAAAAAA==')
plainText = 'y_device=y_C9DB602E-0EB7-4FF4-831E-8DA8CEE0BBF5'.encode('utf-8')
crypto = MyCrypt(key, IV)
encrypt_data = crypto.encrypt(plainText)
encoder = base64.b64encode(encrypt_data)
print(encrypt_data, encoder)
This produces the following output Pi3yzpoVhax0Cul1VkYoyYCivZrEliTDBpDbqZ3dD1bwTUycstAF+MLSTIjSMiQj instead of 4Mmg/BPgc2jDrGL+XRA3S1d8vm02LqTaibMewJ+9LLuE3mV92HjMvVs/OneUCLD4
`
Which isn't my desired output.
should I not be using MODE_ECB, or am I using key as intended ?
To add more context
I'm naive to Crypto/ Objective C.
I'm currently pentesting an app, which does some hashing behind the scenes.
Using frida I'm tracing these function calls, and I see the following get populated for swift Objc calls.
CCCrypt(operation: 0x0, CCAlgorithm: 0x0, CCOptions: 0x1, keyBytes: 0x1051f8639, keyLength: 0x10, ivBuffer: 0x1051f8649, inBuffer: 0x2814bd890, inLength: 0x58, outBuffer: 0x16f1c5d90, outLength: 0x60, outCountPtr: 0x16f1c5e10)
Where
CCCrypt(operation: 0x0, CCAlgorithm: 0x0, CCOptions: 0x1, keyBytes: 0x1051f8639, keyLength: 0x10, ivBuffer: 0x1051f8649, inBuffer: 0x280e41530, inLength: 0x2f, outBuffer: 0x16f1c56c0, outLength: 0x30, outCountPtr: 0x16f1c5710)
In buffer:
0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
280e41530 79 5f 64 65 76 69 63 65 3d 79 5f 43 39 44 42 36 y_device=y_C9DB6
280e41540 30 32 45 2d 30 45 42 37 2d 34 46 46 34 2d 38 33 02E-0EB7-4FF4-83
280e41550 31 45 2d 38 44 41 38 43 45 45 30 42 42 46 35 1E-8DA8CEE0BBF5
Key: 16 47
0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
1051f8639 3f 2c 59 3b 7d 46 96 02 af 5a 6f b7 18 bc 92 cc ?,Y;}F...Zo.....
IV: 16
0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
1051f8649 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
I use https://opensource.apple.com/source/CommonCrypto/CommonCrypto-36064/CommonCrypto/CommonCryptor.h to reference the type of encryption happening based on pointers i.e for Options argument the following is passed 0x1
key = base64.b64decode('PyxZO31GlgKvWm+3GLySzAAAAAAAAAAAAAAAAAAAAAA=') (16 bytes)
Nope, that's 32 bytes. It's true that only 16 are non-zero, making a really poor key, but if you pass 256 bits, you are doing AES-256, and you'll get a different result than you would from AES-128 using the first 128 bits of that key.
Your title mentions PKCS #7 padding, but it looks like your code is padding with zeros. That will change the results as well.
ECB doesn't use an IV. If you can see that the Swift code is using the IV, you might be able to see what mode it's using too, or you could try CBC as a first guess. ECB is insecure in most cases. Of course, using a fixed IV is also insecure.
Your output is longer than it should be (64 bytes instead of 48). Your attempt to do the padding yourself is probably responsible for this.
From <CommonCryptor.h>, we can decode the parameters used in Swift's call to CCCrypt:
Type
Value
Name
Comment
CCOperation
0x0
kCCEncrypt
Symmetric encryption.
CCAlgorithm
0x0
kCCAlgorithmAES128
Advanced Encryption Standard, 128-bit block
CCOptions
0x1
kCCOptionPKCS7Padding
Perform PKCS7 padding.
CCOptions
0x2
kCCOptionECBMode
Electronic Code Book Mode. Default is CBC.
CCOptions is a bit field, and kCCOptionECBMode is not set, so the default is used.
So this is AES-128 in CBC mode with PKCS #7 padding.

How do I make a make a simple contour chart of a Pandas DataFrame with numeric cell values as Z and labeled rows/columns as X and Y coordinates?

I have a Pandas DataFrame, luminance_df, that looks like this:
barelyvisible
ultralight
light
abitlight
medium
abitdark
dark
evendarker
ultradark
almostblack
orange
96
92
83
72
61
53
48
40
34
28
gold
96
89
77
65
56
50
44
37
31
26
yellow
95
88
77
64
53
47
40
33
29
26
chartreuse
95
89
80
67
55
44
35
27
23
20
green
97
93
85
73
58
45
36
29
24
20
forest
96
90
80
67
52
39
30
24
20
16
aqua
97
89
78
64
50
40
32
26
22
19
teal
96
90
82
69
53
43
36
31
27
24
lightblue
97
94
86
74
60
48
39
32
27
24
blue
97
93
87
78
68
60
53
48
40
33
indigo
97
94
89
82
74
67
59
51
41
34
purple
98
95
92
85
76
66
58
50
42
35
royalpurple
98
95
92
85
75
65
56
47
39
32
magenta
98
95
91
83
73
61
49
40
33
28
pink
97
95
90
82
70
60
51
42
35
30
dustypink
97
95
90
82
71
60
50
41
35
30
red
97
94
89
82
71
60
51
42
35
31
So far, I'm building a single multi-chart HTML file like this:
with open(os.path.join(cwd, 'testout.html'), 'w') as outfile:
outfile.write("<p> </p><hr/><p> </p>".join(['<h1>Colors</h1>'+hex_styler.to_html(), '<h1>Hue</h1>'+hue_styler.to_html(), '<h1>Saturation</h1>'+saturation_styler.to_html(
), '<h1>Luminance</h1>'+luminance_styler.to_html(), '<h1>Perceived Brightness</h1>'+perceived_brightness_pivot_styler.to_html(), '<h1>Base Data</h1>'+basic_df.to_html()]))
I'd like to display an elevation/contour style map of the Luminance right after luminance_styler.to_html(), a lot like this one that I produced in Excel:
I'd like the colors to stay sorted "top to bottom" as values on a y-axis and the darknesses to stay sorted "left to right" as values on an x-axis, just like in the example above.
Question
I'm not a data scientist, nor do I use Python terribly regularly. I'm proud of myself for having made luminance_df in the first place, but I am not, for the life of me, figuring out how to make Python simply ... treat numeric cell values in a DataFrame whose labels in both directions are strings ... as a z-axis and make a contour-chart of it.
Everything I Google leads to really complicated data science nuanced questions.
Could someone get me on the right track by giving me the basic "hello world" code to get at least as far with luminance_df's data in Python as I got with the "insert chart" button in Excel?
If you can get me so I've got a img = BytesIO() that's image_base64 = base64.b64encode(img.read()).decode("utf-8")-able, I can f'<img src="data:image/png;base64, {image_base64}" />' it myself into the string concatenation that makes testout.html.
I'm on Windows and have myself set up to be able to pip install.
Notes
To be fair, I find these contour charts much more attractive and much easier to read than the one Excel made, but I'm fine with something sort of "brutish"-looking like the Excel version, as long as it makes "rising" & "falling" obvious and as long as it uses a ROYIGBV rainbow to indicate "less" vs. "more" (pet peeve of mine about the default Excel colors -- yes, I know, it's probably an accessibility thing):
While I'd like my chart's colors to follow a "rainbow" of sorts (because personally I find them easy to read), any "rainbow shading" on the chart should completely ignore the fact that the labels of the y-axis happen to describe colors. No correlation whatsoever. I'm simply plotting number facts between 16 and 98; colors of the chart should just indicate the change in "elevation" between those two extremes.
Effort so far
The only other "simple" question I've found so far that seems similar is Convert pandas DataFrame to a 3d graph using Index and Columns as X,Y and values as Z?, but this code didn't work for me at all, so I don't even know what it outputs, visually, so I have no idea if it's even relevant:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
lumX = luminance_df.columns
lumY = luminance_df.index
lumZ = luminance_df.values
fig = plt.figure()
ax = plt.axes(projection = '3d')
ax.contour3D(lumX,lumY,lumZ)
My script errored out with a message: ValueError: could not convert string to float: 'orange', and I don't know what I'm doing enough to accommodate for the fact that this answer seems to have been written around a presumption of numeric X- and Y-axis keys. (Also, it might not generate the type of chart I'm hoping for -- as I said, can't tell because it doesn't even execute and there's no visual sample in the answer.)
Dataset
Ready for pandas.DataFrame():
{"barelyvisible":{"orange":96,"gold":96,"yellow":95,"chartreuse":95,"green":97,"forest":96,"aqua":97,"teal":96,"lightblue":97,"blue":97,"indigo":97,"purple":98,"royalpurple":98,"magenta":98,"pink":97,"dustypink":97,"red":97},"ultralight":{"orange":92,"gold":89,"yellow":88,"chartreuse":89,"green":93,"forest":90,"aqua":89,"teal":90,"lightblue":94,"blue":93,"indigo":94,"purple":95,"royalpurple":95,"magenta":95,"pink":95,"dustypink":95,"red":94},"light":{"orange":83,"gold":77,"yellow":77,"chartreuse":80,"green":85,"forest":80,"aqua":78,"teal":82,"lightblue":86,"blue":87,"indigo":89,"purple":92,"royalpurple":92,"magenta":91,"pink":90,"dustypink":90,"red":89},"abitlight":{"orange":72,"gold":65,"yellow":64,"chartreuse":67,"green":73,"forest":67,"aqua":64,"teal":69,"lightblue":74,"blue":78,"indigo":82,"purple":85,"royalpurple":85,"magenta":83,"pink":82,"dustypink":82,"red":82},"medium":{"orange":61,"gold":56,"yellow":53,"chartreuse":55,"green":58,"forest":52,"aqua":50,"teal":53,"lightblue":60,"blue":68,"indigo":74,"purple":76,"royalpurple":75,"magenta":73,"pink":70,"dustypink":71,"red":71},"abitdark":{"orange":53,"gold":50,"yellow":47,"chartreuse":44,"green":45,"forest":39,"aqua":40,"teal":43,"lightblue":48,"blue":60,"indigo":67,"purple":66,"royalpurple":65,"magenta":61,"pink":60,"dustypink":60,"red":60},"dark":{"orange":48,"gold":44,"yellow":40,"chartreuse":35,"green":36,"forest":30,"aqua":32,"teal":36,"lightblue":39,"blue":53,"indigo":59,"purple":58,"royalpurple":56,"magenta":49,"pink":51,"dustypink":50,"red":51},"evendarker":{"orange":40,"gold":37,"yellow":33,"chartreuse":27,"green":29,"forest":24,"aqua":26,"teal":31,"lightblue":32,"blue":48,"indigo":51,"purple":50,"royalpurple":47,"magenta":40,"pink":42,"dustypink":41,"red":42},"ultradark":{"orange":34,"gold":31,"yellow":29,"chartreuse":23,"green":24,"forest":20,"aqua":22,"teal":27,"lightblue":27,"blue":40,"indigo":41,"purple":42,"royalpurple":39,"magenta":33,"pink":35,"dustypink":35,"red":35},"almostblack":{"orange":28,"gold":26,"yellow":26,"chartreuse":20,"green":20,"forest":16,"aqua":19,"teal":24,"lightblue":24,"blue":33,"indigo":34,"purple":35,"royalpurple":32,"magenta":28,"pink":30,"dustypink":30,"red":31}}
I believe you only need to do a countourf:
plt.contourf(df, cmap='RdYlBu')
plt.xticks(range(df.shape[1]), df.columns, rotation=90)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
Output:
Or a heatmap:
import seaborn as sns
sns.heatmap(df, cmap='RdYlBu')
Output:

SQLAlchemy Unicode conundrum

I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in it
utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh#jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh#jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh#jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh#jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh#jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|
c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

How to unpack ID3 header's size

I am trying to unpack the ID3v2.3 header with Python 2.7. However, I do not fully understand the first 10 bytes of the MP3 format. For example:
49 44 33 03 00 00 | 00 00 21 76 | 54 41 4C 42
.I .D .3 .3 .0 | RawSize | Size
Using Synalyze it! I can see that RawSize is 0x2176 and Size is 4342.
At offset 4352 is where the MPEG data frames begin. I need to know how
54 41 4C 42 gets converted to 4342 because when I tried:
>>> unpack('i', '\x54\x41\x4C\x42')
(1112293716,)
which does not look in anyways like 4352!
How should I read them in general?
Firstly, you give 14 bytes there, not 10.
Secondly, you've botched reading the size completely. The size uses unpacked 7-bit values rather than 8-bit values.
>>> 0x00 << 21 | 0x00 << 14 | 0x21 << 7 | 0x76
4342

Iterating through the rows in mysql in python

I have a mysql database table consisting of 8 columns as given
ID C1 C2 C3 C4 C5 C6 C7
1 25 33 76 87 56 76 47
2 67 94 90 56 77 32 84
3 53 66 24 93 33 88 99
4 73 34 52 85 67 82 77
5 78 55 52 100 78 68 32
6 67 35 60 93 88 53 66
I need to fetch 3 rows of all the column except the ID column at a time. So far I did this code in python which fetches me the rows with ID values 1,2,3.
ex = MySQLdb.connect(host,port,user,passwd,db)
with ex:
ex_cur = ex.cursor()
ex.execute("SELECT C1,C2,C3,C4,C5,C6,C7 FROM table LIMIT 0, 3;")
In the second cycle I need to fetch rows with ID values 2,3,4, third cycle fetches rows with ID values 3,4,5 which should continue till the end of the database. What query should I use to iterate through the table so as to get the desired set of rows.
I believe there are three ways of doing this: (I'm going to explain at a very high level)
You can create a queue with a size limit of 3 and read in the rows as a stream. Once the queue reaches the max size of 3, do your processing, pop off the first element in your queue, and proceed with the stream. (More efficient)
You would need an iterator and reset your cursor for every set of 3 IDs that you have to do.
Since your table is relatively small (would not suggest this for larger tables), load the whole database into a data structure/into memory. Perhaps make an object for the rows and use an ORM to map rows to objects. Then you would simply have to iterate through each object, or set of 3 objects, and do the necessary processing.

Categories

Resources