Murmurhash 2 results on Python and Haskell

Murmurhash 2 results on Python and Haskell - python

Haskell and Python don't seem to agree on Murmurhash2 results. Python, Java, and PHP returned the same results but Haskell don't. Am I doing something wrong regarding Murmurhash2 on Haskell?
Here is my code for Haskell Murmurhash2:
import Data.Digest.Murmur32
main = do
print $ asWord32 $ hash32WithSeed 1 "woohoo"
And here is the code written in Python:
import murmur
if __name__ == "__main__":
print murmur.string_hash("woohoo", 1)
Python returned 3650852671 while Haskell returned 3966683799

From a quick inspection of the sources, it looks like the algorithm operates on 32 bits at a time. The Python version gets these by simply grabbing 4 bytes at a time from the input string, while the Haskell version converts each character to a single 32-bit Unicode index.
It's therefore not surprising that they yield different results.

The murmur-hash package (I am its author) does not promise to compute the same hashes as other languages. If you rely on hashes to be compatible with other software that computes hashes I suggest you create newtype wrappers that compute hashes the way you want them. For text, in particular, you need to at least specify the encoding. In your case you could convert the text to an ASCII string using Data.ByteString.Char8.pack, but that still doesn't give you the same hash since the ByteString instance is more of a placeholder.
BTW, I'm not actively improving that package because MurmurHash2 has been superseded by MurmurHash3, but I keep accepting patches.

Related

Current idiom for removing 'surrogateescape' characters fron a decoded string

Armin Ronacher, http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/
If you for instance pass [the result of os.fsdecode() or equivalent] to a template engine you [sometimes get a UnicodeEncodeError] somewhere else entirely and because the encoding happens at a much later stage you no longer know why the string was incorrect. If you detect that error when it happens the issue becomes much easier to debug
Armin suggests a function
def remove_surrogate_escaping(s, method='ignore'):
assert method in ('ignore', 'replace'), 'invalid removal method'
return s.encode('utf-8', method).decode('utf-8')
Nick Coghlan, 2014, [Python-Dev] Cleaning up surrogate escaped strings
The current proposal on the issue tracker is to ... take advantage of
the existing error handlers:
def convert_surrogateescape(data, errors='replace'):
return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense - it took a few iterations to
come up with that version. (Added bonus: once you're alerted to the
possibility, it's trivial to write your own version for existing Python 3
versions. The standard name just makes it easier to look up when you come
across it in a piece of code, and provides the option of optimising it
later if it ever seems worth the extra work)
The functions are slightly different. The second was written with knowledge of the first.
Since Python 3.5, the backslashreplace error handler now works on decoding as well as encoding. The first approach is not designed to use backslashreplace e.g. an error decoding the byte 0xff would get printed as "\udcff". The second approach is designed to solve this; it would print "\xff".
If you did not need backslashreplace, you might prefer the first version if you had the misfortune to be supporting Python < 3.5 (including polyglot 2/3 code, ouch).
Question
Is there a better idiom for this purpose yet? Or do we still use this drop-in function?

Nick referred to an issue for adding such a function to the codecs module. As of 2019 the function has not been added, and the ticket remains open.
The latest comment says
msg314682 Nick Coghlan, 2018
A recent discussion on python-ideas also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data.
That includes a lone surrogate fixer: ftfy.fixes.fix_surrogates(text)
...
I do not find the function in ftfy appealing. The documentation does not say so, but it appears to be designed to handle both surrogateescape and ... be part of a workaround for CESU-8, or something like that ?
Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.

Haskell program involving `read` is much slower than an equivalent Python one

As part of a programming challenge, I need to read, from stdin, a sequence of space-separated integers (on a single line), and print the sum of those integers to stdout. The sequence in question can contain as many as 10,000,000 integers.
I have two solutions for this: one written in Haskell (foo.hs), and another, equivalent one, written in Python 2 (foo.py). Unfortunately, the (compiled) Haskell program is consistently slower than the Python program, and I'm at a loss for explaining the discrepancy in performance between the two programs; see the Benchmark section below. If anything, I would have expected Haskell to have the upper hand...
What am I doing wrong? How can I account for this discrepancy? Is there an easy way of speeding up my Haskell code?
(For information, I'm using a mid-2010 Macbook Pro with 8Gb RAM, GHC 7.8.4, and Python 2.7.9.)
foo.hs
main = print . sum =<< getIntList
getIntList :: IO [Int]
getIntList = fmap (map read . words) getLine
(compiled with ghc -O2 foo.hs)
foo.py
ns = map(int, raw_input().split())
print sum(ns)
Benchmark
In the following, test.txt consists of a single line of 10 million space-separated integers.
# Haskell
$ time ./foo < test.txt
1679257
real 0m36.704s
user 0m35.932s
sys 0m0.632s
# Python
$ time python foo.py < test.txt
1679257
real 0m7.916s
user 0m7.756s
sys 0m0.151s

read is slow. For bulk parsing, use bytestring or text primitives, or attoparsec.
I did some benchmarking. Your original version ran in 23,9 secs on my computer. The version below ran in 0.35 secs:
import qualified Data.ByteString.Char8 as B
import Control.Applicative
import Data.Maybe
import Data.List
import Data.Char
main = print . sum =<< getIntList
getIntList :: IO [Int]
getIntList =
map (fst . fromJust . B.readInt) . B.words <$> B.readFile "test.txt"
By specializing the parser to your test.txt file, I could get the runtime down to 0.26 sec:
getIntList :: IO [Int]
getIntList =
unfoldr (B.readInt . B.dropWhile (==' ')) <$> B.readFile "test.txt"

Read is slow
Fast read, from this answer, will bring you down to 5.5 seconds.
import Numeric
fastRead :: String -> Int
fastRead s = case readDec s of [(n, "")] -> n
Strings are Linked Lists
In Haskell the String type is a linked list. Using a packed representation (bytestring if you really only want ascii but Text is also very fast and supports unicode). As shown in this answer, the performance should then be neck and neck.

I would venture to guess that a big part of your problem is actually words. When you map read . words, what you're actually doing is this:
Scan the input looking for a space, building a list of non-spaces as you go. There are a lot of different kinds of spaces, and checking any character that's not a common type of space additionally involves a foreign call to a C function (slow). I'm planning to fix this sometime, but I haven't gotten around to it yet, and even then you'll still be building and throwing away lists for no good reason, and checking for spaces when you really just want to check for digits.
Read through the list of accumulated characters to try to make a number out of them. Produce the number. The accumulated list now becomes garbage.
Go back to step 1.
This is a fairly ridiculous way to proceed. I believe you can even do better using something horrible like reads, but it would make more sense to use something like ReadP. You can also try fancier sorts of things like stream-based parsing; I don't know if that will help much or not.

What character-shifting / pseudo-encryption algorithm is used here?

This is a cry for help from all you cryptologists out there.
Scenario: I have a Windows application (likely built with VC++ or VB and subsequently moved to .Net) that saves some passwords in an XML file. Given a password A0123456789abcDEFGH, the resulting "encrypted" value is 04077040940409304092040910409004089040880408704086040850404504044040430407404073040720407104070
Looking at the string, I've figured out that this is just character shifting: '04' delimits actual character values, which are decimal; if I then subtract these values from 142, I get back the original ASCII code. In Jython (2.2), my decryption routine looks like this (EDITED thanks to suggestions in comments):
blocks = [ pwd[i:i+5] for i in range(0, len(pwd), 5) ]
# now a block looks like '04093'
decrypted = [ chr( 142 - int(block[3:].lstrip('0')) ) for block in blocks ]
This is fine for ASCII values (127 in total) and a handful of accented letters, but 8-bit charsets have another 128 characters; limiting accepted values to 142 doesn't make sense from a decimal perspective.
EDIT: I've gone rummaging through our systems and found three non-ASCII chars:
è 03910
Ø 03926
Õ 03929
From these values, it looks like actually subtracting the 4-number block from 4142 (leaving only '0' as separator) gives me the correct character.
So my question is:
is anybody familiar with this sort of obfuscation scheme in the Windows world? Could this be the product of a standard library function? I'm not very familiar with Win32 and .Net development, to be honest, so I might be missing something very simple.
If it's not a library function, can you think of a better method to de-obfuscate these values without resorting to the magic 142 number, i.e. a scheme that can actually be applied on non-ASCII characters without special-casing them? I'm crap at bit shifting and all that, so again I might be missing something obvious to the trained eye.

is anybody familiar with this sort of obfuscation scheme in the Windows world?
Once you understand it correctly, it's just a trivial rotation cipher like ROT13.
Why would anyone use this?
Well, in general, this is very common. Let's say you have some data that you need to obfuscate. But the decryption algorithm and key have to be embedded in software that the viewers have. There's no point using something fancy like AES, because someone can always just dig the algorithm and key out of your code instead of cracking AES. An encryption scheme that's even marginally harder to crack than finding the hidden key is just as good as a perfect encryption scheme—that is, good enough to deter casual viewers, and useless against serious attackers. (Often you aren't even really worried about stopping attacks, but about proving after the fact that your attacker must have acted in bad faith for contractual/legal reasons.) So, you use either a simple rotation cipher, or a simple xor cipher—it's fast, it's hard to get wrong and easy to debug, and if worst comes to worst you can even decrypt it manually to recover corrupted data.
As for the particulars:
If you want to handle non-ASCII characters, you pretty much have to use Unicode. If you used some fixed 8-bit charset, or the local system's OEM charset, you wouldn't be able to handle passwords from other machines.
A Python script would almost certainly handle Unicode characters, because in Python you either deal in bytes in a str, or Unicode characters in a unicode. But a Windows C or .NET app would be much more likely to use UTF-16, because Windows native APIs deal in UTF-16-LE code points in a WCHAR * (aka a string of 16-bit words).
So, why 4142? Well, it really doesn't matter what the key is. I'm guessing some programmer suggested 42. His manager then said "That doesn't sound very secure." He sighed and said, "I already explained why no key is going to be any more secure than… you know what, forget it, what about 4142?" The manager said, "Ooh, that sounds like a really secure number!" So that's why 4142.
If it's not a library function, can you think of a better method to de-obfuscate these values without resorting to the magic 142 number.
You do need to resort to the magic 4142, but you can make this a lot simpler:
def decrypt(block):
return struct.pack('>H', (4142 - int(block, 10)) % 65536)
So, each block of 5 characters is the decimal representation of a UTF-16 code unit, subtracted from 4142, using C unsigned-short wraparound rules.
This would be trivial to implement in native Windows C, but it's slightly harder in Python. The best transformation function I can come up with is:
def decrypt_block(block):
return struct.pack('>H', (4142 - int(block, 10)) % 65536)
def decrypt(pwd):
blocks = [pwd[i:i+5] for i in range(0, len(pwd), 5)]
return ''.join(map(decrypt_block, blocks)).decode('utf-16-be')
This would be a lot more trivial in C or C#, which is probably what they implemented things in, so let me explain what I'm doing.
You already know how to transform the string into a sequence of 5-character blocks.
My int(block, 10) is doing the same thing as your int(block.lstrip('0')), making sure that a '0' prefix doesn't make Python treat it as an octal numeral instead of decimal, but more explicitly. I don't think this is actually necessary in Jython 2.2 (it definitely isn't in more modern Python/Jython), but I left it just in case.
Next, in C, you'd just do unsigned short x = 4142U - y;, which would automatically underflow appropriately. Python doesn't have unsigned short values, just signed int, so we have to do the underflow manually. (Because Python uses floored division and remainder, the sign is always the same as the divisor—this wouldn't be true in C, at least not C99 and most platforms' C89.)
Then, in C, we'd just cast the unsigned short to a 16-bit "wide character"; Python doesn't have any way to do that, so we have to use struct.pack. (Note that I'm converting it to big-endian, because I think that makes this easier to debug; in C you'd convert to native-endian, and since this is Windows, that would be little-endian.)
So, now we've got a sequence of 2-character UTF-16-BE code points. I just join them into one big string, then decode it as UTF-16-BE.
If you really want to test that I've got this right, you'll need to find characters that aren't just non-ASCII, but non-Western. In particular, you need:
A character that's > U+4142 but < U+10000. Most CJK ideographs, like U+7000 (瀀), fit the bill. This should appear as '41006', because that's 4142-0x7000 rolled over as an unsigned short.
A character that's >= U+10000. This includes uncommon CJK characters, specialized mathematical characters, characters from ancient scripts, etc. For example, the Old Italic character U+10300 (𐌀) encodes to the surrogate pair (0xd800, 0xdf00); 4142-0xd800=14382, and 4142-0xdf00=12590, so you'd get '1438212590'.
The first will be hard to find—even most Chinese- and Japanese-native programmers I've dealt with use ASCII passwords. And the second, even more so; nobody but a historical linguistics professor is likely to even think of using archaic scripts in their passwords. By Murphy's Law, if you write the correct code, it will never be used, but if you don't, it's guaranteed to show up as soon as you ship your code.

Parsing binary files with Python

As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.
The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do
magic = f.read(4)
I get
b'\xcf\xfa\xed\xfe'
This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.
Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.

Take a look at the struct module:
In [1]: import struct
In [2]: magic = b'\xcf\xfa\xed\xfe'
In [3]: decoded = struct.unpack('<I', magic)[0]
In [4]: hex(decoded)
Out[4]: '0xfeedfacf'

There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import it, and, voila, parsing boils down to:
from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections
They have a growing repository of file format specs. It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.
It is actually better than Construct or Hachoir, because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:

I would suggest the Construct module. It offers a very high level interface.

Are PyArg_ParseTuple() "s" format specifiers useful in Python 3.x C API?

I'm trying to write a Python C extension that processes byte strings, and I have something basically working for Python 2.x and Python 3.x.
For the Python 2.x code, near the start of my function, I currently have a line:
if (!PyArg_ParseTuple(args, "s#:in_bytes", &src_ptr, &src_len))
...
I notice that the s# format specifier accepts both Unicode strings and byte strings. I really just want it to accept byte strings and reject Unicode. For Python 2.x, this might be "good enough"--the standard hashlib seems to do the same, accepting Unicode as well as byte strings. However, Python 3.x is meant to clean up the Unicode/byte string mess and not let the two be interchangeable.
So, I'm surprised to find that in Python 3.x, the s format specifiers for PyArg_ParseTuple() still seem to accept Unicode and provide a "default encoded string version" of the Unicode. This seems to go against the principles of Python 3.x, making the s format specifiers unusable in practice. Is my analysis correct, or am I missing something?
Looking at the implementation for hashlib for Python 3.x (e.g. see md5module.c, function MD5_update() and its use of GET_BUFFER_VIEW_OR_ERROUT() macro) I see that it avoids the s format specifiers, and just takes a generic object (O specifier) and then does various explicit type checks using the GET_BUFFER_VIEW_OR_ERROUT() macro. Is this what we have to do?

I agree with you -- it's one of several spots where the C API migration of Python 3 was clearly not designed as carefully and thouroughly as the Python coder-visible parts. I do also agree that probably the best workaround for now is focusing on "buffer views", per that macro -- until and unless something better gets designed into a future Python C API (don't hold your breath waiting for that to happen, though;-).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Murmurhash 2 results on Python and Haskell - python

Related

Current idiom for removing 'surrogateescape' characters fron a decoded string

Haskell program involving `read` is much slower than an equivalent Python one

What character-shifting / pseudo-encryption algorithm is used here?

Parsing binary files with Python

Are PyArg_ParseTuple() "s" format specifiers useful in Python 3.x C API?

Categories

Resources