Haskell program involving `read` is much slower than an equivalent Python one - python

As part of a programming challenge, I need to read, from stdin, a sequence of space-separated integers (on a single line), and print the sum of those integers to stdout. The sequence in question can contain as many as 10,000,000 integers.
I have two solutions for this: one written in Haskell (foo.hs), and another, equivalent one, written in Python 2 (foo.py). Unfortunately, the (compiled) Haskell program is consistently slower than the Python program, and I'm at a loss for explaining the discrepancy in performance between the two programs; see the Benchmark section below. If anything, I would have expected Haskell to have the upper hand...
What am I doing wrong? How can I account for this discrepancy? Is there an easy way of speeding up my Haskell code?
(For information, I'm using a mid-2010 Macbook Pro with 8Gb RAM, GHC 7.8.4, and Python 2.7.9.)
foo.hs
main = print . sum =<< getIntList
getIntList :: IO [Int]
getIntList = fmap (map read . words) getLine
(compiled with ghc -O2 foo.hs)
foo.py
ns = map(int, raw_input().split())
print sum(ns)
Benchmark
In the following, test.txt consists of a single line of 10 million space-separated integers.
# Haskell
$ time ./foo < test.txt
1679257
real 0m36.704s
user 0m35.932s
sys 0m0.632s
# Python
$ time python foo.py < test.txt
1679257
real 0m7.916s
user 0m7.756s
sys 0m0.151s

read is slow. For bulk parsing, use bytestring or text primitives, or attoparsec.
I did some benchmarking. Your original version ran in 23,9 secs on my computer. The version below ran in 0.35 secs:
import qualified Data.ByteString.Char8 as B
import Control.Applicative
import Data.Maybe
import Data.List
import Data.Char
main = print . sum =<< getIntList
getIntList :: IO [Int]
getIntList =
map (fst . fromJust . B.readInt) . B.words <$> B.readFile "test.txt"
By specializing the parser to your test.txt file, I could get the runtime down to 0.26 sec:
getIntList :: IO [Int]
getIntList =
unfoldr (B.readInt . B.dropWhile (==' ')) <$> B.readFile "test.txt"

Read is slow
Fast read, from this answer, will bring you down to 5.5 seconds.
import Numeric
fastRead :: String -> Int
fastRead s = case readDec s of [(n, "")] -> n
Strings are Linked Lists
In Haskell the String type is a linked list. Using a packed representation (bytestring if you really only want ascii but Text is also very fast and supports unicode). As shown in this answer, the performance should then be neck and neck.

I would venture to guess that a big part of your problem is actually words. When you map read . words, what you're actually doing is this:
Scan the input looking for a space, building a list of non-spaces as you go. There are a lot of different kinds of spaces, and checking any character that's not a common type of space additionally involves a foreign call to a C function (slow). I'm planning to fix this sometime, but I haven't gotten around to it yet, and even then you'll still be building and throwing away lists for no good reason, and checking for spaces when you really just want to check for digits.
Read through the list of accumulated characters to try to make a number out of them. Produce the number. The accumulated list now becomes garbage.
Go back to step 1.
This is a fairly ridiculous way to proceed. I believe you can even do better using something horrible like reads, but it would make more sense to use something like ReadP. You can also try fancier sorts of things like stream-based parsing; I don't know if that will help much or not.

Related

python speed up this regex sub

p = re.compile('>.*\n')
p.sub('', text)
I want to delete all lines starting with a '>'. I have a really huge file (3GB) that I process in chunks of size 250MB, so the variable "text" is a string of size 250MB. (I tried different sizes, but the performance was always the same for the complete file).
Now, can I speed up this regex somehow? I tried the multi-line matching, but it was a lot slower. Or are there even better ways?
(I already tried to split the string and then filter out the line like this, but it was also slower (i also tried a lambda instead of def del_line: (that might not be working code, it's just from memory):
def del_line(x): return x[0] != '>'
def func():
....
text = file.readlines(chunksize)
text = filter(del_line, text)
...
EDIT:
As suggested in the comments, I also tried walking line by line:
text = []
for line in file:
if line[0] != '>':
text.append(line)
text = ''.join(text)
That's also slower, it needs ~12 sec. My regex need ~7 sec. (yeah, that's fast, but it must also run on slower machines)
EDIT: Of course, I also tried str.startswith('>'), it was slower...
If you have the chance, running grep as a subprocess is probably the most pragmatic choice.
If for whatever reason you can't rely on grep, you could try implementing some of the "tricks" that make grep fast. From the author himself, you can read about them here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
At the ending of the article, the author summarizes the main points. The one that stands out to me the most is:
Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for
newlines would slow grep down by a factor of several times, because to
find the newlines it would have to look at every byte!
The idea would be to load the entire file in memory and iterate with it on byte-level instead of line-level. Only when you find a match, you look for the line boundaries and delete it.
You say you have to run this on other computers. If it's within your reach and you are not doing it already, consider running it on PyPy instead of CPython (the default interpreter). This may (or may not) improve the runtime by a significant factor, depending on the nature of the program.
Also, as some comments already mentioned, benchmark with the actual grep to get a baseline of how fast you can go, reasonably speaking. Get it on Cygwin if you are on Windows, it's easy enough.
This is not faster?
def cleanup(chunk):
return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>'))
EDIT: yeah, that is not faster. That's twice as slow.
Maybe consider using subprocess and a tool like grep, as suggested by Ryan P. You could even take advantage of multiprocessing.

Murmurhash 2 results on Python and Haskell

Haskell and Python don't seem to agree on Murmurhash2 results. Python, Java, and PHP returned the same results but Haskell don't. Am I doing something wrong regarding Murmurhash2 on Haskell?
Here is my code for Haskell Murmurhash2:
import Data.Digest.Murmur32
main = do
print $ asWord32 $ hash32WithSeed 1 "woohoo"
And here is the code written in Python:
import murmur
if __name__ == "__main__":
print murmur.string_hash("woohoo", 1)
Python returned 3650852671 while Haskell returned 3966683799
From a quick inspection of the sources, it looks like the algorithm operates on 32 bits at a time. The Python version gets these by simply grabbing 4 bytes at a time from the input string, while the Haskell version converts each character to a single 32-bit Unicode index.
It's therefore not surprising that they yield different results.
The murmur-hash package (I am its author) does not promise to compute the same hashes as other languages. If you rely on hashes to be compatible with other software that computes hashes I suggest you create newtype wrappers that compute hashes the way you want them. For text, in particular, you need to at least specify the encoding. In your case you could convert the text to an ASCII string using Data.ByteString.Char8.pack, but that still doesn't give you the same hash since the ByteString instance is more of a placeholder.
BTW, I'm not actively improving that package because MurmurHash2 has been superseded by MurmurHash3, but I keep accepting patches.

Shortest Python Quine?

Python 2.x (30 bytes):
_='_=%r;print _%%_';print _%_
Python 3.x (32 bytes)
_='_=%r;print(_%%_)';print(_%_)
Is this the shortest possible Python quine, or can it be done better? This one seems to improve on all the entries on The Quine Page.
I'm not counting the trivial 'empty' program.
I'm just going to leave this here (save as exceptionQuine.py):
File "exceptionQuine.py", line 1
File "exceptionQuine.py", line 1
^
IndentationError: unexpected indent
Technically, the shortest Python quine is the empty file. Apart from this trivial case:
Since Python's print automatically appends a newline, the quine is actually _='_=%r;print _%%_';print _%_\n (where \n represents a single newline character in the file).
Both
print open(__file__).read()
and anything involving import are not valid quines, because a quine by definition cannot take any input. Reading an external file is considered taking input, and thus a quine cannot read a file -- including itself.
For the record, technically speaking, the shortest possible quine in python is a blank file, but that is sort of cheating too.
In a slightly non-literal approach, taking 'shortest' to mean short in terms of the number of statements as well as just the character count, I have one here that doesn't include any semicolons.
print(lambda x:x+str((x,)))('print(lambda x:x+str((x,)))',)
In my mind this contends, because it's all one function, whereas others are multiple. Does anyone have a shorter one like this?
Edit: User flornquake made the following improvement (backticks for repr() to replace str() and shave off 6 characters):
print(lambda x:x+`(x,)`)('print(lambda x:x+`(x,)`)',)
Even shorter:
print(__file__[:-3])
And name the file print(__file__[:-3]).py (Source)
Edit: actually,
print(__file__)
named print(__file__) works too.
Python 3.8
exec(s:='print("exec(s:=%r)"%s)')
Here is another similar to postylem's answer.
Python 3.6:
print((lambda s:s%s)('print((lambda s:s%%s)(%r))'))
Python 2.7:
print(lambda s:s%s)('print(lambda s:s%%s)(%r)')
I would say:
print open(__file__).read()
Source
As of Python 3.8 I have a new quine! I'm quite proud of it because until now I have never created my own. I drew inspiration from _='_=%r;print(_%%_)';print(_%_), but made it into a single function (with only 2 more characters). It uses the new walrus operator.
print((_:='print((_:=%r)%%_)')%_)
This one is least cryptic, cor is a.format(a)
a="a={1}{0}{1};print(a.format(a,chr(34)))";print(a.format(a,chr(34)))
I am strictly against your solution.
The formatting prarameter % is definitively a too advanced high level language function. If such constructs are allowed, I would say, that import must be allowed as well. Then I can construct a shorter Quine by introducing some other high level language construct (which, BTW is much less powerful than the % function, so it is less advanced):
Here is a Unix shell script creating such a quine.py file and checking it really works:
echo 'import x' > quine.py
echo "print 'import x'" > x.py
python quine.py | cmp - quine.py; echo $?
outputs 0
Yes, that's cheating, like using %. Sorry.

fast string modification in python

This is partially a theoretical question:
I have a string (say UTF-8), and I need to modify it so that each character (not byte) becomes 2 characters, for instance:
"Nissim" becomes "N-i-s-s-i-m-"
"01234" becomes "0a1b2c3d4e"
and so on.
I would suspect that naive concatenation in a loop would be too expensive (it IS the bottleneck, this is supposed to happen all the time).
I would either use an array (pre-allocated) or try to make my own C module to handle this.
Anyone has better ideas for this kind of thing?
(Note that the problem is always about multibyte encodings, and must be solved for UTF-8 as well),
Oh and its Python 2.5, so no shiny Python 3 thingies are available here.
Thanks
#gnosis, beware of all the well-intentioned responders saying you should measure the times: yes, you should (because programmers' instincts are often off-base about performance), but measuring a single case, as in all the timeit examples proffered so far, misses a crucial consideration -- big-O.
Your instincts are correct: in general (with a very few special cases where recent Python releases can optimize things a bit, but they don't stretch very far), building a string by a loop of += over the pieces (or a reduce and so on) must be O(N**2) due to the many intermediate object allocations and the inevitable repeated copying of those object's content; joining, regular expressions, and the third option that was not mentioned in the above answers (write method of cStringIO.StringIO instances) are the O(N) solutions and therefore the only ones worth considering unless you happen to know for sure that the strings you'll be operating on have modest upper bounds on their length.
So what, if any, are the upper bounds in length on the strings you're processing? If you can give us an idea, benchmarks can be run on representative ranges of lengths of interest (for example, say, "most often less than 100 characters but some % of the time maybe a couple thousand characters" would be an excellent spec for this performance evaluation: IOW, it doesn't need to be extremely precise, just indicative of your problem space).
I also notice that nobody seems to follow one crucial and difficult point in your specs: that the strings are Python 2.5 multibyte, UTF-8 encoded, strs, and the insertions must happen only after each "complete character", not after each byte. Everybody seems to be "looping on the str", which give each byte, not each character as you so clearly specify.
There's really no good, fast way to "loop over characters" in a multibyte-encoded byte str; the best one can do is to .decode('utf-8'), giving a unicode object -- process the unicode object (where loops do correctly go over characters!), then .encode it back at the end. By far the best approach in general is to only, exclusively use unicode objects, not encoded strs, throughout the heart of your code; encode and decode to/from byte strings only upon I/O (if and when you must because you need to communicate with subsystems that only support byte strings and not proper Unicode).
So I would strongly suggest that you consider this "best approach" and restructure your code accordingly: unicode everywhere, except at the boundaries where it may be encoded/decoded if and when necessary only. For the "processing" part, you'll be MUCH happier with unicode objects than you would be lugging around balky multibyte-encoded strings!-)
Edit: forgot to comment on a possible approach you mention -- array.array. That's indeed O(N) if you are only appending to the end of the new array you're constructing (some appends will make the array grow beyond previously allocated capacity and therefore require a reallocation and copying of data, but, just like for list, a midly exponential overallocation strategy allows append to be amortized O(1), and therefore N appends to be O(N)).
However, to build an array (again, just like a list) by repeated insert operations in the middle of it is O(N**2), because each of the O(N) insertions must shift all the O(N) following items (assuming the number of previously existing items and the number of newly inserted ones are proportional to each other, as seems to be the case for your specific requirements).
So, an array.array('u'), with repeated appends to it (not inserts!-), is a fourth O(N) approach that can solve your problem (in addition to the three I already mentioned: join, re, and cStringIO) -- those are the ones worth benchmarking once you clarify the ranges of lengths that are of interest, as I mentioned above.
Try to build the result with the re module. It will do the nasty concatenation under the hood, so performance should be OK. Example:
import re
re.sub(r'(.)', r'\1-', u'Nissim')
count = 1
def repl(m):
global count
s = m.group(1) + unicode(count)
count += 1
return s
re.sub(r'(.)', repl, u'Nissim')
this might be a python effective solution:
s1="Nissim"
s2="------"
s3=''.join([''.join(list(x)) for x in zip(s1,s2)])
have you tested how slow it is or how fast you need, i think something like this will be fast enough
s = u"\u0960\u0961"
ss = ''.join(sum(map(list,zip(s,"anurag")),[]))
So try with simplest and if it doesn't suffice then try to improve upon it, C module should be last option
Edit: This is also the fastest
import timeit
s1="Nissim"
s2="------"
timeit.f1=lambda s1,s2:''.join(sum(map(list,zip(s1,s2)),[]))
timeit.f2=lambda s1,s2:''.join([''.join(list(x)) for x in zip(s1,s2)])
timeit.f3=lambda s1,s2:''.join(i+j for i, j in zip(s1, s2))
N=100000
print "anurag",timeit.Timer("timeit.f1('Nissim', '------')","import timeit").timeit(N)
print "dweeves",timeit.Timer("timeit.f2('Nissim', '------')","import timeit").timeit(N)
print "SilentGhost",timeit.Timer("timeit.f3('Nissim', '------')","import timeit").timeit(N)
output is
anurag 1.95547590546
dweeves 2.36131184271
SilentGhost 3.10855625505
here are my timings. Note, it's py3.1
>>> s1
'Nissim'
>>> s2 = '-' * len(s1)
>>> timeit.timeit("''.join(i+j for i, j in zip(s1, s2))", "from __main__ import s1, s2")
3.5249209707199043
>>> timeit.timeit("''.join(sum(map(list,zip(s1,s2)),[]))", "from __main__ import s1, s2")
5.903614027402
>>> timeit.timeit("''.join([''.join(list(x)) for x in zip(s1,s2)])", "from __main__ import s1, s2")
6.04072124013328
>>> timeit.timeit("''.join(i+'-' for i in s1)", "from __main__ import s1, s2")
2.484378367653335
>>> timeit.timeit("reduce(lambda x, y : x+y+'-', s1, '')", "from __main__ import s1; from functools import reduce")
2.290644129319844
Use Reduce.
>>> str = "Nissim"
>>> reduce(lambda x, y : x+y+'-', str, '')
'N-i-s-s-i-m-'
The same with numbers too as long as you know which char maps to which. [dict can be handy]
>>> mapper = dict([(repr(i), chr(i+ord('a'))) for i in range(9)])
>>> str1 = '0123'
>>> reduce(lambda x, y : x+y+mapper[y], str1, '')
'0a1b2c3d'
string="™¡™©€"
unicode(string,"utf-8")
s2='-'*len(s1)
''.join(sum(map(list,zip(s1,s2)),[])).encode("utf-8")

Awk, bash or python for converting a regular file?

I have a text file with lots of lines and with this structure:
[('name_1a',
'name_1b',
value_1),
('name_2a',
'name_2b',
value_2),
.....
.....
('name_XXXa',
'name_XXXb',
value_XXX)]
I would like to convert it to:
name_1a, name_1b, value_1
name_2a, name_2b, value_2
......
name_XXXa, name_XXXb, value_XXX
I wonder what would be the best way, whether awk, python or bash.
Thanks
Jose
Tried evaluating it python? Looks like a list of tuples to me.
eval(your_string)
Note, it's massively unsafe! If there's code in there to delete your hard disk, evaluating it will run that code!
I would like to use Python:
lines = open('filename.txt','r').readlines()
n = len(lines) # n % 3 == 0
for i in range(0,n,3):
name1 = lines[i].strip("',[]\n\r")
name2 = lines[i+1].strip("',[]\n\r")
value = lines[i+2].strip("',[]\n\r")
print name1,name2,value
It looks like legal Python. You might be able to just import it as a module and then write it back out after formatting it.
Oh boy, here is a job for ast.literal_eval:
(literal_eval is safer than eval, since it restricts the input string to literals such as strings, numbers, tuples, lists, dicts, booleans and None:
import ast
filename='in'
with open(filename,'r') as f:
contents=f.read()
data=ast.literal_eval(contents)
for elt in data:
print(', '.join(map(str,elt)))
here's one way to do it with (g)awk
$ awk -vRS=")," ' { gsub(/\n|[\047\]\[)(]/,"") } 1' file
name_1a,name_1b,value_1
name_2a,name_2b,value_2
name_XXXa,name_XXXb,value_XXX
Awk is typically line oriented, and bash is a shell, with limited numbrer of string manipulation functions. It really depends on where your strength as a programmer lies, but all other things being equal, I would choose python.
Did you ever consider that by redirecting the time it took to post this on SO, you could have had it done?
"AWK is a language for processing
files of text. A file is treated as a
sequence of records, and by default
each line is a record. Each line is
broken up into a sequence of fields,
so we can think of the first word in a
line as the first field, the second
word as the second field, and so on.
An AWK program is of a sequence of
pattern-action statements. AWK reads
the input a line at a time. A line is
scanned for each pattern in the
program, and for each pattern that
matches, the associated action is
executed." - Alfred V. Aho[2]
Asking what's the best language for doing a given task is a very different question to say, asking: 'what's the best way of doing a given task in a particular language'. The first, what you're asking, is in most cases entirely subjective.
Since this is a fairly simple task, I would suggest going with what you know (unless you're doing this for learning purposes, which I doubt).
If you know any of the languages you suggested, go ahead and solve this in a matter of minutes. If you know none of them, now enters the subjective part, I would suggest learning Python, since it's so much more fun than the other 2 ;)
If the values are legal python values, you can take advantage of eval() since your data is a legal python data sucture. The following would work if values are integers, otherwise you might have to massage the print call a bit:
input = """[('name_1a',
'name_1b',
1),
('name_2a',
'name_2b',
2),
('name_XXXa',
'name_XXXb',
3)]"""
for e in eval(input):
print '%s,%s,%d' % e
P.S. using eval() is quite controversial since it will execute any valid python code that you pass into it, so take care.

Categories

Resources